linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] nvmet-rdma/srpt: SRQ per completion vector
@ 2020-03-17 13:40 Max Gurtovoy
  2020-03-17 13:40 ` [PATCH 1/5] IB/core: add a simple SRQ set per PD Max Gurtovoy
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 13:40 UTC (permalink / raw)
  To: linux-nvme, sagi, hch, loberman, bvanassche, linux-rdma
  Cc: vladimirk, shlomin, leonro, dledford, jgg, oren, kbusch,
	Max Gurtovoy, idanb

This set is a renewed version of the feature for NVMEoF/RDMA target. In
this series I've decided to implement it also for SRP target that had
similar implementatiom (SRQ per HCA) after previous requests from the
community. The logic is intended to save resource allocation (by sharing
them) and utilize the locality of completions to get the best performance
with Shared Receive Queues (SRQs). We'll create a SRQ per completion
vector (and not per device) using a new API (basic SRQ set, added to this
patchset too) and associate each created QP/CQ/channel with an
appropriate SRQ. This will also reduce the lock contention on the single
SRQ per device (today's solution).

For NVMEoF, my testing environment included 4 initiators (CX5, CX5, CX4,
CX3) that were connected to 4 subsystems (1 ns per sub) throw 2 ports
(each initiator connected to unique subsystem backed in a different
bull_blk device) using a switch to the NVMEoF target (CX5).
I used RoCE link layer. For SRP, I used 1 server with loopback connection
(results are not mentioned below) for testing. Hopefully I'll get a tested-by
signature and feedback from Laurence on the SRP part during the review process.

The below results were made a while ago using NVMEoF.

Configuration:
 - Irqbalancer stopped on each server
 - set_irq_affinity.sh on each interface
 - 2 initiators run traffic throw port 1
 - 2 initiators run traffic throw port 2
 - On initiator set register_always=N
 - Fio with 12 jobs, iodepth 128

Memory consumption calculation for recv buffers (target):
 - Multiple SRQ: SRQ_size * comp_num * ib_devs_num * inline_buffer_size
 - Single SRQ: SRQ_size * 1 * ib_devs_num * inline_buffer_size
 - MQ: RQ_size * CPU_num * ctrl_num * inline_buffer_size

Cases:
 1. Multiple SRQ with 1024 entries:
    - Mem = 1024 * 24 * 2 * 4k = 192MiB (Constant number - not depend on initiators number)
 2. Multiple SRQ with 256 entries:
    - Mem = 256 * 24 * 2 * 4k = 48MiB (Constant number - not depend on initiators number)
 3. MQ:
    - Mem = 256 * 24 * 8 * 4k = 192MiB (Mem grows for every new created ctrl)
 4. Single SRQ (current SRQ implementation):
    - Mem = 4096 * 1 * 2 * 4k = 32MiB (Constant number - not depend on initiators number)

results:

BS    1.read (target CPU)   2.read (target CPU)    3.read (target CPU)   4.read (target CPU)
---  --------------------- --------------------- --------------------- ----------------------
1k     5.88M (80%)            5.45M (72%)            6.77M (91%)          2.2M (72%)

2k     3.56M (65%)            3.45M (59%)            3.72M (64%)          2.12M (59%)

4k     1.8M (33%)             1.87M (32%)            1.88M (32%)          1.59M (34%)

BS    1.write (target CPU)   2.write (target CPU) 3.write (target CPU)   4.write (target CPU)
---  --------------------- --------------------- --------------------- ----------------------
1k     5.42M (63%)            5.14M (55%)            7.75M (82%)          2.14M (74%)

2k     4.15M (56%)            4.14M (51%)            4.16M (52%)          2.08M (73%)

4k     2.17M (28%)            2.17M (27%)            2.16M (28%)          1.62M (24%)


We can see the perf improvement between Case 2 and Case 4 (same order of resource).
We can see the benefit in resource consumption (mem and CPU) with a small perf loss
between cases 2 and 3.
There is still an open question between the perf differance for 1k between Case 1 and
Case 3, but I guess we can investigate and improve it incrementaly.

Thanks to Idan Burstein and Oren Duer for suggesting this nice feature.


Max Gurtovoy (5):
  IB/core: add a simple SRQ set per PD
  nvmet-rdma: add srq pointer to rdma_cmd
  nvmet-rdma: use SRQ per completion vector
  IB/core: cache the CQ completion vector
  RDMA/srpt: use SRQ per completion vector

 drivers/infiniband/core/Makefile      |   2 +-
 drivers/infiniband/core/cq.c          |   1 +
 drivers/infiniband/core/srq_set.c     |  78 +++++++++++++
 drivers/infiniband/core/verbs.c       |   4 +
 drivers/infiniband/ulp/srpt/ib_srpt.c | 169 +++++++++++++++++++++-------
 drivers/infiniband/ulp/srpt/ib_srpt.h |  26 ++++-
 drivers/nvme/target/rdma.c            | 202 +++++++++++++++++++++++++---------
 include/rdma/ib_verbs.h               |   6 +
 include/rdma/srq_set.h                |  18 +++
 9 files changed, 409 insertions(+), 97 deletions(-)
 create mode 100644 drivers/infiniband/core/srq_set.c
 create mode 100644 include/rdma/srq_set.h

-- 
1.8.3.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 13:40 [PATCH 0/5] nvmet-rdma/srpt: SRQ per completion vector Max Gurtovoy
@ 2020-03-17 13:40 ` Max Gurtovoy
  2020-03-17 13:55   ` Leon Romanovsky
  2020-03-17 13:40 ` [PATCH 2/5] nvmet-rdma: add srq pointer to rdma_cmd Max Gurtovoy
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 13:40 UTC (permalink / raw)
  To: linux-nvme, sagi, hch, loberman, bvanassche, linux-rdma
  Cc: vladimirk, shlomin, leonro, dledford, jgg, oren, kbusch,
	Max Gurtovoy, idanb

ULP's can use this API to create/destroy SRQ's with the same
characteristics for implementing a logic that aimed to save resources
without significant performance penalty (e.g. create SRQ per completion
vector and use shared receive buffers for multiple controllers of the
ULP).

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/infiniband/core/Makefile  |  2 +-
 drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/verbs.c   |  4 ++
 include/rdma/ib_verbs.h           |  5 +++
 include/rdma/srq_set.h            | 18 +++++++++
 5 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 drivers/infiniband/core/srq_set.c
 create mode 100644 include/rdma/srq_set.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d1b14887..1d3eaec 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
 				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
 				multicast.o mad.o smi.o agent.o mad_rmpp.o \
 				nldev.o restrack.o counters.o ib_core_uverbs.o \
-				trace.o
+				trace.o srq_set.o
 
 ib_core-$(CONFIG_SECURITY_INFINIBAND) += security.o
 ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
diff --git a/drivers/infiniband/core/srq_set.c b/drivers/infiniband/core/srq_set.c
new file mode 100644
index 0000000..d143561
--- /dev/null
+++ b/drivers/infiniband/core/srq_set.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2020 Mellanox Technologies. All rights reserved.
+ */
+
+#include <rdma/srq_set.h>
+
+struct ib_srq *rdma_srq_get(struct ib_pd *pd)
+{
+	struct ib_srq *srq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pd->srq_lock, flags);
+	srq = list_first_entry_or_null(&pd->srqs, struct ib_srq, pd_entry);
+	if (srq) {
+		list_del(&srq->pd_entry);
+		pd->srqs_used++;
+	}
+	spin_unlock_irqrestore(&pd->srq_lock, flags);
+
+	return srq;
+}
+EXPORT_SYMBOL(rdma_srq_get);
+
+void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&pd->srq_lock, flags);
+	list_add(&srq->pd_entry, &pd->srqs);
+	pd->srqs_used--;
+	spin_unlock_irqrestore(&pd->srq_lock, flags);
+}
+EXPORT_SYMBOL(rdma_srq_put);
+
+int rdma_srq_set_init(struct ib_pd *pd, int nr,
+		struct ib_srq_init_attr *srq_attr)
+{
+	struct ib_srq *srq;
+	unsigned long flags;
+	int ret, i;
+
+	for (i = 0; i < nr; i++) {
+		srq = ib_create_srq(pd, srq_attr);
+		if (IS_ERR(srq)) {
+			ret = PTR_ERR(srq);
+			goto out;
+		}
+
+		spin_lock_irqsave(&pd->srq_lock, flags);
+		list_add_tail(&srq->pd_entry, &pd->srqs);
+		spin_unlock_irqrestore(&pd->srq_lock, flags);
+	}
+
+	return 0;
+out:
+	rdma_srq_set_destroy(pd);
+	return ret;
+}
+EXPORT_SYMBOL(rdma_srq_set_init);
+
+void rdma_srq_set_destroy(struct ib_pd *pd)
+{
+	struct ib_srq *srq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pd->srq_lock, flags);
+	while (!list_empty(&pd->srqs)) {
+		srq = list_first_entry(&pd->srqs, struct ib_srq, pd_entry);
+		list_del(&srq->pd_entry);
+
+		spin_unlock_irqrestore(&pd->srq_lock, flags);
+		ib_destroy_srq(srq);
+		spin_lock_irqsave(&pd->srq_lock, flags);
+	}
+	spin_unlock_irqrestore(&pd->srq_lock, flags);
+}
+EXPORT_SYMBOL(rdma_srq_set_destroy);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index e62c9df..6950abf 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -272,6 +272,9 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
 	pd->__internal_mr = NULL;
 	atomic_set(&pd->usecnt, 0);
 	pd->flags = flags;
+	pd->srqs_used = 0;
+	spin_lock_init(&pd->srq_lock);
+	INIT_LIST_HEAD(&pd->srqs);
 
 	pd->res.type = RDMA_RESTRACK_PD;
 	rdma_restrack_set_task(&pd->res, caller);
@@ -340,6 +343,7 @@ void ib_dealloc_pd_user(struct ib_pd *pd, struct ib_udata *udata)
 		pd->__internal_mr = NULL;
 	}
 
+	WARN_ON_ONCE(pd->srqs_used > 0);
 	/* uverbs manipulates usecnt with proper locking, while the kabi
 	   requires the caller to guarantee we can't race here. */
 	WARN_ON(atomic_read(&pd->usecnt));
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1f779fa..fc8207d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1517,6 +1517,10 @@ struct ib_pd {
 
 	u32			unsafe_global_rkey;
 
+	spinlock_t		srq_lock;
+	int			srqs_used;
+	struct list_head	srqs;
+
 	/*
 	 * Implementation details of the RDMA core, don't use in drivers:
 	 */
@@ -1585,6 +1589,7 @@ struct ib_srq {
 	void		       *srq_context;
 	enum ib_srq_type	srq_type;
 	atomic_t		usecnt;
+	struct list_head	pd_entry; /* srq set entry */
 
 	struct {
 		struct ib_cq   *cq;
diff --git a/include/rdma/srq_set.h b/include/rdma/srq_set.h
new file mode 100644
index 0000000..834c4c6
--- /dev/null
+++ b/include/rdma/srq_set.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB) */
+/*
+ * Copyright (c) 2020 Mellanox Technologies. All rights reserved.
+ */
+
+#ifndef _RDMA_SRQ_SET_H
+#define _RDMA_SRQ_SET_H 1
+
+#include <rdma/ib_verbs.h>
+
+struct ib_srq *rdma_srq_get(struct ib_pd *pd);
+void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);
+
+int rdma_srq_set_init(struct ib_pd *pd, int nr,
+		struct ib_srq_init_attr *srq_attr);
+void rdma_srq_set_destroy(struct ib_pd *pd);
+
+#endif /* _RDMA_SRQ_SET_H */
-- 
1.8.3.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/5] nvmet-rdma: add srq pointer to rdma_cmd
  2020-03-17 13:40 [PATCH 0/5] nvmet-rdma/srpt: SRQ per completion vector Max Gurtovoy
  2020-03-17 13:40 ` [PATCH 1/5] IB/core: add a simple SRQ set per PD Max Gurtovoy
@ 2020-03-17 13:40 ` Max Gurtovoy
  2020-03-17 13:40 ` [PATCH 3/5] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 13:40 UTC (permalink / raw)
  To: linux-nvme, sagi, hch, loberman, bvanassche, linux-rdma
  Cc: vladimirk, shlomin, leonro, dledford, jgg, oren, kbusch,
	Max Gurtovoy, idanb

This is a preparetion patch for the SRQ per completion vector feature.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/rdma.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 37d262a..04062af 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -38,6 +38,7 @@ struct nvmet_rdma_cmd {
 	struct scatterlist	inline_sg[NVMET_RDMA_MAX_INLINE_SGE];
 	struct nvme_command     *nvme_cmd;
 	struct nvmet_rdma_queue	*queue;
+	struct ib_srq		*srq;
 };
 
 enum {
@@ -461,8 +462,8 @@ static int nvmet_rdma_post_recv(struct nvmet_rdma_device *ndev,
 		cmd->sge[0].addr, cmd->sge[0].length,
 		DMA_FROM_DEVICE);
 
-	if (ndev->srq)
-		ret = ib_post_srq_recv(ndev->srq, &cmd->wr, NULL);
+	if (cmd->srq)
+		ret = ib_post_srq_recv(cmd->srq, &cmd->wr, NULL);
 	else
 		ret = ib_post_recv(cmd->queue->cm_id->qp, &cmd->wr, NULL);
 
@@ -882,6 +883,7 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
 	ndev->srq_size = srq_size;
 
 	for (i = 0; i < srq_size; i++) {
+		ndev->srq_cmds[i].srq = srq;
 		ret = nvmet_rdma_post_recv(ndev, &ndev->srq_cmds[i]);
 		if (ret)
 			goto out_free_cmds;
-- 
1.8.3.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/5] nvmet-rdma: use SRQ per completion vector
  2020-03-17 13:40 [PATCH 0/5] nvmet-rdma/srpt: SRQ per completion vector Max Gurtovoy
  2020-03-17 13:40 ` [PATCH 1/5] IB/core: add a simple SRQ set per PD Max Gurtovoy
  2020-03-17 13:40 ` [PATCH 2/5] nvmet-rdma: add srq pointer to rdma_cmd Max Gurtovoy
@ 2020-03-17 13:40 ` Max Gurtovoy
  2020-03-18  6:53   ` Leon Romanovsky
  2020-03-17 13:40 ` [PATCH 4/5] IB/core: cache the CQ " Max Gurtovoy
  2020-03-17 13:40 ` [PATCH 5/5] RDMA/srpt: use SRQ per " Max Gurtovoy
  4 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 13:40 UTC (permalink / raw)
  To: linux-nvme, sagi, hch, loberman, bvanassche, linux-rdma
  Cc: vladimirk, shlomin, leonro, dledford, jgg, oren, kbusch,
	Max Gurtovoy, idanb

In order to save resource allocation and utilize the completion
locality in a better way (compared to SRQ per device that exist today),
allocate Shared Receive Queues (SRQs) per completion vector. Associate
each created QP/CQ with an appropriate SRQ according to the queue index.
This association will reduce the lock contention in the fast path
(compared to SRQ per device solution) and increase the locality in
memory buffers. Add new module parameter for SRQ size to adjust it
according to the expected load. User should make sure the size is >= 256
to avoid lack of resources.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/rdma.c | 204 +++++++++++++++++++++++++++++++++------------
 1 file changed, 153 insertions(+), 51 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 04062af..f560257 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -18,6 +18,7 @@
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
+#include <rdma/srq_set.h>
 #include <rdma/rdma_cm.h>
 #include <rdma/rw.h>
 
@@ -31,6 +32,8 @@
 #define NVMET_RDMA_MAX_INLINE_SGE		4
 #define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int, SZ_16K, PAGE_SIZE)
 
+struct nvmet_rdma_srq;
+
 struct nvmet_rdma_cmd {
 	struct ib_sge		sge[NVMET_RDMA_MAX_INLINE_SGE + 1];
 	struct ib_cqe		cqe;
@@ -38,7 +41,7 @@ struct nvmet_rdma_cmd {
 	struct scatterlist	inline_sg[NVMET_RDMA_MAX_INLINE_SGE];
 	struct nvme_command     *nvme_cmd;
 	struct nvmet_rdma_queue	*queue;
-	struct ib_srq		*srq;
+	struct nvmet_rdma_srq   *nsrq;
 };
 
 enum {
@@ -80,6 +83,7 @@ struct nvmet_rdma_queue {
 	struct ib_cq		*cq;
 	atomic_t		sq_wr_avail;
 	struct nvmet_rdma_device *dev;
+	struct nvmet_rdma_srq   *nsrq;
 	spinlock_t		state_lock;
 	enum nvmet_rdma_queue_state state;
 	struct nvmet_cq		nvme_cq;
@@ -97,17 +101,24 @@ struct nvmet_rdma_queue {
 
 	int			idx;
 	int			host_qid;
+	int			comp_vector;
 	int			recv_queue_size;
 	int			send_queue_size;
 
 	struct list_head	queue_list;
 };
 
+struct nvmet_rdma_srq {
+	struct ib_srq            *srq;
+	struct nvmet_rdma_cmd    *cmds;
+	struct nvmet_rdma_device *ndev;
+};
+
 struct nvmet_rdma_device {
 	struct ib_device	*device;
 	struct ib_pd		*pd;
-	struct ib_srq		*srq;
-	struct nvmet_rdma_cmd	*srq_cmds;
+	struct nvmet_rdma_srq	**srqs;
+	int			srq_count;
 	size_t			srq_size;
 	struct kref		ref;
 	struct list_head	entry;
@@ -119,6 +130,16 @@ struct nvmet_rdma_device {
 module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
 MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
 
+static int srq_size_set(const char *val, const struct kernel_param *kp);
+static const struct kernel_param_ops srq_size_ops = {
+	.set = srq_size_set,
+	.get = param_get_int,
+};
+
+static int nvmet_rdma_srq_size = 1024;
+module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
+MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 1024)");
+
 static DEFINE_IDA(nvmet_rdma_queue_ida);
 static LIST_HEAD(nvmet_rdma_queue_list);
 static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
@@ -139,6 +160,17 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
 
 static const struct nvmet_fabrics_ops nvmet_rdma_ops;
 
+static int srq_size_set(const char *val, const struct kernel_param *kp)
+{
+	int n = 0, ret;
+
+	ret = kstrtoint(val, 10, &n);
+	if (ret != 0 || n < 256)
+		return -EINVAL;
+
+	return param_set_int(val, kp);
+}
+
 static int num_pages(int len)
 {
 	return 1 + (((len - 1) & PAGE_MASK) >> PAGE_SHIFT);
@@ -462,8 +494,8 @@ static int nvmet_rdma_post_recv(struct nvmet_rdma_device *ndev,
 		cmd->sge[0].addr, cmd->sge[0].length,
 		DMA_FROM_DEVICE);
 
-	if (cmd->srq)
-		ret = ib_post_srq_recv(cmd->srq, &cmd->wr, NULL);
+	if (cmd->nsrq)
+		ret = ib_post_srq_recv(cmd->nsrq->srq, &cmd->wr, NULL);
 	else
 		ret = ib_post_recv(cmd->queue->cm_id->qp, &cmd->wr, NULL);
 
@@ -841,30 +873,82 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 	nvmet_rdma_handle_command(queue, rsp);
 }
 
-static void nvmet_rdma_destroy_srq(struct nvmet_rdma_device *ndev)
+static void nvmet_rdma_destroy_srq(struct nvmet_rdma_srq *nsrq)
+{
+	nvmet_rdma_free_cmds(nsrq->ndev, nsrq->cmds, nsrq->ndev->srq_size, false);
+	rdma_srq_put(nsrq->ndev->pd, nsrq->srq);
+
+	kfree(nsrq);
+}
+
+static void nvmet_rdma_destroy_srqs(struct nvmet_rdma_device *ndev)
 {
-	if (!ndev->srq)
+	int i;
+
+	if (!ndev->srqs)
 		return;
 
-	nvmet_rdma_free_cmds(ndev, ndev->srq_cmds, ndev->srq_size, false);
-	ib_destroy_srq(ndev->srq);
+	for (i = 0; i < ndev->srq_count; i++)
+		nvmet_rdma_destroy_srq(ndev->srqs[i]);
+
+	rdma_srq_set_destroy(ndev->pd);
+	kfree(ndev->srqs);
+	ndev->srqs = NULL;
+	ndev->srq_count = 0;
+	ndev->srq_size = 0;
 }
 
-static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
+static struct nvmet_rdma_srq *
+nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
 {
-	struct ib_srq_init_attr srq_attr = { NULL, };
+	size_t srq_size = ndev->srq_size;
+	struct nvmet_rdma_srq *nsrq;
 	struct ib_srq *srq;
-	size_t srq_size;
 	int ret, i;
 
-	srq_size = 4095;	/* XXX: tune */
+	nsrq = kzalloc(sizeof(*nsrq), GFP_KERNEL);
+	if (!nsrq)
+		return ERR_PTR(-ENOMEM);
 
-	srq_attr.attr.max_wr = srq_size;
-	srq_attr.attr.max_sge = 1 + ndev->inline_page_count;
-	srq_attr.attr.srq_limit = 0;
-	srq_attr.srq_type = IB_SRQT_BASIC;
-	srq = ib_create_srq(ndev->pd, &srq_attr);
-	if (IS_ERR(srq)) {
+	srq = rdma_srq_get(ndev->pd);
+	if (!srq) {
+		ret = -EAGAIN;
+		goto out_free_nsrq;
+	}
+
+	nsrq->cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
+	if (IS_ERR(nsrq->cmds)) {
+		ret = PTR_ERR(nsrq->cmds);
+		goto out_put_srq;
+	}
+
+	nsrq->srq = srq;
+	nsrq->ndev = ndev;
+
+	for (i = 0; i < srq_size; i++) {
+		nsrq->cmds[i].nsrq = nsrq;
+		ret = nvmet_rdma_post_recv(ndev, &nsrq->cmds[i]);
+		if (ret)
+			goto out_free_cmds;
+	}
+
+	return nsrq;
+
+out_free_cmds:
+	nvmet_rdma_free_cmds(ndev, nsrq->cmds, srq_size, false);
+out_put_srq:
+	rdma_srq_put(ndev->pd, srq);
+out_free_nsrq:
+	kfree(nsrq);
+	return ERR_PTR(ret);
+}
+
+static int nvmet_rdma_init_srqs(struct nvmet_rdma_device *ndev)
+{
+	struct ib_srq_init_attr srq_attr = { NULL, };
+	int i, ret;
+
+	if (!ndev->device->attrs.max_srq_wr || !ndev->device->attrs.max_srq) {
 		/*
 		 * If SRQs aren't supported we just go ahead and use normal
 		 * non-shared receive queues.
@@ -873,31 +957,44 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
 		return 0;
 	}
 
-	ndev->srq_cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
-	if (IS_ERR(ndev->srq_cmds)) {
-		ret = PTR_ERR(ndev->srq_cmds);
-		goto out_destroy_srq;
-	}
+	ndev->srq_size = min(ndev->device->attrs.max_srq_wr,
+			     nvmet_rdma_srq_size);
+	ndev->srq_count = min(ndev->device->num_comp_vectors,
+			      ndev->device->attrs.max_srq);
 
-	ndev->srq = srq;
-	ndev->srq_size = srq_size;
+	ndev->srqs = kcalloc(ndev->srq_count, sizeof(*ndev->srqs), GFP_KERNEL);
+	if (!ndev->srqs)
+		return -ENOMEM;
 
-	for (i = 0; i < srq_size; i++) {
-		ndev->srq_cmds[i].srq = srq;
-		ret = nvmet_rdma_post_recv(ndev, &ndev->srq_cmds[i]);
-		if (ret)
-			goto out_free_cmds;
+	srq_attr.attr.max_wr = ndev->srq_size;
+	srq_attr.attr.max_sge = 2;
+	srq_attr.attr.srq_limit = 0;
+	srq_attr.srq_type = IB_SRQT_BASIC;
+	ret = rdma_srq_set_init(ndev->pd, ndev->srq_count, &srq_attr);
+	if (ret)
+		goto err_free;
+
+	for (i = 0; i < ndev->srq_count; i++) {
+		ndev->srqs[i] = nvmet_rdma_init_srq(ndev);
+		if (IS_ERR(ndev->srqs[i]))
+			goto err_srq;
 	}
 
 	return 0;
 
-out_free_cmds:
-	nvmet_rdma_free_cmds(ndev, ndev->srq_cmds, ndev->srq_size, false);
-out_destroy_srq:
-	ib_destroy_srq(srq);
+err_srq:
+	while (--i >= 0)
+		nvmet_rdma_destroy_srq(ndev->srqs[i]);
+	rdma_srq_set_destroy(ndev->pd);
+err_free:
+	kfree(ndev->srqs);
+	ndev->srqs = NULL;
+	ndev->srq_count = 0;
+	ndev->srq_size = 0;
 	return ret;
 }
 
+
 static void nvmet_rdma_free_dev(struct kref *ref)
 {
 	struct nvmet_rdma_device *ndev =
@@ -907,7 +1004,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
 	list_del(&ndev->entry);
 	mutex_unlock(&device_list_mutex);
 
-	nvmet_rdma_destroy_srq(ndev);
+	nvmet_rdma_destroy_srqs(ndev);
 	ib_dealloc_pd(ndev->pd);
 
 	kfree(ndev);
@@ -953,7 +1050,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
 		goto out_free_dev;
 
 	if (nvmet_rdma_use_srq) {
-		ret = nvmet_rdma_init_srq(ndev);
+		ret = nvmet_rdma_init_srqs(ndev);
 		if (ret)
 			goto out_free_pd;
 	}
@@ -977,14 +1074,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	struct ib_qp_init_attr qp_attr;
 	struct nvmet_rdma_device *ndev = queue->dev;
-	int comp_vector, nr_cqe, ret, i;
+	int nr_cqe, ret, i;
 
-	/*
-	 * Spread the io queues across completion vectors,
-	 * but still keep all admin queues on vector 0.
-	 */
-	comp_vector = !queue->host_qid ? 0 :
-		queue->idx % ndev->device->num_comp_vectors;
 
 	/*
 	 * Reserve CQ slots for RECV + RDMA_READ/RDMA_WRITE + RDMA_SEND.
@@ -992,7 +1083,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
 
 	queue->cq = ib_alloc_cq(ndev->device, queue,
-			nr_cqe + 1, comp_vector,
+			nr_cqe + 1, queue->comp_vector,
 			IB_POLL_WORKQUEUE);
 	if (IS_ERR(queue->cq)) {
 		ret = PTR_ERR(queue->cq);
@@ -1014,8 +1105,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	qp_attr.cap.max_send_sge = max(ndev->device->attrs.max_sge_rd,
 					ndev->device->attrs.max_send_sge);
 
-	if (ndev->srq) {
-		qp_attr.srq = ndev->srq;
+	if (queue->nsrq) {
+		qp_attr.srq = queue->nsrq->srq;
 	} else {
 		/* +1 for drain */
 		qp_attr.cap.max_recv_wr = 1 + queue->recv_queue_size;
@@ -1034,7 +1125,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 		 __func__, queue->cq->cqe, qp_attr.cap.max_send_sge,
 		 qp_attr.cap.max_send_wr, queue->cm_id);
 
-	if (!ndev->srq) {
+	if (!queue->nsrq) {
 		for (i = 0; i < queue->recv_queue_size; i++) {
 			queue->cmds[i].queue = queue;
 			ret = nvmet_rdma_post_recv(ndev, &queue->cmds[i]);
@@ -1070,7 +1161,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 	nvmet_sq_destroy(&queue->nvme_sq);
 
 	nvmet_rdma_destroy_queue_ib(queue);
-	if (!queue->dev->srq) {
+	if (!queue->nsrq) {
 		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
 				queue->recv_queue_size,
 				!queue->host_qid);
@@ -1182,13 +1273,22 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
 		goto out_destroy_sq;
 	}
 
+	/*
+	 * Spread the io queues across completion vectors,
+	 * but still keep all admin queues on vector 0.
+	 */
+	queue->comp_vector = !queue->host_qid ? 0 :
+		queue->idx % ndev->device->num_comp_vectors;
+
 	ret = nvmet_rdma_alloc_rsps(queue);
 	if (ret) {
 		ret = NVME_RDMA_CM_NO_RSC;
 		goto out_ida_remove;
 	}
 
-	if (!ndev->srq) {
+	if (ndev->srqs) {
+		queue->nsrq = ndev->srqs[queue->comp_vector % ndev->srq_count];
+	} else {
 		queue->cmds = nvmet_rdma_alloc_cmds(ndev,
 				queue->recv_queue_size,
 				!queue->host_qid);
@@ -1209,10 +1309,12 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
 	return queue;
 
 out_free_cmds:
-	if (!ndev->srq) {
+	if (!queue->nsrq) {
 		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
 				queue->recv_queue_size,
 				!queue->host_qid);
+	} else {
+		queue->nsrq = NULL;
 	}
 out_free_responses:
 	nvmet_rdma_free_rsps(queue);
-- 
1.8.3.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/5] IB/core: cache the CQ completion vector
  2020-03-17 13:40 [PATCH 0/5] nvmet-rdma/srpt: SRQ per completion vector Max Gurtovoy
                   ` (2 preceding siblings ...)
  2020-03-17 13:40 ` [PATCH 3/5] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
@ 2020-03-17 13:40 ` Max Gurtovoy
  2020-03-17 15:19   ` Chuck Lever
  2020-03-17 13:40 ` [PATCH 5/5] RDMA/srpt: use SRQ per " Max Gurtovoy
  4 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 13:40 UTC (permalink / raw)
  To: linux-nvme, sagi, hch, loberman, bvanassche, linux-rdma
  Cc: vladimirk, shlomin, leonro, dledford, jgg, oren, kbusch,
	Max Gurtovoy, idanb

In some cases, e.g. when using ib_alloc_cq_any, one would like to know
the completion vector that eventually assigned to the CQ. Cache this
value during CQ creation.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/infiniband/core/cq.c | 1 +
 include/rdma/ib_verbs.h      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 4f25b24..a7cbf52 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -217,6 +217,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
 	cq->device = dev;
 	cq->cq_context = private;
 	cq->poll_ctx = poll_ctx;
+	cq->comp_vector = comp_vector;
 	atomic_set(&cq->usecnt, 0);
 
 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fc8207d..0d61772 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1558,6 +1558,7 @@ struct ib_cq {
 	struct ib_device       *device;
 	struct ib_ucq_object   *uobject;
 	ib_comp_handler   	comp_handler;
+	u32			comp_vector;
 	void                  (*event_handler)(struct ib_event *, void *);
 	void                   *cq_context;
 	int               	cqe;
-- 
1.8.3.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/5] RDMA/srpt: use SRQ per completion vector
  2020-03-17 13:40 [PATCH 0/5] nvmet-rdma/srpt: SRQ per completion vector Max Gurtovoy
                   ` (3 preceding siblings ...)
  2020-03-17 13:40 ` [PATCH 4/5] IB/core: cache the CQ " Max Gurtovoy
@ 2020-03-17 13:40 ` Max Gurtovoy
  2020-03-17 13:58   ` Leon Romanovsky
  2020-03-17 19:58   ` Leon Romanovsky
  4 siblings, 2 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 13:40 UTC (permalink / raw)
  To: linux-nvme, sagi, hch, loberman, bvanassche, linux-rdma
  Cc: vladimirk, shlomin, leonro, dledford, jgg, oren, kbusch,
	Max Gurtovoy, idanb

In order to save resource allocation and utilize the completion
locality in a better way (compared to SRQ per device that exist
today), allocate Shared Receive Queues (SRQs) per completion vector.
Associate each created channel with an appropriate SRQ according to the
completion vector index. This association will reduce the lock
contention in the fast path (compared to SRQ per device solution) and
increase the locality in memory buffers.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 169 +++++++++++++++++++++++++---------
 drivers/infiniband/ulp/srpt/ib_srpt.h |  26 +++++-
 2 files changed, 148 insertions(+), 47 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 9855274..34869b7 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -811,6 +811,31 @@ static bool srpt_test_and_set_cmd_state(struct srpt_send_ioctx *ioctx,
 }
 
 /**
+ * srpt_srq_post_recv - post an initial IB receive request for SRQ
+ * @srq: SRPT SRQ context.
+ * @ioctx: Receive I/O context pointer.
+ */
+static int srpt_srq_post_recv(struct srpt_srq *srq, struct srpt_recv_ioctx *ioctx)
+{
+	struct srpt_device *sdev = srq->sdev;
+	struct ib_sge list;
+	struct ib_recv_wr wr;
+
+	BUG_ON(!srq);
+	list.addr = ioctx->ioctx.dma + ioctx->ioctx.offset;
+	list.length = srp_max_req_size;
+	list.lkey = sdev->lkey;
+
+	ioctx->ioctx.cqe.done = srpt_recv_done;
+	wr.wr_cqe = &ioctx->ioctx.cqe;
+	wr.next = NULL;
+	wr.sg_list = &list;
+	wr.num_sge = 1;
+
+	return ib_post_srq_recv(srq->ibsrq, &wr, NULL);
+}
+
+/**
  * srpt_post_recv - post an IB receive request
  * @sdev: SRPT HCA pointer.
  * @ch: SRPT RDMA channel.
@@ -823,6 +848,7 @@ static int srpt_post_recv(struct srpt_device *sdev, struct srpt_rdma_ch *ch,
 	struct ib_recv_wr wr;
 
 	BUG_ON(!sdev);
+	BUG_ON(!ch);
 	list.addr = ioctx->ioctx.dma + ioctx->ioctx.offset;
 	list.length = srp_max_req_size;
 	list.lkey = sdev->lkey;
@@ -834,7 +860,7 @@ static int srpt_post_recv(struct srpt_device *sdev, struct srpt_rdma_ch *ch,
 	wr.num_sge = 1;
 
 	if (sdev->use_srq)
-		return ib_post_srq_recv(sdev->srq, &wr, NULL);
+		return ib_post_srq_recv(ch->srq->ibsrq, &wr, NULL);
 	else
 		return ib_post_recv(ch->qp, &wr, NULL);
 }
@@ -1820,7 +1846,8 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 					SRPT_MAX_SG_PER_WQE);
 	qp_init->port_num = ch->sport->port;
 	if (sdev->use_srq) {
-		qp_init->srq = sdev->srq;
+		ch->srq = sdev->srqs[ch->cq->comp_vector % sdev->srq_count];
+		qp_init->srq = ch->srq->ibsrq;
 	} else {
 		qp_init->cap.max_recv_wr = ch->rq_size;
 		qp_init->cap.max_recv_sge = min(attrs->max_recv_sge,
@@ -1878,6 +1905,8 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 
 static void srpt_destroy_ch_ib(struct srpt_rdma_ch *ch)
 {
+	if (ch->srq)
+		ch->srq = NULL;
 	ib_destroy_qp(ch->qp);
 	ib_free_cq(ch->cq);
 }
@@ -3018,20 +3047,75 @@ static struct se_wwn *srpt_lookup_wwn(const char *name)
 	return wwn;
 }
 
-static void srpt_free_srq(struct srpt_device *sdev)
+static void srpt_free_srq(struct srpt_srq *srq)
 {
-	if (!sdev->srq)
-		return;
 
-	ib_destroy_srq(sdev->srq);
-	srpt_free_ioctx_ring((struct srpt_ioctx **)sdev->ioctx_ring, sdev,
-			     sdev->srq_size, sdev->req_buf_cache,
+	srpt_free_ioctx_ring((struct srpt_ioctx **)srq->ioctx_ring, srq->sdev,
+			     srq->sdev->srq_size, srq->sdev->req_buf_cache,
 			     DMA_FROM_DEVICE);
+	rdma_srq_put(srq->sdev->pd, srq->ibsrq);
+	kfree(srq);
+
+}
+
+static void srpt_free_srqs(struct srpt_device *sdev)
+{
+	int i;
+
+	if (!sdev->srqs)
+		return;
+
+	for (i = 0; i < sdev->srq_count; i++)
+		srpt_free_srq(sdev->srqs[i]);
 	kmem_cache_destroy(sdev->req_buf_cache);
-	sdev->srq = NULL;
+	rdma_srq_set_destroy(sdev->pd);
+	kfree(sdev->srqs);
+	sdev->srqs = NULL;
 }
 
-static int srpt_alloc_srq(struct srpt_device *sdev)
+static struct srpt_srq *srpt_alloc_srq(struct srpt_device *sdev)
+{
+	struct srpt_srq	*srq;
+	int i, ret;
+
+	srq = kzalloc(sizeof(*srq), GFP_KERNEL);
+	if (!srq) {
+		pr_debug("failed to allocate SRQ context\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	srq->ibsrq = rdma_srq_get(sdev->pd);
+	if (!srq) {
+		ret = -EAGAIN;
+		goto free_srq;
+	}
+
+	srq->ioctx_ring = (struct srpt_recv_ioctx **)
+		srpt_alloc_ioctx_ring(sdev, sdev->srq_size,
+				      sizeof(*srq->ioctx_ring[0]),
+				      sdev->req_buf_cache, 0, DMA_FROM_DEVICE);
+	if (!srq->ioctx_ring) {
+		ret = -ENOMEM;
+		goto put_srq;
+	}
+
+	srq->sdev = sdev;
+
+	for (i = 0; i < sdev->srq_size; ++i) {
+		INIT_LIST_HEAD(&srq->ioctx_ring[i]->wait_list);
+		srpt_srq_post_recv(srq, srq->ioctx_ring[i]);
+	}
+
+	return srq;
+
+put_srq:
+	rdma_srq_put(sdev->pd, srq->ibsrq);
+free_srq:
+	kfree(srq);
+	return ERR_PTR(ret);
+}
+
+static int srpt_alloc_srqs(struct srpt_device *sdev)
 {
 	struct ib_srq_init_attr srq_attr = {
 		.event_handler = srpt_srq_event,
@@ -3041,46 +3125,45 @@ static int srpt_alloc_srq(struct srpt_device *sdev)
 		.srq_type = IB_SRQT_BASIC,
 	};
 	struct ib_device *device = sdev->device;
-	struct ib_srq *srq;
-	int i;
+	int i, j, ret;
 
-	WARN_ON_ONCE(sdev->srq);
-	srq = ib_create_srq(sdev->pd, &srq_attr);
-	if (IS_ERR(srq)) {
-		pr_debug("ib_create_srq() failed: %ld\n", PTR_ERR(srq));
-		return PTR_ERR(srq);
-	}
+	WARN_ON_ONCE(sdev->srqs);
+	sdev->srqs = kcalloc(sdev->srq_count, sizeof(*sdev->srqs), GFP_KERNEL);
+	if (!sdev->srqs)
+		return -ENOMEM;
 
-	pr_debug("create SRQ #wr= %d max_allow=%d dev= %s\n", sdev->srq_size,
-		 sdev->device->attrs.max_srq_wr, dev_name(&device->dev));
+	pr_debug("create SRQ set #wr= %d max_allow=%d dev= %s\n",
+		 sdev->srq_size, sdev->device->attrs.max_srq_wr,
+		 dev_name(&device->dev));
+
+	ret = rdma_srq_set_init(sdev->pd, sdev->srq_count, &srq_attr);
+	if (ret)
+		goto out_free;
 
 	sdev->req_buf_cache = kmem_cache_create("srpt-srq-req-buf",
 						srp_max_req_size, 0, 0, NULL);
 	if (!sdev->req_buf_cache)
-		goto free_srq;
+		goto out_free_set;
 
-	sdev->ioctx_ring = (struct srpt_recv_ioctx **)
-		srpt_alloc_ioctx_ring(sdev, sdev->srq_size,
-				      sizeof(*sdev->ioctx_ring[0]),
-				      sdev->req_buf_cache, 0, DMA_FROM_DEVICE);
-	if (!sdev->ioctx_ring)
-		goto free_cache;
+	for (i = 0; i < sdev->srq_count; i++) {
+		sdev->srqs[i] = srpt_alloc_srq(sdev);
+		if (IS_ERR(sdev->srqs[i]))
+			goto free_srq;
+	}
 
 	sdev->use_srq = true;
-	sdev->srq = srq;
-
-	for (i = 0; i < sdev->srq_size; ++i) {
-		INIT_LIST_HEAD(&sdev->ioctx_ring[i]->wait_list);
-		srpt_post_recv(sdev, NULL, sdev->ioctx_ring[i]);
-	}
 
 	return 0;
 
-free_cache:
-	kmem_cache_destroy(sdev->req_buf_cache);
-
 free_srq:
-	ib_destroy_srq(srq);
+	for (j = 0; j < i; j++)
+		srpt_free_srq(sdev->srqs[j]);
+	kmem_cache_destroy(sdev->req_buf_cache);
+out_free_set:
+	rdma_srq_set_destroy(sdev->pd);
+out_free:
+	kfree(sdev->srqs);
+	sdev->srqs = NULL;
 	return -ENOMEM;
 }
 
@@ -3090,10 +3173,10 @@ static int srpt_use_srq(struct srpt_device *sdev, bool use_srq)
 	int ret = 0;
 
 	if (!use_srq) {
-		srpt_free_srq(sdev);
+		srpt_free_srqs(sdev);
 		sdev->use_srq = false;
-	} else if (use_srq && !sdev->srq) {
-		ret = srpt_alloc_srq(sdev);
+	} else if (use_srq && !sdev->srqs) {
+		ret = srpt_alloc_srqs(sdev);
 	}
 	pr_debug("%s(%s): use_srq = %d; ret = %d\n", __func__,
 		 dev_name(&device->dev), sdev->use_srq, ret);
@@ -3127,6 +3210,8 @@ static void srpt_add_one(struct ib_device *device)
 	sdev->lkey = sdev->pd->local_dma_lkey;
 
 	sdev->srq_size = min(srpt_srq_size, sdev->device->attrs.max_srq_wr);
+	sdev->srq_count = min(sdev->device->num_comp_vectors,
+			      sdev->device->attrs.max_srq);
 
 	srpt_use_srq(sdev, sdev->port[0].port_attrib.use_srq);
 
@@ -3204,7 +3289,7 @@ static void srpt_add_one(struct ib_device *device)
 	if (sdev->cm_id)
 		ib_destroy_cm_id(sdev->cm_id);
 err_ring:
-	srpt_free_srq(sdev);
+	srpt_free_srqs(sdev);
 	ib_dealloc_pd(sdev->pd);
 free_dev:
 	kfree(sdev);
@@ -3255,7 +3340,7 @@ static void srpt_remove_one(struct ib_device *device, void *client_data)
 	for (i = 0; i < sdev->device->phys_port_cnt; i++)
 		srpt_release_sport(&sdev->port[i]);
 
-	srpt_free_srq(sdev);
+	srpt_free_srqs(sdev);
 
 	ib_dealloc_pd(sdev->pd);
 
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h b/drivers/infiniband/ulp/srpt/ib_srpt.h
index 2e1a698..a637d4f 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -42,6 +42,7 @@
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_sa.h>
 #include <rdma/ib_cm.h>
+#include <rdma/srq_set.h>
 #include <rdma/rdma_cm.h>
 #include <rdma/rw.h>
 
@@ -56,6 +57,7 @@
 #define SRP_SERVICE_NAME_PREFIX		"SRP.T10:"
 
 struct srpt_nexus;
+struct srpt_srq;
 
 enum {
 	/*
@@ -255,6 +257,7 @@ enum rdma_ch_state {
 /**
  * struct srpt_rdma_ch - RDMA channel
  * @nexus:         I_T nexus this channel is associated with.
+ * @srq:           SRQ that this channel is associated with (if use_srq=True).
  * @qp:            IB queue pair used for communicating over this channel.
  * @ib_cm:	   See below.
  * @ib_cm.cm_id:   IB CM ID associated with the channel.
@@ -295,6 +298,7 @@ enum rdma_ch_state {
  */
 struct srpt_rdma_ch {
 	struct srpt_nexus	*nexus;
+	struct srpt_srq		*srq;
 	struct ib_qp		*qp;
 	union {
 		struct {
@@ -432,17 +436,29 @@ struct srpt_port {
 };
 
 /**
+ * struct srpt_srq - SRQ (shared receive queue) context for SRPT
+ * @ibsrq:         verbs SRQ pointer.
+ * @ioctx_ring:    Per SRQ ring.
+ * @sdev:          backpointer to the HCA information.
+ */
+struct srpt_srq {
+	struct ib_srq		*ibsrq;
+	struct srpt_recv_ioctx	**ioctx_ring;
+	struct srpt_device	*sdev;
+};
+
+/**
  * struct srpt_device - information associated by SRPT with a single HCA
  * @device:        Backpointer to the struct ib_device managed by the IB core.
  * @pd:            IB protection domain.
  * @lkey:          L_Key (local key) with write access to all local memory.
- * @srq:           Per-HCA SRQ (shared receive queue).
  * @cm_id:         Connection identifier.
- * @srq_size:      SRQ size.
+ * @srqs:          SRQ's array.
+ * @srq_count:     SRQs array size.
+ * @srq_size:      SRQ size for each in SRQ the array.
  * @sdev_mutex:	   Serializes use_srq changes.
  * @use_srq:       Whether or not to use SRQ.
  * @req_buf_cache: kmem_cache for @ioctx_ring buffers.
- * @ioctx_ring:    Per-HCA SRQ.
  * @event_handler: Per-HCA asynchronous IB event handler.
  * @list:          Node in srpt_dev_list.
  * @port:          Information about the ports owned by this HCA.
@@ -451,13 +467,13 @@ struct srpt_device {
 	struct ib_device	*device;
 	struct ib_pd		*pd;
 	u32			lkey;
-	struct ib_srq		*srq;
 	struct ib_cm_id		*cm_id;
+	struct srpt_srq		**srqs;
+	int			srq_count;
 	int			srq_size;
 	struct mutex		sdev_mutex;
 	bool			use_srq;
 	struct kmem_cache	*req_buf_cache;
-	struct srpt_recv_ioctx	**ioctx_ring;
 	struct ib_event_handler	event_handler;
 	struct list_head	list;
 	struct srpt_port        port[];
-- 
1.8.3.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 13:40 ` [PATCH 1/5] IB/core: add a simple SRQ set per PD Max Gurtovoy
@ 2020-03-17 13:55   ` Leon Romanovsky
  2020-03-17 16:37     ` Max Gurtovoy
  0 siblings, 1 reply; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-17 13:55 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi

On Tue, Mar 17, 2020 at 03:40:26PM +0200, Max Gurtovoy wrote:
> ULP's can use this API to create/destroy SRQ's with the same
> characteristics for implementing a logic that aimed to save resources
> without significant performance penalty (e.g. create SRQ per completion
> vector and use shared receive buffers for multiple controllers of the
> ULP).
>
> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> ---
>  drivers/infiniband/core/Makefile  |  2 +-
>  drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
>  drivers/infiniband/core/verbs.c   |  4 ++
>  include/rdma/ib_verbs.h           |  5 +++
>  include/rdma/srq_set.h            | 18 +++++++++
>  5 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/infiniband/core/srq_set.c
>  create mode 100644 include/rdma/srq_set.h
>
> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> index d1b14887..1d3eaec 100644
> --- a/drivers/infiniband/core/Makefile
> +++ b/drivers/infiniband/core/Makefile
> @@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
>  				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
>  				multicast.o mad.o smi.o agent.o mad_rmpp.o \
>  				nldev.o restrack.o counters.o ib_core_uverbs.o \
> -				trace.o
> +				trace.o srq_set.o

Why did you call it "srq_set.c" and not "srq.c"?

>
>  ib_core-$(CONFIG_SECURITY_INFINIBAND) += security.o
>  ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
> diff --git a/drivers/infiniband/core/srq_set.c b/drivers/infiniband/core/srq_set.c
> new file mode 100644
> index 0000000..d143561
> --- /dev/null
> +++ b/drivers/infiniband/core/srq_set.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2020 Mellanox Technologies. All rights reserved.
> + */
> +
> +#include <rdma/srq_set.h>
> +
> +struct ib_srq *rdma_srq_get(struct ib_pd *pd)
> +{
> +	struct ib_srq *srq;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&pd->srq_lock, flags);
> +	srq = list_first_entry_or_null(&pd->srqs, struct ib_srq, pd_entry);
> +	if (srq) {
> +		list_del(&srq->pd_entry);
> +		pd->srqs_used++;

I don't see any real usage of srqs_used.

> +	}
> +	spin_unlock_irqrestore(&pd->srq_lock, flags);
> +
> +	return srq;
> +}
> +EXPORT_SYMBOL(rdma_srq_get);
> +
> +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&pd->srq_lock, flags);
> +	list_add(&srq->pd_entry, &pd->srqs);
> +	pd->srqs_used--;
> +	spin_unlock_irqrestore(&pd->srq_lock, flags);
> +}
> +EXPORT_SYMBOL(rdma_srq_put);
> +
> +int rdma_srq_set_init(struct ib_pd *pd, int nr,
> +		struct ib_srq_init_attr *srq_attr)
> +{
> +	struct ib_srq *srq;
> +	unsigned long flags;
> +	int ret, i;
> +
> +	for (i = 0; i < nr; i++) {
> +		srq = ib_create_srq(pd, srq_attr);
> +		if (IS_ERR(srq)) {
> +			ret = PTR_ERR(srq);
> +			goto out;
> +		}
> +
> +		spin_lock_irqsave(&pd->srq_lock, flags);
> +		list_add_tail(&srq->pd_entry, &pd->srqs);
> +		spin_unlock_irqrestore(&pd->srq_lock, flags);
> +	}
> +
> +	return 0;
> +out:
> +	rdma_srq_set_destroy(pd);
> +	return ret;
> +}
> +EXPORT_SYMBOL(rdma_srq_set_init);
> +
> +void rdma_srq_set_destroy(struct ib_pd *pd)
> +{
> +	struct ib_srq *srq;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&pd->srq_lock, flags);
> +	while (!list_empty(&pd->srqs)) {
> +		srq = list_first_entry(&pd->srqs, struct ib_srq, pd_entry);
> +		list_del(&srq->pd_entry);
> +
> +		spin_unlock_irqrestore(&pd->srq_lock, flags);
> +		ib_destroy_srq(srq);
> +		spin_lock_irqsave(&pd->srq_lock, flags);
> +	}
> +	spin_unlock_irqrestore(&pd->srq_lock, flags);
> +}
> +EXPORT_SYMBOL(rdma_srq_set_destroy);
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index e62c9df..6950abf 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -272,6 +272,9 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
>  	pd->__internal_mr = NULL;
>  	atomic_set(&pd->usecnt, 0);
>  	pd->flags = flags;
> +	pd->srqs_used = 0;
> +	spin_lock_init(&pd->srq_lock);
> +	INIT_LIST_HEAD(&pd->srqs);
>
>  	pd->res.type = RDMA_RESTRACK_PD;
>  	rdma_restrack_set_task(&pd->res, caller);
> @@ -340,6 +343,7 @@ void ib_dealloc_pd_user(struct ib_pd *pd, struct ib_udata *udata)
>  		pd->__internal_mr = NULL;
>  	}
>
> +	WARN_ON_ONCE(pd->srqs_used > 0);

It can be achieved by WARN_ON_ONCE(!list_empty(&pd->srqs))

>  	/* uverbs manipulates usecnt with proper locking, while the kabi
>  	   requires the caller to guarantee we can't race here. */
>  	WARN_ON(atomic_read(&pd->usecnt));
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 1f779fa..fc8207d 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1517,6 +1517,10 @@ struct ib_pd {
>
>  	u32			unsafe_global_rkey;
>
> +	spinlock_t		srq_lock;
> +	int			srqs_used;
> +	struct list_head	srqs;
> +
>  	/*
>  	 * Implementation details of the RDMA core, don't use in drivers:
>  	 */
> @@ -1585,6 +1589,7 @@ struct ib_srq {
>  	void		       *srq_context;
>  	enum ib_srq_type	srq_type;
>  	atomic_t		usecnt;
> +	struct list_head	pd_entry; /* srq set entry */
>
>  	struct {
>  		struct ib_cq   *cq;
> diff --git a/include/rdma/srq_set.h b/include/rdma/srq_set.h
> new file mode 100644
> index 0000000..834c4c6
> --- /dev/null
> +++ b/include/rdma/srq_set.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB) */
> +/*
> + * Copyright (c) 2020 Mellanox Technologies. All rights reserved.
> + */
> +
> +#ifndef _RDMA_SRQ_SET_H
> +#define _RDMA_SRQ_SET_H 1

"1" is not needed.

> +
> +#include <rdma/ib_verbs.h>
> +
> +struct ib_srq *rdma_srq_get(struct ib_pd *pd);
> +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);

At the end, it is not get/put semantics but more add/remove.

Thanks

> +
> +int rdma_srq_set_init(struct ib_pd *pd, int nr,
> +		struct ib_srq_init_attr *srq_attr);
> +void rdma_srq_set_destroy(struct ib_pd *pd);
> +
> +#endif /* _RDMA_SRQ_SET_H */
> --
> 1.8.3.1
>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] RDMA/srpt: use SRQ per completion vector
  2020-03-17 13:40 ` [PATCH 5/5] RDMA/srpt: use SRQ per " Max Gurtovoy
@ 2020-03-17 13:58   ` Leon Romanovsky
  2020-03-17 16:43     ` Max Gurtovoy
  2020-03-17 19:58   ` Leon Romanovsky
  1 sibling, 1 reply; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-17 13:58 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi

On Tue, Mar 17, 2020 at 03:40:30PM +0200, Max Gurtovoy wrote:
> In order to save resource allocation and utilize the completion
> locality in a better way (compared to SRQ per device that exist
> today), allocate Shared Receive Queues (SRQs) per completion vector.
> Associate each created channel with an appropriate SRQ according to the
> completion vector index. This association will reduce the lock
> contention in the fast path (compared to SRQ per device solution) and
> increase the locality in memory buffers.
>
> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> ---
>  drivers/infiniband/ulp/srpt/ib_srpt.c | 169 +++++++++++++++++++++++++---------
>  drivers/infiniband/ulp/srpt/ib_srpt.h |  26 +++++-
>  2 files changed, 148 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
> index 9855274..34869b7 100644
> --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> @@ -811,6 +811,31 @@ static bool srpt_test_and_set_cmd_state(struct srpt_send_ioctx *ioctx,
>  }
>
>  /**
> + * srpt_srq_post_recv - post an initial IB receive request for SRQ
> + * @srq: SRPT SRQ context.
> + * @ioctx: Receive I/O context pointer.
> + */
> +static int srpt_srq_post_recv(struct srpt_srq *srq, struct srpt_recv_ioctx *ioctx)
> +{
> +	struct srpt_device *sdev = srq->sdev;
> +	struct ib_sge list;
> +	struct ib_recv_wr wr;
> +
> +	BUG_ON(!srq);
> +	list.addr = ioctx->ioctx.dma + ioctx->ioctx.offset;
> +	list.length = srp_max_req_size;
> +	list.lkey = sdev->lkey;
> +
> +	ioctx->ioctx.cqe.done = srpt_recv_done;
> +	wr.wr_cqe = &ioctx->ioctx.cqe;
> +	wr.next = NULL;
> +	wr.sg_list = &list;
> +	wr.num_sge = 1;
> +
> +	return ib_post_srq_recv(srq->ibsrq, &wr, NULL);
> +}
> +
> +/**
>   * srpt_post_recv - post an IB receive request
>   * @sdev: SRPT HCA pointer.
>   * @ch: SRPT RDMA channel.
> @@ -823,6 +848,7 @@ static int srpt_post_recv(struct srpt_device *sdev, struct srpt_rdma_ch *ch,
>  	struct ib_recv_wr wr;
>
>  	BUG_ON(!sdev);
> +	BUG_ON(!ch);
>  	list.addr = ioctx->ioctx.dma + ioctx->ioctx.offset;
>  	list.length = srp_max_req_size;
>  	list.lkey = sdev->lkey;
> @@ -834,7 +860,7 @@ static int srpt_post_recv(struct srpt_device *sdev, struct srpt_rdma_ch *ch,
>  	wr.num_sge = 1;
>
>  	if (sdev->use_srq)
> -		return ib_post_srq_recv(sdev->srq, &wr, NULL);
> +		return ib_post_srq_recv(ch->srq->ibsrq, &wr, NULL);
>  	else
>  		return ib_post_recv(ch->qp, &wr, NULL);
>  }
> @@ -1820,7 +1846,8 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
>  					SRPT_MAX_SG_PER_WQE);
>  	qp_init->port_num = ch->sport->port;
>  	if (sdev->use_srq) {
> -		qp_init->srq = sdev->srq;
> +		ch->srq = sdev->srqs[ch->cq->comp_vector % sdev->srq_count];
> +		qp_init->srq = ch->srq->ibsrq;
>  	} else {
>  		qp_init->cap.max_recv_wr = ch->rq_size;
>  		qp_init->cap.max_recv_sge = min(attrs->max_recv_sge,
> @@ -1878,6 +1905,8 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
>
>  static void srpt_destroy_ch_ib(struct srpt_rdma_ch *ch)
>  {
> +	if (ch->srq)
> +		ch->srq = NULL;
>  	ib_destroy_qp(ch->qp);
>  	ib_free_cq(ch->cq);
>  }
> @@ -3018,20 +3047,75 @@ static struct se_wwn *srpt_lookup_wwn(const char *name)
>  	return wwn;
>  }
>
> -static void srpt_free_srq(struct srpt_device *sdev)
> +static void srpt_free_srq(struct srpt_srq *srq)
>  {
> -	if (!sdev->srq)
> -		return;
>
> -	ib_destroy_srq(sdev->srq);
> -	srpt_free_ioctx_ring((struct srpt_ioctx **)sdev->ioctx_ring, sdev,
> -			     sdev->srq_size, sdev->req_buf_cache,
> +	srpt_free_ioctx_ring((struct srpt_ioctx **)srq->ioctx_ring, srq->sdev,
> +			     srq->sdev->srq_size, srq->sdev->req_buf_cache,
>  			     DMA_FROM_DEVICE);
> +	rdma_srq_put(srq->sdev->pd, srq->ibsrq);
> +	kfree(srq);
> +
> +}
> +
> +static void srpt_free_srqs(struct srpt_device *sdev)
> +{
> +	int i;
> +
> +	if (!sdev->srqs)
> +		return;
> +
> +	for (i = 0; i < sdev->srq_count; i++)
> +		srpt_free_srq(sdev->srqs[i]);
>  	kmem_cache_destroy(sdev->req_buf_cache);
> -	sdev->srq = NULL;
> +	rdma_srq_set_destroy(sdev->pd);
> +	kfree(sdev->srqs);
> +	sdev->srqs = NULL;
>  }
>
> -static int srpt_alloc_srq(struct srpt_device *sdev)
> +static struct srpt_srq *srpt_alloc_srq(struct srpt_device *sdev)
> +{
> +	struct srpt_srq	*srq;
> +	int i, ret;
> +
> +	srq = kzalloc(sizeof(*srq), GFP_KERNEL);
> +	if (!srq) {
> +		pr_debug("failed to allocate SRQ context\n");

Please no to kzalloc prints and no to pr_* prints.

> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	srq->ibsrq = rdma_srq_get(sdev->pd);
> +	if (!srq) {

!srq->ibsrq ????

> +		ret = -EAGAIN;
> +		goto free_srq;
> +	}

Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] IB/core: cache the CQ completion vector
  2020-03-17 13:40 ` [PATCH 4/5] IB/core: cache the CQ " Max Gurtovoy
@ 2020-03-17 15:19   ` Chuck Lever
  2020-03-17 15:41     ` Max Gurtovoy
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2020-03-17 15:19 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, leonro, dledford, jgg, Oren Duer, kbusch, hch, sagi

Hi Max-

> On Mar 17, 2020, at 9:40 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
> 
> In some cases, e.g. when using ib_alloc_cq_any, one would like to know
> the completion vector that eventually assigned to the CQ. Cache this
> value during CQ creation.

I'm confused by the mention of the ib_alloc_cq_any() API here.

Is your design somehow dependent on the way the current ib_alloc_cq_any()
selects comp_vectors? The contract for _any() is that it is an API for
callers that simply do not care about what comp_vector is chosen. There's
no guarantee that the _any() comp_vector allocator will continue to use
round-robin in the future, for instance.

If you want to guarantee that there is an SRQ for each comp_vector and a
comp_vector for each SRQ, stick with a CQ allocation API that enables
explicit selection of the comp_vector value, and cache that value in the
caller, not in the core data structures.


> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> ---
> drivers/infiniband/core/cq.c | 1 +
> include/rdma/ib_verbs.h      | 1 +
> 2 files changed, 2 insertions(+)
> 
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index 4f25b24..a7cbf52 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -217,6 +217,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
> 	cq->device = dev;
> 	cq->cq_context = private;
> 	cq->poll_ctx = poll_ctx;
> +	cq->comp_vector = comp_vector;
> 	atomic_set(&cq->usecnt, 0);
> 
> 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index fc8207d..0d61772 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1558,6 +1558,7 @@ struct ib_cq {
> 	struct ib_device       *device;
> 	struct ib_ucq_object   *uobject;
> 	ib_comp_handler   	comp_handler;
> +	u32			comp_vector;
> 	void                  (*event_handler)(struct ib_event *, void *);
> 	void                   *cq_context;
> 	int               	cqe;
> -- 
> 1.8.3.1
> 

--
Chuck Lever
chucklever@gmail.com




_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] IB/core: cache the CQ completion vector
  2020-03-17 15:19   ` Chuck Lever
@ 2020-03-17 15:41     ` Max Gurtovoy
  2020-03-17 20:36       ` Chuck Lever
  2020-03-17 22:50       ` Bart Van Assche
  0 siblings, 2 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 15:41 UTC (permalink / raw)
  To: Chuck Lever
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, leonro, dledford, jgg, Oren Duer, kbusch, hch, sagi


On 3/17/2020 5:19 PM, Chuck Lever wrote:
> Hi Max-
>
>> On Mar 17, 2020, at 9:40 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
>>
>> In some cases, e.g. when using ib_alloc_cq_any, one would like to know
>> the completion vector that eventually assigned to the CQ. Cache this
>> value during CQ creation.
> I'm confused by the mention of the ib_alloc_cq_any() API here.

Can't the user use ib_alloc_cq_any() and still want to know what is the 
final comp vector for his needs ?

>
> Is your design somehow dependent on the way the current ib_alloc_cq_any()
> selects comp_vectors? The contract for _any() is that it is an API for
> callers that simply do not care about what comp_vector is chosen. There's
> no guarantee that the _any() comp_vector allocator will continue to use
> round-robin in the future, for instance.

it's fine. I just want to make sure that I'll spread the SRQ's equally.

>
> If you want to guarantee that there is an SRQ for each comp_vector and a
> comp_vector for each SRQ, stick with a CQ allocation API that enables
> explicit selection of the comp_vector value, and cache that value in the
> caller, not in the core data structures.

I'm Ok with that as well. This is exactly what we do in the nvmf/rdma 
but I wanted to stick also with the SRP target approach.

Bart,

Any objection to remove the call for ib_alloc_cq_any() from ib_srpt and 
use ib_alloc_cq() ?


>
>
>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>> ---
>> drivers/infiniband/core/cq.c | 1 +
>> include/rdma/ib_verbs.h      | 1 +
>> 2 files changed, 2 insertions(+)
>>
>> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
>> index 4f25b24..a7cbf52 100644
>> --- a/drivers/infiniband/core/cq.c
>> +++ b/drivers/infiniband/core/cq.c
>> @@ -217,6 +217,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
>> 	cq->device = dev;
>> 	cq->cq_context = private;
>> 	cq->poll_ctx = poll_ctx;
>> +	cq->comp_vector = comp_vector;
>> 	atomic_set(&cq->usecnt, 0);
>>
>> 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index fc8207d..0d61772 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -1558,6 +1558,7 @@ struct ib_cq {
>> 	struct ib_device       *device;
>> 	struct ib_ucq_object   *uobject;
>> 	ib_comp_handler   	comp_handler;
>> +	u32			comp_vector;
>> 	void                  (*event_handler)(struct ib_event *, void *);
>> 	void                   *cq_context;
>> 	int               	cqe;
>> -- 
>> 1.8.3.1
>>
> --
> Chuck Lever
> chucklever@gmail.com
>
>
>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 13:55   ` Leon Romanovsky
@ 2020-03-17 16:37     ` Max Gurtovoy
  2020-03-17 18:10       ` Jason Gunthorpe
  2020-03-18  6:47       ` Leon Romanovsky
  0 siblings, 2 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 16:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi


On 3/17/2020 3:55 PM, Leon Romanovsky wrote:
> On Tue, Mar 17, 2020 at 03:40:26PM +0200, Max Gurtovoy wrote:
>> ULP's can use this API to create/destroy SRQ's with the same
>> characteristics for implementing a logic that aimed to save resources
>> without significant performance penalty (e.g. create SRQ per completion
>> vector and use shared receive buffers for multiple controllers of the
>> ULP).
>>
>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>> ---
>>   drivers/infiniband/core/Makefile  |  2 +-
>>   drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
>>   drivers/infiniband/core/verbs.c   |  4 ++
>>   include/rdma/ib_verbs.h           |  5 +++
>>   include/rdma/srq_set.h            | 18 +++++++++
>>   5 files changed, 106 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/infiniband/core/srq_set.c
>>   create mode 100644 include/rdma/srq_set.h
>>
>> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
>> index d1b14887..1d3eaec 100644
>> --- a/drivers/infiniband/core/Makefile
>> +++ b/drivers/infiniband/core/Makefile
>> @@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
>>   				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
>>   				multicast.o mad.o smi.o agent.o mad_rmpp.o \
>>   				nldev.o restrack.o counters.o ib_core_uverbs.o \
>> -				trace.o
>> +				trace.o srq_set.o
> Why did you call it "srq_set.c" and not "srq.c"?

because it's not a SRQ API but SRQ-set API.


>>   ib_core-$(CONFIG_SECURITY_INFINIBAND) += security.o
>>   ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
>> diff --git a/drivers/infiniband/core/srq_set.c b/drivers/infiniband/core/srq_set.c
>> new file mode 100644
>> index 0000000..d143561
>> --- /dev/null
>> +++ b/drivers/infiniband/core/srq_set.c
>> @@ -0,0 +1,78 @@
>> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>> +/*
>> + * Copyright (c) 2020 Mellanox Technologies. All rights reserved.
>> + */
>> +
>> +#include <rdma/srq_set.h>
>> +
>> +struct ib_srq *rdma_srq_get(struct ib_pd *pd)
>> +{
>> +	struct ib_srq *srq;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&pd->srq_lock, flags);
>> +	srq = list_first_entry_or_null(&pd->srqs, struct ib_srq, pd_entry);
>> +	if (srq) {
>> +		list_del(&srq->pd_entry);
>> +		pd->srqs_used++;
> I don't see any real usage of srqs_used.
>
>> +	}
>> +	spin_unlock_irqrestore(&pd->srq_lock, flags);
>> +
>> +	return srq;
>> +}
>> +EXPORT_SYMBOL(rdma_srq_get);
>> +
>> +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&pd->srq_lock, flags);
>> +	list_add(&srq->pd_entry, &pd->srqs);
>> +	pd->srqs_used--;
>> +	spin_unlock_irqrestore(&pd->srq_lock, flags);
>> +}
>> +EXPORT_SYMBOL(rdma_srq_put);
>> +
>> +int rdma_srq_set_init(struct ib_pd *pd, int nr,
>> +		struct ib_srq_init_attr *srq_attr)
>> +{
>> +	struct ib_srq *srq;
>> +	unsigned long flags;
>> +	int ret, i;
>> +
>> +	for (i = 0; i < nr; i++) {
>> +		srq = ib_create_srq(pd, srq_attr);
>> +		if (IS_ERR(srq)) {
>> +			ret = PTR_ERR(srq);
>> +			goto out;
>> +		}
>> +
>> +		spin_lock_irqsave(&pd->srq_lock, flags);
>> +		list_add_tail(&srq->pd_entry, &pd->srqs);
>> +		spin_unlock_irqrestore(&pd->srq_lock, flags);
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	rdma_srq_set_destroy(pd);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(rdma_srq_set_init);
>> +
>> +void rdma_srq_set_destroy(struct ib_pd *pd)
>> +{
>> +	struct ib_srq *srq;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&pd->srq_lock, flags);
>> +	while (!list_empty(&pd->srqs)) {
>> +		srq = list_first_entry(&pd->srqs, struct ib_srq, pd_entry);
>> +		list_del(&srq->pd_entry);
>> +
>> +		spin_unlock_irqrestore(&pd->srq_lock, flags);
>> +		ib_destroy_srq(srq);
>> +		spin_lock_irqsave(&pd->srq_lock, flags);
>> +	}
>> +	spin_unlock_irqrestore(&pd->srq_lock, flags);
>> +}
>> +EXPORT_SYMBOL(rdma_srq_set_destroy);
>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
>> index e62c9df..6950abf 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -272,6 +272,9 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
>>   	pd->__internal_mr = NULL;
>>   	atomic_set(&pd->usecnt, 0);
>>   	pd->flags = flags;
>> +	pd->srqs_used = 0;
>> +	spin_lock_init(&pd->srq_lock);
>> +	INIT_LIST_HEAD(&pd->srqs);
>>
>>   	pd->res.type = RDMA_RESTRACK_PD;
>>   	rdma_restrack_set_task(&pd->res, caller);
>> @@ -340,6 +343,7 @@ void ib_dealloc_pd_user(struct ib_pd *pd, struct ib_udata *udata)
>>   		pd->__internal_mr = NULL;
>>   	}
>>
>> +	WARN_ON_ONCE(pd->srqs_used > 0);
> It can be achieved by WARN_ON_ONCE(!list_empty(&pd->srqs))

ok, I'll change it for v2.


>
>>   	/* uverbs manipulates usecnt with proper locking, while the kabi
>>   	   requires the caller to guarantee we can't race here. */
>>   	WARN_ON(atomic_read(&pd->usecnt));
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index 1f779fa..fc8207d 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -1517,6 +1517,10 @@ struct ib_pd {
>>
>>   	u32			unsafe_global_rkey;
>>
>> +	spinlock_t		srq_lock;
>> +	int			srqs_used;
>> +	struct list_head	srqs;
>> +
>>   	/*
>>   	 * Implementation details of the RDMA core, don't use in drivers:
>>   	 */
>> @@ -1585,6 +1589,7 @@ struct ib_srq {
>>   	void		       *srq_context;
>>   	enum ib_srq_type	srq_type;
>>   	atomic_t		usecnt;
>> +	struct list_head	pd_entry; /* srq set entry */
>>
>>   	struct {
>>   		struct ib_cq   *cq;
>> diff --git a/include/rdma/srq_set.h b/include/rdma/srq_set.h
>> new file mode 100644
>> index 0000000..834c4c6
>> --- /dev/null
>> +++ b/include/rdma/srq_set.h
>> @@ -0,0 +1,18 @@
>> +/* SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB) */
>> +/*
>> + * Copyright (c) 2020 Mellanox Technologies. All rights reserved.
>> + */
>> +
>> +#ifndef _RDMA_SRQ_SET_H
>> +#define _RDMA_SRQ_SET_H 1
> "1" is not needed.

agreed.


>
>> +
>> +#include <rdma/ib_verbs.h>
>> +
>> +struct ib_srq *rdma_srq_get(struct ib_pd *pd);
>> +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);
> At the end, it is not get/put semantics but more add/remove.

srq = rdma_srq_add ?

rdma_srq_remove(pd, srq) ?

Doesn't seems right to me.

Lets make it simple. For asking a SRQ from the PD set lets use 
rdma_srq_get and returning to we'll use rdma_srq_put.

>
> Thanks
>
>> +
>> +int rdma_srq_set_init(struct ib_pd *pd, int nr,
>> +		struct ib_srq_init_attr *srq_attr);
>> +void rdma_srq_set_destroy(struct ib_pd *pd);
>> +
>> +#endif /* _RDMA_SRQ_SET_H */
>> --
>> 1.8.3.1
>>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] RDMA/srpt: use SRQ per completion vector
  2020-03-17 13:58   ` Leon Romanovsky
@ 2020-03-17 16:43     ` Max Gurtovoy
  0 siblings, 0 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 16:43 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi


On 3/17/2020 3:58 PM, Leon Romanovsky wrote:
> On Tue, Mar 17, 2020 at 03:40:30PM +0200, Max Gurtovoy wrote:
>> In order to save resource allocation and utilize the completion
>> locality in a better way (compared to SRQ per device that exist
>> today), allocate Shared Receive Queues (SRQs) per completion vector.
>> Associate each created channel with an appropriate SRQ according to the
>> completion vector index. This association will reduce the lock
>> contention in the fast path (compared to SRQ per device solution) and
>> increase the locality in memory buffers.
>>
>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>> ---
>>   drivers/infiniband/ulp/srpt/ib_srpt.c | 169 +++++++++++++++++++++++++---------
>>   drivers/infiniband/ulp/srpt/ib_srpt.h |  26 +++++-
>>   2 files changed, 148 insertions(+), 47 deletions(-)
>>
>> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
>> index 9855274..34869b7 100644
>> --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
>> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
>> @@ -811,6 +811,31 @@ static bool srpt_test_and_set_cmd_state(struct srpt_send_ioctx *ioctx,
>>   }
>>
>>   /**
>> + * srpt_srq_post_recv - post an initial IB receive request for SRQ
>> + * @srq: SRPT SRQ context.
>> + * @ioctx: Receive I/O context pointer.
>> + */
>> +static int srpt_srq_post_recv(struct srpt_srq *srq, struct srpt_recv_ioctx *ioctx)
>> +{
>> +	struct srpt_device *sdev = srq->sdev;
>> +	struct ib_sge list;
>> +	struct ib_recv_wr wr;
>> +
>> +	BUG_ON(!srq);
>> +	list.addr = ioctx->ioctx.dma + ioctx->ioctx.offset;
>> +	list.length = srp_max_req_size;
>> +	list.lkey = sdev->lkey;
>> +
>> +	ioctx->ioctx.cqe.done = srpt_recv_done;
>> +	wr.wr_cqe = &ioctx->ioctx.cqe;
>> +	wr.next = NULL;
>> +	wr.sg_list = &list;
>> +	wr.num_sge = 1;
>> +
>> +	return ib_post_srq_recv(srq->ibsrq, &wr, NULL);
>> +}
>> +
>> +/**
>>    * srpt_post_recv - post an IB receive request
>>    * @sdev: SRPT HCA pointer.
>>    * @ch: SRPT RDMA channel.
>> @@ -823,6 +848,7 @@ static int srpt_post_recv(struct srpt_device *sdev, struct srpt_rdma_ch *ch,
>>   	struct ib_recv_wr wr;
>>
>>   	BUG_ON(!sdev);
>> +	BUG_ON(!ch);
>>   	list.addr = ioctx->ioctx.dma + ioctx->ioctx.offset;
>>   	list.length = srp_max_req_size;
>>   	list.lkey = sdev->lkey;
>> @@ -834,7 +860,7 @@ static int srpt_post_recv(struct srpt_device *sdev, struct srpt_rdma_ch *ch,
>>   	wr.num_sge = 1;
>>
>>   	if (sdev->use_srq)
>> -		return ib_post_srq_recv(sdev->srq, &wr, NULL);
>> +		return ib_post_srq_recv(ch->srq->ibsrq, &wr, NULL);
>>   	else
>>   		return ib_post_recv(ch->qp, &wr, NULL);
>>   }
>> @@ -1820,7 +1846,8 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
>>   					SRPT_MAX_SG_PER_WQE);
>>   	qp_init->port_num = ch->sport->port;
>>   	if (sdev->use_srq) {
>> -		qp_init->srq = sdev->srq;
>> +		ch->srq = sdev->srqs[ch->cq->comp_vector % sdev->srq_count];
>> +		qp_init->srq = ch->srq->ibsrq;
>>   	} else {
>>   		qp_init->cap.max_recv_wr = ch->rq_size;
>>   		qp_init->cap.max_recv_sge = min(attrs->max_recv_sge,
>> @@ -1878,6 +1905,8 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
>>
>>   static void srpt_destroy_ch_ib(struct srpt_rdma_ch *ch)
>>   {
>> +	if (ch->srq)
>> +		ch->srq = NULL;
>>   	ib_destroy_qp(ch->qp);
>>   	ib_free_cq(ch->cq);
>>   }
>> @@ -3018,20 +3047,75 @@ static struct se_wwn *srpt_lookup_wwn(const char *name)
>>   	return wwn;
>>   }
>>
>> -static void srpt_free_srq(struct srpt_device *sdev)
>> +static void srpt_free_srq(struct srpt_srq *srq)
>>   {
>> -	if (!sdev->srq)
>> -		return;
>>
>> -	ib_destroy_srq(sdev->srq);
>> -	srpt_free_ioctx_ring((struct srpt_ioctx **)sdev->ioctx_ring, sdev,
>> -			     sdev->srq_size, sdev->req_buf_cache,
>> +	srpt_free_ioctx_ring((struct srpt_ioctx **)srq->ioctx_ring, srq->sdev,
>> +			     srq->sdev->srq_size, srq->sdev->req_buf_cache,
>>   			     DMA_FROM_DEVICE);
>> +	rdma_srq_put(srq->sdev->pd, srq->ibsrq);
>> +	kfree(srq);
>> +
>> +}
>> +
>> +static void srpt_free_srqs(struct srpt_device *sdev)
>> +{
>> +	int i;
>> +
>> +	if (!sdev->srqs)
>> +		return;
>> +
>> +	for (i = 0; i < sdev->srq_count; i++)
>> +		srpt_free_srq(sdev->srqs[i]);
>>   	kmem_cache_destroy(sdev->req_buf_cache);
>> -	sdev->srq = NULL;
>> +	rdma_srq_set_destroy(sdev->pd);
>> +	kfree(sdev->srqs);
>> +	sdev->srqs = NULL;
>>   }
>>
>> -static int srpt_alloc_srq(struct srpt_device *sdev)
>> +static struct srpt_srq *srpt_alloc_srq(struct srpt_device *sdev)
>> +{
>> +	struct srpt_srq	*srq;
>> +	int i, ret;
>> +
>> +	srq = kzalloc(sizeof(*srq), GFP_KERNEL);
>> +	if (!srq) {
>> +		pr_debug("failed to allocate SRQ context\n");
> Please no to kzalloc prints and no to pr_* prints.

Sure no problem.

I'll remove the above print.


>
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>> +
>> +	srq->ibsrq = rdma_srq_get(sdev->pd);
>> +	if (!srq) {
> !srq->ibsrq ????

Yup good catch.


>
>> +		ret = -EAGAIN;
>> +		goto free_srq;
>> +	}
> Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 16:37     ` Max Gurtovoy
@ 2020-03-17 18:10       ` Jason Gunthorpe
  2020-03-17 18:24         ` Max Gurtovoy
  2020-03-17 19:54         ` Leon Romanovsky
  2020-03-18  6:47       ` Leon Romanovsky
  1 sibling, 2 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2020-03-17 18:10 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, Leon Romanovsky, dledford, oren, kbusch, hch, sagi

On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:

> > > +#include <rdma/ib_verbs.h>
> > > +
> > > +struct ib_srq *rdma_srq_get(struct ib_pd *pd);
> > > +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);
> > At the end, it is not get/put semantics but more add/remove.
> 
> srq = rdma_srq_add ?
> 
> rdma_srq_remove(pd, srq) ?
> 
> Doesn't seems right to me.
> 
> Lets make it simple. For asking a SRQ from the PD set lets use rdma_srq_get
> and returning to we'll use rdma_srq_put.

Is there reference couting here? get/put should be restricted to
refcounting APIs, IMHO.

Jason

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 18:10       ` Jason Gunthorpe
@ 2020-03-17 18:24         ` Max Gurtovoy
  2020-03-17 18:43           ` Jason Gunthorpe
  2020-03-17 19:54         ` Leon Romanovsky
  1 sibling, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 18:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, Leon Romanovsky, dledford, oren, kbusch, hch, sagi


On 3/17/2020 8:10 PM, Jason Gunthorpe wrote:
> On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
>
>>>> +#include <rdma/ib_verbs.h>
>>>> +
>>>> +struct ib_srq *rdma_srq_get(struct ib_pd *pd);
>>>> +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);
>>> At the end, it is not get/put semantics but more add/remove.
>> srq = rdma_srq_add ?
>>
>> rdma_srq_remove(pd, srq) ?
>>
>> Doesn't seems right to me.
>>
>> Lets make it simple. For asking a SRQ from the PD set lets use rdma_srq_get
>> and returning to we'll use rdma_srq_put.
> Is there reference couting here? get/put should be restricted to
> refcounting APIs, IMHO.

I've added a counter (pd->srqs_used) that Leon asked to remove .

There is no call to kref get/put here.

Do you prefer that I'll change it to be array in PD: "struct 
ib_srq           **srqs;" ?

And update ib_alloc_pd API to get pd_attrs and allocate the array during 
PD allocation ?

>
> Jason

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 18:24         ` Max Gurtovoy
@ 2020-03-17 18:43           ` Jason Gunthorpe
  2020-03-17 21:56             ` Max Gurtovoy
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2020-03-17 18:43 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, Leon Romanovsky, dledford, oren, kbusch, hch, sagi

On Tue, Mar 17, 2020 at 08:24:30PM +0200, Max Gurtovoy wrote:
> 
> On 3/17/2020 8:10 PM, Jason Gunthorpe wrote:
> > On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
> > 
> > > > > +#include <rdma/ib_verbs.h>
> > > > > +
> > > > > +struct ib_srq *rdma_srq_get(struct ib_pd *pd);
> > > > > +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);
> > > > At the end, it is not get/put semantics but more add/remove.
> > > srq = rdma_srq_add ?
> > > 
> > > rdma_srq_remove(pd, srq) ?
> > > 
> > > Doesn't seems right to me.
> > > 
> > > Lets make it simple. For asking a SRQ from the PD set lets use rdma_srq_get
> > > and returning to we'll use rdma_srq_put.
> > Is there reference couting here? get/put should be restricted to
> > refcounting APIs, IMHO.
> 
> I've added a counter (pd->srqs_used) that Leon asked to remove .
> 
> There is no call to kref get/put here.

I didn't look closely, any kind of refcount scheme is reasonable, but
if add is supposed to create a new srq then that isn't 'get'..

> Do you prefer that I'll change it to be array in PD: "struct
> ib_srq           **srqs;" ?

Not particularly..

It actually feels a bit weird, should there be some numa-ness involved
here so that the SRQ with memory on the node that is going to be
polling it is returned?

> And update ib_alloc_pd API to get pd_attrs and allocate the array during PD
> allocation ?

The API is a bit more composable if things can can be done as
following function calls are done that way.. I don't like the giant
multiplexor structs in general

Jason

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 18:10       ` Jason Gunthorpe
  2020-03-17 18:24         ` Max Gurtovoy
@ 2020-03-17 19:54         ` Leon Romanovsky
  1 sibling, 0 replies; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-17 19:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, oren, kbusch, Max Gurtovoy, hch, sagi

On Tue, Mar 17, 2020 at 03:10:36PM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
>
> > > > +#include <rdma/ib_verbs.h>
> > > > +
> > > > +struct ib_srq *rdma_srq_get(struct ib_pd *pd);
> > > > +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);
> > > At the end, it is not get/put semantics but more add/remove.
> >
> > srq = rdma_srq_add ?
> >
> > rdma_srq_remove(pd, srq) ?
> >
> > Doesn't seems right to me.
> >
> > Lets make it simple. For asking a SRQ from the PD set lets use rdma_srq_get
> > and returning to we'll use rdma_srq_put.
>
> Is there reference couting here? get/put should be restricted to
> refcounting APIs, IMHO.

No, there is no refcounting, Max introduced some counter which he
decrease and increase, but doesn't have any other usage of it, simply
burning CPU cycles.

Thanks

>
> Jason

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] RDMA/srpt: use SRQ per completion vector
  2020-03-17 13:40 ` [PATCH 5/5] RDMA/srpt: use SRQ per " Max Gurtovoy
  2020-03-17 13:58   ` Leon Romanovsky
@ 2020-03-17 19:58   ` Leon Romanovsky
  1 sibling, 0 replies; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-17 19:58 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi

On Tue, Mar 17, 2020 at 03:40:30PM +0200, Max Gurtovoy wrote:
> In order to save resource allocation and utilize the completion
> locality in a better way (compared to SRQ per device that exist
> today), allocate Shared Receive Queues (SRQs) per completion vector.
> Associate each created channel with an appropriate SRQ according to the
> completion vector index. This association will reduce the lock
> contention in the fast path (compared to SRQ per device solution) and
> increase the locality in memory buffers.
>
> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> ---
>  drivers/infiniband/ulp/srpt/ib_srpt.c | 169 +++++++++++++++++++++++++---------
>  drivers/infiniband/ulp/srpt/ib_srpt.h |  26 +++++-
>  2 files changed, 148 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
> index 9855274..34869b7 100644
> --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> @@ -811,6 +811,31 @@ static bool srpt_test_and_set_cmd_state(struct srpt_send_ioctx *ioctx,
>  }
>
>  /**
> + * srpt_srq_post_recv - post an initial IB receive request for SRQ
> + * @srq: SRPT SRQ context.
> + * @ioctx: Receive I/O context pointer.
> + */
> +static int srpt_srq_post_recv(struct srpt_srq *srq, struct srpt_recv_ioctx *ioctx)
> +{
> +	struct srpt_device *sdev = srq->sdev;
> +	struct ib_sge list;
> +	struct ib_recv_wr wr;
> +
> +	BUG_ON(!srq);

And of course no BUG_ONs in the code.

Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] IB/core: cache the CQ completion vector
  2020-03-17 15:41     ` Max Gurtovoy
@ 2020-03-17 20:36       ` Chuck Lever
  2020-03-17 22:18         ` Max Gurtovoy
  2020-03-17 22:50       ` Bart Van Assche
  1 sibling, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2020-03-17 20:36 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, leonro, dledford, jgg, Oren Duer, kbusch, hch, sagi



> On Mar 17, 2020, at 11:41 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
> 
> 
> On 3/17/2020 5:19 PM, Chuck Lever wrote:
>> Hi Max-
>> 
>>> On Mar 17, 2020, at 9:40 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
>>> 
>>> In some cases, e.g. when using ib_alloc_cq_any, one would like to know
>>> the completion vector that eventually assigned to the CQ. Cache this
>>> value during CQ creation.
>> I'm confused by the mention of the ib_alloc_cq_any() API here.
> 
> Can't the user use ib_alloc_cq_any() and still want to know what is the final comp vector for his needs ?

If your caller cares about the final cv value, then it should not use
the _any API. The point of _any is that the caller really does not care,
the cv value is hidden because it is not consequential. Your design
breaks that assumption/contract.


>> Is your design somehow dependent on the way the current ib_alloc_cq_any()
>> selects comp_vectors? The contract for _any() is that it is an API for
>> callers that simply do not care about what comp_vector is chosen. There's
>> no guarantee that the _any() comp_vector allocator will continue to use
>> round-robin in the future, for instance.
> 
> it's fine. I just want to make sure that I'll spread the SRQ's equally.

The _any algorithm is simplistic, it spreads cvs for the system as a whole.
All devices, all callers. You're going to get some bad degenerate cases
as soon as you have multiple users of the SRQ facility.

So, you really want to have a more specialized comp_vector selector for
the SRQ facility; one that is careful to spread cvs per device, independent
of the global allocator, which is good enough for normal cases.

I think your tests perform well simply because there was no other contender
for comp_vectors on your test system.


>> If you want to guarantee that there is an SRQ for each comp_vector and a
>> comp_vector for each SRQ, stick with a CQ allocation API that enables
>> explicit selection of the comp_vector value, and cache that value in the
>> caller, not in the core data structures.
> 
> I'm Ok with that as well. This is exactly what we do in the nvmf/rdma but I wanted to stick also with the SRP target approach.
> 
> Bart,
> 
> Any objection to remove the call for ib_alloc_cq_any() from ib_srpt and use ib_alloc_cq() ?
> 
> 
>> 
>> 
>>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>>> ---
>>> drivers/infiniband/core/cq.c | 1 +
>>> include/rdma/ib_verbs.h      | 1 +
>>> 2 files changed, 2 insertions(+)
>>> 
>>> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
>>> index 4f25b24..a7cbf52 100644
>>> --- a/drivers/infiniband/core/cq.c
>>> +++ b/drivers/infiniband/core/cq.c
>>> @@ -217,6 +217,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
>>> 	cq->device = dev;
>>> 	cq->cq_context = private;
>>> 	cq->poll_ctx = poll_ctx;
>>> +	cq->comp_vector = comp_vector;
>>> 	atomic_set(&cq->usecnt, 0);
>>> 
>>> 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
>>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>>> index fc8207d..0d61772 100644
>>> --- a/include/rdma/ib_verbs.h
>>> +++ b/include/rdma/ib_verbs.h
>>> @@ -1558,6 +1558,7 @@ struct ib_cq {
>>> 	struct ib_device       *device;
>>> 	struct ib_ucq_object   *uobject;
>>> 	ib_comp_handler   	comp_handler;
>>> +	u32			comp_vector;
>>> 	void                  (*event_handler)(struct ib_event *, void *);
>>> 	void                   *cq_context;
>>> 	int               	cqe;
>>> -- 
>>> 1.8.3.1
>>> 
>> --
>> Chuck Lever
>> chucklever@gmail.com
>> 
>> 
>> 

--
Chuck Lever
chucklever@gmail.com




_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 18:43           ` Jason Gunthorpe
@ 2020-03-17 21:56             ` Max Gurtovoy
  0 siblings, 0 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 21:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, Leon Romanovsky, dledford, oren, kbusch, hch, sagi


On 3/17/2020 8:43 PM, Jason Gunthorpe wrote:
> On Tue, Mar 17, 2020 at 08:24:30PM +0200, Max Gurtovoy wrote:
>> On 3/17/2020 8:10 PM, Jason Gunthorpe wrote:
>>> On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
>>>
>>>>>> +#include <rdma/ib_verbs.h>
>>>>>> +
>>>>>> +struct ib_srq *rdma_srq_get(struct ib_pd *pd);
>>>>>> +void rdma_srq_put(struct ib_pd *pd, struct ib_srq *srq);
>>>>> At the end, it is not get/put semantics but more add/remove.
>>>> srq = rdma_srq_add ?
>>>>
>>>> rdma_srq_remove(pd, srq) ?
>>>>
>>>> Doesn't seems right to me.
>>>>
>>>> Lets make it simple. For asking a SRQ from the PD set lets use rdma_srq_get
>>>> and returning to we'll use rdma_srq_put.
>>> Is there reference couting here? get/put should be restricted to
>>> refcounting APIs, IMHO.
>> I've added a counter (pd->srqs_used) that Leon asked to remove .
>>
>> There is no call to kref get/put here.
> I didn't look closely, any kind of refcount scheme is reasonable, but
> if add is supposed to create a new srq then that isn't 'get'..

No, we don't create new srq during the "get". We create a set using 
"rdma_srq_set_init".

"get" will simple pull some srq from the set and "put" will push it back.

>
>> Do you prefer that I'll change it to be array in PD: "struct
>> ib_srq           **srqs;" ?
> Not particularly..
>
> It actually feels a bit weird, should there be some numa-ness involved
> here so that the SRQ with memory on the node that is going to be
> polling it is returned?

Maybe this will be the next improvement. But for now the receive buffers 
are allocated by the ULP.

The idea is to spread the SRQs as much as possible as we do for QP/CQ to 
reach almost the same performance.

In case of 1 SRQ we can't reach good performance since many resources 
and cores are racing for 1 resources.

In case of regular RQ we allocate many buffers that most of the time are 
idle.

If we'll spread SRQs for all cores/vectors we have we'll get great perf 
with saving resources that might be critical in MQ ULPs as NVMf/SRP 
targets that might serve hundreds initiators with hundreds of queues each.


>
>> And update ib_alloc_pd API to get pd_attrs and allocate the array during PD
>> allocation ?
> The API is a bit more composable if things can can be done as
> following function calls are done that way.. I don't like the giant
> multiplexor structs in general
>
> Jason

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] IB/core: cache the CQ completion vector
  2020-03-17 20:36       ` Chuck Lever
@ 2020-03-17 22:18         ` Max Gurtovoy
  0 siblings, 0 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 22:18 UTC (permalink / raw)
  To: Chuck Lever
  Cc: loberman, bvanassche, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, leonro, dledford, jgg, Oren Duer, kbusch, hch, sagi


On 3/17/2020 10:36 PM, Chuck Lever wrote:
>
>> On Mar 17, 2020, at 11:41 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
>>
>>
>> On 3/17/2020 5:19 PM, Chuck Lever wrote:
>>> Hi Max-
>>>
>>>> On Mar 17, 2020, at 9:40 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
>>>>
>>>> In some cases, e.g. when using ib_alloc_cq_any, one would like to know
>>>> the completion vector that eventually assigned to the CQ. Cache this
>>>> value during CQ creation.
>>> I'm confused by the mention of the ib_alloc_cq_any() API here.
>> Can't the user use ib_alloc_cq_any() and still want to know what is the final comp vector for his needs ?
> If your caller cares about the final cv value, then it should not use
> the _any API. The point of _any is that the caller really does not care,
> the cv value is hidden because it is not consequential. Your design
> breaks that assumption/contract.

How come it breaks ?

If the ULP want to let the rdma-core layer to allocate the optimal 
vector and rely on it to do so, why is it wrong to know the final vector 
assigned ?

I can remove it and change the SRP target to use ib_alloc_cq but it 
doesn't break the contract.

>
>>> Is your design somehow dependent on the way the current ib_alloc_cq_any()
>>> selects comp_vectors? The contract for _any() is that it is an API for
>>> callers that simply do not care about what comp_vector is chosen. There's
>>> no guarantee that the _any() comp_vector allocator will continue to use
>>> round-robin in the future, for instance.
>> it's fine. I just want to make sure that I'll spread the SRQ's equally.
> The _any algorithm is simplistic, it spreads cvs for the system as a whole.
> All devices, all callers. You're going to get some bad degenerate cases
> as soon as you have multiple users of the SRQ facility.

how come ? This facility is per PD that is created by the ULP.


>
> So, you really want to have a more specialized comp_vector selector for
> the SRQ facility; one that is careful to spread cvs per device, independent
> of the global allocator, which is good enough for normal cases.
>
> I think your tests perform well simply because there was no other contender
> for comp_vectors on your test system.

For the testing result I did I used NVMf target that uses ib_alloc_cq so 
it would be good anyway.

According to your theory ib_alloc_cq_any will not perform well and have 
degenerate cases anyway regardless the SRQ feature that was intended to 
save resources and stay with great performance.

>
>
>>> If you want to guarantee that there is an SRQ for each comp_vector and a
>>> comp_vector for each SRQ, stick with a CQ allocation API that enables
>>> explicit selection of the comp_vector value, and cache that value in the
>>> caller, not in the core data structures.
>> I'm Ok with that as well. This is exactly what we do in the nvmf/rdma but I wanted to stick also with the SRP target approach.
>>
>> Bart,
>>
>> Any objection to remove the call for ib_alloc_cq_any() from ib_srpt and use ib_alloc_cq() ?
>>
>>
>>>
>>>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>>>> ---
>>>> drivers/infiniband/core/cq.c | 1 +
>>>> include/rdma/ib_verbs.h      | 1 +
>>>> 2 files changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
>>>> index 4f25b24..a7cbf52 100644
>>>> --- a/drivers/infiniband/core/cq.c
>>>> +++ b/drivers/infiniband/core/cq.c
>>>> @@ -217,6 +217,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
>>>> 	cq->device = dev;
>>>> 	cq->cq_context = private;
>>>> 	cq->poll_ctx = poll_ctx;
>>>> +	cq->comp_vector = comp_vector;
>>>> 	atomic_set(&cq->usecnt, 0);
>>>>
>>>> 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
>>>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>>>> index fc8207d..0d61772 100644
>>>> --- a/include/rdma/ib_verbs.h
>>>> +++ b/include/rdma/ib_verbs.h
>>>> @@ -1558,6 +1558,7 @@ struct ib_cq {
>>>> 	struct ib_device       *device;
>>>> 	struct ib_ucq_object   *uobject;
>>>> 	ib_comp_handler   	comp_handler;
>>>> +	u32			comp_vector;
>>>> 	void                  (*event_handler)(struct ib_event *, void *);
>>>> 	void                   *cq_context;
>>>> 	int               	cqe;
>>>> -- 
>>>> 1.8.3.1
>>>>
>>> --
>>> Chuck Lever
>>> chucklever@gmail.com
>>>
>>>
>>>
> --
> Chuck Lever
> chucklever@gmail.com
>
>
>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] IB/core: cache the CQ completion vector
  2020-03-17 15:41     ` Max Gurtovoy
  2020-03-17 20:36       ` Chuck Lever
@ 2020-03-17 22:50       ` Bart Van Assche
  2020-03-17 23:26         ` Max Gurtovoy
  1 sibling, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2020-03-17 22:50 UTC (permalink / raw)
  To: Max Gurtovoy, Chuck Lever
  Cc: loberman, sagi, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, leonro, dledford, jgg, Oren Duer, kbusch, hch

On 2020-03-17 08:41, Max Gurtovoy wrote:
> On 3/17/2020 5:19 PM, Chuck Lever wrote:
>> If you want to guarantee that there is an SRQ for each comp_vector and a
>> comp_vector for each SRQ, stick with a CQ allocation API that enables
>> explicit selection of the comp_vector value, and cache that value in the
>> caller, not in the core data structures.
> 
> I'm Ok with that as well. This is exactly what we do in the nvmf/rdma
> but I wanted to stick also with the SRP target approach.
> 
> Any objection to remove the call for ib_alloc_cq_any() from ib_srpt and
> use ib_alloc_cq() ?
Hi Max,

Wasn't that call introduced by Chuck (see also commit 20cf4e026730
("rdma: Enable ib_alloc_cq to spread work over a device's comp_vectors")
# v5.4)? Anyway, I'm fine with the proposed change.

Thanks,

Bart.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] IB/core: cache the CQ completion vector
  2020-03-17 22:50       ` Bart Van Assche
@ 2020-03-17 23:26         ` Max Gurtovoy
  0 siblings, 0 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-17 23:26 UTC (permalink / raw)
  To: Bart Van Assche, Chuck Lever
  Cc: loberman, sagi, vladimirk, idanb, linux-rdma, shlomin,
	linux-nvme, leonro, dledford, jgg, Oren Duer, kbusch, hch


On 3/18/2020 12:50 AM, Bart Van Assche wrote:
> On 2020-03-17 08:41, Max Gurtovoy wrote:
>> On 3/17/2020 5:19 PM, Chuck Lever wrote:
>>> If you want to guarantee that there is an SRQ for each comp_vector and a
>>> comp_vector for each SRQ, stick with a CQ allocation API that enables
>>> explicit selection of the comp_vector value, and cache that value in the
>>> caller, not in the core data structures.
>> I'm Ok with that as well. This is exactly what we do in the nvmf/rdma
>> but I wanted to stick also with the SRP target approach.
>>
>> Any objection to remove the call for ib_alloc_cq_any() from ib_srpt and
>> use ib_alloc_cq() ?
> Hi Max,
>
> Wasn't that call introduced by Chuck (see also commit 20cf4e026730
> ("rdma: Enable ib_alloc_cq to spread work over a device's comp_vectors")
> # v5.4)? Anyway, I'm fine with the proposed change.

Yes this was introduced by Chuck, but no contract is broken :)

I'll update srpt in V2.


> Thanks,
>
> Bart.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-17 16:37     ` Max Gurtovoy
  2020-03-17 18:10       ` Jason Gunthorpe
@ 2020-03-18  6:47       ` Leon Romanovsky
  2020-03-18  9:46         ` Max Gurtovoy
  1 sibling, 1 reply; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-18  6:47 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi

On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
>
> On 3/17/2020 3:55 PM, Leon Romanovsky wrote:
> > On Tue, Mar 17, 2020 at 03:40:26PM +0200, Max Gurtovoy wrote:
> > > ULP's can use this API to create/destroy SRQ's with the same
> > > characteristics for implementing a logic that aimed to save resources
> > > without significant performance penalty (e.g. create SRQ per completion
> > > vector and use shared receive buffers for multiple controllers of the
> > > ULP).
> > >
> > > Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> > > ---
> > >   drivers/infiniband/core/Makefile  |  2 +-
> > >   drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
> > >   drivers/infiniband/core/verbs.c   |  4 ++
> > >   include/rdma/ib_verbs.h           |  5 +++
> > >   include/rdma/srq_set.h            | 18 +++++++++
> > >   5 files changed, 106 insertions(+), 1 deletion(-)
> > >   create mode 100644 drivers/infiniband/core/srq_set.c
> > >   create mode 100644 include/rdma/srq_set.h
> > >
> > > diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> > > index d1b14887..1d3eaec 100644
> > > --- a/drivers/infiniband/core/Makefile
> > > +++ b/drivers/infiniband/core/Makefile
> > > @@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
> > >   				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
> > >   				multicast.o mad.o smi.o agent.o mad_rmpp.o \
> > >   				nldev.o restrack.o counters.o ib_core_uverbs.o \
> > > -				trace.o
> > > +				trace.o srq_set.o
> > Why did you call it "srq_set.c" and not "srq.c"?
>
> because it's not a SRQ API but SRQ-set API.

I would say that it is SRQ-pool and not SRQ-set API.

Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] nvmet-rdma: use SRQ per completion vector
  2020-03-17 13:40 ` [PATCH 3/5] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
@ 2020-03-18  6:53   ` Leon Romanovsky
  2020-03-18  9:39     ` Max Gurtovoy
  0 siblings, 1 reply; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-18  6:53 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi

On Tue, Mar 17, 2020 at 03:40:28PM +0200, Max Gurtovoy wrote:
> In order to save resource allocation and utilize the completion
> locality in a better way (compared to SRQ per device that exist today),
> allocate Shared Receive Queues (SRQs) per completion vector. Associate
> each created QP/CQ with an appropriate SRQ according to the queue index.
> This association will reduce the lock contention in the fast path
> (compared to SRQ per device solution) and increase the locality in
> memory buffers. Add new module parameter for SRQ size to adjust it
> according to the expected load. User should make sure the size is >= 256
> to avoid lack of resources.
>
> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> ---
>  drivers/nvme/target/rdma.c | 204 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 153 insertions(+), 51 deletions(-)
>
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 04062af..f560257 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -18,6 +18,7 @@
>  #include <asm/unaligned.h>
>
>  #include <rdma/ib_verbs.h>
> +#include <rdma/srq_set.h>
>  #include <rdma/rdma_cm.h>
>  #include <rdma/rw.h>
>
> @@ -31,6 +32,8 @@
>  #define NVMET_RDMA_MAX_INLINE_SGE		4
>  #define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int, SZ_16K, PAGE_SIZE)
>
> +struct nvmet_rdma_srq;
> +
>  struct nvmet_rdma_cmd {
>  	struct ib_sge		sge[NVMET_RDMA_MAX_INLINE_SGE + 1];
>  	struct ib_cqe		cqe;
> @@ -38,7 +41,7 @@ struct nvmet_rdma_cmd {
>  	struct scatterlist	inline_sg[NVMET_RDMA_MAX_INLINE_SGE];
>  	struct nvme_command     *nvme_cmd;
>  	struct nvmet_rdma_queue	*queue;
> -	struct ib_srq		*srq;
> +	struct nvmet_rdma_srq   *nsrq;
>  };
>
>  enum {
> @@ -80,6 +83,7 @@ struct nvmet_rdma_queue {
>  	struct ib_cq		*cq;
>  	atomic_t		sq_wr_avail;
>  	struct nvmet_rdma_device *dev;
> +	struct nvmet_rdma_srq   *nsrq;
>  	spinlock_t		state_lock;
>  	enum nvmet_rdma_queue_state state;
>  	struct nvmet_cq		nvme_cq;
> @@ -97,17 +101,24 @@ struct nvmet_rdma_queue {
>
>  	int			idx;
>  	int			host_qid;
> +	int			comp_vector;
>  	int			recv_queue_size;
>  	int			send_queue_size;
>
>  	struct list_head	queue_list;
>  };
>
> +struct nvmet_rdma_srq {
> +	struct ib_srq            *srq;
> +	struct nvmet_rdma_cmd    *cmds;
> +	struct nvmet_rdma_device *ndev;
> +};
> +
>  struct nvmet_rdma_device {
>  	struct ib_device	*device;
>  	struct ib_pd		*pd;
> -	struct ib_srq		*srq;
> -	struct nvmet_rdma_cmd	*srq_cmds;
> +	struct nvmet_rdma_srq	**srqs;
> +	int			srq_count;
>  	size_t			srq_size;
>  	struct kref		ref;
>  	struct list_head	entry;
> @@ -119,6 +130,16 @@ struct nvmet_rdma_device {
>  module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
>  MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
>
> +static int srq_size_set(const char *val, const struct kernel_param *kp);
> +static const struct kernel_param_ops srq_size_ops = {
> +	.set = srq_size_set,
> +	.get = param_get_int,
> +};
> +
> +static int nvmet_rdma_srq_size = 1024;
> +module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
> +MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 1024)");
> +

Why do you need this? Do you expect users to change this parameter for
every different workload?


>  static DEFINE_IDA(nvmet_rdma_queue_ida);
>  static LIST_HEAD(nvmet_rdma_queue_list);
>  static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
> @@ -139,6 +160,17 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
>
>  static const struct nvmet_fabrics_ops nvmet_rdma_ops;
>
> +static int srq_size_set(const char *val, const struct kernel_param *kp)
> +{
> +	int n = 0, ret;
> +
> +	ret = kstrtoint(val, 10, &n);
> +	if (ret != 0 || n < 256)
> +		return -EINVAL;
> +
> +	return param_set_int(val, kp);
> +}
> +
>  static int num_pages(int len)
>  {
>  	return 1 + (((len - 1) & PAGE_MASK) >> PAGE_SHIFT);
> @@ -462,8 +494,8 @@ static int nvmet_rdma_post_recv(struct nvmet_rdma_device *ndev,
>  		cmd->sge[0].addr, cmd->sge[0].length,
>  		DMA_FROM_DEVICE);
>
> -	if (cmd->srq)
> -		ret = ib_post_srq_recv(cmd->srq, &cmd->wr, NULL);
> +	if (cmd->nsrq)
> +		ret = ib_post_srq_recv(cmd->nsrq->srq, &cmd->wr, NULL);
>  	else
>  		ret = ib_post_recv(cmd->queue->cm_id->qp, &cmd->wr, NULL);
>
> @@ -841,30 +873,82 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>  	nvmet_rdma_handle_command(queue, rsp);
>  }
>
> -static void nvmet_rdma_destroy_srq(struct nvmet_rdma_device *ndev)
> +static void nvmet_rdma_destroy_srq(struct nvmet_rdma_srq *nsrq)
> +{
> +	nvmet_rdma_free_cmds(nsrq->ndev, nsrq->cmds, nsrq->ndev->srq_size, false);
> +	rdma_srq_put(nsrq->ndev->pd, nsrq->srq);
> +
> +	kfree(nsrq);
> +}
> +
> +static void nvmet_rdma_destroy_srqs(struct nvmet_rdma_device *ndev)
>  {
> -	if (!ndev->srq)
> +	int i;
> +
> +	if (!ndev->srqs)
>  		return;
>
> -	nvmet_rdma_free_cmds(ndev, ndev->srq_cmds, ndev->srq_size, false);
> -	ib_destroy_srq(ndev->srq);
> +	for (i = 0; i < ndev->srq_count; i++)
> +		nvmet_rdma_destroy_srq(ndev->srqs[i]);
> +
> +	rdma_srq_set_destroy(ndev->pd);
> +	kfree(ndev->srqs);
> +	ndev->srqs = NULL;
> +	ndev->srq_count = 0;
> +	ndev->srq_size = 0;
>  }
>
> -static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
> +static struct nvmet_rdma_srq *
> +nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
>  {
> -	struct ib_srq_init_attr srq_attr = { NULL, };
> +	size_t srq_size = ndev->srq_size;
> +	struct nvmet_rdma_srq *nsrq;
>  	struct ib_srq *srq;
> -	size_t srq_size;
>  	int ret, i;
>
> -	srq_size = 4095;	/* XXX: tune */
> +	nsrq = kzalloc(sizeof(*nsrq), GFP_KERNEL);
> +	if (!nsrq)
> +		return ERR_PTR(-ENOMEM);
>
> -	srq_attr.attr.max_wr = srq_size;
> -	srq_attr.attr.max_sge = 1 + ndev->inline_page_count;
> -	srq_attr.attr.srq_limit = 0;
> -	srq_attr.srq_type = IB_SRQT_BASIC;
> -	srq = ib_create_srq(ndev->pd, &srq_attr);
> -	if (IS_ERR(srq)) {
> +	srq = rdma_srq_get(ndev->pd);
> +	if (!srq) {
> +		ret = -EAGAIN;
> +		goto out_free_nsrq;
> +	}
> +
> +	nsrq->cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
> +	if (IS_ERR(nsrq->cmds)) {
> +		ret = PTR_ERR(nsrq->cmds);
> +		goto out_put_srq;
> +	}
> +
> +	nsrq->srq = srq;
> +	nsrq->ndev = ndev;
> +
> +	for (i = 0; i < srq_size; i++) {
> +		nsrq->cmds[i].nsrq = nsrq;
> +		ret = nvmet_rdma_post_recv(ndev, &nsrq->cmds[i]);
> +		if (ret)
> +			goto out_free_cmds;
> +	}
> +
> +	return nsrq;
> +
> +out_free_cmds:
> +	nvmet_rdma_free_cmds(ndev, nsrq->cmds, srq_size, false);
> +out_put_srq:
> +	rdma_srq_put(ndev->pd, srq);
> +out_free_nsrq:
> +	kfree(nsrq);
> +	return ERR_PTR(ret);
> +}
> +
> +static int nvmet_rdma_init_srqs(struct nvmet_rdma_device *ndev)
> +{
> +	struct ib_srq_init_attr srq_attr = { NULL, };
> +	int i, ret;
> +
> +	if (!ndev->device->attrs.max_srq_wr || !ndev->device->attrs.max_srq) {
>  		/*
>  		 * If SRQs aren't supported we just go ahead and use normal
>  		 * non-shared receive queues.
> @@ -873,31 +957,44 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
>  		return 0;
>  	}
>
> -	ndev->srq_cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
> -	if (IS_ERR(ndev->srq_cmds)) {
> -		ret = PTR_ERR(ndev->srq_cmds);
> -		goto out_destroy_srq;
> -	}
> +	ndev->srq_size = min(ndev->device->attrs.max_srq_wr,
> +			     nvmet_rdma_srq_size);
> +	ndev->srq_count = min(ndev->device->num_comp_vectors,
> +			      ndev->device->attrs.max_srq);
>
> -	ndev->srq = srq;
> -	ndev->srq_size = srq_size;
> +	ndev->srqs = kcalloc(ndev->srq_count, sizeof(*ndev->srqs), GFP_KERNEL);
> +	if (!ndev->srqs)
> +		return -ENOMEM;
>
> -	for (i = 0; i < srq_size; i++) {
> -		ndev->srq_cmds[i].srq = srq;
> -		ret = nvmet_rdma_post_recv(ndev, &ndev->srq_cmds[i]);
> -		if (ret)
> -			goto out_free_cmds;
> +	srq_attr.attr.max_wr = ndev->srq_size;
> +	srq_attr.attr.max_sge = 2;
> +	srq_attr.attr.srq_limit = 0;
> +	srq_attr.srq_type = IB_SRQT_BASIC;
> +	ret = rdma_srq_set_init(ndev->pd, ndev->srq_count, &srq_attr);
> +	if (ret)
> +		goto err_free;
> +
> +	for (i = 0; i < ndev->srq_count; i++) {
> +		ndev->srqs[i] = nvmet_rdma_init_srq(ndev);
> +		if (IS_ERR(ndev->srqs[i]))
> +			goto err_srq;
>  	}
>
>  	return 0;
>
> -out_free_cmds:
> -	nvmet_rdma_free_cmds(ndev, ndev->srq_cmds, ndev->srq_size, false);
> -out_destroy_srq:
> -	ib_destroy_srq(srq);
> +err_srq:
> +	while (--i >= 0)
> +		nvmet_rdma_destroy_srq(ndev->srqs[i]);
> +	rdma_srq_set_destroy(ndev->pd);
> +err_free:
> +	kfree(ndev->srqs);
> +	ndev->srqs = NULL;
> +	ndev->srq_count = 0;
> +	ndev->srq_size = 0;
>  	return ret;
>  }
>
> +
>  static void nvmet_rdma_free_dev(struct kref *ref)
>  {
>  	struct nvmet_rdma_device *ndev =
> @@ -907,7 +1004,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
>  	list_del(&ndev->entry);
>  	mutex_unlock(&device_list_mutex);
>
> -	nvmet_rdma_destroy_srq(ndev);
> +	nvmet_rdma_destroy_srqs(ndev);
>  	ib_dealloc_pd(ndev->pd);
>
>  	kfree(ndev);
> @@ -953,7 +1050,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
>  		goto out_free_dev;
>
>  	if (nvmet_rdma_use_srq) {
> -		ret = nvmet_rdma_init_srq(ndev);
> +		ret = nvmet_rdma_init_srqs(ndev);
>  		if (ret)
>  			goto out_free_pd;
>  	}
> @@ -977,14 +1074,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>  {
>  	struct ib_qp_init_attr qp_attr;
>  	struct nvmet_rdma_device *ndev = queue->dev;
> -	int comp_vector, nr_cqe, ret, i;
> +	int nr_cqe, ret, i;
>
> -	/*
> -	 * Spread the io queues across completion vectors,
> -	 * but still keep all admin queues on vector 0.
> -	 */
> -	comp_vector = !queue->host_qid ? 0 :
> -		queue->idx % ndev->device->num_comp_vectors;
>
>  	/*
>  	 * Reserve CQ slots for RECV + RDMA_READ/RDMA_WRITE + RDMA_SEND.
> @@ -992,7 +1083,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>  	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
>
>  	queue->cq = ib_alloc_cq(ndev->device, queue,
> -			nr_cqe + 1, comp_vector,
> +			nr_cqe + 1, queue->comp_vector,
>  			IB_POLL_WORKQUEUE);
>  	if (IS_ERR(queue->cq)) {
>  		ret = PTR_ERR(queue->cq);
> @@ -1014,8 +1105,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>  	qp_attr.cap.max_send_sge = max(ndev->device->attrs.max_sge_rd,
>  					ndev->device->attrs.max_send_sge);
>
> -	if (ndev->srq) {
> -		qp_attr.srq = ndev->srq;
> +	if (queue->nsrq) {
> +		qp_attr.srq = queue->nsrq->srq;
>  	} else {
>  		/* +1 for drain */
>  		qp_attr.cap.max_recv_wr = 1 + queue->recv_queue_size;
> @@ -1034,7 +1125,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>  		 __func__, queue->cq->cqe, qp_attr.cap.max_send_sge,
>  		 qp_attr.cap.max_send_wr, queue->cm_id);
>
> -	if (!ndev->srq) {
> +	if (!queue->nsrq) {
>  		for (i = 0; i < queue->recv_queue_size; i++) {
>  			queue->cmds[i].queue = queue;
>  			ret = nvmet_rdma_post_recv(ndev, &queue->cmds[i]);
> @@ -1070,7 +1161,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
>  	nvmet_sq_destroy(&queue->nvme_sq);
>
>  	nvmet_rdma_destroy_queue_ib(queue);
> -	if (!queue->dev->srq) {
> +	if (!queue->nsrq) {
>  		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
>  				queue->recv_queue_size,
>  				!queue->host_qid);
> @@ -1182,13 +1273,22 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
>  		goto out_destroy_sq;
>  	}
>
> +	/*
> +	 * Spread the io queues across completion vectors,
> +	 * but still keep all admin queues on vector 0.
> +	 */
> +	queue->comp_vector = !queue->host_qid ? 0 :
> +		queue->idx % ndev->device->num_comp_vectors;
> +
>  	ret = nvmet_rdma_alloc_rsps(queue);
>  	if (ret) {
>  		ret = NVME_RDMA_CM_NO_RSC;
>  		goto out_ida_remove;
>  	}
>
> -	if (!ndev->srq) {
> +	if (ndev->srqs) {
> +		queue->nsrq = ndev->srqs[queue->comp_vector % ndev->srq_count];
> +	} else {
>  		queue->cmds = nvmet_rdma_alloc_cmds(ndev,
>  				queue->recv_queue_size,
>  				!queue->host_qid);
> @@ -1209,10 +1309,12 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
>  	return queue;
>
>  out_free_cmds:
> -	if (!ndev->srq) {
> +	if (!queue->nsrq) {
>  		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
>  				queue->recv_queue_size,
>  				!queue->host_qid);
> +	} else {
> +		queue->nsrq = NULL;

I have no idea if it right or not, but the logic seems strange.
If queue->nsrq exists, you nullify the pointer, is it done on purpose?

Thanks

>  	}
>  out_free_responses:
>  	nvmet_rdma_free_rsps(queue);
> --
> 1.8.3.1
>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] nvmet-rdma: use SRQ per completion vector
  2020-03-18  6:53   ` Leon Romanovsky
@ 2020-03-18  9:39     ` Max Gurtovoy
  0 siblings, 0 replies; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-18  9:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi


On 3/18/2020 8:53 AM, Leon Romanovsky wrote:
> On Tue, Mar 17, 2020 at 03:40:28PM +0200, Max Gurtovoy wrote:
>> In order to save resource allocation and utilize the completion
>> locality in a better way (compared to SRQ per device that exist today),
>> allocate Shared Receive Queues (SRQs) per completion vector. Associate
>> each created QP/CQ with an appropriate SRQ according to the queue index.
>> This association will reduce the lock contention in the fast path
>> (compared to SRQ per device solution) and increase the locality in
>> memory buffers. Add new module parameter for SRQ size to adjust it
>> according to the expected load. User should make sure the size is >= 256
>> to avoid lack of resources.
>>
>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>> ---
>>   drivers/nvme/target/rdma.c | 204 +++++++++++++++++++++++++++++++++------------
>>   1 file changed, 153 insertions(+), 51 deletions(-)
>>
>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
>> index 04062af..f560257 100644
>> --- a/drivers/nvme/target/rdma.c
>> +++ b/drivers/nvme/target/rdma.c
>> @@ -18,6 +18,7 @@
>>   #include <asm/unaligned.h>
>>
>>   #include <rdma/ib_verbs.h>
>> +#include <rdma/srq_set.h>
>>   #include <rdma/rdma_cm.h>
>>   #include <rdma/rw.h>
>>
>> @@ -31,6 +32,8 @@
>>   #define NVMET_RDMA_MAX_INLINE_SGE		4
>>   #define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int, SZ_16K, PAGE_SIZE)
>>
>> +struct nvmet_rdma_srq;
>> +
>>   struct nvmet_rdma_cmd {
>>   	struct ib_sge		sge[NVMET_RDMA_MAX_INLINE_SGE + 1];
>>   	struct ib_cqe		cqe;
>> @@ -38,7 +41,7 @@ struct nvmet_rdma_cmd {
>>   	struct scatterlist	inline_sg[NVMET_RDMA_MAX_INLINE_SGE];
>>   	struct nvme_command     *nvme_cmd;
>>   	struct nvmet_rdma_queue	*queue;
>> -	struct ib_srq		*srq;
>> +	struct nvmet_rdma_srq   *nsrq;
>>   };
>>
>>   enum {
>> @@ -80,6 +83,7 @@ struct nvmet_rdma_queue {
>>   	struct ib_cq		*cq;
>>   	atomic_t		sq_wr_avail;
>>   	struct nvmet_rdma_device *dev;
>> +	struct nvmet_rdma_srq   *nsrq;
>>   	spinlock_t		state_lock;
>>   	enum nvmet_rdma_queue_state state;
>>   	struct nvmet_cq		nvme_cq;
>> @@ -97,17 +101,24 @@ struct nvmet_rdma_queue {
>>
>>   	int			idx;
>>   	int			host_qid;
>> +	int			comp_vector;
>>   	int			recv_queue_size;
>>   	int			send_queue_size;
>>
>>   	struct list_head	queue_list;
>>   };
>>
>> +struct nvmet_rdma_srq {
>> +	struct ib_srq            *srq;
>> +	struct nvmet_rdma_cmd    *cmds;
>> +	struct nvmet_rdma_device *ndev;
>> +};
>> +
>>   struct nvmet_rdma_device {
>>   	struct ib_device	*device;
>>   	struct ib_pd		*pd;
>> -	struct ib_srq		*srq;
>> -	struct nvmet_rdma_cmd	*srq_cmds;
>> +	struct nvmet_rdma_srq	**srqs;
>> +	int			srq_count;
>>   	size_t			srq_size;
>>   	struct kref		ref;
>>   	struct list_head	entry;
>> @@ -119,6 +130,16 @@ struct nvmet_rdma_device {
>>   module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
>>   MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
>>
>> +static int srq_size_set(const char *val, const struct kernel_param *kp);
>> +static const struct kernel_param_ops srq_size_ops = {
>> +	.set = srq_size_set,
>> +	.get = param_get_int,
>> +};
>> +
>> +static int nvmet_rdma_srq_size = 1024;
>> +module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
>> +MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 1024)");
>> +
> Why do you need this? Do you expect users to change this parameter for
> every different workload?

No. I want to help users to control their memory consumption that is 
very important in systems/storage-nodes with limited resources, instead 
of setting a values that might be over-consuming for them.

>
>
>>   static DEFINE_IDA(nvmet_rdma_queue_ida);
>>   static LIST_HEAD(nvmet_rdma_queue_list);
>>   static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
>> @@ -139,6 +160,17 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
>>
>>   static const struct nvmet_fabrics_ops nvmet_rdma_ops;
>>
>> +static int srq_size_set(const char *val, const struct kernel_param *kp)
>> +{
>> +	int n = 0, ret;
>> +
>> +	ret = kstrtoint(val, 10, &n);
>> +	if (ret != 0 || n < 256)
>> +		return -EINVAL;
>> +
>> +	return param_set_int(val, kp);
>> +}
>> +
>>   static int num_pages(int len)
>>   {
>>   	return 1 + (((len - 1) & PAGE_MASK) >> PAGE_SHIFT);
>> @@ -462,8 +494,8 @@ static int nvmet_rdma_post_recv(struct nvmet_rdma_device *ndev,
>>   		cmd->sge[0].addr, cmd->sge[0].length,
>>   		DMA_FROM_DEVICE);
>>
>> -	if (cmd->srq)
>> -		ret = ib_post_srq_recv(cmd->srq, &cmd->wr, NULL);
>> +	if (cmd->nsrq)
>> +		ret = ib_post_srq_recv(cmd->nsrq->srq, &cmd->wr, NULL);
>>   	else
>>   		ret = ib_post_recv(cmd->queue->cm_id->qp, &cmd->wr, NULL);
>>
>> @@ -841,30 +873,82 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>>   	nvmet_rdma_handle_command(queue, rsp);
>>   }
>>
>> -static void nvmet_rdma_destroy_srq(struct nvmet_rdma_device *ndev)
>> +static void nvmet_rdma_destroy_srq(struct nvmet_rdma_srq *nsrq)
>> +{
>> +	nvmet_rdma_free_cmds(nsrq->ndev, nsrq->cmds, nsrq->ndev->srq_size, false);
>> +	rdma_srq_put(nsrq->ndev->pd, nsrq->srq);
>> +
>> +	kfree(nsrq);
>> +}
>> +
>> +static void nvmet_rdma_destroy_srqs(struct nvmet_rdma_device *ndev)
>>   {
>> -	if (!ndev->srq)
>> +	int i;
>> +
>> +	if (!ndev->srqs)
>>   		return;
>>
>> -	nvmet_rdma_free_cmds(ndev, ndev->srq_cmds, ndev->srq_size, false);
>> -	ib_destroy_srq(ndev->srq);
>> +	for (i = 0; i < ndev->srq_count; i++)
>> +		nvmet_rdma_destroy_srq(ndev->srqs[i]);
>> +
>> +	rdma_srq_set_destroy(ndev->pd);
>> +	kfree(ndev->srqs);
>> +	ndev->srqs = NULL;
>> +	ndev->srq_count = 0;
>> +	ndev->srq_size = 0;
>>   }
>>
>> -static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
>> +static struct nvmet_rdma_srq *
>> +nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
>>   {
>> -	struct ib_srq_init_attr srq_attr = { NULL, };
>> +	size_t srq_size = ndev->srq_size;
>> +	struct nvmet_rdma_srq *nsrq;
>>   	struct ib_srq *srq;
>> -	size_t srq_size;
>>   	int ret, i;
>>
>> -	srq_size = 4095;	/* XXX: tune */
>> +	nsrq = kzalloc(sizeof(*nsrq), GFP_KERNEL);
>> +	if (!nsrq)
>> +		return ERR_PTR(-ENOMEM);
>>
>> -	srq_attr.attr.max_wr = srq_size;
>> -	srq_attr.attr.max_sge = 1 + ndev->inline_page_count;
>> -	srq_attr.attr.srq_limit = 0;
>> -	srq_attr.srq_type = IB_SRQT_BASIC;
>> -	srq = ib_create_srq(ndev->pd, &srq_attr);
>> -	if (IS_ERR(srq)) {
>> +	srq = rdma_srq_get(ndev->pd);
>> +	if (!srq) {
>> +		ret = -EAGAIN;
>> +		goto out_free_nsrq;
>> +	}
>> +
>> +	nsrq->cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
>> +	if (IS_ERR(nsrq->cmds)) {
>> +		ret = PTR_ERR(nsrq->cmds);
>> +		goto out_put_srq;
>> +	}
>> +
>> +	nsrq->srq = srq;
>> +	nsrq->ndev = ndev;
>> +
>> +	for (i = 0; i < srq_size; i++) {
>> +		nsrq->cmds[i].nsrq = nsrq;
>> +		ret = nvmet_rdma_post_recv(ndev, &nsrq->cmds[i]);
>> +		if (ret)
>> +			goto out_free_cmds;
>> +	}
>> +
>> +	return nsrq;
>> +
>> +out_free_cmds:
>> +	nvmet_rdma_free_cmds(ndev, nsrq->cmds, srq_size, false);
>> +out_put_srq:
>> +	rdma_srq_put(ndev->pd, srq);
>> +out_free_nsrq:
>> +	kfree(nsrq);
>> +	return ERR_PTR(ret);
>> +}
>> +
>> +static int nvmet_rdma_init_srqs(struct nvmet_rdma_device *ndev)
>> +{
>> +	struct ib_srq_init_attr srq_attr = { NULL, };
>> +	int i, ret;
>> +
>> +	if (!ndev->device->attrs.max_srq_wr || !ndev->device->attrs.max_srq) {
>>   		/*
>>   		 * If SRQs aren't supported we just go ahead and use normal
>>   		 * non-shared receive queues.
>> @@ -873,31 +957,44 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
>>   		return 0;
>>   	}
>>
>> -	ndev->srq_cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
>> -	if (IS_ERR(ndev->srq_cmds)) {
>> -		ret = PTR_ERR(ndev->srq_cmds);
>> -		goto out_destroy_srq;
>> -	}
>> +	ndev->srq_size = min(ndev->device->attrs.max_srq_wr,
>> +			     nvmet_rdma_srq_size);
>> +	ndev->srq_count = min(ndev->device->num_comp_vectors,
>> +			      ndev->device->attrs.max_srq);
>>
>> -	ndev->srq = srq;
>> -	ndev->srq_size = srq_size;
>> +	ndev->srqs = kcalloc(ndev->srq_count, sizeof(*ndev->srqs), GFP_KERNEL);
>> +	if (!ndev->srqs)
>> +		return -ENOMEM;
>>
>> -	for (i = 0; i < srq_size; i++) {
>> -		ndev->srq_cmds[i].srq = srq;
>> -		ret = nvmet_rdma_post_recv(ndev, &ndev->srq_cmds[i]);
>> -		if (ret)
>> -			goto out_free_cmds;
>> +	srq_attr.attr.max_wr = ndev->srq_size;
>> +	srq_attr.attr.max_sge = 2;
>> +	srq_attr.attr.srq_limit = 0;
>> +	srq_attr.srq_type = IB_SRQT_BASIC;
>> +	ret = rdma_srq_set_init(ndev->pd, ndev->srq_count, &srq_attr);
>> +	if (ret)
>> +		goto err_free;
>> +
>> +	for (i = 0; i < ndev->srq_count; i++) {
>> +		ndev->srqs[i] = nvmet_rdma_init_srq(ndev);
>> +		if (IS_ERR(ndev->srqs[i]))
>> +			goto err_srq;
>>   	}
>>
>>   	return 0;
>>
>> -out_free_cmds:
>> -	nvmet_rdma_free_cmds(ndev, ndev->srq_cmds, ndev->srq_size, false);
>> -out_destroy_srq:
>> -	ib_destroy_srq(srq);
>> +err_srq:
>> +	while (--i >= 0)
>> +		nvmet_rdma_destroy_srq(ndev->srqs[i]);
>> +	rdma_srq_set_destroy(ndev->pd);
>> +err_free:
>> +	kfree(ndev->srqs);
>> +	ndev->srqs = NULL;
>> +	ndev->srq_count = 0;
>> +	ndev->srq_size = 0;
>>   	return ret;
>>   }
>>
>> +
>>   static void nvmet_rdma_free_dev(struct kref *ref)
>>   {
>>   	struct nvmet_rdma_device *ndev =
>> @@ -907,7 +1004,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
>>   	list_del(&ndev->entry);
>>   	mutex_unlock(&device_list_mutex);
>>
>> -	nvmet_rdma_destroy_srq(ndev);
>> +	nvmet_rdma_destroy_srqs(ndev);
>>   	ib_dealloc_pd(ndev->pd);
>>
>>   	kfree(ndev);
>> @@ -953,7 +1050,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
>>   		goto out_free_dev;
>>
>>   	if (nvmet_rdma_use_srq) {
>> -		ret = nvmet_rdma_init_srq(ndev);
>> +		ret = nvmet_rdma_init_srqs(ndev);
>>   		if (ret)
>>   			goto out_free_pd;
>>   	}
>> @@ -977,14 +1074,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>>   {
>>   	struct ib_qp_init_attr qp_attr;
>>   	struct nvmet_rdma_device *ndev = queue->dev;
>> -	int comp_vector, nr_cqe, ret, i;
>> +	int nr_cqe, ret, i;
>>
>> -	/*
>> -	 * Spread the io queues across completion vectors,
>> -	 * but still keep all admin queues on vector 0.
>> -	 */
>> -	comp_vector = !queue->host_qid ? 0 :
>> -		queue->idx % ndev->device->num_comp_vectors;
>>
>>   	/*
>>   	 * Reserve CQ slots for RECV + RDMA_READ/RDMA_WRITE + RDMA_SEND.
>> @@ -992,7 +1083,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>>   	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
>>
>>   	queue->cq = ib_alloc_cq(ndev->device, queue,
>> -			nr_cqe + 1, comp_vector,
>> +			nr_cqe + 1, queue->comp_vector,
>>   			IB_POLL_WORKQUEUE);
>>   	if (IS_ERR(queue->cq)) {
>>   		ret = PTR_ERR(queue->cq);
>> @@ -1014,8 +1105,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>>   	qp_attr.cap.max_send_sge = max(ndev->device->attrs.max_sge_rd,
>>   					ndev->device->attrs.max_send_sge);
>>
>> -	if (ndev->srq) {
>> -		qp_attr.srq = ndev->srq;
>> +	if (queue->nsrq) {
>> +		qp_attr.srq = queue->nsrq->srq;
>>   	} else {
>>   		/* +1 for drain */
>>   		qp_attr.cap.max_recv_wr = 1 + queue->recv_queue_size;
>> @@ -1034,7 +1125,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
>>   		 __func__, queue->cq->cqe, qp_attr.cap.max_send_sge,
>>   		 qp_attr.cap.max_send_wr, queue->cm_id);
>>
>> -	if (!ndev->srq) {
>> +	if (!queue->nsrq) {
>>   		for (i = 0; i < queue->recv_queue_size; i++) {
>>   			queue->cmds[i].queue = queue;
>>   			ret = nvmet_rdma_post_recv(ndev, &queue->cmds[i]);
>> @@ -1070,7 +1161,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
>>   	nvmet_sq_destroy(&queue->nvme_sq);
>>
>>   	nvmet_rdma_destroy_queue_ib(queue);
>> -	if (!queue->dev->srq) {
>> +	if (!queue->nsrq) {
>>   		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
>>   				queue->recv_queue_size,
>>   				!queue->host_qid);
>> @@ -1182,13 +1273,22 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
>>   		goto out_destroy_sq;
>>   	}
>>
>> +	/*
>> +	 * Spread the io queues across completion vectors,
>> +	 * but still keep all admin queues on vector 0.
>> +	 */
>> +	queue->comp_vector = !queue->host_qid ? 0 :
>> +		queue->idx % ndev->device->num_comp_vectors;
>> +
>>   	ret = nvmet_rdma_alloc_rsps(queue);
>>   	if (ret) {
>>   		ret = NVME_RDMA_CM_NO_RSC;
>>   		goto out_ida_remove;
>>   	}
>>
>> -	if (!ndev->srq) {
>> +	if (ndev->srqs) {
>> +		queue->nsrq = ndev->srqs[queue->comp_vector % ndev->srq_count];
>> +	} else {
>>   		queue->cmds = nvmet_rdma_alloc_cmds(ndev,
>>   				queue->recv_queue_size,
>>   				!queue->host_qid);
>> @@ -1209,10 +1309,12 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
>>   	return queue;
>>
>>   out_free_cmds:
>> -	if (!ndev->srq) {
>> +	if (!queue->nsrq) {
>>   		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
>>   				queue->recv_queue_size,
>>   				!queue->host_qid);
>> +	} else {
>> +		queue->nsrq = NULL;
> I have no idea if it right or not, but the logic seems strange.
> If queue->nsrq exists, you nullify the pointer, is it done on purpose?

Yes. nsrq is a pointer taken from ndev->srqs and was allocated by the 
device.


>
> Thanks
>
>>   	}
>>   out_free_responses:
>>   	nvmet_rdma_free_rsps(queue);
>> --
>> 1.8.3.1
>>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-18  6:47       ` Leon Romanovsky
@ 2020-03-18  9:46         ` Max Gurtovoy
  2020-03-18 10:29           ` Leon Romanovsky
  0 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-18  9:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi


On 3/18/2020 8:47 AM, Leon Romanovsky wrote:
> On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
>> On 3/17/2020 3:55 PM, Leon Romanovsky wrote:
>>> On Tue, Mar 17, 2020 at 03:40:26PM +0200, Max Gurtovoy wrote:
>>>> ULP's can use this API to create/destroy SRQ's with the same
>>>> characteristics for implementing a logic that aimed to save resources
>>>> without significant performance penalty (e.g. create SRQ per completion
>>>> vector and use shared receive buffers for multiple controllers of the
>>>> ULP).
>>>>
>>>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>>>> ---
>>>>    drivers/infiniband/core/Makefile  |  2 +-
>>>>    drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
>>>>    drivers/infiniband/core/verbs.c   |  4 ++
>>>>    include/rdma/ib_verbs.h           |  5 +++
>>>>    include/rdma/srq_set.h            | 18 +++++++++
>>>>    5 files changed, 106 insertions(+), 1 deletion(-)
>>>>    create mode 100644 drivers/infiniband/core/srq_set.c
>>>>    create mode 100644 include/rdma/srq_set.h
>>>>
>>>> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
>>>> index d1b14887..1d3eaec 100644
>>>> --- a/drivers/infiniband/core/Makefile
>>>> +++ b/drivers/infiniband/core/Makefile
>>>> @@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
>>>>    				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
>>>>    				multicast.o mad.o smi.o agent.o mad_rmpp.o \
>>>>    				nldev.o restrack.o counters.o ib_core_uverbs.o \
>>>> -				trace.o
>>>> +				trace.o srq_set.o
>>> Why did you call it "srq_set.c" and not "srq.c"?
>> because it's not a SRQ API but SRQ-set API.
> I would say that it is SRQ-pool and not SRQ-set API.

If you have some other idea for an API, please share with us.

I've created this API in core layer to make the life of the ULPs easier 
and we can see that it's very easy to add this feature to ULPs and get a 
big benefit of it.

>
> Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-18  9:46         ` Max Gurtovoy
@ 2020-03-18 10:29           ` Leon Romanovsky
  2020-03-18 10:39             ` Max Gurtovoy
  0 siblings, 1 reply; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-18 10:29 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi

On Wed, Mar 18, 2020 at 11:46:19AM +0200, Max Gurtovoy wrote:
>
> On 3/18/2020 8:47 AM, Leon Romanovsky wrote:
> > On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
> > > On 3/17/2020 3:55 PM, Leon Romanovsky wrote:
> > > > On Tue, Mar 17, 2020 at 03:40:26PM +0200, Max Gurtovoy wrote:
> > > > > ULP's can use this API to create/destroy SRQ's with the same
> > > > > characteristics for implementing a logic that aimed to save resources
> > > > > without significant performance penalty (e.g. create SRQ per completion
> > > > > vector and use shared receive buffers for multiple controllers of the
> > > > > ULP).
> > > > >
> > > > > Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> > > > > ---
> > > > >    drivers/infiniband/core/Makefile  |  2 +-
> > > > >    drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
> > > > >    drivers/infiniband/core/verbs.c   |  4 ++
> > > > >    include/rdma/ib_verbs.h           |  5 +++
> > > > >    include/rdma/srq_set.h            | 18 +++++++++
> > > > >    5 files changed, 106 insertions(+), 1 deletion(-)
> > > > >    create mode 100644 drivers/infiniband/core/srq_set.c
> > > > >    create mode 100644 include/rdma/srq_set.h
> > > > >
> > > > > diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> > > > > index d1b14887..1d3eaec 100644
> > > > > --- a/drivers/infiniband/core/Makefile
> > > > > +++ b/drivers/infiniband/core/Makefile
> > > > > @@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
> > > > >    				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
> > > > >    				multicast.o mad.o smi.o agent.o mad_rmpp.o \
> > > > >    				nldev.o restrack.o counters.o ib_core_uverbs.o \
> > > > > -				trace.o
> > > > > +				trace.o srq_set.o
> > > > Why did you call it "srq_set.c" and not "srq.c"?
> > > because it's not a SRQ API but SRQ-set API.
> > I would say that it is SRQ-pool and not SRQ-set API.
>
> If you have some other idea for an API, please share with us.
>
> I've created this API in core layer to make the life of the ULPs easier and
> we can see that it's very easy to add this feature to ULPs and get a big
> benefit of it.

No one here said against the feature, but tried to understand the
rationale behind name *_set and why you decided to use "set" term
and not "pool", like was done for MR pool.

So my suggestion is to name file as srq_pool.c and use rdma_srq_poll_*()
function names.

Thanks

>
> >
> > Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-18 10:29           ` Leon Romanovsky
@ 2020-03-18 10:39             ` Max Gurtovoy
  2020-03-18 10:46               ` Leon Romanovsky
  0 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2020-03-18 10:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi


On 3/18/2020 12:29 PM, Leon Romanovsky wrote:
> On Wed, Mar 18, 2020 at 11:46:19AM +0200, Max Gurtovoy wrote:
>> On 3/18/2020 8:47 AM, Leon Romanovsky wrote:
>>> On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
>>>> On 3/17/2020 3:55 PM, Leon Romanovsky wrote:
>>>>> On Tue, Mar 17, 2020 at 03:40:26PM +0200, Max Gurtovoy wrote:
>>>>>> ULP's can use this API to create/destroy SRQ's with the same
>>>>>> characteristics for implementing a logic that aimed to save resources
>>>>>> without significant performance penalty (e.g. create SRQ per completion
>>>>>> vector and use shared receive buffers for multiple controllers of the
>>>>>> ULP).
>>>>>>
>>>>>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>>>>>> ---
>>>>>>     drivers/infiniband/core/Makefile  |  2 +-
>>>>>>     drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
>>>>>>     drivers/infiniband/core/verbs.c   |  4 ++
>>>>>>     include/rdma/ib_verbs.h           |  5 +++
>>>>>>     include/rdma/srq_set.h            | 18 +++++++++
>>>>>>     5 files changed, 106 insertions(+), 1 deletion(-)
>>>>>>     create mode 100644 drivers/infiniband/core/srq_set.c
>>>>>>     create mode 100644 include/rdma/srq_set.h
>>>>>>
>>>>>> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
>>>>>> index d1b14887..1d3eaec 100644
>>>>>> --- a/drivers/infiniband/core/Makefile
>>>>>> +++ b/drivers/infiniband/core/Makefile
>>>>>> @@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
>>>>>>     				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
>>>>>>     				multicast.o mad.o smi.o agent.o mad_rmpp.o \
>>>>>>     				nldev.o restrack.o counters.o ib_core_uverbs.o \
>>>>>> -				trace.o
>>>>>> +				trace.o srq_set.o
>>>>> Why did you call it "srq_set.c" and not "srq.c"?
>>>> because it's not a SRQ API but SRQ-set API.
>>> I would say that it is SRQ-pool and not SRQ-set API.
>> If you have some other idea for an API, please share with us.
>>
>> I've created this API in core layer to make the life of the ULPs easier and
>> we can see that it's very easy to add this feature to ULPs and get a big
>> benefit of it.
> No one here said against the feature, but tried to understand the
> rationale behind name *_set and why you decided to use "set" term
> and not "pool", like was done for MR pool.

Ok. But srq_pool was the name I used 2 years ago and you didn't like 
this back then.

So I renamed it to srq_set.

>
> So my suggestion is to name file as srq_pool.c and use rdma_srq_poll_*()
> function names.

No problem.

>
> Thanks
>
>>> Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] IB/core: add a simple SRQ set per PD
  2020-03-18 10:39             ` Max Gurtovoy
@ 2020-03-18 10:46               ` Leon Romanovsky
  0 siblings, 0 replies; 29+ messages in thread
From: Leon Romanovsky @ 2020-03-18 10:46 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: loberman, bvanassche, vladimirk, shlomin, linux-rdma, linux-nvme,
	idanb, dledford, jgg, oren, kbusch, hch, sagi

On Wed, Mar 18, 2020 at 12:39:44PM +0200, Max Gurtovoy wrote:
>
> On 3/18/2020 12:29 PM, Leon Romanovsky wrote:
> > On Wed, Mar 18, 2020 at 11:46:19AM +0200, Max Gurtovoy wrote:
> > > On 3/18/2020 8:47 AM, Leon Romanovsky wrote:
> > > > On Tue, Mar 17, 2020 at 06:37:57PM +0200, Max Gurtovoy wrote:
> > > > > On 3/17/2020 3:55 PM, Leon Romanovsky wrote:
> > > > > > On Tue, Mar 17, 2020 at 03:40:26PM +0200, Max Gurtovoy wrote:
> > > > > > > ULP's can use this API to create/destroy SRQ's with the same
> > > > > > > characteristics for implementing a logic that aimed to save resources
> > > > > > > without significant performance penalty (e.g. create SRQ per completion
> > > > > > > vector and use shared receive buffers for multiple controllers of the
> > > > > > > ULP).
> > > > > > >
> > > > > > > Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> > > > > > > ---
> > > > > > >     drivers/infiniband/core/Makefile  |  2 +-
> > > > > > >     drivers/infiniband/core/srq_set.c | 78 +++++++++++++++++++++++++++++++++++++++
> > > > > > >     drivers/infiniband/core/verbs.c   |  4 ++
> > > > > > >     include/rdma/ib_verbs.h           |  5 +++
> > > > > > >     include/rdma/srq_set.h            | 18 +++++++++
> > > > > > >     5 files changed, 106 insertions(+), 1 deletion(-)
> > > > > > >     create mode 100644 drivers/infiniband/core/srq_set.c
> > > > > > >     create mode 100644 include/rdma/srq_set.h
> > > > > > >
> > > > > > > diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> > > > > > > index d1b14887..1d3eaec 100644
> > > > > > > --- a/drivers/infiniband/core/Makefile
> > > > > > > +++ b/drivers/infiniband/core/Makefile
> > > > > > > @@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
> > > > > > >     				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
> > > > > > >     				multicast.o mad.o smi.o agent.o mad_rmpp.o \
> > > > > > >     				nldev.o restrack.o counters.o ib_core_uverbs.o \
> > > > > > > -				trace.o
> > > > > > > +				trace.o srq_set.o
> > > > > > Why did you call it "srq_set.c" and not "srq.c"?
> > > > > because it's not a SRQ API but SRQ-set API.
> > > > I would say that it is SRQ-pool and not SRQ-set API.
> > > If you have some other idea for an API, please share with us.
> > >
> > > I've created this API in core layer to make the life of the ULPs easier and
> > > we can see that it's very easy to add this feature to ULPs and get a big
> > > benefit of it.
> > No one here said against the feature, but tried to understand the
> > rationale behind name *_set and why you decided to use "set" term
> > and not "pool", like was done for MR pool.
>
> Ok. But srq_pool was the name I used 2 years ago and you didn't like this
> back then.

I don't like it today too, but don't have better name to suggest.

Thanks

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-03-18 10:46 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-17 13:40 [PATCH 0/5] nvmet-rdma/srpt: SRQ per completion vector Max Gurtovoy
2020-03-17 13:40 ` [PATCH 1/5] IB/core: add a simple SRQ set per PD Max Gurtovoy
2020-03-17 13:55   ` Leon Romanovsky
2020-03-17 16:37     ` Max Gurtovoy
2020-03-17 18:10       ` Jason Gunthorpe
2020-03-17 18:24         ` Max Gurtovoy
2020-03-17 18:43           ` Jason Gunthorpe
2020-03-17 21:56             ` Max Gurtovoy
2020-03-17 19:54         ` Leon Romanovsky
2020-03-18  6:47       ` Leon Romanovsky
2020-03-18  9:46         ` Max Gurtovoy
2020-03-18 10:29           ` Leon Romanovsky
2020-03-18 10:39             ` Max Gurtovoy
2020-03-18 10:46               ` Leon Romanovsky
2020-03-17 13:40 ` [PATCH 2/5] nvmet-rdma: add srq pointer to rdma_cmd Max Gurtovoy
2020-03-17 13:40 ` [PATCH 3/5] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
2020-03-18  6:53   ` Leon Romanovsky
2020-03-18  9:39     ` Max Gurtovoy
2020-03-17 13:40 ` [PATCH 4/5] IB/core: cache the CQ " Max Gurtovoy
2020-03-17 15:19   ` Chuck Lever
2020-03-17 15:41     ` Max Gurtovoy
2020-03-17 20:36       ` Chuck Lever
2020-03-17 22:18         ` Max Gurtovoy
2020-03-17 22:50       ` Bart Van Assche
2020-03-17 23:26         ` Max Gurtovoy
2020-03-17 13:40 ` [PATCH 5/5] RDMA/srpt: use SRQ per " Max Gurtovoy
2020-03-17 13:58   ` Leon Romanovsky
2020-03-17 16:43     ` Max Gurtovoy
2020-03-17 19:58   ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).