All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] nvmet-rdma: SRQ per completion vector
@ 2017-09-05 10:59 Max Gurtovoy
  2017-09-05 10:59 ` [PATCH 1/3] nvmet-rdma: use srq pointer in rdma_cmd Max Gurtovoy
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Max Gurtovoy @ 2017-09-05 10:59 UTC (permalink / raw)


This is a new feature for NVMEoF RDMA target, that is intended to save resource allocation
(by sharing them) and utilize the locality (completions and memory) to get the best
performance with Shared Receive Queues (SRQs). We'll create a SRQ per completion vector
(and not per device) and assosiate each created QP/CQ with an appropriate SRQ. This will
also reduce the lock contention on the single SRQ per device (today's solution).

My testing environment included 4 initiators (CX5, CX5, CX4, CX3) that were connected to 4
subsystems (1 ns per sub) throw 2 ports (each initiator connected to unique subsystem
backed in a different bull_blk device) using a switch to the NVMEoF target (CX5).
I used RoCE link layer.

Configuration:
 - Irqbalancer stopped on each server
 - set_irq_affinity.sh on each interface
 - 2 initiators run traffic throw port 1
 - 2 initiators run traffic throw port 2
 - On initiator set register_always=N
 - Fio with 12 jobs, iodepth 128

Memory consumption calculation for recv buffers (target):
 - Multiple SRQ: SRQ_size * comp_num * ib_devs_num * inline_buffer_size
 - Single SRQ: SRQ_size * 1 * ib_devs_num * inline_buffer_size
 - MQ: RQ_size * CPU_num * ctrl_num * inline_buffer_size

Cases:
 1. Multiple SRQ with 1024 entries:
    - Mem = 1024 * 24 * 2 * 4k = 192MiB (Constant number - not depend on initiators number)
 2. Multiple SRQ with 256 entries:
    - Mem = 256 * 24 * 2 * 4k = 48MiB (Constant number - not depend on initiators number)
 3. MQ:
    - Mem = 256 * 24 * 8 * 4k = 192MiB (Mem grows for every new created ctrl)
 4. Single SRQ (current SRQ implementation):
    - Mem = 4096 * 1 * 2 * 4k = 32MiB (Constant number - not depend on initiators number)

results:

BS    1.read (target CPU)   2.read (target CPU)    3.read (target CPU)   4.read (target CPU)
---  --------------------- --------------------- --------------------- ----------------------
1k     5.88M (80%)            5.45M (72%)            6.77M (91%)          2.2M (72%)

2k     3.56M (65%)            3.45M (59%)            3.72M (64%)          2.12M (59%)

4k     1.8M (33%)             1.87M (32%)            1.88M (32%)          1.59M (34%)

BS    1.write (target CPU)   2.write (target CPU) 3.write (target CPU)   4.write (target CPU)
---  --------------------- --------------------- --------------------- ----------------------
1k     5.42M (63%)            5.14M (55%)            7.75M (82%)          2.14M (74%)

2k     4.15M (56%)            4.14M (51%)            4.16M (52%)          2.08M (73%)

4k     2.17M (28%)            2.17M (27%)            2.16M (28%)          1.62M (24%)


We can see the perf improvement between Case 2 and Case 4 (same order of resource).
We can see the benefit in resource consumption (mem and CPU) with a small perf loss
between cases 2 and 3.
There is still an open question between the perf differance for 1k between Case 1 and
Case 3, but I guess we can investigate and improve it incrementaly.

Thanks to Idan Burstein and Oren Duer for suggesting this nice feature.

Max Gurtovoy (3):
  nvmet-rdma: use srq pointer in rdma_cmd
  nvmet-rdma: use SRQ per completion vector
  nvmet-rdma: add module parameter for SRQ size

 drivers/nvme/target/rdma.c |  156 ++++++++++++++++++++++++++++++++++----------
 1 files changed, 121 insertions(+), 35 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] nvmet-rdma: use srq pointer in rdma_cmd
  2017-09-05 10:59 [PATCH 0/3] nvmet-rdma: SRQ per completion vector Max Gurtovoy
@ 2017-09-05 10:59 ` Max Gurtovoy
  2017-09-06 10:30   ` Christoph Hellwig
  2017-09-05 10:59 ` [PATCH 2/3] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Max Gurtovoy @ 2017-09-05 10:59 UTC (permalink / raw)


This is a preparetion patch that creates an infrastructure in post
recv mechanism for the SRQ per completion vector feature.

Signed-off-by: Max Gurtovoy <maxg at mellanox.com>
---
 drivers/nvme/target/rdma.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 56a4cba..fb322b3 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -45,6 +45,7 @@ struct nvmet_rdma_cmd {
 	struct page		*inline_page;
 	struct nvme_command     *nvme_cmd;
 	struct nvmet_rdma_queue	*queue;
+	struct ib_srq		*srq;
 };
 
 enum {
@@ -442,8 +443,8 @@ static int nvmet_rdma_post_recv(struct nvmet_rdma_device *ndev,
 		cmd->sge[0].addr, cmd->sge[0].length,
 		DMA_FROM_DEVICE);
 
-	if (ndev->srq)
-		return ib_post_srq_recv(ndev->srq, &cmd->wr, &bad_wr);
+	if (cmd->srq)
+		return ib_post_srq_recv(cmd->srq, &cmd->wr, &bad_wr);
 	return ib_post_recv(cmd->queue->cm_id->qp, &cmd->wr, &bad_wr);
 }
 
@@ -819,8 +820,10 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
 	ndev->srq = srq;
 	ndev->srq_size = srq_size;
 
-	for (i = 0; i < srq_size; i++)
+	for (i = 0; i < srq_size; i++) {
+		ndev->srq_cmds[i].srq = srq;
 		nvmet_rdma_post_recv(ndev, &ndev->srq_cmds[i]);
+	}
 
 	return 0;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvmet-rdma: use SRQ per completion vector
  2017-09-05 10:59 [PATCH 0/3] nvmet-rdma: SRQ per completion vector Max Gurtovoy
  2017-09-05 10:59 ` [PATCH 1/3] nvmet-rdma: use srq pointer in rdma_cmd Max Gurtovoy
@ 2017-09-05 10:59 ` Max Gurtovoy
  2017-09-06 10:38   ` Christoph Hellwig
  2017-09-06 14:50   ` Sagi Grimberg
  2017-09-05 10:59 ` [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size Max Gurtovoy
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Max Gurtovoy @ 2017-09-05 10:59 UTC (permalink / raw)


In order to save resource allocation and utilize the completion
locality in a better way, allocate Shared Receive Queues (SRQs) per
completion vector (and not per device). Assosiate each created QP/CQ
with an appropriate SRQ according to the queue index. This association
will reduce the lock contention in the fast path and increase the
locality in memory buffers.

Signed-off-by: Max Gurtovoy <maxg at mellanox.com>
---
 drivers/nvme/target/rdma.c |  132 ++++++++++++++++++++++++++++++++------------
 1 files changed, 97 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index fb322b3..1b52080 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -37,6 +37,8 @@
  */
 #define NVMET_RDMA_INLINE_DATA_SIZE	PAGE_SIZE
 
+struct nvmet_rdma_srq;
+
 struct nvmet_rdma_cmd {
 	struct ib_sge		sge[2];
 	struct ib_cqe		cqe;
@@ -45,7 +47,7 @@ struct nvmet_rdma_cmd {
 	struct page		*inline_page;
 	struct nvme_command     *nvme_cmd;
 	struct nvmet_rdma_queue	*queue;
-	struct ib_srq		*srq;
+	struct nvmet_rdma_srq   *nsrq;
 };
 
 enum {
@@ -87,6 +89,7 @@ struct nvmet_rdma_queue {
 	struct ib_cq		*cq;
 	atomic_t		sq_wr_avail;
 	struct nvmet_rdma_device *dev;
+	struct nvmet_rdma_srq   *nsrq;
 	spinlock_t		state_lock;
 	enum nvmet_rdma_queue_state state;
 	struct nvmet_cq		nvme_cq;
@@ -104,18 +107,25 @@ struct nvmet_rdma_queue {
 
 	int			idx;
 	int			host_qid;
+	int			comp_vector;
 	int			recv_queue_size;
 	int			send_queue_size;
 
 	struct list_head	queue_list;
 };
 
+struct nvmet_rdma_srq {
+	struct ib_srq            *srq;
+	struct nvmet_rdma_cmd    *srq_cmds;
+	size_t                   srq_size;
+	struct nvmet_rdma_device *ndev;
+};
+
 struct nvmet_rdma_device {
 	struct ib_device	*device;
 	struct ib_pd		*pd;
-	struct ib_srq		*srq;
-	struct nvmet_rdma_cmd	*srq_cmds;
-	size_t			srq_size;
+	struct nvmet_rdma_srq	**srqs;
+	int			srq_count;
 	struct kref		ref;
 	struct list_head	entry;
 };
@@ -443,8 +453,8 @@ static int nvmet_rdma_post_recv(struct nvmet_rdma_device *ndev,
 		cmd->sge[0].addr, cmd->sge[0].length,
 		DMA_FROM_DEVICE);
 
-	if (cmd->srq)
-		return ib_post_srq_recv(cmd->srq, &cmd->wr, &bad_wr);
+	if (cmd->nsrq)
+		return ib_post_srq_recv(cmd->nsrq->srq, &cmd->wr, &bad_wr);
 	return ib_post_recv(cmd->queue->cm_id->qp, &cmd->wr, &bad_wr);
 }
 
@@ -779,22 +789,42 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 	nvmet_rdma_handle_command(queue, rsp);
 }
 
-static void nvmet_rdma_destroy_srq(struct nvmet_rdma_device *ndev)
+static void nvmet_rdma_destroy_srq(struct nvmet_rdma_srq *nsrq)
+{
+	if (!nsrq)
+		return;
+
+	nvmet_rdma_free_cmds(nsrq->ndev, nsrq->srq_cmds, nsrq->srq_size, false);
+	ib_destroy_srq(nsrq->srq);
+
+	kfree(nsrq);
+}
+
+static void nvmet_rdma_destroy_srqs(struct nvmet_rdma_device *ndev)
 {
-	if (!ndev->srq)
+	int i;
+
+	if (!ndev->srqs)
 		return;
 
-	nvmet_rdma_free_cmds(ndev, ndev->srq_cmds, ndev->srq_size, false);
-	ib_destroy_srq(ndev->srq);
+	for (i = 0; i < ndev->srq_count; i++)
+		nvmet_rdma_destroy_srq(ndev->srqs[i]);
+
+	kfree(ndev->srqs);
 }
 
-static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
+static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev, int index)
 {
 	struct ib_srq_init_attr srq_attr = { NULL, };
+	struct nvmet_rdma_srq *nsrq;
 	struct ib_srq *srq;
 	size_t srq_size;
 	int ret, i;
 
+	nsrq = kzalloc(sizeof(*nsrq), GFP_KERNEL);
+	if (!nsrq)
+		return -ENOMEM;
+
 	srq_size = 4095;	/* XXX: tune */
 
 	srq_attr.attr.max_wr = srq_size;
@@ -808,27 +838,57 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
 		 * non-shared receive queues.
 		 */
 		pr_info("SRQ requested but not supported.\n");
+		kfree(nsrq);
 		return 0;
 	}
 
-	ndev->srq_cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
-	if (IS_ERR(ndev->srq_cmds)) {
-		ret = PTR_ERR(ndev->srq_cmds);
+	nsrq->srq_cmds = nvmet_rdma_alloc_cmds(ndev, srq_size, false);
+	if (IS_ERR(nsrq->srq_cmds)) {
+		ret = PTR_ERR(nsrq->srq_cmds);
 		goto out_destroy_srq;
 	}
 
-	ndev->srq = srq;
-	ndev->srq_size = srq_size;
+	nsrq->srq = srq;
+	nsrq->srq_size = srq_size;
+	nsrq->ndev = ndev;
+	ndev->srqs[index] = nsrq;
 
 	for (i = 0; i < srq_size; i++) {
-		ndev->srq_cmds[i].srq = srq;
-		nvmet_rdma_post_recv(ndev, &ndev->srq_cmds[i]);
+		nsrq->srq_cmds[i].nsrq = nsrq;
+		nvmet_rdma_post_recv(ndev, &nsrq->srq_cmds[i]);
 	}
 
 	return 0;
 
 out_destroy_srq:
 	ib_destroy_srq(srq);
+	kfree(nsrq);
+	return ret;
+}
+
+static int nvmet_rdma_init_srqs(struct nvmet_rdma_device *ndev)
+{
+	int srq_count = ndev->device->num_comp_vectors;
+	int i, ret;
+
+	ndev->srqs = kzalloc(srq_count * sizeof(*ndev->srqs), GFP_KERNEL);
+	if (!ndev->srqs)
+		return -ENOMEM;
+
+	for (i = 0; i < srq_count; i++) {
+		ret = nvmet_rdma_init_srq(ndev, i);
+		if (ret)
+			goto err_srq;
+	}
+
+	ndev->srq_count = srq_count;
+	return 0;
+
+err_srq:
+	while (--i >= 0)
+		nvmet_rdma_destroy_srq(ndev->srqs[i]);
+
+	kfree(ndev->srqs);
 	return ret;
 }
 
@@ -841,7 +901,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
 	list_del(&ndev->entry);
 	mutex_unlock(&device_list_mutex);
 
-	nvmet_rdma_destroy_srq(ndev);
+	nvmet_rdma_destroy_srqs(ndev);
 	ib_dealloc_pd(ndev->pd);
 
 	kfree(ndev);
@@ -872,7 +932,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
 		goto out_free_dev;
 
 	if (nvmet_rdma_use_srq) {
-		ret = nvmet_rdma_init_srq(ndev);
+		ret = nvmet_rdma_init_srqs(ndev);
 		if (ret)
 			goto out_free_pd;
 	}
@@ -896,14 +956,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	struct ib_qp_init_attr qp_attr;
 	struct nvmet_rdma_device *ndev = queue->dev;
-	int comp_vector, nr_cqe, ret, i;
-
-	/*
-	 * Spread the io queues across completion vectors,
-	 * but still keep all admin queues on vector 0.
-	 */
-	comp_vector = !queue->host_qid ? 0 :
-		queue->idx % ndev->device->num_comp_vectors;
+	int nr_cqe, ret, i;
 
 	/*
 	 * Reserve CQ slots for RECV + RDMA_READ/RDMA_WRITE + RDMA_SEND.
@@ -911,7 +964,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
 
 	queue->cq = ib_alloc_cq(ndev->device, queue,
-			nr_cqe + 1, comp_vector,
+			nr_cqe + 1, queue->comp_vector,
 			IB_POLL_WORKQUEUE);
 	if (IS_ERR(queue->cq)) {
 		ret = PTR_ERR(queue->cq);
@@ -933,8 +986,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	qp_attr.cap.max_send_sge = max(ndev->device->attrs.max_sge_rd,
 					ndev->device->attrs.max_sge);
 
-	if (ndev->srq) {
-		qp_attr.srq = ndev->srq;
+	if (queue->nsrq) {
+		qp_attr.srq = queue->nsrq->srq;
 	} else {
 		/* +1 for drain */
 		qp_attr.cap.max_recv_wr = 1 + queue->recv_queue_size;
@@ -953,7 +1006,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 		 __func__, queue->cq->cqe, qp_attr.cap.max_send_sge,
 		 qp_attr.cap.max_send_wr, queue->cm_id);
 
-	if (!ndev->srq) {
+	if (!queue->nsrq) {
 		for (i = 0; i < queue->recv_queue_size; i++) {
 			queue->cmds[i].queue = queue;
 			nvmet_rdma_post_recv(ndev, &queue->cmds[i]);
@@ -982,7 +1035,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 	nvmet_sq_destroy(&queue->nvme_sq);
 
 	nvmet_rdma_destroy_queue_ib(queue);
-	if (!queue->dev->srq) {
+	if (!queue->nsrq) {
 		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
 				queue->recv_queue_size,
 				!queue->host_qid);
@@ -1099,13 +1152,22 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
 		goto out_destroy_sq;
 	}
 
+	/*
+	 * Spread the io queues across completion vectors,
+	 * but still keep all admin queues on vector 0.
+	 */
+	queue->comp_vector = !queue->host_qid ? 0 :
+		queue->idx % ndev->device->num_comp_vectors;
+
 	ret = nvmet_rdma_alloc_rsps(queue);
 	if (ret) {
 		ret = NVME_RDMA_CM_NO_RSC;
 		goto out_ida_remove;
 	}
 
-	if (!ndev->srq) {
+	if (ndev->srqs && ndev->srqs[queue->comp_vector % ndev->srq_count]) {
+		queue->nsrq = ndev->srqs[queue->comp_vector % ndev->srq_count];
+	} else {
 		queue->cmds = nvmet_rdma_alloc_cmds(ndev,
 				queue->recv_queue_size,
 				!queue->host_qid);
@@ -1126,7 +1188,7 @@ static int nvmet_rdma_cm_reject(struct rdma_cm_id *cm_id,
 	return queue;
 
 out_free_cmds:
-	if (!ndev->srq) {
+	if (!queue->nsrq) {
 		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
 				queue->recv_queue_size,
 				!queue->host_qid);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size
  2017-09-05 10:59 [PATCH 0/3] nvmet-rdma: SRQ per completion vector Max Gurtovoy
  2017-09-05 10:59 ` [PATCH 1/3] nvmet-rdma: use srq pointer in rdma_cmd Max Gurtovoy
  2017-09-05 10:59 ` [PATCH 2/3] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
@ 2017-09-05 10:59 ` Max Gurtovoy
  2017-09-06 10:39   ` Christoph Hellwig
  2017-09-06 15:02   ` Sagi Grimberg
  2017-09-06 14:40 ` [PATCH 0/3] nvmet-rdma: SRQ per completion vector Sagi Grimberg
  2017-11-07 14:40 ` Max Gurtovoy
  4 siblings, 2 replies; 15+ messages in thread
From: Max Gurtovoy @ 2017-09-05 10:59 UTC (permalink / raw)


Adjust SRQ size according to the expected load. Make sure the
size is >= 256 to avoid lack of resources.

Signed-off-by: Max Gurtovoy <maxg at mellanox.com>
---
 drivers/nvme/target/rdma.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 1b52080..34868ad 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -134,6 +134,16 @@ struct nvmet_rdma_device {
 module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
 MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
 
+static int srq_size_set(const char *val, const struct kernel_param *kp);
+static const struct kernel_param_ops srq_size_ops = {
+	.set = srq_size_set,
+	.get = param_get_int,
+};
+
+static int nvmet_rdma_srq_size = 4095;
+module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
+MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 4095)");
+
 static DEFINE_IDA(nvmet_rdma_queue_ida);
 static LIST_HEAD(nvmet_rdma_queue_list);
 static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
@@ -150,6 +160,17 @@ struct nvmet_rdma_device {
 
 static struct nvmet_fabrics_ops nvmet_rdma_ops;
 
+static int srq_size_set(const char *val, const struct kernel_param *kp)
+{
+	int n = 0, ret;
+
+	ret = kstrtoint(val, 10, &n);
+	if (ret != 0 || n < 256)
+		return -EINVAL;
+
+	return param_set_int(val, kp);
+}
+
 /* XXX: really should move to a generic header sooner or later.. */
 static inline u32 get_unaligned_le24(const u8 *p)
 {
@@ -825,7 +846,7 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev, int index)
 	if (!nsrq)
 		return -ENOMEM;
 
-	srq_size = 4095;	/* XXX: tune */
+	srq_size = nvmet_rdma_srq_size;
 
 	srq_attr.attr.max_wr = srq_size;
 	srq_attr.attr.max_sge = 2;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 1/3] nvmet-rdma: use srq pointer in rdma_cmd
  2017-09-05 10:59 ` [PATCH 1/3] nvmet-rdma: use srq pointer in rdma_cmd Max Gurtovoy
@ 2017-09-06 10:30   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-09-06 10:30 UTC (permalink / raw)


This looks ok, but not very useful on it's own.  I'd be tempted to
just squash it into the next patch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvmet-rdma: use SRQ per completion vector
  2017-09-05 10:59 ` [PATCH 2/3] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
@ 2017-09-06 10:38   ` Christoph Hellwig
  2017-09-06 14:57     ` Sagi Grimberg
  2017-09-06 14:50   ` Sagi Grimberg
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2017-09-06 10:38 UTC (permalink / raw)


> @@ -104,18 +107,25 @@ struct nvmet_rdma_queue {
>  
>  	int			idx;
>  	int			host_qid;
> +	int			comp_vector;

How will this interact with these changes:

http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-cq.2

which I'd really like to see in?

> +struct nvmet_rdma_srq {
> +	struct ib_srq            *srq;
> +	struct nvmet_rdma_cmd    *srq_cmds;
> +	size_t                   srq_size;

Please kill the srq_ prefix, given that it's already implied by the
structure.

> +static void nvmet_rdma_destroy_srq(struct nvmet_rdma_srq *nsrq)
> +{
> +	if (!nsrq)
> +		return;

We should never get here with a null pointer, right?

>  	srq_attr.attr.max_wr = srq_size;
> @@ -808,27 +838,57 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev)
>  		 * non-shared receive queues.
>  		 */
>  		pr_info("SRQ requested but not supported.\n");
> +		kfree(nsrq);
>  		return 0;

goto out_free_nsrq..

> +	ndev->srqs[index] = nsrq;

Move this to the caller so that we don't need to pass in the index.

> +	ndev->srqs = kzalloc(srq_count * sizeof(*ndev->srqs), GFP_KERNEL);

kcalloc / kmalloc_array?

> +	if (ndev->srqs && ndev->srqs[queue->comp_vector % ndev->srq_count]) {

ndev->srqs[queue->comp_vector % ndev->srq_count] should always be
non-NULL, right?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size
  2017-09-05 10:59 ` [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size Max Gurtovoy
@ 2017-09-06 10:39   ` Christoph Hellwig
  2017-09-06 15:02   ` Sagi Grimberg
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-09-06 10:39 UTC (permalink / raw)


On Tue, Sep 05, 2017@01:59:17PM +0300, Max Gurtovoy wrote:
> Adjust SRQ size according to the expected load. Make sure the
> size is >= 256 to avoid lack of resources.
> 
> Signed-off-by: Max Gurtovoy <maxg at mellanox.com>
> ---
>  drivers/nvme/target/rdma.c |   23 ++++++++++++++++++++++-
>  1 files changed, 22 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 1b52080..34868ad 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -134,6 +134,16 @@ struct nvmet_rdma_device {
>  module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
>  MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
>  
> +static int srq_size_set(const char *val, const struct kernel_param *kp);
> +static const struct kernel_param_ops srq_size_ops = {
> +	.set = srq_size_set,
> +	.get = param_get_int,
> +};
> +
> +static int nvmet_rdma_srq_size = 4095;
> +module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
> +MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 4095)");
> +
>  static DEFINE_IDA(nvmet_rdma_queue_ida);
>  static LIST_HEAD(nvmet_rdma_queue_list);
>  static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
> @@ -150,6 +160,17 @@ struct nvmet_rdma_device {
>  
>  static struct nvmet_fabrics_ops nvmet_rdma_ops;
>  
> +static int srq_size_set(const char *val, const struct kernel_param *kp)
> +{
> +	int n = 0, ret;
> +
> +	ret = kstrtoint(val, 10, &n);
> +	if (ret != 0 || n < 256)
> +		return -EINVAL;
> +
> +	return param_set_int(val, kp);
> +}

Can you add a new param_set_int_minmax helper to the core module code
instead of having do duplicate this sort of logic in all kinds of drivers?

Otherwise this looks fine to me.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/3] nvmet-rdma: SRQ per completion vector
  2017-09-05 10:59 [PATCH 0/3] nvmet-rdma: SRQ per completion vector Max Gurtovoy
                   ` (2 preceding siblings ...)
  2017-09-05 10:59 ` [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size Max Gurtovoy
@ 2017-09-06 14:40 ` Sagi Grimberg
  2017-11-07 14:40 ` Max Gurtovoy
  4 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2017-09-06 14:40 UTC (permalink / raw)


Hi Max,

> This is a new feature for NVMEoF RDMA target, that is intended to save resource allocation
> (by sharing them) and utilize the locality (completions and memory) to get the best
> performance with Shared Receive Queues (SRQs). We'll create a SRQ per completion vector
> (and not per device) and assosiate each created QP/CQ with an appropriate SRQ. This will
> also reduce the lock contention on the single SRQ per device (today's solution).

I have a similar patch set which I've been using for some time now, I've
been reluctant to submit it because I think we need the rdma core to
create an API that helps everyone to get it right rather than adding
features to A subsystem we are focused on at a given time.

Couple of thoughts:

First of all, allocating num_comp_vectors srqs and pre-posted buffers
on the first connection establishment has a big impact on the time it
takes to establish it (with enough cpu cores the host might give up).
We should look into allocate lazily as we grow.

Second, the application might not want to run on all completion vectors
that the device has (I am working on modifying the CQ pool API to
accommodate that).

Third, given that SRQ is an optional feature, I'd be happy if we can
hide the this information from the application and implement a fallback
mechanism in the rdma core.

What I had in mind was to add srq_pool that the application can
allocate, and pass it to qp creation with a proper hint on the affinity
(both cq and srq can be selected from this hint).

Having said that, I'm not convinced it will be easy to come up with the
API in one shot, so maybe its worth moving forward with this, not sure..

Thoughts?

> We can see the perf improvement between Case 2 and Case 4 (same order of resource).

Yea, srq in its current form is a giant performance hit on a multi-core
system (more suitable for a processing power constrained targets).

> We can see the benefit in resource consumption (mem and CPU) with a small perf loss
> between cases 2 and 3.
> There is still an open question between the perf differance for 1k between Case 1 and
> Case 3, but I guess we can investigate and improve it incrementaly.

Where is this loss coming from? Is this a HW limitation? if its SW then
its a good place to analyze where we can improve (I have a couple of
ideas).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvmet-rdma: use SRQ per completion vector
  2017-09-05 10:59 ` [PATCH 2/3] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
  2017-09-06 10:38   ` Christoph Hellwig
@ 2017-09-06 14:50   ` Sagi Grimberg
  2017-09-07 10:47     ` Max Gurtovoy
  1 sibling, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2017-09-06 14:50 UTC (permalink / raw)



> In order to save resource allocation and utilize the completion
> locality in a better way, allocate Shared Receive Queues (SRQs) per
> completion vector (and not per device).

Something is backwards here, srq per vector is not saving resources
compared to srq per-device, maybe compared to normal receive queues
(if we have enough initiators).

And what do you mean by "utilize the completion locality in a better
way"? How is using srq affecting a completion queue locality?

> Assosiate each created QP/CQ

associate

> with an appropriate SRQ according to the queue index. This association
> will reduce the lock contention in the fast path

It reduces lock contention compared to srq-per-device, not normal
receive queues.


> and increase the locality in memory buffers.

How does it increase locality in memory buffers?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvmet-rdma: use SRQ per completion vector
  2017-09-06 10:38   ` Christoph Hellwig
@ 2017-09-06 14:57     ` Sagi Grimberg
  2017-09-07 12:03       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2017-09-06 14:57 UTC (permalink / raw)



>>   
>>   	int			idx;
>>   	int			host_qid;
>> +	int			comp_vector;
> 
> How will this interact with these changes:
> 
> http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-cq.2
> 
> which I'd really like to see in?

I have another spin on that one (plus the fine-grained affinity
assignments addition that I proposed to nvmet a while ago).

You can see it here:
git://git.infradead.org/users/sagi/linux.git rdma-cq-pool

Its on top of the pci_alloc_vectors mlx5 conversion and
the nvme code centralization.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size
  2017-09-05 10:59 ` [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size Max Gurtovoy
  2017-09-06 10:39   ` Christoph Hellwig
@ 2017-09-06 15:02   ` Sagi Grimberg
  1 sibling, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2017-09-06 15:02 UTC (permalink / raw)



> Adjust SRQ size according to the expected load. Make sure the
> size is >= 256 to avoid lack of resources.

Why? who said that you cannot do with lower than that?

This is the painful part of srq, you never know how to
size it... I wish we could find a way to know...

> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 1b52080..34868ad 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -134,6 +134,16 @@ struct nvmet_rdma_device {
>   module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
>   MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
>   
> +static int srq_size_set(const char *val, const struct kernel_param *kp);
> +static const struct kernel_param_ops srq_size_ops = {
> +	.set = srq_size_set,
> +	.get = param_get_int,
> +};
> +
> +static int nvmet_rdma_srq_size = 4095;
> +module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
> +MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 4095)");
> +
>   static DEFINE_IDA(nvmet_rdma_queue_ida);
>   static LIST_HEAD(nvmet_rdma_queue_list);
>   static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
> @@ -150,6 +160,17 @@ struct nvmet_rdma_device {
>   
>   static struct nvmet_fabrics_ops nvmet_rdma_ops;
>   
> +static int srq_size_set(const char *val, const struct kernel_param *kp)
> +{
> +	int n = 0, ret;
> +
> +	ret = kstrtoint(val, 10, &n);
> +	if (ret != 0 || n < 256)
> +		return -EINVAL;
> +
> +	return param_set_int(val, kp);
> +}
> +
>   /* XXX: really should move to a generic header sooner or later.. */
>   static inline u32 get_unaligned_le24(const u8 *p)
>   {
> @@ -825,7 +846,7 @@ static int nvmet_rdma_init_srq(struct nvmet_rdma_device *ndev, int index)
>   	if (!nsrq)
>   		return -ENOMEM;
>   
> -	srq_size = 4095;	/* XXX: tune */
> +	srq_size = nvmet_rdma_srq_size;
>   
>   	srq_attr.attr.max_wr = srq_size;
>   	srq_attr.attr.max_sge = 2;
> 

You need to check device max_srq_wr and also check max_srq_sge (and
please make sure its correctly set by drivers that support it).

Also you need to check minimum between the number of srqs you
want and device max_srq.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvmet-rdma: use SRQ per completion vector
  2017-09-06 14:50   ` Sagi Grimberg
@ 2017-09-07 10:47     ` Max Gurtovoy
  0 siblings, 0 replies; 15+ messages in thread
From: Max Gurtovoy @ 2017-09-07 10:47 UTC (permalink / raw)




On 9/6/2017 5:50 PM, Sagi Grimberg wrote:
> 
>> In order to save resource allocation and utilize the completion
>> locality in a better way, allocate Shared Receive Queues (SRQs) per
>> completion vector (and not per device).
> 
> Something is backwards here, srq per vector is not saving resources
> compared to srq per-device, maybe compared to normal receive queues
> (if we have enough initiators).

I meant compared to the normal MQ state.
Also I added a module param that one can control the amount of resource 
allocations.

> 
> And what do you mean by "utilize the completion locality in a better
> way"? How is using srq affecting a completion queue locality?

I meant that for each completion that arrives on CPU N the "local" SRQ_N 
will be the one that will post_recv (and lock the srq spin_lock) the buffer.
In SRQ per device we'll get completions from all CPU's and a single SRQ 
will be "working" on all the CPU's.

maybe we can rephrase it.

> 
>> Assosiate each created QP/CQ
> 
> associate

:)

> 
>> with an appropriate SRQ according to the queue index. This association
>> will reduce the lock contention in the fast path
> 
> It reduces lock contention compared to srq-per-device, not normal
> receive queues.

of course.

> 
> 
>> and increase the locality in memory buffers.
> 
> How does it increase locality in memory buffers?

We are sharing buffers between many connections (compared to normal MQ). 
lets say that many initiatores are running traffic. In that case the 
shared buffers will be "hot" much more than the case that each resource 
will be used by 1 consumer. I assume the MM unit can recognize this 
situation.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvmet-rdma: use SRQ per completion vector
  2017-09-06 14:57     ` Sagi Grimberg
@ 2017-09-07 12:03       ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-09-07 12:03 UTC (permalink / raw)


On Wed, Sep 06, 2017@05:57:47PM +0300, Sagi Grimberg wrote:
>
>>>     	int			idx;
>>>   	int			host_qid;
>>> +	int			comp_vector;
>>
>> How will this interact with these changes:
>>
>> http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-cq.2
>>
>> which I'd really like to see in?
>
> I have another spin on that one (plus the fine-grained affinity
> assignments addition that I proposed to nvmet a while ago).
>
> You can see it here:
> git://git.infradead.org/users/sagi/linux.git rdma-cq-pool

Tiem to get it out to the list :)  Chuck was also really interested
in some of this for NFS.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/3] nvmet-rdma: SRQ per completion vector
  2017-09-05 10:59 [PATCH 0/3] nvmet-rdma: SRQ per completion vector Max Gurtovoy
                   ` (3 preceding siblings ...)
  2017-09-06 14:40 ` [PATCH 0/3] nvmet-rdma: SRQ per completion vector Sagi Grimberg
@ 2017-11-07 14:40 ` Max Gurtovoy
  2017-11-08 10:07   ` Sagi Grimberg
  4 siblings, 1 reply; 15+ messages in thread
From: Max Gurtovoy @ 2017-11-07 14:40 UTC (permalink / raw)


Hi Christoph/Sagi,

On 9/5/2017 1:59 PM, Max Gurtovoy wrote:
> This is a new feature for NVMEoF RDMA target, that is intended to save resource allocation
> (by sharing them) and utilize the locality (completions and memory) to get the best
> performance with Shared Receive Queues (SRQs). We'll create a SRQ per completion vector
> (and not per device) and assosiate each created QP/CQ with an appropriate SRQ. This will
> also reduce the lock contention on the single SRQ per device (today's solution).
> 
> My testing environment included 4 initiators (CX5, CX5, CX4, CX3) that were connected to 4
> subsystems (1 ns per sub) throw 2 ports (each initiator connected to unique subsystem
> backed in a different bull_blk device) using a switch to the NVMEoF target (CX5).
> I used RoCE link layer.
> 

what is the approach for this SRQ feature ?
I haven't seen much progress with the CQ pool patches (we first wanted 
it to be on top of the CQ pool patches).
Should I resent it with your comments or wait till CQ pool will be in ?

-Max.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/3] nvmet-rdma: SRQ per completion vector
  2017-11-07 14:40 ` Max Gurtovoy
@ 2017-11-08 10:07   ` Sagi Grimberg
  0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:07 UTC (permalink / raw)


Max,

> what is the approach for this SRQ feature ?
> I haven't seen much progress with the CQ pool patches (we first wanted 
> it to be on top of the CQ pool patches).
> Should I resent it with your comments or wait till CQ pool will be in ?

Just sent out a v3 of the CQ pool patches.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-11-08 10:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-05 10:59 [PATCH 0/3] nvmet-rdma: SRQ per completion vector Max Gurtovoy
2017-09-05 10:59 ` [PATCH 1/3] nvmet-rdma: use srq pointer in rdma_cmd Max Gurtovoy
2017-09-06 10:30   ` Christoph Hellwig
2017-09-05 10:59 ` [PATCH 2/3] nvmet-rdma: use SRQ per completion vector Max Gurtovoy
2017-09-06 10:38   ` Christoph Hellwig
2017-09-06 14:57     ` Sagi Grimberg
2017-09-07 12:03       ` Christoph Hellwig
2017-09-06 14:50   ` Sagi Grimberg
2017-09-07 10:47     ` Max Gurtovoy
2017-09-05 10:59 ` [PATCH 3/3] nvmet-rdma: add module parameter for SRQ size Max Gurtovoy
2017-09-06 10:39   ` Christoph Hellwig
2017-09-06 15:02   ` Sagi Grimberg
2017-09-06 14:40 ` [PATCH 0/3] nvmet-rdma: SRQ per completion vector Sagi Grimberg
2017-11-07 14:40 ` Max Gurtovoy
2017-11-08 10:07   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.