All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
@ 2017-02-07 22:39 Parav Pandit
  2017-02-08  8:27 ` Christoph Hellwig
  2017-02-08  9:45 ` Sagi Grimberg
  0 siblings, 2 replies; 14+ messages in thread
From: Parav Pandit @ 2017-02-07 22:39 UTC (permalink / raw)


This patch allows to send data for write commands via maximum of 2 SGEs
for block write commands using RDMA's inherent sges support.

Number of send SGEs have almost no relation to the amount of inline data
size being advertised by the NVMe target. Though its likely that when
NVMe target advertises more than 4K or page size worth of inline data
size, multiple send sges can become more useful.

In future more advance implementation may be made available to refer to
exposed inline data size to decide upon number of SGEs for IOQ.
Until than just make use of RDMA adapters capability.
Most adapters known to support 3 send sges (cmd + data) or more.
So this shouldn't break any current deployments.
It is also tested with ConnectX-4 RoCEv2 100Gb adapters, ext4 and xfs
filesystem and raw block IOs.

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
Signed-off-by: Parav Pandit <parav at mellanox.com>
---
 drivers/nvme/host/rdma.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 557f29b..fc66c9c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -40,7 +40,7 @@
 
 #define NVME_RDMA_MAX_SEGMENTS		256
 
-#define NVME_RDMA_MAX_INLINE_SEGMENTS	1
+#define NVME_RDMA_MAX_INLINE_SEGMENTS	2
 
 static const char *const nvme_rdma_cm_status_strs[] = {
 	[NVME_RDMA_CM_INVALID_LEN]	= "invalid length",
@@ -913,16 +913,23 @@ static int nvme_rdma_set_sg_null(struct nvme_command *c)
 }
 
 static int nvme_rdma_map_sg_inline(struct nvme_rdma_queue *queue,
-		struct nvme_rdma_request *req, struct nvme_command *c)
+		struct nvme_rdma_request *req, struct nvme_command *c,
+		int count)
 {
 	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
-
-	req->sge[1].addr = sg_dma_address(req->sg_table.sgl);
-	req->sge[1].length = sg_dma_len(req->sg_table.sgl);
-	req->sge[1].lkey = queue->device->pd->local_dma_lkey;
+	int i;
+	u32 len = 0;
+
+	for (i = 0; i < count; i++) {
+		req->sge[i + 1].addr = sg_dma_address(&req->sg_table.sgl[i]);
+		req->sge[i + 1].length = sg_dma_len(&req->sg_table.sgl[i]);
+		req->sge[i + 1].lkey = queue->device->pd->local_dma_lkey;
+		len += sg_dma_len(&req->sg_table.sgl[i]);
+		req->num_sge++;
+	}
 
 	sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
-	sg->length = cpu_to_le32(sg_dma_len(req->sg_table.sgl));
+	sg->length = cpu_to_le32(len);
 	sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
 
 	req->inline_data = true;
@@ -1012,13 +1019,13 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 		return -EIO;
 	}
 
-	if (count == 1) {
+	if (count <= NVME_RDMA_MAX_INLINE_SEGMENTS) {
 		if (rq_data_dir(rq) == WRITE && nvme_rdma_queue_idx(queue) &&
 		    blk_rq_payload_bytes(rq) <=
 				nvme_rdma_inline_data_size(queue))
-			return nvme_rdma_map_sg_inline(queue, req, c);
+			return nvme_rdma_map_sg_inline(queue, req, c, count);
 
-		if (dev->pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY)
+		if (count == 1 && dev->pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY)
 			return nvme_rdma_map_sg_single(queue, req, c);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-07 22:39 [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands Parav Pandit
@ 2017-02-08  8:27 ` Christoph Hellwig
  2017-02-08  9:48   ` Sagi Grimberg
  2017-02-08 15:15   ` Parav Pandit
  2017-02-08  9:45 ` Sagi Grimberg
  1 sibling, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2017-02-08  8:27 UTC (permalink / raw)


This looks fine to me in general, but I'm a little curious how
your arrived at that 2 number.  Why not 4 or 8?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-07 22:39 [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands Parav Pandit
  2017-02-08  8:27 ` Christoph Hellwig
@ 2017-02-08  9:45 ` Sagi Grimberg
  1 sibling, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2017-02-08  9:45 UTC (permalink / raw)


>  static int nvme_rdma_map_sg_inline(struct nvme_rdma_queue *queue,
> -		struct nvme_rdma_request *req, struct nvme_command *c)
> +		struct nvme_rdma_request *req, struct nvme_command *c,
> +		int count)
>  {
>  	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> -
> -	req->sge[1].addr = sg_dma_address(req->sg_table.sgl);
> -	req->sge[1].length = sg_dma_len(req->sg_table.sgl);
> -	req->sge[1].lkey = queue->device->pd->local_dma_lkey;
> +	int i;
> +	u32 len = 0;
> +
> +	for (i = 0; i < count; i++) {
> +		req->sge[i + 1].addr = sg_dma_address(&req->sg_table.sgl[i]);
> +		req->sge[i + 1].length = sg_dma_len(&req->sg_table.sgl[i]);
> +		req->sge[i + 1].lkey = queue->device->pd->local_dma_lkey;
> +		len += sg_dma_len(&req->sg_table.sgl[i]);
> +		req->num_sge++;
> +	}

Nit, you can also have sge iterator that starts in req->sge[1] instead.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-08  8:27 ` Christoph Hellwig
@ 2017-02-08  9:48   ` Sagi Grimberg
  2017-02-08 15:15   ` Parav Pandit
  1 sibling, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2017-02-08  9:48 UTC (permalink / raw)



> This looks fine to me in general, but I'm a little curious how
> your arrived at that 2 number.  Why not 4 or 8?

Well, its preallocated with the request.

Maybe we can have a heuristic that will allocate dynamically according
to the max in-capsule data capability reported by the target (one sge
for each PAGE_SIZE)?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-08  8:27 ` Christoph Hellwig
  2017-02-08  9:48   ` Sagi Grimberg
@ 2017-02-08 15:15   ` Parav Pandit
  2017-02-08 15:41     ` Steve Wise
  1 sibling, 1 reply; 14+ messages in thread
From: Parav Pandit @ 2017-02-08 15:15 UTC (permalink / raw)




> -----Original Message-----
> From: Christoph Hellwig [mailto:hch at lst.de]
> Sent: Wednesday, February 8, 2017 2:28 AM
> To: Parav Pandit <parav at mellanox.com>
> Cc: hch at lst.de; sagi at grimberg.me; linux-nvme at lists.infradead.org;
> axboe at fb.com; keith.busch at intel.com
> Subject: Re: [PATCH] nvme-rdma: Support 2 inline data SGEs for write
> commands.
> 
> This looks fine to me in general, but I'm a little curious how your arrived at
> that 2 number.  Why not 4 or 8?

Some adapters doesn't support more than 5 sges. So increasing it to 4 or 8 would fail them.
Chuck reported 1 out of 3 adapters only support 3. Its likely due to min(send_sge, recv_sge).
With total of 3 SGEs (64B cmd sge + data sges), WQE fits in 64 bytes size at the provider driver level. So to strike balance between cache line usage and performance, and to be nice to those adapters which has low SGE (either due to min(send_sge, recv) or other otherwise, I picked 2 data sges. 

Side  note: I have split out send and recv sge in other unrelated patch which is in review internally from linux-rdma tree.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-08 15:15   ` Parav Pandit
@ 2017-02-08 15:41     ` Steve Wise
  2017-02-08 15:42       ` Parav Pandit
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Wise @ 2017-02-08 15:41 UTC (permalink / raw)


> >
> > This looks fine to me in general, but I'm a little curious how your arrived
at
> > that 2 number.  Why not 4 or 8?
> 
> Some adapters doesn't support more than 5 sges. So increasing it to 4 or 8
would
> fail them.
> Chuck reported 1 out of 3 adapters only support 3. Its likely due to
min(send_sge,
> recv_sge).
> With total of 3 SGEs (64B cmd sge + data sges), WQE fits in 64 bytes size at
the
> provider driver level. So to strike balance between cache line usage and
> performance, and to be nice to those adapters which has low SGE (either due to
> min(send_sge, recv) or other otherwise, I picked 2 data sges.
> 
> Side  note: I have split out send and recv sge in other unrelated patch which
is in
> review internally from linux-rdma tree.

I'm looking forward to this patch since cxgb4 has max_recv_sge of 4 and
max_send_sge of 17!

Steve.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-08 15:41     ` Steve Wise
@ 2017-02-08 15:42       ` Parav Pandit
  2017-02-08 15:49         ` Steve Wise
  0 siblings, 1 reply; 14+ messages in thread
From: Parav Pandit @ 2017-02-08 15:42 UTC (permalink / raw)


Hi Steve,

> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com]
> Sent: Wednesday, February 8, 2017 9:41 AM
> To: Parav Pandit <parav at mellanox.com>; 'Christoph Hellwig' <hch at lst.de>
> Cc: axboe at fb.com; keith.busch at intel.com; sagi at grimberg.me; linux-
> nvme at lists.infradead.org
> Subject: RE: [PATCH] nvme-rdma: Support 2 inline data SGEs for write
> commands.
> 
> > >
> > > This looks fine to me in general, but I'm a little curious how your
> > > arrived
> at
> > > that 2 number.  Why not 4 or 8?
> >
> > Some adapters doesn't support more than 5 sges. So increasing it to 4
> > or 8
> would
> > fail them.
> > Chuck reported 1 out of 3 adapters only support 3. Its likely due to
> min(send_sge,
> > recv_sge).
> > With total of 3 SGEs (64B cmd sge + data sges), WQE fits in 64 bytes
> > size at
> the
> > provider driver level. So to strike balance between cache line usage
> > and performance, and to be nice to those adapters which has low SGE
> > (either due to min(send_sge, recv) or other otherwise, I picked 2 data sges.
> >
> > Side  note: I have split out send and recv sge in other unrelated
> > patch which
> is in
> > review internally from linux-rdma tree.
> 
> I'm looking forward to this patch since cxgb4 has max_recv_sge of 4 and
> max_send_sge of 17!
> 
Sure. I will send out shortly once done with review and testing with nvme target.

> Steve.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-08 15:42       ` Parav Pandit
@ 2017-02-08 15:49         ` Steve Wise
  0 siblings, 0 replies; 14+ messages in thread
From: Steve Wise @ 2017-02-08 15:49 UTC (permalink / raw)


> >
> > I'm looking forward to this patch since cxgb4 has max_recv_sge of 4 and
> > max_send_sge of 17!
> >
> Sure. I will send out shortly once done with review and testing with nvme
target.
> 

If you can send this and your 3 or 4 nvme/rdma patches as a series, I'll test
them on cxgb4.

Steve.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-22 20:07         ` Sagi Grimberg
@ 2017-02-22 20:46           ` Parav Pandit
  0 siblings, 0 replies; 14+ messages in thread
From: Parav Pandit @ 2017-02-22 20:46 UTC (permalink / raw)


Hi Sagi,

> -----Original Message-----
> From: Sagi Grimberg [mailto:sagi at grimberg.me]
> Sent: Wednesday, February 22, 2017 2:07 PM
> To: Parav Pandit <parav at mellanox.com>; Christoph Hellwig <hch at lst.de>
> Cc: linux-nvme at lists.infradead.org; axboe at fb.com; keith.busch at intel.com
> Subject: Re: [PATCH] nvme-rdma: Support 2 inline data SGEs for write
> commands.
> 
> 
> >> I'm referring to struct nvme_rdma_request
> > Yes, I got that now.
> > That needs to have dynamic allocation for sge array if we want to do based
> on ioccsz.
> > Unless we do 4 or 8 or larger SGEs, extra 8 byte pointer to store link to
> more than one SGE might not be worth enough just for writes.
> 
> Which is what Christoph asked, why not 4 or 8?
Which I explained in to his reply as below previously.

Snippet:
--------------
Some adapters doesn't support more than 5 sges. So increasing it to 4 
or 8 would fail them.
Chuck reported 1 out of 3 adapters only support 3. Its likely due to 
min(send_sge, recv_sge).
With total of 3 SGEs (64B cmd sge + data sges), WQE fits in 64 bytes 
size at the provider driver level. So to strike balance between cache 
line usage and performance, and to be nice to those adapters which has 
low SGE (either due to min(send_sge, recv) or other otherwise, I picked 2 data sges.
--------------
I have this patch ready that splits send and receive sges for a while now,
but it's been tested in very limited environment and requires some more testing that Steve already offered.
I plan to finish these patches first before I push this additional enhancement.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-22 20:02       ` Parav Pandit
@ 2017-02-22 20:07         ` Sagi Grimberg
  2017-02-22 20:46           ` Parav Pandit
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2017-02-22 20:07 UTC (permalink / raw)



>> I'm referring to struct nvme_rdma_request
> Yes, I got that now.
> That needs to have dynamic allocation for sge array if we want to do based on ioccsz.
> Unless we do 4 or 8 or larger SGEs, extra 8 byte pointer to store link to more than one SGE might not be worth enough just for writes.

Which is what Christoph asked, why not 4 or 8?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-22 19:55     ` Sagi Grimberg
@ 2017-02-22 20:02       ` Parav Pandit
  2017-02-22 20:07         ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: Parav Pandit @ 2017-02-22 20:02 UTC (permalink / raw)


> -----Original Message-----
> From: Sagi Grimberg [mailto:sagi at grimberg.me]
> Sent: Wednesday, February 22, 2017 1:56 PM
> To: Parav Pandit <parav at mellanox.com>; Christoph Hellwig <hch at lst.de>
> Cc: linux-nvme at lists.infradead.org; axboe at fb.com; keith.busch at intel.com
> Subject: Re: [PATCH] nvme-rdma: Support 2 inline data SGEs for write
> commands.
> 
> 
> >>> Hi Christoph, Sagi,
> >>
> >>> Does this host side patch look fine? Its unrelated to target side
> >> implementation.
> >>
> >> I'm still wandering if we want it dynamic according to the target
> >> ioccsz (say 1 per PAGE_SIZE?)
> >
> > I don't think we are consuming any more resources by increasing to 2SGEs
> because it's still aligns to hardware's minimum WQE size of 64B or 128B size.
> > Can we add that additional complexity in incremental manner unless 2 SGEs
> are less optimal compared to 1?
> 
> I'm referring to struct nvme_rdma_request
Yes, I got that now.
That needs to have dynamic allocation for sge array if we want to do based on ioccsz.
Unless we do 4 or 8 or larger SGEs, extra 8 byte pointer to store link to more than one SGE might not be worth enough just for writes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-22 19:52   ` Parav Pandit
@ 2017-02-22 19:55     ` Sagi Grimberg
  2017-02-22 20:02       ` Parav Pandit
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2017-02-22 19:55 UTC (permalink / raw)



>>> Hi Christoph, Sagi,
>>
>>> Does this host side patch look fine? Its unrelated to target side
>> implementation.
>>
>> I'm still wandering if we want it dynamic according to the target ioccsz (say 1
>> per PAGE_SIZE?)
>
> I don't think we are consuming any more resources by increasing to 2SGEs because it's still aligns to hardware's minimum WQE size of 64B or 128B size.
> Can we add that additional complexity in incremental manner unless 2 SGEs are less optimal compared to 1?

I'm referring to struct nvme_rdma_request

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
  2017-02-22 19:38 ` Sagi Grimberg
@ 2017-02-22 19:52   ` Parav Pandit
  2017-02-22 19:55     ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: Parav Pandit @ 2017-02-22 19:52 UTC (permalink / raw)


> -----Original Message-----
> From: Sagi Grimberg [mailto:sagi at grimberg.me]
> Sent: Wednesday, February 22, 2017 1:39 PM
> To: Parav Pandit <parav at mellanox.com>; Christoph Hellwig <hch at lst.de>
> Cc: linux-nvme at lists.infradead.org; axboe at fb.com; keith.busch at intel.com
> Subject: Re: [PATCH] nvme-rdma: Support 2 inline data SGEs for write
> commands.
> 
> 
> > Hi Christoph, Sagi,
> 
> > Does this host side patch look fine? Its unrelated to target side
> implementation.
> 
> I'm still wandering if we want it dynamic according to the target ioccsz (say 1
> per PAGE_SIZE?)

I don't think we are consuming any more resources by increasing to 2SGEs because it's still aligns to hardware's minimum WQE size of 64B or 128B size.
Can we add that additional complexity in incremental manner unless 2 SGEs are less optimal compared to 1?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands.
       [not found] <VI1PR0502MB3008FAE182CFAFA906ECF62DD1500@VI1PR0502MB3008.eurprd05.prod.outlook.com>
@ 2017-02-22 19:38 ` Sagi Grimberg
  2017-02-22 19:52   ` Parav Pandit
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2017-02-22 19:38 UTC (permalink / raw)



> Hi Christoph, Sagi,

> Does this host side patch look fine? Its unrelated to target side implementation.

I'm still wandering if we want it dynamic according to the target
ioccsz (say 1 per PAGE_SIZE?)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-02-22 20:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-07 22:39 [PATCH] nvme-rdma: Support 2 inline data SGEs for write commands Parav Pandit
2017-02-08  8:27 ` Christoph Hellwig
2017-02-08  9:48   ` Sagi Grimberg
2017-02-08 15:15   ` Parav Pandit
2017-02-08 15:41     ` Steve Wise
2017-02-08 15:42       ` Parav Pandit
2017-02-08 15:49         ` Steve Wise
2017-02-08  9:45 ` Sagi Grimberg
     [not found] <VI1PR0502MB3008FAE182CFAFA906ECF62DD1500@VI1PR0502MB3008.eurprd05.prod.outlook.com>
2017-02-22 19:38 ` Sagi Grimberg
2017-02-22 19:52   ` Parav Pandit
2017-02-22 19:55     ` Sagi Grimberg
2017-02-22 20:02       ` Parav Pandit
2017-02-22 20:07         ` Sagi Grimberg
2017-02-22 20:46           ` Parav Pandit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.