All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
@ 2021-05-24 17:19 Walker, Benjamin
  2021-05-26  9:12 ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: Walker, Benjamin @ 2021-05-24 17:19 UTC (permalink / raw)
  To: linux-nvme

This bug was found using the iozone tool.

- Linux kernel initiator, SPDK target, RDMA (RoCEv2) transport
- iozone is performing a 5KiB write to a 512 byte block size nvme device
- The SPDK target has reported that it supports 4KiB of in-capsule data, MSDBD of 16 (number of SGL descriptors), and ICDOFF of 0.
- The Linux kernel sends an NVMe-oF capsule with a command that claims to have 5KiB of data in the command, but actually only has a single SGL element describing 4KiB of data in-capsule.
- The SPDK target correctly fails this I/O

This fails on at least 5.11 but worked prior to 5.4. A git bisect shows that this commit is responsible: 38e1800275d3af607e4df92ff49dc2cf442586a4

I believe the key is the use of MSDBD > 1 and in-capsule data support. This seems to trick the initiator into thinking it can do 5KiB in one command with two SGL elements, but then the initiator goes down the in-capsule data path and can only describe 4KiB that way.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-24 17:19 [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1 Walker, Benjamin
@ 2021-05-26  9:12 ` Sagi Grimberg
  2021-05-26 14:10   ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2021-05-26  9:12 UTC (permalink / raw)
  To: Walker, Benjamin, linux-nvme


> This bug was found using the iozone tool.
> 
> - Linux kernel initiator, SPDK target, RDMA (RoCEv2) transport
> - iozone is performing a 5KiB write to a 512 byte block size nvme device
> - The SPDK target has reported that it supports 4KiB of in-capsule data, MSDBD of 16 (number of SGL descriptors), and ICDOFF of 0.
> - The Linux kernel sends an NVMe-oF capsule with a command that claims to have 5KiB of data in the command, but actually only has a single SGL element describing 4KiB of data in-capsule.
> - The SPDK target correctly fails this I/O
> 
> This fails on at least 5.11 but worked prior to 5.4. A git bisect shows that this commit is responsible: 38e1800275d3af607e4df92ff49dc2cf442586a4
> 
> I believe the key is the use of MSDBD > 1 and in-capsule data support. This seems to trick the initiator into thinking it can do 5KiB in one command with two SGL elements, but then the initiator goes down the in-capsule data path and can only describe 4KiB that way.

I think it may be a wrong iteration on the scatterlist (which means
that iozone generated a scatterlist with more than 2 entries, causing
the sg_table to be chained).

Does this make the issue go away?
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 8d107b201f16..44cfcaeb5f2e 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1319,22 +1319,22 @@ static int nvme_rdma_map_sg_inline(struct 
nvme_rdma_queue *queue,
                 struct nvme_rdma_request *req, struct nvme_command *c,
                 int count)
  {
-       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
-       struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
+       struct nvme_sgl_desc *sgl = &c->common.dptr.sgl;
+       struct scatterlist *sg, *scat = req->data_sgl.sg_table.sgl;
         struct ib_sge *sge = &req->sge[1];
         u32 len = 0;
         int i;

-       for (i = 0; i < count; i++, sgl++, sge++) {
-               sge->addr = sg_dma_address(sgl);
-               sge->length = sg_dma_len(sgl);
+       for_each_sg(scat, sg, count, i) {
+               sge->addr = sg_dma_address(sg);
+               sge->length = sg_dma_len(sg);
                 sge->lkey = queue->device->pd->local_dma_lkey;
                 len += sge->length;
         }

-       sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
-       sg->length = cpu_to_le32(len);
-       sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
+       sgl->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
+       sgl->length = cpu_to_le32(len);
+       sgl->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;

         req->num_sge += count;
         return 0;
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-26  9:12 ` Sagi Grimberg
@ 2021-05-26 14:10   ` Christoph Hellwig
  2021-05-26 14:49     ` Max Gurtovoy
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2021-05-26 14:10 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Walker, Benjamin, linux-nvme, Max Gurtovoy, Israel Rukshin

On Wed, May 26, 2021 at 02:12:08AM -0700, Sagi Grimberg wrote:
>  {
> -       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> -       struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
> +       struct nvme_sgl_desc *sgl = &c->common.dptr.sgl;
> +       struct scatterlist *sg, *scat = req->data_sgl.sg_table.sgl;
>         struct ib_sge *sge = &req->sge[1];
>         u32 len = 0;
>         int i;
> 
> -       for (i = 0; i < count; i++, sgl++, sge++) {
> -               sge->addr = sg_dma_address(sgl);
> -               sge->length = sg_dma_len(sgl);
> +       for_each_sg(scat, sg, count, i) {
> +               sge->addr = sg_dma_address(sg);
> +               sge->length = sg_dma_len(sg);
>                 sge->lkey = queue->device->pd->local_dma_lkey;
>                 len += sge->length;
>         }

We do need for_each_sg here indeed, but you also need to keep
incrementing sge for each loop iteration.  I think we can also
drop the scat local variable with just a single users and all the
renaming while we're at it.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-26 14:10   ` Christoph Hellwig
@ 2021-05-26 14:49     ` Max Gurtovoy
  2021-05-26 16:00       ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Max Gurtovoy @ 2021-05-26 14:49 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg
  Cc: Walker, Benjamin, linux-nvme, Israel Rukshin


On 5/26/2021 5:10 PM, Christoph Hellwig wrote:
> On Wed, May 26, 2021 at 02:12:08AM -0700, Sagi Grimberg wrote:
>>   {
>> -       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
>> -       struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
>> +       struct nvme_sgl_desc *sgl = &c->common.dptr.sgl;
>> +       struct scatterlist *sg, *scat = req->data_sgl.sg_table.sgl;
>>          struct ib_sge *sge = &req->sge[1];
>>          u32 len = 0;
>>          int i;
>>
>> -       for (i = 0; i < count; i++, sgl++, sge++) {
>> -               sge->addr = sg_dma_address(sgl);
>> -               sge->length = sg_dma_len(sgl);
>> +       for_each_sg(scat, sg, count, i) {
>> +               sge->addr = sg_dma_address(sg);
>> +               sge->length = sg_dma_len(sg);
>>                  sge->lkey = queue->device->pd->local_dma_lkey;
>>                  len += sge->length;
>>          }
> We do need for_each_sg here indeed, but you also need to keep
> incrementing sge for each loop iteration.  I think we can also
> drop the scat local variable with just a single users and all the
> renaming while we're at it.

Is the above fixing the issue ?

Seems like code refactoring to me, right ?



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-26 14:49     ` Max Gurtovoy
@ 2021-05-26 16:00       ` Christoph Hellwig
  2021-05-26 19:29         ` Walker, Benjamin
  2021-05-26 21:57         ` Max Gurtovoy
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2021-05-26 16:00 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Christoph Hellwig, Sagi Grimberg, Walker, Benjamin, linux-nvme,
	Israel Rukshin

On Wed, May 26, 2021 at 05:49:41PM +0300, Max Gurtovoy wrote:
> > We do need for_each_sg here indeed, but you also need to keep
> > incrementing sge for each loop iteration.  I think we can also
> > drop the scat local variable with just a single users and all the
> > renaming while we're at it.
> 
> Is the above fixing the issue ?
> 
> Seems like code refactoring to me, right ?

It fixes support for chained SGLs when using inline segments.  Not sure
if it fixes the original bug report, but the current code is broken.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-26 16:00       ` Christoph Hellwig
@ 2021-05-26 19:29         ` Walker, Benjamin
  2021-05-26 21:57         ` Max Gurtovoy
  1 sibling, 0 replies; 9+ messages in thread
From: Walker, Benjamin @ 2021-05-26 19:29 UTC (permalink / raw)
  To: Christoph Hellwig, Max Gurtovoy; +Cc: Sagi Grimberg, linux-nvme, Israel Rukshin

> From: Christoph Hellwig <hch@infradead.org>
 
> On Wed, May 26, 2021 at 05:49:41PM +0300, Max Gurtovoy wrote:
> > > We do need for_each_sg here indeed, but you also need to keep
> > > incrementing sge for each loop iteration.  I think we can also drop
> > > the scat local variable with just a single users and all the
> > > renaming while we're at it.
> >
> > Is the above fixing the issue ?
> >
> > Seems like code refactoring to me, right ?
> 
> It fixes support for chained SGLs when using inline segments.  Not sure if it fixes
> the original bug report, but the current code is broken.

I'll get this re-tested shortly, but I've spent some time on getting this to reproduce
and it is really easy to make it happen. Simply start the SPDK target,
create an RDMA transport (RoCEv2 is fine) with in-capsule data size of 8K and
add any kind of disk as a namespace (malloc works). Then connect to it from
the kernel initiator, create and mount a filesystem on it (I used xfs), generate
a large file and then write to it in 5k blocks using dd with the odirect flag. It
hits immediately. The command claims it is a write of 10 blocks, but the single
SGL element describes in-capsule data with length 4096.

It does not happen when:
1) The transport is TCP
2) The in-capsule data size is 4K

I'll apply the patch and re-test now.

Thanks,
Ben

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-26 16:00       ` Christoph Hellwig
  2021-05-26 19:29         ` Walker, Benjamin
@ 2021-05-26 21:57         ` Max Gurtovoy
  2021-05-27 17:43           ` Walker, Benjamin
  1 sibling, 1 reply; 9+ messages in thread
From: Max Gurtovoy @ 2021-05-26 21:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Walker, Benjamin, linux-nvme, Israel Rukshin


On 5/26/2021 7:00 PM, Christoph Hellwig wrote:
> On Wed, May 26, 2021 at 05:49:41PM +0300, Max Gurtovoy wrote:
>>> We do need for_each_sg here indeed, but you also need to keep
>>> incrementing sge for each loop iteration.  I think we can also
>>> drop the scat local variable with just a single users and all the
>>> renaming while we're at it.
>> Is the above fixing the issue ?
>>
>> Seems like code refactoring to me, right ?
> It fixes support for chained SGLs when using inline segments.  Not sure
> if it fixes the original bug report, but the current code is broken.

ohh, I see the usage of sg_next is missing.

Well the original bug is for sure using inline chained SGLs but I 
thought it's using 2 sg_nents and NVME_INLINE_SG_CNT == 2.

maybe there is a split and the IO is using sg_nents == 3 ? and then we 
chain.

Let's wait for Ben's test report (Ben please make sure to fix Sagi's 
suggestion with Christoph's comment in your test).


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-26 21:57         ` Max Gurtovoy
@ 2021-05-27 17:43           ` Walker, Benjamin
  2021-05-27 17:48             ` Max Gurtovoy
  0 siblings, 1 reply; 9+ messages in thread
From: Walker, Benjamin @ 2021-05-27 17:43 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig; +Cc: Sagi Grimberg, linux-nvme, Israel Rukshin

> From: Max Gurtovoy <mgurtovoy@nvidia.com>
>  
> On 5/26/2021 7:00 PM, Christoph Hellwig wrote:
> > On Wed, May 26, 2021 at 05:49:41PM +0300, Max Gurtovoy wrote:
> >>> We do need for_each_sg here indeed, but you also need to keep
> >>> incrementing sge for each loop iteration.  I think we can also drop
> >>> the scat local variable with just a single users and all the
> >>> renaming while we're at it.
> >> Is the above fixing the issue ?
> >>
> >> Seems like code refactoring to me, right ?
> > It fixes support for chained SGLs when using inline segments.  Not
> > sure if it fixes the original bug report, but the current code is broken.
> 
> ohh, I see the usage of sg_next is missing.
> 
> Well the original bug is for sure using inline chained SGLs but I thought it's using
> 2 sg_nents and NVME_INLINE_SG_CNT == 2.
> 
> maybe there is a split and the IO is using sg_nents == 3 ? and then we chain.
> 
> Let's wait for Ben's test report (Ben please make sure to fix Sagi's suggestion
> with Christoph's comment in your test).

This patch fixes the bug:

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 37943dc4c2c1..6a3f7f6bd1ab 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1320,16 +1320,17 @@ static int nvme_rdma_map_sg_inline(struct nvme_rdma_queue *queue,
                int count)
 {
        struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
-       struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
+      struct scatterlist *sgl, *scat = req->data_sgl.sg_table.sgl;
        struct ib_sge *sge = &req->sge[1];
        u32 len = 0;
        int i;

-       for (i = 0; i < count; i++, sgl++, sge++) {
+      for_each_sg(scat, sgl, count, i) {
                sge->addr = sg_dma_address(sgl);
                sge->length = sg_dma_len(sgl);
                sge->lkey = queue->device->pd->local_dma_lkey;
                len += sge->length;
+              sge++;
        }
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1
  2021-05-27 17:43           ` Walker, Benjamin
@ 2021-05-27 17:48             ` Max Gurtovoy
  0 siblings, 0 replies; 9+ messages in thread
From: Max Gurtovoy @ 2021-05-27 17:48 UTC (permalink / raw)
  To: Walker, Benjamin, Christoph Hellwig
  Cc: Sagi Grimberg, linux-nvme, Israel Rukshin


On 5/27/2021 8:43 PM, Walker, Benjamin wrote:
>> From: Max Gurtovoy <mgurtovoy@nvidia.com>
>>   
>> On 5/26/2021 7:00 PM, Christoph Hellwig wrote:
>>> On Wed, May 26, 2021 at 05:49:41PM +0300, Max Gurtovoy wrote:
>>>>> We do need for_each_sg here indeed, but you also need to keep
>>>>> incrementing sge for each loop iteration.  I think we can also drop
>>>>> the scat local variable with just a single users and all the
>>>>> renaming while we're at it.
>>>> Is the above fixing the issue ?
>>>>
>>>> Seems like code refactoring to me, right ?
>>> It fixes support for chained SGLs when using inline segments.  Not
>>> sure if it fixes the original bug report, but the current code is broken.
>> ohh, I see the usage of sg_next is missing.
>>
>> Well the original bug is for sure using inline chained SGLs but I thought it's using
>> 2 sg_nents and NVME_INLINE_SG_CNT == 2.
>>
>> maybe there is a split and the IO is using sg_nents == 3 ? and then we chain.
>>
>> Let's wait for Ben's test report (Ben please make sure to fix Sagi's suggestion
>> with Christoph's comment in your test).
> This patch fixes the bug:
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 37943dc4c2c1..6a3f7f6bd1ab 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1320,16 +1320,17 @@ static int nvme_rdma_map_sg_inline(struct nvme_rdma_queue *queue,
>                  int count)
>   {
>          struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> -       struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
> +      struct scatterlist *sgl, *scat = req->data_sgl.sg_table.sgl;
>          struct ib_sge *sge = &req->sge[1];
>          u32 len = 0;
>          int i;
>
> -       for (i = 0; i < count; i++, sgl++, sge++) {
> +      for_each_sg(scat, sgl, count, i) {
>                  sge->addr = sg_dma_address(sgl);
>                  sge->length = sg_dma_len(sgl);
>                  sge->lkey = queue->device->pd->local_dma_lkey;
>                  len += sge->length;
> +              sge++;

good news.

So either we use the sge[i].length or sge++.

Both will do the job.


>          }

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-05-27 19:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-24 17:19 [bug report] nvme sends invalid command capsule over rdma transport for 5KiB write when target supports MSDBD > 1 Walker, Benjamin
2021-05-26  9:12 ` Sagi Grimberg
2021-05-26 14:10   ` Christoph Hellwig
2021-05-26 14:49     ` Max Gurtovoy
2021-05-26 16:00       ` Christoph Hellwig
2021-05-26 19:29         ` Walker, Benjamin
2021-05-26 21:57         ` Max Gurtovoy
2021-05-27 17:43           ` Walker, Benjamin
2021-05-27 17:48             ` Max Gurtovoy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.