[PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
@ 2021-09-14 15:38 Sagi Grimberg
  2021-09-14 22:01 ` Keith Busch
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sagi Grimberg @ 2021-09-14 15:38 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Chaitanya Kulkarni

When the controller sends us multiple r2t PDUs in a single
request we need to account for it correctly as our send/recv
context run concurrently (i.e. we get a new r2t with r2t_offset
before we updated our iterator and req->data_sent marker). This
can cause wrong offsets to be sent to the controller.

To fix that, we will first know that this may happen only in
the send sequence of the last page, hence we will take
the r2t_offset to the h2c PDU data_offset, and in
nvme_tcp_try_send_data loop, we make sure to increment
the request markers also when we completed a PDU but
we are expecting more r2t PDUs as we still did not send
the entire data of the request.

Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
- Keith, can you ask the WD team to test this as well?

 drivers/nvme/host/tcp.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 645025620154..87b73eb94041 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -607,7 +607,7 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
 		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
 	data->ttag = pdu->ttag;
 	data->command_id = nvme_cid(rq);
-	data->data_offset = cpu_to_le32(req->data_sent);
+	data->data_offset = pdu->r2t_offset;
 	data->data_length = cpu_to_le32(req->pdu_len);
 	return 0;
 }
@@ -939,7 +939,15 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 			nvme_tcp_ddgst_update(queue->snd_hash, page,
 					offset, ret);
 
-		/* fully successful last write*/
+		/*
+		 * update the request iterator except for the last payload send
+		 * in the request where we don't want to modify it as we may
+		 * compete with the RX path completing the request.
+		 */
+		if (req->data_sent + ret < req->data_len)
+			nvme_tcp_advance_req(req, ret);
+
+		/* fully successful last send in current PDU */
 		if (last && ret == len) {
 			if (queue->data_digest) {
 				nvme_tcp_ddgst_final(queue->snd_hash,
@@ -951,7 +959,6 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 			}
 			return 1;
 		}
-		nvme_tcp_advance_req(req, ret);
 	}
 	return -EAGAIN;
 }
-- 
2.30.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-14 15:38 [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting Sagi Grimberg
@ 2021-09-14 22:01 ` Keith Busch
  2021-09-15  9:28   ` Sagi Grimberg
  2021-09-16 16:45 ` Keith Busch
  2021-09-28 20:40 ` Keith Busch
  2 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2021-09-14 22:01 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> - Keith, can you ask the WD team to test this as well?

I'll send this their way, though I understand they've a bit of a backlog
to work through so it may be a while before I receive test results.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-14 22:01 ` Keith Busch
@ 2021-09-15  9:28   ` Sagi Grimberg
  0 siblings, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2021-09-15  9:28 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni


> On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
>> - Keith, can you ask the WD team to test this as well?
> 
> I'll send this their way, though I understand they've a bit of a backlog
> to work through so it may be a while before I receive test results.

OK, but can I get a review? This is an issue reported by Dell and the
patch addresses that for them.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-14 15:38 [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting Sagi Grimberg
  2021-09-14 22:01 ` Keith Busch
@ 2021-09-16 16:45 ` Keith Busch
  2021-09-19  7:12   ` Sagi Grimberg
  2021-09-28 20:40 ` Keith Busch
  2 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2021-09-16 16:45 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> When the controller sends us multiple r2t PDUs in a single
> request we need to account for it correctly as our send/recv
> context run concurrently (i.e. we get a new r2t with r2t_offset
> before we updated our iterator and req->data_sent marker). This
> can cause wrong offsets to be sent to the controller.
> 
> To fix that, we will first know that this may happen only in
> the send sequence of the last page, hence we will take
> the r2t_offset to the h2c PDU data_offset, and in
> nvme_tcp_try_send_data loop, we make sure to increment
> the request markers also when we completed a PDU but
> we are expecting more r2t PDUs as we still did not send
> the entire data of the request.
> 
> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> - Keith, can you ask the WD team to test this as well?
> 
>  drivers/nvme/host/tcp.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 645025620154..87b73eb94041 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -607,7 +607,7 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
>  		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
>  	data->ttag = pdu->ttag;
>  	data->command_id = nvme_cid(rq);
> -	data->data_offset = cpu_to_le32(req->data_sent);
> +	data->data_offset = pdu->r2t_offset;

Shouldn't this be "le32_to_cpu(pdu->r2t_offset)"?

>  	data->data_length = cpu_to_le32(req->pdu_len);
>  	return 0;
>  }
> @@ -939,7 +939,15 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
>  			nvme_tcp_ddgst_update(queue->snd_hash, page,
>  					offset, ret);
>  
> -		/* fully successful last write*/
> +		/*
> +		 * update the request iterator except for the last payload send
> +		 * in the request where we don't want to modify it as we may
> +		 * compete with the RX path completing the request.
> +		 */
> +		if (req->data_sent + ret < req->data_len)
> +			nvme_tcp_advance_req(req, ret);
> +
> +		/* fully successful last send in current PDU */
>  		if (last && ret == len) {
>  			if (queue->data_digest) {
>  				nvme_tcp_ddgst_final(queue->snd_hash,
> @@ -951,7 +959,6 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
>  			}
>  			return 1;
>  		}
> -		nvme_tcp_advance_req(req, ret);
>  	}
>  	return -EAGAIN;
>  }
> -- 
> 2.30.2

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-16 16:45 ` Keith Busch
@ 2021-09-19  7:12   ` Sagi Grimberg
  2021-09-20 10:11     ` Sagi Grimberg
  2021-09-20 15:04     ` Keith Busch
  0 siblings, 2 replies; 14+ messages in thread
From: Sagi Grimberg @ 2021-09-19  7:12 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni



On 9/16/21 7:45 PM, Keith Busch wrote:
> On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
>> When the controller sends us multiple r2t PDUs in a single
>> request we need to account for it correctly as our send/recv
>> context run concurrently (i.e. we get a new r2t with r2t_offset
>> before we updated our iterator and req->data_sent marker). This
>> can cause wrong offsets to be sent to the controller.
>>
>> To fix that, we will first know that this may happen only in
>> the send sequence of the last page, hence we will take
>> the r2t_offset to the h2c PDU data_offset, and in
>> nvme_tcp_try_send_data loop, we make sure to increment
>> the request markers also when we completed a PDU but
>> we are expecting more r2t PDUs as we still did not send
>> the entire data of the request.
>>
>> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
>> Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>> Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>> - Keith, can you ask the WD team to test this as well?
>>
>>   drivers/nvme/host/tcp.c | 13 ++++++++++---
>>   1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 645025620154..87b73eb94041 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -607,7 +607,7 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
>>   		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
>>   	data->ttag = pdu->ttag;
>>   	data->command_id = nvme_cid(rq);
>> -	data->data_offset = cpu_to_le32(req->data_sent);
>> +	data->data_offset = pdu->r2t_offset;
> 
> Shouldn't this be "le32_to_cpu(pdu->r2t_offset)"?

No, data is a wire payload, so it is le32

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-19  7:12   ` Sagi Grimberg
@ 2021-09-20 10:11     ` Sagi Grimberg
  2021-09-20 15:04     ` Keith Busch
  1 sibling, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2021-09-20 10:11 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni


>>> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
>>> Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>>> Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>> ---
>>> - Keith, can you ask the WD team to test this as well?
>>>
>>>   drivers/nvme/host/tcp.c | 13 ++++++++++---
>>>   1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 645025620154..87b73eb94041 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -607,7 +607,7 @@ static int nvme_tcp_setup_h2c_data_pdu(struct 
>>> nvme_tcp_request *req,
>>>           cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
>>>       data->ttag = pdu->ttag;
>>>       data->command_id = nvme_cid(rq);
>>> -    data->data_offset = cpu_to_le32(req->data_sent);
>>> +    data->data_offset = pdu->r2t_offset;
>>
>> Shouldn't this be "le32_to_cpu(pdu->r2t_offset)"?
> 
> No, data is a wire payload, so it is le32

Keith, if/when you are OK with this, it should go 5.15
and stable.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-19  7:12   ` Sagi Grimberg
  2021-09-20 10:11     ` Sagi Grimberg
@ 2021-09-20 15:04     ` Keith Busch
  1 sibling, 0 replies; 14+ messages in thread
From: Keith Busch @ 2021-09-20 15:04 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Sun, Sep 19, 2021 at 10:12:21AM +0300, Sagi Grimberg wrote:
> 
> 
> On 9/16/21 7:45 PM, Keith Busch wrote:
> > On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> > > When the controller sends us multiple r2t PDUs in a single
> > > request we need to account for it correctly as our send/recv
> > > context run concurrently (i.e. we get a new r2t with r2t_offset
> > > before we updated our iterator and req->data_sent marker). This
> > > can cause wrong offsets to be sent to the controller.
> > > 
> > > To fix that, we will first know that this may happen only in
> > > the send sequence of the last page, hence we will take
> > > the r2t_offset to the h2c PDU data_offset, and in
> > > nvme_tcp_try_send_data loop, we make sure to increment
> > > the request markers also when we completed a PDU but
> > > we are expecting more r2t PDUs as we still did not send
> > > the entire data of the request.
> > > 
> > > Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> > > Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> > > ---
> > > - Keith, can you ask the WD team to test this as well?
> > > 
> > >   drivers/nvme/host/tcp.c | 13 ++++++++++---
> > >   1 file changed, 10 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > > index 645025620154..87b73eb94041 100644
> > > --- a/drivers/nvme/host/tcp.c
> > > +++ b/drivers/nvme/host/tcp.c
> > > @@ -607,7 +607,7 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
> > >   		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
> > >   	data->ttag = pdu->ttag;
> > >   	data->command_id = nvme_cid(rq);
> > > -	data->data_offset = cpu_to_le32(req->data_sent);
> > > +	data->data_offset = pdu->r2t_offset;
> > 
> > Shouldn't this be "le32_to_cpu(pdu->r2t_offset)"?
> 
> No, data is a wire payload, so it is le32

Oh, of course. Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>

The WDC group has not gotten to test this yet, just fyi, but I think
this patch is fine anyway.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-14 15:38 [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting Sagi Grimberg
  2021-09-14 22:01 ` Keith Busch
  2021-09-16 16:45 ` Keith Busch
@ 2021-09-28 20:40 ` Keith Busch
  2021-09-28 21:00   ` Sagi Grimberg
  2 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2021-09-28 20:40 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> When the controller sends us multiple r2t PDUs in a single
> request we need to account for it correctly as our send/recv
> context run concurrently (i.e. we get a new r2t with r2t_offset
> before we updated our iterator and req->data_sent marker). This
> can cause wrong offsets to be sent to the controller.
> 
> To fix that, we will first know that this may happen only in
> the send sequence of the last page, hence we will take
> the r2t_offset to the h2c PDU data_offset, and in
> nvme_tcp_try_send_data loop, we make sure to increment
> the request markers also when we completed a PDU but
> we are expecting more r2t PDUs as we still did not send
> the entire data of the request.
> 
> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> - Keith, can you ask the WD team to test this as well?

I finally have some feedback on testing, and it looks like this patch
has created a regression. The testers are observing:

  [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
  [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed:  -71
  [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery

Which we had seen before, and you fixed with commit:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7

I only just recently heard about this, so I haven't looked into any
additional possible cause for this regression yet, but just wanted to
let you know earlier.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-28 20:40 ` Keith Busch
@ 2021-09-28 21:00   ` Sagi Grimberg
  2021-09-29  0:24     ` Keith Busch
  2021-09-30 20:15     ` Keith Busch
  0 siblings, 2 replies; 14+ messages in thread
From: Sagi Grimberg @ 2021-09-28 21:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni



On 9/28/21 11:40 PM, Keith Busch wrote:
> On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
>> When the controller sends us multiple r2t PDUs in a single
>> request we need to account for it correctly as our send/recv
>> context run concurrently (i.e. we get a new r2t with r2t_offset
>> before we updated our iterator and req->data_sent marker). This
>> can cause wrong offsets to be sent to the controller.
>>
>> To fix that, we will first know that this may happen only in
>> the send sequence of the last page, hence we will take
>> the r2t_offset to the h2c PDU data_offset, and in
>> nvme_tcp_try_send_data loop, we make sure to increment
>> the request markers also when we completed a PDU but
>> we are expecting more r2t PDUs as we still did not send
>> the entire data of the request.
>>
>> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
>> Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>> Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>> - Keith, can you ask the WD team to test this as well?
> 
> I finally have some feedback on testing, and it looks like this patch
> has created a regression. The testers are observing:
> 
>    [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
>    [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed:  -71
>    [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
> 
> Which we had seen before, and you fixed with commit:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
> 
> I only just recently heard about this, so I haven't looked into any
> additional possible cause for this regression yet, but just wanted to
> let you know earlier.

Yes I definitely see the bug... Damn it..

The issue here is that we peek at req->data_sent and req->data_len
after the last send, but by then the request may have completed
and these members were overwritten by a new execution of the
request.

We really should make sure that we record these before we perform
the network send to avoid this race.

Can you guys check that this restores the correct behavior?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3c1c29dd3020..6d63129adccc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -926,15 +926,17 @@ static void nvme_tcp_fail_request(struct 
nvme_tcp_request *req)
  static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
  {
         struct nvme_tcp_queue *queue = req->queue;
+       int req_data_len = req->data_len;

         while (true) {
                 struct page *page = nvme_tcp_req_cur_page(req);
                 size_t offset = nvme_tcp_req_cur_offset(req);
                 size_t len = nvme_tcp_req_cur_length(req);
-               bool last = nvme_tcp_pdu_last_send(req, len);
+               bool pdu_last = nvme_tcp_pdu_last_send(req, len);
+               bool req_data_sent = req->data_sent;
                 int ret, flags = MSG_DONTWAIT;

-               if (last && !queue->data_digest && 
!nvme_tcp_queue_more(queue))
+               if (pdu_last && !queue->data_digest && 
!nvme_tcp_queue_more(queue))
                         flags |= MSG_EOR;
                 else
                         flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
@@ -958,7 +960,7 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                  * in the request where we don't want to modify it as 
we may
                  * compete with the RX path completing the request.
                  */
-               if (req->data_sent + ret < req->data_len)
+               if (req_data_sent + ret < req_data_len)
                         nvme_tcp_advance_req(req, ret);

                 /* fully successful last send in current PDU */
                                 nvme_tcp_ddgst_final(queue->snd_hash,
                                         &req->ddgst);
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-28 21:00   ` Sagi Grimberg
@ 2021-09-29  0:24     ` Keith Busch
  2021-09-30 20:15     ` Keith Busch
  1 sibling, 0 replies; 14+ messages in thread
From: Keith Busch @ 2021-09-29  0:24 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 12:00:20AM +0300, Sagi Grimberg wrote:
> 
> 
> On 9/28/21 11:40 PM, Keith Busch wrote:
> > On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> > > When the controller sends us multiple r2t PDUs in a single
> > > request we need to account for it correctly as our send/recv
> > > context run concurrently (i.e. we get a new r2t with r2t_offset
> > > before we updated our iterator and req->data_sent marker). This
> > > can cause wrong offsets to be sent to the controller.
> > > 
> > > To fix that, we will first know that this may happen only in
> > > the send sequence of the last page, hence we will take
> > > the r2t_offset to the h2c PDU data_offset, and in
> > > nvme_tcp_try_send_data loop, we make sure to increment
> > > the request markers also when we completed a PDU but
> > > we are expecting more r2t PDUs as we still did not send
> > > the entire data of the request.
> > > 
> > > Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> > > Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> > > ---
> > > - Keith, can you ask the WD team to test this as well?
> > 
> > I finally have some feedback on testing, and it looks like this patch
> > has created a regression. The testers are observing:
> > 
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed:  -71
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
> > 
> > Which we had seen before, and you fixed with commit:
> > 
> >    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
> > 
> > I only just recently heard about this, so I haven't looked into any
> > additional possible cause for this regression yet, but just wanted to
> > let you know earlier.
> 
> Yes I definitely see the bug... Damn it..
> 
> The issue here is that we peek at req->data_sent and req->data_len
> after the last send, but by then the request may have completed
> and these members were overwritten by a new execution of the
> request.
> 
> We really should make sure that we record these before we perform
> the network send to avoid this race.
> 
> Can you guys check that this restores the correct behavior?

We'll give this a spin (I had to fix it up so it applies and compiles;
no biggie)

> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3c1c29dd3020..6d63129adccc 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -926,15 +926,17 @@ static void nvme_tcp_fail_request(struct
> nvme_tcp_request *req)
>  static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
>  {
>         struct nvme_tcp_queue *queue = req->queue;
> +       int req_data_len = req->data_len;
> 
>         while (true) {
>                 struct page *page = nvme_tcp_req_cur_page(req);
>                 size_t offset = nvme_tcp_req_cur_offset(req);
>                 size_t len = nvme_tcp_req_cur_length(req);
> -               bool last = nvme_tcp_pdu_last_send(req, len);
> +               bool pdu_last = nvme_tcp_pdu_last_send(req, len);
> +               bool req_data_sent = req->data_sent;
>                 int ret, flags = MSG_DONTWAIT;
> 
> -               if (last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
> +               if (pdu_last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
>                         flags |= MSG_EOR;
>                 else
>                         flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> @@ -958,7 +960,7 @@ static int nvme_tcp_try_send_data(struct
> nvme_tcp_request *req)
>                  * in the request where we don't want to modify it as we may
>                  * compete with the RX path completing the request.
>                  */
> -               if (req->data_sent + ret < req->data_len)
> +               if (req_data_sent + ret < req_data_len)
>                         nvme_tcp_advance_req(req, ret);
> 
>                 /* fully successful last send in current PDU */
>                                 nvme_tcp_ddgst_final(queue->snd_hash,
>                                         &req->ddgst);
> --

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-28 21:00   ` Sagi Grimberg
  2021-09-29  0:24     ` Keith Busch
@ 2021-09-30 20:15     ` Keith Busch
  2021-10-02 22:19       ` Sagi Grimberg
  1 sibling, 1 reply; 14+ messages in thread
From: Keith Busch @ 2021-09-30 20:15 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 12:00:20AM +0300, Sagi Grimberg wrote:
> 
> 
> On 9/28/21 11:40 PM, Keith Busch wrote:
> > On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> > > When the controller sends us multiple r2t PDUs in a single
> > > request we need to account for it correctly as our send/recv
> > > context run concurrently (i.e. we get a new r2t with r2t_offset
> > > before we updated our iterator and req->data_sent marker). This
> > > can cause wrong offsets to be sent to the controller.
> > > 
> > > To fix that, we will first know that this may happen only in
> > > the send sequence of the last page, hence we will take
> > > the r2t_offset to the h2c PDU data_offset, and in
> > > nvme_tcp_try_send_data loop, we make sure to increment
> > > the request markers also when we completed a PDU but
> > > we are expecting more r2t PDUs as we still did not send
> > > the entire data of the request.
> > > 
> > > Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> > > Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> > > ---
> > > - Keith, can you ask the WD team to test this as well?
> > 
> > I finally have some feedback on testing, and it looks like this patch
> > has created a regression. The testers are observing:
> > 
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed:  -71
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
> > 
> > Which we had seen before, and you fixed with commit:
> > 
> >    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
> > 
> > I only just recently heard about this, so I haven't looked into any
> > additional possible cause for this regression yet, but just wanted to
> > let you know earlier.
> 
> Yes I definitely see the bug... Damn it..
> 
> The issue here is that we peek at req->data_sent and req->data_len
> after the last send, but by then the request may have completed
> and these members were overwritten by a new execution of the
> request.
> 
> We really should make sure that we record these before we perform
> the network send to avoid this race.
> 
> Can you guys check that this restores the correct behavior?

Unfortunately this was unsuccessful. The same issue is still occuring. Please
let me know if you have another patch to try. I'll also keep looking for a
solution as well.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-09-30 20:15     ` Keith Busch
@ 2021-10-02 22:19       ` Sagi Grimberg
  2021-10-03  2:04         ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2021-10-02 22:19 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni


>>>> When the controller sends us multiple r2t PDUs in a single
>>>> request we need to account for it correctly as our send/recv
>>>> context run concurrently (i.e. we get a new r2t with r2t_offset
>>>> before we updated our iterator and req->data_sent marker). This
>>>> can cause wrong offsets to be sent to the controller.
>>>>
>>>> To fix that, we will first know that this may happen only in
>>>> the send sequence of the last page, hence we will take
>>>> the r2t_offset to the h2c PDU data_offset, and in
>>>> nvme_tcp_try_send_data loop, we make sure to increment
>>>> the request markers also when we completed a PDU but
>>>> we are expecting more r2t PDUs as we still did not send
>>>> the entire data of the request.
>>>>
>>>> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
>>>> Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>>>> Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>> ---
>>>> - Keith, can you ask the WD team to test this as well?
>>>
>>> I finally have some feedback on testing, and it looks like this patch
>>> has created a regression. The testers are observing:
>>>
>>>     [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
>>>     [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed:  -71
>>>     [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
>>>
>>> Which we had seen before, and you fixed with commit:
>>>
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
>>>
>>> I only just recently heard about this, so I haven't looked into any
>>> additional possible cause for this regression yet, but just wanted to
>>> let you know earlier.
>>
>> Yes I definitely see the bug... Damn it..
>>
>> The issue here is that we peek at req->data_sent and req->data_len
>> after the last send, but by then the request may have completed
>> and these members were overwritten by a new execution of the
>> request.
>>
>> We really should make sure that we record these before we perform
>> the network send to avoid this race.
>>
>> Can you guys check that this restores the correct behavior?
> 
> Unfortunately this was unsuccessful. The same issue is still occuring. Please
> let me know if you have another patch to try. I'll also keep looking for a
> solution as well.

Really? That's unexpected, this patch should ensure that the request
is not advanced after the last payload send. The req->data_sent and
req->data_len are recorded before we actually perform the send so
the request should not be advanced if (e.g. last send):
	(req_data_sent + ret == req_data_len)

So I'm surprised that the same issue is still occurring.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-10-02 22:19       ` Sagi Grimberg
@ 2021-10-03  2:04         ` Keith Busch
  2021-10-03  8:51           ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2021-10-03  2:04 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Sun, Oct 03, 2021 at 01:19:28AM +0300, Sagi Grimberg wrote:
> > Unfortunately this was unsuccessful. The same issue is still occuring. Please
> > let me know if you have another patch to try. I'll also keep looking for a
> > solution as well.
> 
> Really? That's unexpected, this patch should ensure that the request
> is not advanced after the last payload send. The req->data_sent and
> req->data_len are recorded before we actually perform the send so
> the request should not be advanced if (e.g. last send):
> 	(req_data_sent + ret == req_data_len)
> 
> So I'm surprised that the same issue is still occurring.

Only thing I thought of so far is if req_data_len is not aligned with
nvme_tcp_req_cur_length(). When this was working, the request would not
advance based on the nvme_tcp_req_cur_length() value; now the criteria
is based on 'req->data_len'.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
  2021-10-03  2:04         ` Keith Busch
@ 2021-10-03  8:51           ` Sagi Grimberg
  0 siblings, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2021-10-03  8:51 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

>>> Unfortunately this was unsuccessful. The same issue is still occuring. Please
>>> let me know if you have another patch to try. I'll also keep looking for a
>>> solution as well.
>>
>> Really? That's unexpected, this patch should ensure that the request
>> is not advanced after the last payload send. The req->data_sent and
>> req->data_len are recorded before we actually perform the send so
>> the request should not be advanced if (e.g. last send):
>> 	(req_data_sent + ret == req_data_len)
>>
>> So I'm surprised that the same issue is still occurring.
> 
> Only thing I thought of so far is if req_data_len is not aligned with
> nvme_tcp_req_cur_length(). When this was working, the request would not
> advance based on the nvme_tcp_req_cur_length() value; now the criteria
> is based on 'req->data_len'.

That is exactly the bug fix that broke this. Before a multi-pdu request
would incorrectly avoid advancing the iterator.
nvme_tcp_req_cur_length() is the length we need to send based
on the pdu, not the full request. In that case, we shouldn't have
an issue advancing the request.

My recollection was that the original issue you guys reported was
on the request last payload send, where the controller sent a completion
and the request was re-executed as a new request before the send
context advanced the request, which led to a use-after-free. I suspected
that this specific issue would have been addressed by this patch.

Think of a 8K write where the target sends 0-4k r2t and then 4k-8k
r2t.

The host needs to send 0-4k, on the second r2t reception send 4k-8k but
due to the fact that a completion may arrive, avoid advancing the
request (like the original reported issue).

I think we can also solve this a different way with refcounting the
request, but if the suggested patch doesn't resolve the issue then
we might still be missing something I don't yet understand.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-10-03  8:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-14 15:38 [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting Sagi Grimberg
2021-09-14 22:01 ` Keith Busch
2021-09-15  9:28   ` Sagi Grimberg
2021-09-16 16:45 ` Keith Busch
2021-09-19  7:12   ` Sagi Grimberg
2021-09-20 10:11     ` Sagi Grimberg
2021-09-20 15:04     ` Keith Busch
2021-09-28 20:40 ` Keith Busch
2021-09-28 21:00   ` Sagi Grimberg
2021-09-29  0:24     ` Keith Busch
2021-09-30 20:15     ` Keith Busch
2021-10-02 22:19       ` Sagi Grimberg
2021-10-03  2:04         ` Keith Busch
2021-10-03  8:51           ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.