From: Keith Busch <kbusch@kernel.org>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Subject: Re: [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
Date: Tue, 28 Sep 2021 17:24:07 -0700 [thread overview]
Message-ID: <20210929002407.GB390773@dhcp-10-100-145-180.wdc.com> (raw)
In-Reply-To: <d969a4b7-4d3c-9df7-5a7d-9700bff37d50@grimberg.me>
On Wed, Sep 29, 2021 at 12:00:20AM +0300, Sagi Grimberg wrote:
>
>
> On 9/28/21 11:40 PM, Keith Busch wrote:
> > On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> > > When the controller sends us multiple r2t PDUs in a single
> > > request we need to account for it correctly as our send/recv
> > > context run concurrently (i.e. we get a new r2t with r2t_offset
> > > before we updated our iterator and req->data_sent marker). This
> > > can cause wrong offsets to be sent to the controller.
> > >
> > > To fix that, we will first know that this may happen only in
> > > the send sequence of the last page, hence we will take
> > > the r2t_offset to the h2c PDU data_offset, and in
> > > nvme_tcp_try_send_data loop, we make sure to increment
> > > the request markers also when we completed a PDU but
> > > we are expecting more r2t PDUs as we still did not send
> > > the entire data of the request.
> > >
> > > Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> > > Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
> > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> > > ---
> > > - Keith, can you ask the WD team to test this as well?
> >
> > I finally have some feedback on testing, and it looks like this patch
> > has created a regression. The testers are observing:
> >
> > [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
> > [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed: -71
> > [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
> >
> > Which we had seen before, and you fixed with commit:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
> >
> > I only just recently heard about this, so I haven't looked into any
> > additional possible cause for this regression yet, but just wanted to
> > let you know earlier.
>
> Yes I definitely see the bug... Damn it..
>
> The issue here is that we peek at req->data_sent and req->data_len
> after the last send, but by then the request may have completed
> and these members were overwritten by a new execution of the
> request.
>
> We really should make sure that we record these before we perform
> the network send to avoid this race.
>
> Can you guys check that this restores the correct behavior?
We'll give this a spin (I had to fix it up so it applies and compiles;
no biggie)
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3c1c29dd3020..6d63129adccc 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -926,15 +926,17 @@ static void nvme_tcp_fail_request(struct
> nvme_tcp_request *req)
> static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
> {
> struct nvme_tcp_queue *queue = req->queue;
> + int req_data_len = req->data_len;
>
> while (true) {
> struct page *page = nvme_tcp_req_cur_page(req);
> size_t offset = nvme_tcp_req_cur_offset(req);
> size_t len = nvme_tcp_req_cur_length(req);
> - bool last = nvme_tcp_pdu_last_send(req, len);
> + bool pdu_last = nvme_tcp_pdu_last_send(req, len);
> + bool req_data_sent = req->data_sent;
> int ret, flags = MSG_DONTWAIT;
>
> - if (last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
> + if (pdu_last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
> flags |= MSG_EOR;
> else
> flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> @@ -958,7 +960,7 @@ static int nvme_tcp_try_send_data(struct
> nvme_tcp_request *req)
> * in the request where we don't want to modify it as we may
> * compete with the RX path completing the request.
> */
> - if (req->data_sent + ret < req->data_len)
> + if (req_data_sent + ret < req_data_len)
> nvme_tcp_advance_req(req, ret);
>
> /* fully successful last send in current PDU */
> nvme_tcp_ddgst_final(queue->snd_hash,
> &req->ddgst);
> --
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2021-09-29 0:24 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-14 15:38 [PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting Sagi Grimberg
2021-09-14 22:01 ` Keith Busch
2021-09-15 9:28 ` Sagi Grimberg
2021-09-16 16:45 ` Keith Busch
2021-09-19 7:12 ` Sagi Grimberg
2021-09-20 10:11 ` Sagi Grimberg
2021-09-20 15:04 ` Keith Busch
2021-09-28 20:40 ` Keith Busch
2021-09-28 21:00 ` Sagi Grimberg
2021-09-29 0:24 ` Keith Busch [this message]
2021-09-30 20:15 ` Keith Busch
2021-10-02 22:19 ` Sagi Grimberg
2021-10-03 2:04 ` Keith Busch
2021-10-03 8:51 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210929002407.GB390773@dhcp-10-100-145-180.wdc.com \
--to=kbusch@kernel.org \
--cc=Chaitanya.Kulkarni@wdc.com \
--cc=hch@lst.de \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).