All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty
@ 2021-05-17 22:36 Keith Busch
  2021-05-18  0:42 ` Sagi Grimberg
  2021-05-19  6:35 ` Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: Keith Busch @ 2021-05-17 22:36 UTC (permalink / raw)
  To: linux-nvme, sagi; +Cc: hch, Keith Busch

A possible race condition exists where the request to send data is
enqueued from nvme_tcp_handle_r2t()'s will not be observed by
nvme_tcp_send_all() if it happens to be running. The driver relies on
io_work to send the enqueued request when it is runs again, but the
concurrently running nvme_tcp_send_all() may not have released the
send_mutex at that time. If no future commands are enqueued to re-kick
the io_work, the request will timeout in the SEND_H2C state, resulting
in a timeout error like:

  nvme nvme0: queue 1: timeout request 0x3 type 6

Ensure the io_work continues to run as long as the req_list is not
empty.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
Marking this RFC because the timeout is difficult to recreate, so
difficult to verify the patch. The patch was created purely from code
inspection, so I'm just hoping for feedback on my analysis right now. 

 drivers/nvme/host/tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0222e23f5936..d07eb13d8713 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1140,7 +1140,8 @@ static void nvme_tcp_io_work(struct work_struct *w)
 				pending = true;
 			else if (unlikely(result < 0))
 				break;
-		}
+		} else
+			pending = !llist_empty(&queue->req_list);
 
 		result = nvme_tcp_try_recv(queue);
 		if (result > 0)
-- 
2.25.4


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty
  2021-05-17 22:36 [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty Keith Busch
@ 2021-05-18  0:42 ` Sagi Grimberg
  2021-05-18  1:38   ` Keith Busch
  2021-05-19  6:35 ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2021-05-18  0:42 UTC (permalink / raw)
  To: Keith Busch, linux-nvme; +Cc: hch


> A possible race condition exists where the request to send data is
> enqueued from nvme_tcp_handle_r2t()'s will not be observed by
> nvme_tcp_send_all() if it happens to be running. The driver relies on
> io_work to send the enqueued request when it is runs again, but the
> concurrently running nvme_tcp_send_all() may not have released the
> send_mutex at that time. If no future commands are enqueued to re-kick
> the io_work, the request will timeout in the SEND_H2C state, resulting
> in a timeout error like:
> 
>    nvme nvme0: queue 1: timeout request 0x3 type 6
> 
> Ensure the io_work continues to run as long as the req_list is not
> empty.

There is a version of this patch that I personally suggested before,
however I couldn't explain why that should happen...

nvme_tcp_send_all tries to send everything it has queues, it means
should either be able to send everything, or it should see a full socket
buffer. But in case the socket buffer is full, there should be a
.write_space() sk callback triggering when the socket buffer evacuates
space... Maybe there is a chance that write_space triggered, started
execution, and that the send_mutex is still taken?

Can we maybe try to catch if that is the case?

> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> Marking this RFC because the timeout is difficult to recreate, so
> difficult to verify the patch. The patch was created purely from code
> inspection, so I'm just hoping for feedback on my analysis right now.
> 
>   drivers/nvme/host/tcp.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0222e23f5936..d07eb13d8713 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1140,7 +1140,8 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   				pending = true;
>   			else if (unlikely(result < 0))
>   				break;
> -		}
> +		} else
> +			pending = !llist_empty(&queue->req_list);

In my version, pending was unconditionally set to true...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty
  2021-05-18  0:42 ` Sagi Grimberg
@ 2021-05-18  1:38   ` Keith Busch
  2021-05-18  3:02     ` Sagi Grimberg
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2021-05-18  1:38 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, May 17, 2021 at 05:42:18PM -0700, Sagi Grimberg wrote:
> > A possible race condition exists where the request to send data is
> > enqueued from nvme_tcp_handle_r2t()'s will not be observed by
> > nvme_tcp_send_all() if it happens to be running. The driver relies on
> > io_work to send the enqueued request when it is runs again, but the
> > concurrently running nvme_tcp_send_all() may not have released the
> > send_mutex at that time. If no future commands are enqueued to re-kick
> > the io_work, the request will timeout in the SEND_H2C state, resulting
> > in a timeout error like:
> > 
> >    nvme nvme0: queue 1: timeout request 0x3 type 6
> > 
> > Ensure the io_work continues to run as long as the req_list is not
> > empty.
> 
> There is a version of this patch that I personally suggested before,
> however I couldn't explain why that should happen...
> 
> nvme_tcp_send_all tries to send everything it has queues, it means
> should either be able to send everything, or it should see a full socket
> buffer. But in case the socket buffer is full, there should be a
> .write_space() sk callback triggering when the socket buffer evacuates
> space... Maybe there is a chance that write_space triggered, started
> execution, and that the send_mutex is still taken?

nvme_tcp_send_all() breaks out of the loop if nvme_tcp_fetch_request()
returns NULL. If that happens just before io_work calls
nvme_tcp_handle_r2t() to enqueue the H2C request, nvme_tcp_send_all()
will not see that request, but send_mutex is still held. We're counting
on io_work to run again to handle sending the H2C data in that case.
Unlikely as it sounds, if the same nvme_tcp_send_all() context is still
holding the send_mutex when io_work gets back to trying to take it, how
will the data get sent?

> Can we maybe try to catch if that is the case?

Do you have a better idea on how we can catch this? I think there was
only one occurance of this sighting so far, and it looks like it took a
long time to encounter it, but we will try again if you have a proposal.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty
  2021-05-18  1:38   ` Keith Busch
@ 2021-05-18  3:02     ` Sagi Grimberg
  2021-05-18 14:38       ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2021-05-18  3:02 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>> A possible race condition exists where the request to send data is
>>> enqueued from nvme_tcp_handle_r2t()'s will not be observed by
>>> nvme_tcp_send_all() if it happens to be running. The driver relies on
>>> io_work to send the enqueued request when it is runs again, but the
>>> concurrently running nvme_tcp_send_all() may not have released the
>>> send_mutex at that time. If no future commands are enqueued to re-kick
>>> the io_work, the request will timeout in the SEND_H2C state, resulting
>>> in a timeout error like:
>>>
>>>     nvme nvme0: queue 1: timeout request 0x3 type 6
>>>
>>> Ensure the io_work continues to run as long as the req_list is not
>>> empty.
>>
>> There is a version of this patch that I personally suggested before,
>> however I couldn't explain why that should happen...
>>
>> nvme_tcp_send_all tries to send everything it has queues, it means
>> should either be able to send everything, or it should see a full socket
>> buffer. But in case the socket buffer is full, there should be a
>> .write_space() sk callback triggering when the socket buffer evacuates
>> space... Maybe there is a chance that write_space triggered, started
>> execution, and that the send_mutex is still taken?
> 
> nvme_tcp_send_all() breaks out of the loop if nvme_tcp_fetch_request()
> returns NULL. If that happens just before io_work calls
> nvme_tcp_handle_r2t() to enqueue the H2C request, nvme_tcp_send_all()
> will not see that request, but send_mutex is still held. We're counting
> on io_work to run again to handle sending the H2C data in that case.
> Unlikely as it sounds, if the same nvme_tcp_send_all() context is still
> holding the send_mutex when io_work gets back to trying to take it, how
> will the data get sent?

Yes you are correct, overlooked this race. I guess this is all coming
from having less queues than cores (where rx really competes with tx)
which is not as common as a non default.

This is enough to convince me that this is needed:
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


>> Can we maybe try to catch if that is the case?
> 
> Do you have a better idea on how we can catch this? I think there was
> only one occurance of this sighting so far, and it looks like it took a
> long time to encounter it, but we will try again if you have a proposal.

We can continue to test with the patch and hunt for another occurance,
given the argument above, this patch is needed regardless...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty
  2021-05-18  3:02     ` Sagi Grimberg
@ 2021-05-18 14:38       ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2021-05-18 14:38 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, May 17, 2021 at 08:02:41PM -0700, Sagi Grimberg wrote:
> > nvme_tcp_send_all() breaks out of the loop if nvme_tcp_fetch_request()
> > returns NULL. If that happens just before io_work calls
> > nvme_tcp_handle_r2t() to enqueue the H2C request, nvme_tcp_send_all()
> > will not see that request, but send_mutex is still held. We're counting
> > on io_work to run again to handle sending the H2C data in that case.
> > Unlikely as it sounds, if the same nvme_tcp_send_all() context is still
> > holding the send_mutex when io_work gets back to trying to take it, how
> > will the data get sent?
> 
> Yes you are correct, overlooked this race. I guess this is all coming
> from having less queues than cores (where rx really competes with tx)
> which is not as common as a non default.
> 
> This is enough to convince me that this is needed:
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Great! I thought the scenario seemed possible, but wasn't completely
sure, so thank you for confirming.

Christoph, can we pick this up for the next rc? For stable, we can add

Fixes: db5ad6b7f8cdd ("nvme-tcp: try to send request in queue_rq context")
 
> > > Can we maybe try to catch if that is the case?
> > 
> > Do you have a better idea on how we can catch this? I think there was
> > only one occurance of this sighting so far, and it looks like it took a
> > long time to encounter it, but we will try again if you have a proposal.
> 
> We can continue to test with the patch and hunt for another occurance,
> given the argument above, this patch is needed regardless...

Sounds good, we'll run with the patch and see what happens. If the tests
are successful, I'm not sure if we can conclude this definitely fixes
the timeout or if we just got lucky. If a timeout is observed, though, I
will try to work in a debug patch.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty
  2021-05-17 22:36 [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty Keith Busch
  2021-05-18  0:42 ` Sagi Grimberg
@ 2021-05-19  6:35 ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2021-05-19  6:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, sagi, hch

Thanks,

applied to nvme-5.13 with the fixes tag added.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-05-19  6:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 22:36 [RFC PATCH] nvme-tcp: rerun io_work if req_list is not empty Keith Busch
2021-05-18  0:42 ` Sagi Grimberg
2021-05-18  1:38   ` Keith Busch
2021-05-18  3:02     ` Sagi Grimberg
2021-05-18 14:38       ` Keith Busch
2021-05-19  6:35 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.