All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] nvme-tcp: queue stalls under high load
@ 2022-05-19  6:26 Hannes Reinecke
  2022-05-19  6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-19  6:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

one of our partners registered queue stalls and I/O timeouts under
high load. Analysis revealed that we see an extremely 'choppy' I/O
behaviour when running large transfers on systems on low-performance
links (eg 1GigE networks).
We had a system with 30 queues trying to transfer 128M requests; simple
calculation shows that transferring a _single_ request on all queues
will take up to 38 seconds, thereby timing out the last request before
it got sent.
As a solution I first fixed up the timeout handler to reset the timeout
if the request is still queued or in the process of being send. The
second path modifies the send path to only allow for new requests if we
have enough space on the TX queue, and finally break up the send loop to
avoid system stalls when sending large request.

As usual, comments and reviews are welcome.

Hannes Reinecke (3):
  nvme-tcp: spurious I/O timeout under high load
  nvme-tcp: Check for write space before queueing requests
  nvme-tcp: send quota for nvme_tcp_send_all()

 drivers/nvme/host/tcp.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-19  6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke
@ 2022-05-19  6:26 ` Hannes Reinecke
  2022-05-20  9:05   ` Sagi Grimberg
  2022-05-19  6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-19  6:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

When running on slow links requests might take some time
for be processed, and as we always allow to queue requests
timeout may trigger when the requests are still queued.
Eg sending 128M requests over 30 queues over a 1GigE link
will inevitably timeout before the last request could be sent.
So reset the timeout if the request is still being queued
or if it's in the process of being sent.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bb67538d241b..ede76a0719a0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2332,6 +2332,13 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
 		"queue %d: timeout request %#x type %d\n",
 		nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
 
+	if (!list_empty(&req->entry) || req->queue->request == req) {
+		dev_warn(ctrl->device,
+			 "queue %d: queue stall, resetting timeout\n",
+			 nvme_tcp_queue_id(req->queue));
+		return BLK_EH_RESET_TIMER;
+	}
+
 	if (ctrl->state != NVME_CTRL_LIVE) {
 		/*
 		 * If we are resetting, connecting or deleting we should
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] nvme-tcp: Check for write space before queueing requests
  2022-05-19  6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke
  2022-05-19  6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke
@ 2022-05-19  6:26 ` Hannes Reinecke
  2022-05-20  9:17   ` Sagi Grimberg
  2022-05-19  6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke
  2022-05-20  9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg
  3 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-19  6:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

The current model of always queue incoming requests lead to
write stalls as we easily overload the network device under
high I/O load.
To avoid unlimited queueing we should rather check if write
space is available before accepting new requests.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index ede76a0719a0..606565a4c708 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2464,6 +2464,9 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (unlikely(ret))
 		return ret;
 
+	if (!sk_stream_is_writeable(queue->sock->sk))
+		return BLK_STS_RESOURCE;
+
 	blk_mq_start_request(rq);
 
 	nvme_tcp_queue_request(req, true, bd->last);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all()
  2022-05-19  6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke
  2022-05-19  6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke
  2022-05-19  6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke
@ 2022-05-19  6:26 ` Hannes Reinecke
  2022-05-20  9:19   ` Sagi Grimberg
  2022-05-20  9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg
  3 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-19  6:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Daniel Wagner, Hannes Reinecke

From: Daniel Wagner <dwagner@suse.de>

Add a send quota in nvme_tcp_send_all() to avoid stalls when sending
large amounts of requests.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 606565a4c708..87d760dfa3a9 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
 static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
 {
 	int ret;
+	unsigned long deadline = jiffies + msecs_to_jiffies(1);
 
 	/* drain the send queue as much as we can... */
 	do {
 		ret = nvme_tcp_try_send(queue);
-	} while (ret > 0);
+	} while (ret > 0 || !time_after(jiffies, deadline));
 }
 
 static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-19  6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke
@ 2022-05-20  9:05   ` Sagi Grimberg
  2022-05-23  8:42     ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-20  9:05 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

The patch title does not explain what the patch does, or what it
fixes.

> When running on slow links requests might take some time
> for be processed, and as we always allow to queue requests
> timeout may trigger when the requests are still queued.
> Eg sending 128M requests over 30 queues over a 1GigE link
> will inevitably timeout before the last request could be sent.
> So reset the timeout if the request is still being queued
> or if it's in the process of being sent.

Maybe I'm missing something... But you are overloading so much that you
timeout even before a command is sent out. That still does not change
the fact that the timeout expired. Why is resetting the timer without
taking any action the acceptable action in this case?

Is this solving a bug? The fact that you get timeouts in your test
is somewhat expected isn't it?

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index bb67538d241b..ede76a0719a0 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2332,6 +2332,13 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
>   		"queue %d: timeout request %#x type %d\n",
>   		nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
>   
> +	if (!list_empty(&req->entry) || req->queue->request == req) {
> +		dev_warn(ctrl->device,
> +			 "queue %d: queue stall, resetting timeout\n",
> +			 nvme_tcp_queue_id(req->queue));
> +		return BLK_EH_RESET_TIMER;
> +	}
> +
>   	if (ctrl->state != NVME_CTRL_LIVE) {
>   		/*
>   		 * If we are resetting, connecting or deleting we should



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: Check for write space before queueing requests
  2022-05-19  6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke
@ 2022-05-20  9:17   ` Sagi Grimberg
  2022-05-20 10:05     ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-20  9:17 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


> The current model of always queue incoming requests lead to
> write stalls as we easily overload the network device under
> high I/O load.
> To avoid unlimited queueing we should rather check if write
> space is available before accepting new requests.

I'm somewhat on the fence with this one... On one end, we
are checking the sock write space, but don't check the queued
requests. And, this is purely advisory and not really a check
we rely on.

The merit of doing something like this is that we don't start
the request timer, but we can just as easily queue the request
and have it later queued for long due to sock being overloaded.

Can you explain your thoughts to why this is a good solution?

> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index ede76a0719a0..606565a4c708 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2464,6 +2464,9 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
>   	if (unlikely(ret))
>   		return ret;
>   
> +	if (!sk_stream_is_writeable(queue->sock->sk))
> +		return BLK_STS_RESOURCE;
> +
>   	blk_mq_start_request(rq);
>   
>   	nvme_tcp_queue_request(req, true, bd->last);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all()
  2022-05-19  6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke
@ 2022-05-20  9:19   ` Sagi Grimberg
  2022-05-20  9:59     ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-20  9:19 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Daniel Wagner


> Add a send quota in nvme_tcp_send_all() to avoid stalls when sending
> large amounts of requests.
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 606565a4c708..87d760dfa3a9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
>   static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>   {
>   	int ret;
> +	unsigned long deadline = jiffies + msecs_to_jiffies(1);
>   
>   	/* drain the send queue as much as we can... */
>   	do {
>   		ret = nvme_tcp_try_send(queue);
> -	} while (ret > 0);
> +	} while (ret > 0 || !time_after(jiffies, deadline));

Umm, this will stay here a deadline period even if we don't have
anything to send?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: queue stalls under high load
  2022-05-19  6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke
                   ` (2 preceding siblings ...)
  2022-05-19  6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke
@ 2022-05-20  9:20 ` Sagi Grimberg
  2022-05-20 10:01   ` Hannes Reinecke
  3 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-20  9:20 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


> Hi all,
> 
> one of our partners registered queue stalls and I/O timeouts under
> high load. Analysis revealed that we see an extremely 'choppy' I/O
> behaviour when running large transfers on systems on low-performance
> links (eg 1GigE networks).
> We had a system with 30 queues trying to transfer 128M requests; simple
> calculation shows that transferring a _single_ request on all queues
> will take up to 38 seconds, thereby timing out the last request before
> it got sent.
> As a solution I first fixed up the timeout handler to reset the timeout
> if the request is still queued or in the process of being send. The
> second path modifies the send path to only allow for new requests if we
> have enough space on the TX queue, and finally break up the send loop to
> avoid system stalls when sending large request.

What is the average latency you are seeing with this test?
I'm guessing more than 30 seconds :)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all()
  2022-05-20  9:19   ` Sagi Grimberg
@ 2022-05-20  9:59     ` Hannes Reinecke
  2022-05-21 20:02       ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-20  9:59 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Daniel Wagner

On 5/20/22 11:19, Sagi Grimberg wrote:
> 
>> Add a send quota in nvme_tcp_send_all() to avoid stalls when sending
>> large amounts of requests.
>>
>> Signed-off-by: Daniel Wagner <dwagner@suse.de>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/tcp.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 606565a4c708..87d760dfa3a9 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct 
>> nvme_tcp_request *req,
>>   static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>   {
>>       int ret;
>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>       /* drain the send queue as much as we can... */
>>       do {
>>           ret = nvme_tcp_try_send(queue);
>> -    } while (ret > 0);
>> +    } while (ret > 0 || !time_after(jiffies, deadline));
> 
> Umm, this will stay here a deadline period even if we don't have
> anything to send?

Ah. Yeah, maybe. We can change it to '&&', which should solve it.
(I think)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: queue stalls under high load
  2022-05-20  9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg
@ 2022-05-20 10:01   ` Hannes Reinecke
  2022-05-21 20:03     ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-20 10:01 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 5/20/22 11:20, Sagi Grimberg wrote:
> 
>> Hi all,
>>
>> one of our partners registered queue stalls and I/O timeouts under
>> high load. Analysis revealed that we see an extremely 'choppy' I/O
>> behaviour when running large transfers on systems on low-performance
>> links (eg 1GigE networks).
>> We had a system with 30 queues trying to transfer 128M requests; simple
>> calculation shows that transferring a _single_ request on all queues
>> will take up to 38 seconds, thereby timing out the last request before
>> it got sent.
>> As a solution I first fixed up the timeout handler to reset the timeout
>> if the request is still queued or in the process of being send. The
>> second path modifies the send path to only allow for new requests if we
>> have enough space on the TX queue, and finally break up the send loop to
>> avoid system stalls when sending large request.
> 
> What is the average latency you are seeing with this test?
> I'm guessing more than 30 seconds :)

Yes, of course. Simple maths, in the end.
(Actually it's more as we're always triggering a reconnect cycle...)
And telling the customer to change his testcase only helps _so_ much.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: Check for write space before queueing requests
  2022-05-20  9:17   ` Sagi Grimberg
@ 2022-05-20 10:05     ` Hannes Reinecke
  2022-05-21 20:01       ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-20 10:05 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 5/20/22 11:17, Sagi Grimberg wrote:
> 
>> The current model of always queue incoming requests lead to
>> write stalls as we easily overload the network device under
>> high I/O load.
>> To avoid unlimited queueing we should rather check if write
>> space is available before accepting new requests.
> 
> I'm somewhat on the fence with this one... On one end, we
> are checking the sock write space, but don't check the queued
> requests. And, this is purely advisory and not really a check
> we rely on.
> 
> The merit of doing something like this is that we don't start
> the request timer, but we can just as easily queue the request
> and have it later queued for long due to sock being overloaded.
> 
> Can you explain your thoughts to why this is a good solution?
> 
Request timeouts.
As soon as we call 'blk_mq_start_request()' the I/O timer is called, and 
given that we (currently) queue _every_ request irrespective of the 
underlying device status we might end up queueing for a _loooong_ time.

Timeouts while still in the queue are being handled by the first patch, 
but the underlying network might also be busy with retries and whatnot.
So again, queuing requests when we _know_ there'll be a congestion is 
just asking for trouble (or, rather, spurious I/O timeouts).

If one is worried about performance one can always increase the wmem 
size :-), but really it means that either your testcase or your network 
is misdesigned.
And I'm perfectly fine with increasing the latency in these cases.
What I don't like is timeouts, as these will show up to the user and we 
get all the supportcalls telling us that the kernel is broken.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: Check for write space before queueing requests
  2022-05-20 10:05     ` Hannes Reinecke
@ 2022-05-21 20:01       ` Sagi Grimberg
  0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-21 20:01 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>> The current model of always queue incoming requests lead to
>>> write stalls as we easily overload the network device under
>>> high I/O load.
>>> To avoid unlimited queueing we should rather check if write
>>> space is available before accepting new requests.
>>
>> I'm somewhat on the fence with this one... On one end, we
>> are checking the sock write space, but don't check the queued
>> requests. And, this is purely advisory and not really a check
>> we rely on.
>>
>> The merit of doing something like this is that we don't start
>> the request timer, but we can just as easily queue the request
>> and have it later queued for long due to sock being overloaded.
>>
>> Can you explain your thoughts to why this is a good solution?
>>
> Request timeouts.
> As soon as we call 'blk_mq_start_request()' the I/O timer is called, and 
> given that we (currently) queue _every_ request irrespective of the 
> underlying device status we might end up queueing for a _loooong_ time.
> 
> Timeouts while still in the queue are being handled by the first patch, 
> but the underlying network might also be busy with retries and whatnot.
> So again, queuing requests when we _know_ there'll be a congestion is 
> just asking for trouble (or, rather, spurious I/O timeouts).
> 
> If one is worried about performance one can always increase the wmem 
> size :-), but really it means that either your testcase or your network 
> is misdesigned.
> And I'm perfectly fine with increasing the latency in these cases.
> What I don't like is timeouts, as these will show up to the user and we 
> get all the supportcalls telling us that the kernel is broken.

Can you run some sanity perf tests to understand if anything unexpected
comes up from this?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all()
  2022-05-20  9:59     ` Hannes Reinecke
@ 2022-05-21 20:02       ` Sagi Grimberg
  0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-21 20:02 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Daniel Wagner


>>> Add a send quota in nvme_tcp_send_all() to avoid stalls when sending
>>> large amounts of requests.
>>>
>>> Signed-off-by: Daniel Wagner <dwagner@suse.de>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/host/tcp.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 606565a4c708..87d760dfa3a9 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct 
>>> nvme_tcp_request *req,
>>>   static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>>   {
>>>       int ret;
>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>       /* drain the send queue as much as we can... */
>>>       do {
>>>           ret = nvme_tcp_try_send(queue);
>>> -    } while (ret > 0);
>>> +    } while (ret > 0 || !time_after(jiffies, deadline));
>>
>> Umm, this will stay here a deadline period even if we don't have
>> anything to send?
> 
> Ah. Yeah, maybe. We can change it to '&&', which should solve it.
> (I think)

See what is done in io_work()


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: queue stalls under high load
  2022-05-20 10:01   ` Hannes Reinecke
@ 2022-05-21 20:03     ` Sagi Grimberg
  0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-21 20:03 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>> Hi all,
>>>
>>> one of our partners registered queue stalls and I/O timeouts under
>>> high load. Analysis revealed that we see an extremely 'choppy' I/O
>>> behaviour when running large transfers on systems on low-performance
>>> links (eg 1GigE networks).
>>> We had a system with 30 queues trying to transfer 128M requests; simple
>>> calculation shows that transferring a _single_ request on all queues
>>> will take up to 38 seconds, thereby timing out the last request before
>>> it got sent.
>>> As a solution I first fixed up the timeout handler to reset the timeout
>>> if the request is still queued or in the process of being send. The
>>> second path modifies the send path to only allow for new requests if we
>>> have enough space on the TX queue, and finally break up the send loop to
>>> avoid system stalls when sending large request.
>>
>> What is the average latency you are seeing with this test?
>> I'm guessing more than 30 seconds :)
> 
> Yes, of course. Simple maths, in the end.
> (Actually it's more as we're always triggering a reconnect cycle...)
> And telling the customer to change his testcase only helps _so_ much.

Not to change the test case, simply making the io_timeout substantially
larger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-20  9:05   ` Sagi Grimberg
@ 2022-05-23  8:42     ` Hannes Reinecke
  2022-05-23 13:36       ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-23  8:42 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 5/20/22 11:05, Sagi Grimberg wrote:
> The patch title does not explain what the patch does, or what it
> fixes.
> 
>> When running on slow links requests might take some time
>> for be processed, and as we always allow to queue requests
>> timeout may trigger when the requests are still queued.
>> Eg sending 128M requests over 30 queues over a 1GigE link
>> will inevitably timeout before the last request could be sent.
>> So reset the timeout if the request is still being queued
>> or if it's in the process of being sent.
> 
> Maybe I'm missing something... But you are overloading so much that you
> timeout even before a command is sent out. That still does not change
> the fact that the timeout expired. Why is resetting the timer without
> taking any action the acceptable action in this case?
> 
> Is this solving a bug? The fact that you get timeouts in your test
> is somewhat expected isn't it?
> 

Yes, and no.
We happily let requests sit in the (blk-layer) queue for basically any 
amount of time.
And it's a design decision within the driver _when_ to start the timer.
My point is that starting the timer and _then_ do internal queuing is 
questionable; we might have returned BLK_STS_AGAIN (or something) when 
we found that we cannot send requests right now.
Or we might have started the timer only when the request is being sent 
to the HW.
So returning a timeout in one case but not the other is somewhat erratic.

I would argue that we should only start the timer when requests have had 
a chance to be sent to the HW; when it's still within the driver one has 
a hard time arguing why timeouts do apply on one level but not on the 
other, especially as both levels to exactly the same (to wit: queue 
commands until they can be sent).

I'm open to discussion what we should be doing when the request is in 
the process of being sent. But when it didn't have a chance to be sent 
and we just overloaded our internal queuing we shouldn't be sending 
timeouts.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-23  8:42     ` Hannes Reinecke
@ 2022-05-23 13:36       ` Sagi Grimberg
  2022-05-23 14:01         ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-23 13:36 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>> The patch title does not explain what the patch does, or what it
>> fixes.
>>
>>> When running on slow links requests might take some time
>>> for be processed, and as we always allow to queue requests
>>> timeout may trigger when the requests are still queued.
>>> Eg sending 128M requests over 30 queues over a 1GigE link
>>> will inevitably timeout before the last request could be sent.
>>> So reset the timeout if the request is still being queued
>>> or if it's in the process of being sent.
>>
>> Maybe I'm missing something... But you are overloading so much that you
>> timeout even before a command is sent out. That still does not change
>> the fact that the timeout expired. Why is resetting the timer without
>> taking any action the acceptable action in this case?
>>
>> Is this solving a bug? The fact that you get timeouts in your test
>> is somewhat expected isn't it?
>>
> 
> Yes, and no.
> We happily let requests sit in the (blk-layer) queue for basically any 
> amount of time.
> And it's a design decision within the driver _when_ to start the timer.

Is it? isn't it supposed to start when the request is queued?

> My point is that starting the timer and _then_ do internal queuing is 
> questionable; we might have returned BLK_STS_AGAIN (or something) when 
> we found that we cannot send requests right now.
> Or we might have started the timer only when the request is being sent 
> to the HW.

It is not sent to the HW, it is sent down the TCP stack. But it is not
any different than posting the request to a hw queue on a pci/rdma/fc
device. The device has some context that process the queue and sends
it to the wire, in nvme-tcp that context is io_work.

> So returning a timeout in one case but not the other is somewhat erratic.

What is the difference than posting a work request to an rdma nic on
a congested network? an imaginary 1Gb rdma nic :)

Or maybe lets ask it differently, what happens if you run this test
on the same nic, but with soft-iwarp/soft-roce interface on top of it?

> I would argue that we should only start the timer when requests have had 
> a chance to be sent to the HW; when it's still within the driver one has 
> a hard time arguing why timeouts do apply on one level but not on the 
> other, especially as both levels to exactly the same (to wit: queue 
> commands until they can be sent).

I look at this differently, the way I see it, is that nvme-tcp is
exactly like nvme-rdma/nvme-fc but also implements context executing the
command, in software. So in my mind, it is mixing different layers.

> I'm open to discussion what we should be doing when the request is in 
> the process of being sent. But when it didn't have a chance to be sent 
> and we just overloaded our internal queuing we shouldn't be sending 
> timeouts.

As mentioned above, what happens if that same reporter opens another bug
that the same phenomenon happens with soft-iwarp? What would you tell
him/her?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-23 13:36       ` Sagi Grimberg
@ 2022-05-23 14:01         ` Hannes Reinecke
  2022-05-23 15:05           ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-23 14:01 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 5/23/22 15:36, Sagi Grimberg wrote:
> 
>>> The patch title does not explain what the patch does, or what it
>>> fixes.
>>>
>>>> When running on slow links requests might take some time
>>>> for be processed, and as we always allow to queue requests
>>>> timeout may trigger when the requests are still queued.
>>>> Eg sending 128M requests over 30 queues over a 1GigE link
>>>> will inevitably timeout before the last request could be sent.
>>>> So reset the timeout if the request is still being queued
>>>> or if it's in the process of being sent.
>>>
>>> Maybe I'm missing something... But you are overloading so much that you
>>> timeout even before a command is sent out. That still does not change
>>> the fact that the timeout expired. Why is resetting the timer without
>>> taking any action the acceptable action in this case?
>>>
>>> Is this solving a bug? The fact that you get timeouts in your test
>>> is somewhat expected isn't it?
>>>
>>
>> Yes, and no.
>> We happily let requests sit in the (blk-layer) queue for basically any 
>> amount of time.
>> And it's a design decision within the driver _when_ to start the timer.
> 
> Is it? isn't it supposed to start when the request is queued?
> 
Queued where?

>> My point is that starting the timer and _then_ do internal queuing is 
>> questionable; we might have returned BLK_STS_AGAIN (or something) when 
>> we found that we cannot send requests right now.
>> Or we might have started the timer only when the request is being sent 
>> to the HW.
> 
> It is not sent to the HW, it is sent down the TCP stack. But it is not
> any different than posting the request to a hw queue on a pci/rdma/fc
> device. The device has some context that process the queue and sends
> it to the wire, in nvme-tcp that context is io_work.
> 
>> So returning a timeout in one case but not the other is somewhat erratic.
> 
> What is the difference than posting a work request to an rdma nic on
> a congested network? an imaginary 1Gb rdma nic :)
> 
> Or maybe lets ask it differently, what happens if you run this test
> on the same nic, but with soft-iwarp/soft-roce interface on top of it?
> 
I can't really tell, as I haven't tried.
Can give it a go, though.

>> I would argue that we should only start the timer when requests have 
>> had a chance to be sent to the HW; when it's still within the driver 
>> one has a hard time arguing why timeouts do apply on one level but not 
>> on the other, especially as both levels to exactly the same (to wit: 
>> queue commands until they can be sent).
> 
> I look at this differently, the way I see it, is that nvme-tcp is
> exactly like nvme-rdma/nvme-fc but also implements context executing the
> command, in software. So in my mind, it is mixing different layers.
> 
Hmm. Yes, of course one could take this stance.
Especially given the NVMe-oF notion of 'transport'.

Sadly it's hard to reproduce this with other transports, as they 
inevitably only run on HW fast enough to not directly exhibit this 
problem (FC is now on 8G min, and IB probably on at least 10G).

The issue arises when running a fio test with a variable size
(4M - 128M), which works on other transports like FC.
For TCP we're running into the said timeouts, but adding things like 
blk-cgroup or rq-qos make the issue go away.

So question naturally would be why we need a traffic shaper on TCP, but 
not on FC.

>> I'm open to discussion what we should be doing when the request is in 
>> the process of being sent. But when it didn't have a chance to be sent 
>> and we just overloaded our internal queuing we shouldn't be sending 
>> timeouts.
> 
> As mentioned above, what happens if that same reporter opens another bug
> that the same phenomenon happens with soft-iwarp? What would you tell
> him/her?

Nope. It's a HW appliance. Not a chance to change that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-23 14:01         ` Hannes Reinecke
@ 2022-05-23 15:05           ` Sagi Grimberg
  2022-05-23 16:07             ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-23 15:05 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>> The patch title does not explain what the patch does, or what it
>>>> fixes.
>>>>
>>>>> When running on slow links requests might take some time
>>>>> for be processed, and as we always allow to queue requests
>>>>> timeout may trigger when the requests are still queued.
>>>>> Eg sending 128M requests over 30 queues over a 1GigE link
>>>>> will inevitably timeout before the last request could be sent.
>>>>> So reset the timeout if the request is still being queued
>>>>> or if it's in the process of being sent.
>>>>
>>>> Maybe I'm missing something... But you are overloading so much that you
>>>> timeout even before a command is sent out. That still does not change
>>>> the fact that the timeout expired. Why is resetting the timer without
>>>> taking any action the acceptable action in this case?
>>>>
>>>> Is this solving a bug? The fact that you get timeouts in your test
>>>> is somewhat expected isn't it?
>>>>
>>>
>>> Yes, and no.
>>> We happily let requests sit in the (blk-layer) queue for basically 
>>> any amount of time.
>>> And it's a design decision within the driver _when_ to start the timer.
>>
>> Is it? isn't it supposed to start when the request is queued?
>>
> Queued where?

When .queue_rq() is called, when it returns the request should be
queued.

>>> My point is that starting the timer and _then_ do internal queuing is 
>>> questionable; we might have returned BLK_STS_AGAIN (or something) 
>>> when we found that we cannot send requests right now.
>>> Or we might have started the timer only when the request is being 
>>> sent to the HW.
>>
>> It is not sent to the HW, it is sent down the TCP stack. But it is not
>> any different than posting the request to a hw queue on a pci/rdma/fc
>> device. The device has some context that process the queue and sends
>> it to the wire, in nvme-tcp that context is io_work.
>>
>>> So returning a timeout in one case but not the other is somewhat 
>>> erratic.
>>
>> What is the difference than posting a work request to an rdma nic on
>> a congested network? an imaginary 1Gb rdma nic :)
>>
>> Or maybe lets ask it differently, what happens if you run this test
>> on the same nic, but with soft-iwarp/soft-roce interface on top of it?
>>
> I can't really tell, as I haven't tried.
> Can give it a go, though.

You can, but the same thing will happen. the only difference is that
soft-iwarp is a TCP ULP that does not expose the state of the socket
to nvme transport (rdma), nor any congestion related attributes.

>>> I would argue that we should only start the timer when requests have 
>>> had a chance to be sent to the HW; when it's still within the driver 
>>> one has a hard time arguing why timeouts do apply on one level but 
>>> not on the other, especially as both levels to exactly the same (to 
>>> wit: queue commands until they can be sent).
>>
>> I look at this differently, the way I see it, is that nvme-tcp is
>> exactly like nvme-rdma/nvme-fc but also implements context executing the
>> command, in software. So in my mind, it is mixing different layers.
>>
> Hmm. Yes, of course one could take this stance.
> Especially given the NVMe-oF notion of 'transport'.
> 
> Sadly it's hard to reproduce this with other transports, as they 
> inevitably only run on HW fast enough to not directly exhibit this 
> problem (FC is now on 8G min, and IB probably on at least 10G).

use soft-iwarp/soft-roce and it should do the exact same thing
on a 1Gb nic.

> The issue arises when running a fio test with a variable size
> (4M - 128M), which works on other transports like FC.
> For TCP we're running into the said timeouts, but adding things like 
> blk-cgroup or rq-qos make the issue go away.
> 
> So question naturally would be why we need a traffic shaper on TCP, but 
> not on FC.

If I were to take an old 10G rdma nic (or even new 25G), stick it into a
128 core machine, create 128 queues of depth 1024, run workload with
bs=4M jobs=128 qd=1024, you will also see timeouts.

In every setup exists a workload that will overwhelm it...

>>> I'm open to discussion what we should be doing when the request is in 
>>> the process of being sent. But when it didn't have a chance to be 
>>> sent and we just overloaded our internal queuing we shouldn't be 
>>> sending timeouts.
>>
>> As mentioned above, what happens if that same reporter opens another bug
>> that the same phenomenon happens with soft-iwarp? What would you tell
>> him/her?
> 
> Nope. It's a HW appliance. Not a chance to change that.

It was just a theoretical question.

Do note that I'm not against solving a problem for anyone, I'm just
questioning if increasing the io_timeout to be unbound in case the
network is congested, is the right solution for everyone instead of
a particular case that can easily be solved with udev to make the
io_timeout to be as high as needed.

One can argue that this patchset is making nvme-tcp to basically
ignore the device io_timeout in certain cases.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-23 15:05           ` Sagi Grimberg
@ 2022-05-23 16:07             ` Hannes Reinecke
  2022-05-24  7:57               ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-23 16:07 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 5/23/22 17:05, Sagi Grimberg wrote:
> 
[ .. ]
>>>> I'm open to discussion what we should be doing when the request is 
>>>> in the process of being sent. But when it didn't have a chance to be 
>>>> sent and we just overloaded our internal queuing we shouldn't be 
>>>> sending timeouts.
>>>
>>> As mentioned above, what happens if that same reporter opens another bug
>>> that the same phenomenon happens with soft-iwarp? What would you tell
>>> him/her?
>>
>> Nope. It's a HW appliance. Not a chance to change that.
> 
> It was just a theoretical question.
> 
> Do note that I'm not against solving a problem for anyone, I'm just
> questioning if increasing the io_timeout to be unbound in case the
> network is congested, is the right solution for everyone instead of
> a particular case that can easily be solved with udev to make the
> io_timeout to be as high as needed.
> 
> One can argue that this patchset is making nvme-tcp to basically
> ignore the device io_timeout in certain cases.

Oh, yes, sure, that will happen.
What I'm actually arguing is the imprecise difference between 
BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
and command timeouts in case of resource constraints on the driver 
implementing ->queue_rq().

If there is a resource constrain driver is free to return 
BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept 
the request (in which case there will be a timeout).

I could live with a timeout if that would just result in the command 
being retried. But in the case of nvme it results in a connection reset 
to boot, making customers really nervous that their system is broken.

And having a workload which can generate connection resets feels like a 
DoS attack to me; applications shouldn't be able to do that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-23 16:07             ` Hannes Reinecke
@ 2022-05-24  7:57               ` Sagi Grimberg
  2022-05-24  8:08                 ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-24  7:57 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>>> I'm open to discussion what we should be doing when the request is 
>>>>> in the process of being sent. But when it didn't have a chance to 
>>>>> be sent and we just overloaded our internal queuing we shouldn't be 
>>>>> sending timeouts.
>>>>
>>>> As mentioned above, what happens if that same reporter opens another 
>>>> bug
>>>> that the same phenomenon happens with soft-iwarp? What would you tell
>>>> him/her?
>>>
>>> Nope. It's a HW appliance. Not a chance to change that.
>>
>> It was just a theoretical question.
>>
>> Do note that I'm not against solving a problem for anyone, I'm just
>> questioning if increasing the io_timeout to be unbound in case the
>> network is congested, is the right solution for everyone instead of
>> a particular case that can easily be solved with udev to make the
>> io_timeout to be as high as needed.
>>
>> One can argue that this patchset is making nvme-tcp to basically
>> ignore the device io_timeout in certain cases.
> 
> Oh, yes, sure, that will happen.
> What I'm actually arguing is the imprecise difference between 
> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
> and command timeouts in case of resource constraints on the driver 
> implementing ->queue_rq().
> 
> If there is a resource constrain driver is free to return 
> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept 
> the request (in which case there will be a timeout).

There is no resource constraint. The driver sizes up the resources
to be able to queue all the requests it is getting.

> I could live with a timeout if that would just result in the command 
> being retried. But in the case of nvme it results in a connection reset 
> to boot, making customers really nervous that their system is broken.

But how does the driver know that it is running in this environment that
is completely congested? What I'm saying is that this is a specific use
case that the solution can have negative side-effects for other common
use-cases, because it is beyond the scope of the driver to handle.

We can also trigger this condition with nvme-rdma.

We could stay with this patch, but I'd argue that this might be the
wrong thing to do in certain use-cases.

> And having a workload which can generate connection resets feels like a 
> DoS attack to me; applications shouldn't be able to do that.

Having a workload occupying all the tags all the time is also a DDOS
attack.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-24  7:57               ` Sagi Grimberg
@ 2022-05-24  8:08                 ` Hannes Reinecke
  2022-05-24  8:53                   ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-24  8:08 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 5/24/22 09:57, Sagi Grimberg wrote:
> 
>>>>>> I'm open to discussion what we should be doing when the request is 
>>>>>> in the process of being sent. But when it didn't have a chance to 
>>>>>> be sent and we just overloaded our internal queuing we shouldn't 
>>>>>> be sending timeouts.
>>>>>
>>>>> As mentioned above, what happens if that same reporter opens 
>>>>> another bug
>>>>> that the same phenomenon happens with soft-iwarp? What would you tell
>>>>> him/her?
>>>>
>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>
>>> It was just a theoretical question.
>>>
>>> Do note that I'm not against solving a problem for anyone, I'm just
>>> questioning if increasing the io_timeout to be unbound in case the
>>> network is congested, is the right solution for everyone instead of
>>> a particular case that can easily be solved with udev to make the
>>> io_timeout to be as high as needed.
>>>
>>> One can argue that this patchset is making nvme-tcp to basically
>>> ignore the device io_timeout in certain cases.
>>
>> Oh, yes, sure, that will happen.
>> What I'm actually arguing is the imprecise difference between 
>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>> and command timeouts in case of resource constraints on the driver 
>> implementing ->queue_rq().
>>
>> If there is a resource constrain driver is free to return 
>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept 
>> the request (in which case there will be a timeout).
> 
> There is no resource constraint. The driver sizes up the resources
> to be able to queue all the requests it is getting.
> 
>> I could live with a timeout if that would just result in the command 
>> being retried. But in the case of nvme it results in a connection 
>> reset to boot, making customers really nervous that their system is 
>> broken.
> 
> But how does the driver know that it is running in this environment that
> is completely congested? What I'm saying is that this is a specific use
> case that the solution can have negative side-effects for other common
> use-cases, because it is beyond the scope of the driver to handle.
> 
> We can also trigger this condition with nvme-rdma.
> 
> We could stay with this patch, but I'd argue that this might be the
> wrong thing to do in certain use-cases.
> 
Right, okay.

Arguably this is a workload corner case, and we might not want to fix 
this in the driver.

_However_: do we need to do a controller reset in this case?
Shouldn't it be sufficient to just complete the command w/ timeout error 
and be done with it?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-24  8:08                 ` Hannes Reinecke
@ 2022-05-24  8:53                   ` Sagi Grimberg
  2022-05-24  9:34                     ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-24  8:53 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>>>>> I'm open to discussion what we should be doing when the request 
>>>>>>> is in the process of being sent. But when it didn't have a chance 
>>>>>>> to be sent and we just overloaded our internal queuing we 
>>>>>>> shouldn't be sending timeouts.
>>>>>>
>>>>>> As mentioned above, what happens if that same reporter opens 
>>>>>> another bug
>>>>>> that the same phenomenon happens with soft-iwarp? What would you tell
>>>>>> him/her?
>>>>>
>>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>>
>>>> It was just a theoretical question.
>>>>
>>>> Do note that I'm not against solving a problem for anyone, I'm just
>>>> questioning if increasing the io_timeout to be unbound in case the
>>>> network is congested, is the right solution for everyone instead of
>>>> a particular case that can easily be solved with udev to make the
>>>> io_timeout to be as high as needed.
>>>>
>>>> One can argue that this patchset is making nvme-tcp to basically
>>>> ignore the device io_timeout in certain cases.
>>>
>>> Oh, yes, sure, that will happen.
>>> What I'm actually arguing is the imprecise difference between 
>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>>> and command timeouts in case of resource constraints on the driver 
>>> implementing ->queue_rq().
>>>
>>> If there is a resource constrain driver is free to return 
>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept 
>>> the request (in which case there will be a timeout).
>>
>> There is no resource constraint. The driver sizes up the resources
>> to be able to queue all the requests it is getting.
>>
>>> I could live with a timeout if that would just result in the command 
>>> being retried. But in the case of nvme it results in a connection 
>>> reset to boot, making customers really nervous that their system is 
>>> broken.
>>
>> But how does the driver know that it is running in this environment that
>> is completely congested? What I'm saying is that this is a specific use
>> case that the solution can have negative side-effects for other common
>> use-cases, because it is beyond the scope of the driver to handle.
>>
>> We can also trigger this condition with nvme-rdma.
>>
>> We could stay with this patch, but I'd argue that this might be the
>> wrong thing to do in certain use-cases.
>>
> Right, okay.
> 
> Arguably this is a workload corner case, and we might not want to fix 
> this in the driver.
> 
> _However_: do we need to do a controller reset in this case?
> Shouldn't it be sufficient to just complete the command w/ timeout error 
> and be done with it?

The question is what is special about this timeout vs. any other
timeout?

pci attempts to abort the command before triggering a controller
reset, Maybe we should also? although abort is not really reliable
going on the admin queue...


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-24  8:53                   ` Sagi Grimberg
@ 2022-05-24  9:34                     ` Hannes Reinecke
  2022-05-24  9:58                       ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-24  9:34 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 5/24/22 10:53, Sagi Grimberg wrote:
> 
>>>>>>>> I'm open to discussion what we should be doing when the request 
>>>>>>>> is in the process of being sent. But when it didn't have a 
>>>>>>>> chance to be sent and we just overloaded our internal queuing we 
>>>>>>>> shouldn't be sending timeouts.
>>>>>>>
>>>>>>> As mentioned above, what happens if that same reporter opens 
>>>>>>> another bug
>>>>>>> that the same phenomenon happens with soft-iwarp? What would you 
>>>>>>> tell
>>>>>>> him/her?
>>>>>>
>>>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>>>
>>>>> It was just a theoretical question.
>>>>>
>>>>> Do note that I'm not against solving a problem for anyone, I'm just
>>>>> questioning if increasing the io_timeout to be unbound in case the
>>>>> network is congested, is the right solution for everyone instead of
>>>>> a particular case that can easily be solved with udev to make the
>>>>> io_timeout to be as high as needed.
>>>>>
>>>>> One can argue that this patchset is making nvme-tcp to basically
>>>>> ignore the device io_timeout in certain cases.
>>>>
>>>> Oh, yes, sure, that will happen.
>>>> What I'm actually arguing is the imprecise difference between 
>>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>>>> and command timeouts in case of resource constraints on the driver 
>>>> implementing ->queue_rq().
>>>>
>>>> If there is a resource constrain driver is free to return 
>>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or 
>>>> accept the request (in which case there will be a timeout).
>>>
>>> There is no resource constraint. The driver sizes up the resources
>>> to be able to queue all the requests it is getting.
>>>
>>>> I could live with a timeout if that would just result in the command 
>>>> being retried. But in the case of nvme it results in a connection 
>>>> reset to boot, making customers really nervous that their system is 
>>>> broken.
>>>
>>> But how does the driver know that it is running in this environment that
>>> is completely congested? What I'm saying is that this is a specific use
>>> case that the solution can have negative side-effects for other common
>>> use-cases, because it is beyond the scope of the driver to handle.
>>>
>>> We can also trigger this condition with nvme-rdma.
>>>
>>> We could stay with this patch, but I'd argue that this might be the
>>> wrong thing to do in certain use-cases.
>>>
>> Right, okay.
>>
>> Arguably this is a workload corner case, and we might not want to fix 
>> this in the driver.
>>
>> _However_: do we need to do a controller reset in this case?
>> Shouldn't it be sufficient to just complete the command w/ timeout 
>> error and be done with it?
> 
> The question is what is special about this timeout vs. any other
> timeout?
> 
> pci attempts to abort the command before triggering a controller
> reset, Maybe we should also? although abort is not really reliable
> going on the admin queue...

I am not talking about NVMe abort.
I'm talking about this:

@@ -2335,6 +2340,11 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
                 "queue %d: timeout request %#x type %d\n",
                 nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);

+       if (!list_empty(&req->entry)) {
+               nvme_tcp_complete_timed_out(rq);
+               return BLK_EH_DONE;
+       }
+
         if (ctrl->state != NVME_CTRL_LIVE) {
                 /*
                  * If we are resetting, connecting or deleting we should


as the command is still in the queue and NVMe abort don't enter the 
picture at all.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
  2022-05-24  9:34                     ` Hannes Reinecke
@ 2022-05-24  9:58                       ` Sagi Grimberg
  0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2022-05-24  9:58 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 5/24/22 12:34, Hannes Reinecke wrote:
> On 5/24/22 10:53, Sagi Grimberg wrote:
>>
>>>>>>>>> I'm open to discussion what we should be doing when the request 
>>>>>>>>> is in the process of being sent. But when it didn't have a 
>>>>>>>>> chance to be sent and we just overloaded our internal queuing 
>>>>>>>>> we shouldn't be sending timeouts.
>>>>>>>>
>>>>>>>> As mentioned above, what happens if that same reporter opens 
>>>>>>>> another bug
>>>>>>>> that the same phenomenon happens with soft-iwarp? What would you 
>>>>>>>> tell
>>>>>>>> him/her?
>>>>>>>
>>>>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>>>>
>>>>>> It was just a theoretical question.
>>>>>>
>>>>>> Do note that I'm not against solving a problem for anyone, I'm just
>>>>>> questioning if increasing the io_timeout to be unbound in case the
>>>>>> network is congested, is the right solution for everyone instead of
>>>>>> a particular case that can easily be solved with udev to make the
>>>>>> io_timeout to be as high as needed.
>>>>>>
>>>>>> One can argue that this patchset is making nvme-tcp to basically
>>>>>> ignore the device io_timeout in certain cases.
>>>>>
>>>>> Oh, yes, sure, that will happen.
>>>>> What I'm actually arguing is the imprecise difference between 
>>>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>>>>> and command timeouts in case of resource constraints on the driver 
>>>>> implementing ->queue_rq().
>>>>>
>>>>> If there is a resource constrain driver is free to return 
>>>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or 
>>>>> accept the request (in which case there will be a timeout).
>>>>
>>>> There is no resource constraint. The driver sizes up the resources
>>>> to be able to queue all the requests it is getting.
>>>>
>>>>> I could live with a timeout if that would just result in the 
>>>>> command being retried. But in the case of nvme it results in a 
>>>>> connection reset to boot, making customers really nervous that 
>>>>> their system is broken.
>>>>
>>>> But how does the driver know that it is running in this environment 
>>>> that
>>>> is completely congested? What I'm saying is that this is a specific use
>>>> case that the solution can have negative side-effects for other common
>>>> use-cases, because it is beyond the scope of the driver to handle.
>>>>
>>>> We can also trigger this condition with nvme-rdma.
>>>>
>>>> We could stay with this patch, but I'd argue that this might be the
>>>> wrong thing to do in certain use-cases.
>>>>
>>> Right, okay.
>>>
>>> Arguably this is a workload corner case, and we might not want to fix 
>>> this in the driver.
>>>
>>> _However_: do we need to do a controller reset in this case?
>>> Shouldn't it be sufficient to just complete the command w/ timeout 
>>> error and be done with it?
>>
>> The question is what is special about this timeout vs. any other
>> timeout?
>>
>> pci attempts to abort the command before triggering a controller
>> reset, Maybe we should also? although abort is not really reliable
>> going on the admin queue...
> 
> I am not talking about NVMe abort.
> I'm talking about this:
> 
> @@ -2335,6 +2340,11 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
>                  "queue %d: timeout request %#x type %d\n",
>                  nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
> 
> +       if (!list_empty(&req->entry)) {
> +               nvme_tcp_complete_timed_out(rq);
> +               return BLK_EH_DONE;
> +       }
> +
>          if (ctrl->state != NVME_CTRL_LIVE) {
>                  /*
>                   * If we are resetting, connecting or deleting we should
> 
> 
> as the command is still in the queue and NVMe abort don't enter the 
> picture at all.

That for sure will not help because nvme_tcp_complete_timed_out stops
the queue. What you can do is maybe just remove it from the pending list
and complete it.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-05-24  9:59 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19  6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke
2022-05-19  6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke
2022-05-20  9:05   ` Sagi Grimberg
2022-05-23  8:42     ` Hannes Reinecke
2022-05-23 13:36       ` Sagi Grimberg
2022-05-23 14:01         ` Hannes Reinecke
2022-05-23 15:05           ` Sagi Grimberg
2022-05-23 16:07             ` Hannes Reinecke
2022-05-24  7:57               ` Sagi Grimberg
2022-05-24  8:08                 ` Hannes Reinecke
2022-05-24  8:53                   ` Sagi Grimberg
2022-05-24  9:34                     ` Hannes Reinecke
2022-05-24  9:58                       ` Sagi Grimberg
2022-05-19  6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke
2022-05-20  9:17   ` Sagi Grimberg
2022-05-20 10:05     ` Hannes Reinecke
2022-05-21 20:01       ` Sagi Grimberg
2022-05-19  6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke
2022-05-20  9:19   ` Sagi Grimberg
2022-05-20  9:59     ` Hannes Reinecke
2022-05-21 20:02       ` Sagi Grimberg
2022-05-20  9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg
2022-05-20 10:01   ` Hannes Reinecke
2022-05-21 20:03     ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.