Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
Date: Mon, 23 May 2022 10:42:45 +0200	[thread overview]
Message-ID: <96a3315f-43a4-efe6-1f37-0552d66dbd85@suse.de> (raw)
In-Reply-To: <7827d599-7714-3947-ee24-e343e90eee6e@grimberg.me>

On 5/20/22 11:05, Sagi Grimberg wrote:
> The patch title does not explain what the patch does, or what it
> fixes.
> 
>> When running on slow links requests might take some time
>> for be processed, and as we always allow to queue requests
>> timeout may trigger when the requests are still queued.
>> Eg sending 128M requests over 30 queues over a 1GigE link
>> will inevitably timeout before the last request could be sent.
>> So reset the timeout if the request is still being queued
>> or if it's in the process of being sent.
> 
> Maybe I'm missing something... But you are overloading so much that you
> timeout even before a command is sent out. That still does not change
> the fact that the timeout expired. Why is resetting the timer without
> taking any action the acceptable action in this case?
> 
> Is this solving a bug? The fact that you get timeouts in your test
> is somewhat expected isn't it?
> 

Yes, and no.
We happily let requests sit in the (blk-layer) queue for basically any 
amount of time.
And it's a design decision within the driver _when_ to start the timer.
My point is that starting the timer and _then_ do internal queuing is 
questionable; we might have returned BLK_STS_AGAIN (or something) when 
we found that we cannot send requests right now.
Or we might have started the timer only when the request is being sent 
to the HW.
So returning a timeout in one case but not the other is somewhat erratic.

I would argue that we should only start the timer when requests have had 
a chance to be sent to the HW; when it's still within the driver one has 
a hard time arguing why timeouts do apply on one level but not on the 
other, especially as both levels to exactly the same (to wit: queue 
commands until they can be sent).

I'm open to discussion what we should be doing when the request is in 
the process of being sent. But when it didn't have a chance to be sent 
and we just overloaded our internal queuing we shouldn't be sending 
timeouts.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer