Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
Date: Tue, 24 May 2022 10:08:50 +0200	[thread overview]
Message-ID: <76475e4f-13c7-2e0c-8584-f46918f5cefa@suse.de> (raw)
In-Reply-To: <02805f44-6f2d-b12e-c224-d44616332d5a@grimberg.me>

On 5/24/22 09:57, Sagi Grimberg wrote:
> 
>>>>>> I'm open to discussion what we should be doing when the request is 
>>>>>> in the process of being sent. But when it didn't have a chance to 
>>>>>> be sent and we just overloaded our internal queuing we shouldn't 
>>>>>> be sending timeouts.
>>>>>
>>>>> As mentioned above, what happens if that same reporter opens 
>>>>> another bug
>>>>> that the same phenomenon happens with soft-iwarp? What would you tell
>>>>> him/her?
>>>>
>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>
>>> It was just a theoretical question.
>>>
>>> Do note that I'm not against solving a problem for anyone, I'm just
>>> questioning if increasing the io_timeout to be unbound in case the
>>> network is congested, is the right solution for everyone instead of
>>> a particular case that can easily be solved with udev to make the
>>> io_timeout to be as high as needed.
>>>
>>> One can argue that this patchset is making nvme-tcp to basically
>>> ignore the device io_timeout in certain cases.
>>
>> Oh, yes, sure, that will happen.
>> What I'm actually arguing is the imprecise difference between 
>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>> and command timeouts in case of resource constraints on the driver 
>> implementing ->queue_rq().
>>
>> If there is a resource constrain driver is free to return 
>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept 
>> the request (in which case there will be a timeout).
> 
> There is no resource constraint. The driver sizes up the resources
> to be able to queue all the requests it is getting.
> 
>> I could live with a timeout if that would just result in the command 
>> being retried. But in the case of nvme it results in a connection 
>> reset to boot, making customers really nervous that their system is 
>> broken.
> 
> But how does the driver know that it is running in this environment that
> is completely congested? What I'm saying is that this is a specific use
> case that the solution can have negative side-effects for other common
> use-cases, because it is beyond the scope of the driver to handle.
> 
> We can also trigger this condition with nvme-rdma.
> 
> We could stay with this patch, but I'd argue that this might be the
> wrong thing to do in certain use-cases.
> 
Right, okay.

Arguably this is a workload corner case, and we might not want to fix 
this in the driver.

_However_: do we need to do a controller reset in this case?
Shouldn't it be sufficient to just complete the command w/ timeout error 
and be done with it?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman