Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

From: Sagi Grimberg <sagi@grimberg.me>
To: Hannes Reinecke <hare@suse.de>, Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
Date: Mon, 23 May 2022 18:05:45 +0300	[thread overview]
Message-ID: <7ec792e3-5110-2272-b6fe-1a976c8c054f@grimberg.me> (raw)
In-Reply-To: <e749bb3e-2ff7-23f5-9f5f-974c986845b6@suse.de>

>>>> The patch title does not explain what the patch does, or what it
>>>> fixes.
>>>>
>>>>> When running on slow links requests might take some time
>>>>> for be processed, and as we always allow to queue requests
>>>>> timeout may trigger when the requests are still queued.
>>>>> Eg sending 128M requests over 30 queues over a 1GigE link
>>>>> will inevitably timeout before the last request could be sent.
>>>>> So reset the timeout if the request is still being queued
>>>>> or if it's in the process of being sent.
>>>>
>>>> Maybe I'm missing something... But you are overloading so much that you
>>>> timeout even before a command is sent out. That still does not change
>>>> the fact that the timeout expired. Why is resetting the timer without
>>>> taking any action the acceptable action in this case?
>>>>
>>>> Is this solving a bug? The fact that you get timeouts in your test
>>>> is somewhat expected isn't it?
>>>>
>>>
>>> Yes, and no.
>>> We happily let requests sit in the (blk-layer) queue for basically 
>>> any amount of time.
>>> And it's a design decision within the driver _when_ to start the timer.
>>
>> Is it? isn't it supposed to start when the request is queued?
>>
> Queued where?

When .queue_rq() is called, when it returns the request should be
queued.

>>> My point is that starting the timer and _then_ do internal queuing is 
>>> questionable; we might have returned BLK_STS_AGAIN (or something) 
>>> when we found that we cannot send requests right now.
>>> Or we might have started the timer only when the request is being 
>>> sent to the HW.
>>
>> It is not sent to the HW, it is sent down the TCP stack. But it is not
>> any different than posting the request to a hw queue on a pci/rdma/fc
>> device. The device has some context that process the queue and sends
>> it to the wire, in nvme-tcp that context is io_work.
>>
>>> So returning a timeout in one case but not the other is somewhat 
>>> erratic.
>>
>> What is the difference than posting a work request to an rdma nic on
>> a congested network? an imaginary 1Gb rdma nic :)
>>
>> Or maybe lets ask it differently, what happens if you run this test
>> on the same nic, but with soft-iwarp/soft-roce interface on top of it?
>>
> I can't really tell, as I haven't tried.
> Can give it a go, though.

You can, but the same thing will happen. the only difference is that
soft-iwarp is a TCP ULP that does not expose the state of the socket
to nvme transport (rdma), nor any congestion related attributes.

>>> I would argue that we should only start the timer when requests have 
>>> had a chance to be sent to the HW; when it's still within the driver 
>>> one has a hard time arguing why timeouts do apply on one level but 
>>> not on the other, especially as both levels to exactly the same (to 
>>> wit: queue commands until they can be sent).
>>
>> I look at this differently, the way I see it, is that nvme-tcp is
>> exactly like nvme-rdma/nvme-fc but also implements context executing the
>> command, in software. So in my mind, it is mixing different layers.
>>
> Hmm. Yes, of course one could take this stance.
> Especially given the NVMe-oF notion of 'transport'.
> 
> Sadly it's hard to reproduce this with other transports, as they 
> inevitably only run on HW fast enough to not directly exhibit this 
> problem (FC is now on 8G min, and IB probably on at least 10G).

use soft-iwarp/soft-roce and it should do the exact same thing
on a 1Gb nic.

> The issue arises when running a fio test with a variable size
> (4M - 128M), which works on other transports like FC.
> For TCP we're running into the said timeouts, but adding things like 
> blk-cgroup or rq-qos make the issue go away.
> 
> So question naturally would be why we need a traffic shaper on TCP, but 
> not on FC.

If I were to take an old 10G rdma nic (or even new 25G), stick it into a
128 core machine, create 128 queues of depth 1024, run workload with
bs=4M jobs=128 qd=1024, you will also see timeouts.

In every setup exists a workload that will overwhelm it...

>>> I'm open to discussion what we should be doing when the request is in 
>>> the process of being sent. But when it didn't have a chance to be 
>>> sent and we just overloaded our internal queuing we shouldn't be 
>>> sending timeouts.
>>
>> As mentioned above, what happens if that same reporter opens another bug
>> that the same phenomenon happens with soft-iwarp? What would you tell
>> him/her?
> 
> Nope. It's a HW appliance. Not a chance to change that.

It was just a theoretical question.

Do note that I'm not against solving a problem for anyone, I'm just
questioning if increasing the io_timeout to be unbound in case the
network is congested, is the right solution for everyone instead of
a particular case that can easily be solved with udev to make the
io_timeout to be as high as needed.

One can argue that this patchset is making nvme-tcp to basically
ignore the device io_timeout in certain cases.