All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: Hannes Reinecke <hare@suse.de>, Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
Date: Mon, 23 May 2022 18:05:45 +0300	[thread overview]
Message-ID: <7ec792e3-5110-2272-b6fe-1a976c8c054f@grimberg.me> (raw)
In-Reply-To: <e749bb3e-2ff7-23f5-9f5f-974c986845b6@suse.de>


>>>> The patch title does not explain what the patch does, or what it
>>>> fixes.
>>>>
>>>>> When running on slow links requests might take some time
>>>>> for be processed, and as we always allow to queue requests
>>>>> timeout may trigger when the requests are still queued.
>>>>> Eg sending 128M requests over 30 queues over a 1GigE link
>>>>> will inevitably timeout before the last request could be sent.
>>>>> So reset the timeout if the request is still being queued
>>>>> or if it's in the process of being sent.
>>>>
>>>> Maybe I'm missing something... But you are overloading so much that you
>>>> timeout even before a command is sent out. That still does not change
>>>> the fact that the timeout expired. Why is resetting the timer without
>>>> taking any action the acceptable action in this case?
>>>>
>>>> Is this solving a bug? The fact that you get timeouts in your test
>>>> is somewhat expected isn't it?
>>>>
>>>
>>> Yes, and no.
>>> We happily let requests sit in the (blk-layer) queue for basically 
>>> any amount of time.
>>> And it's a design decision within the driver _when_ to start the timer.
>>
>> Is it? isn't it supposed to start when the request is queued?
>>
> Queued where?

When .queue_rq() is called, when it returns the request should be
queued.

>>> My point is that starting the timer and _then_ do internal queuing is 
>>> questionable; we might have returned BLK_STS_AGAIN (or something) 
>>> when we found that we cannot send requests right now.
>>> Or we might have started the timer only when the request is being 
>>> sent to the HW.
>>
>> It is not sent to the HW, it is sent down the TCP stack. But it is not
>> any different than posting the request to a hw queue on a pci/rdma/fc
>> device. The device has some context that process the queue and sends
>> it to the wire, in nvme-tcp that context is io_work.
>>
>>> So returning a timeout in one case but not the other is somewhat 
>>> erratic.
>>
>> What is the difference than posting a work request to an rdma nic on
>> a congested network? an imaginary 1Gb rdma nic :)
>>
>> Or maybe lets ask it differently, what happens if you run this test
>> on the same nic, but with soft-iwarp/soft-roce interface on top of it?
>>
> I can't really tell, as I haven't tried.
> Can give it a go, though.

You can, but the same thing will happen. the only difference is that
soft-iwarp is a TCP ULP that does not expose the state of the socket
to nvme transport (rdma), nor any congestion related attributes.

>>> I would argue that we should only start the timer when requests have 
>>> had a chance to be sent to the HW; when it's still within the driver 
>>> one has a hard time arguing why timeouts do apply on one level but 
>>> not on the other, especially as both levels to exactly the same (to 
>>> wit: queue commands until they can be sent).
>>
>> I look at this differently, the way I see it, is that nvme-tcp is
>> exactly like nvme-rdma/nvme-fc but also implements context executing the
>> command, in software. So in my mind, it is mixing different layers.
>>
> Hmm. Yes, of course one could take this stance.
> Especially given the NVMe-oF notion of 'transport'.
> 
> Sadly it's hard to reproduce this with other transports, as they 
> inevitably only run on HW fast enough to not directly exhibit this 
> problem (FC is now on 8G min, and IB probably on at least 10G).

use soft-iwarp/soft-roce and it should do the exact same thing
on a 1Gb nic.

> The issue arises when running a fio test with a variable size
> (4M - 128M), which works on other transports like FC.
> For TCP we're running into the said timeouts, but adding things like 
> blk-cgroup or rq-qos make the issue go away.
> 
> So question naturally would be why we need a traffic shaper on TCP, but 
> not on FC.

If I were to take an old 10G rdma nic (or even new 25G), stick it into a
128 core machine, create 128 queues of depth 1024, run workload with
bs=4M jobs=128 qd=1024, you will also see timeouts.

In every setup exists a workload that will overwhelm it...

>>> I'm open to discussion what we should be doing when the request is in 
>>> the process of being sent. But when it didn't have a chance to be 
>>> sent and we just overloaded our internal queuing we shouldn't be 
>>> sending timeouts.
>>
>> As mentioned above, what happens if that same reporter opens another bug
>> that the same phenomenon happens with soft-iwarp? What would you tell
>> him/her?
> 
> Nope. It's a HW appliance. Not a chance to change that.

It was just a theoretical question.

Do note that I'm not against solving a problem for anyone, I'm just
questioning if increasing the io_timeout to be unbound in case the
network is congested, is the right solution for everyone instead of
a particular case that can easily be solved with udev to make the
io_timeout to be as high as needed.

One can argue that this patchset is making nvme-tcp to basically
ignore the device io_timeout in certain cases.


  reply	other threads:[~2022-05-23 15:06 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-19  6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke
2022-05-19  6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke
2022-05-20  9:05   ` Sagi Grimberg
2022-05-23  8:42     ` Hannes Reinecke
2022-05-23 13:36       ` Sagi Grimberg
2022-05-23 14:01         ` Hannes Reinecke
2022-05-23 15:05           ` Sagi Grimberg [this message]
2022-05-23 16:07             ` Hannes Reinecke
2022-05-24  7:57               ` Sagi Grimberg
2022-05-24  8:08                 ` Hannes Reinecke
2022-05-24  8:53                   ` Sagi Grimberg
2022-05-24  9:34                     ` Hannes Reinecke
2022-05-24  9:58                       ` Sagi Grimberg
2022-05-19  6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke
2022-05-20  9:17   ` Sagi Grimberg
2022-05-20 10:05     ` Hannes Reinecke
2022-05-21 20:01       ` Sagi Grimberg
2022-05-19  6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke
2022-05-20  9:19   ` Sagi Grimberg
2022-05-20  9:59     ` Hannes Reinecke
2022-05-21 20:02       ` Sagi Grimberg
2022-05-20  9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg
2022-05-20 10:01   ` Hannes Reinecke
2022-05-21 20:03     ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ec792e3-5110-2272-b6fe-1a976c8c054f@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.