* [PATCH 0/3] nvme-tcp: queue stalls under high load @ 2022-05-19 6:26 Hannes Reinecke 2022-05-19 6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke ` (3 more replies) 0 siblings, 4 replies; 24+ messages in thread From: Hannes Reinecke @ 2022-05-19 6:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke Hi all, one of our partners registered queue stalls and I/O timeouts under high load. Analysis revealed that we see an extremely 'choppy' I/O behaviour when running large transfers on systems on low-performance links (eg 1GigE networks). We had a system with 30 queues trying to transfer 128M requests; simple calculation shows that transferring a _single_ request on all queues will take up to 38 seconds, thereby timing out the last request before it got sent. As a solution I first fixed up the timeout handler to reset the timeout if the request is still queued or in the process of being send. The second path modifies the send path to only allow for new requests if we have enough space on the TX queue, and finally break up the send loop to avoid system stalls when sending large request. As usual, comments and reviews are welcome. Hannes Reinecke (3): nvme-tcp: spurious I/O timeout under high load nvme-tcp: Check for write space before queueing requests nvme-tcp: send quota for nvme_tcp_send_all() drivers/nvme/host/tcp.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) -- 2.29.2 ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-19 6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke @ 2022-05-19 6:26 ` Hannes Reinecke 2022-05-20 9:05 ` Sagi Grimberg 2022-05-19 6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-19 6:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke When running on slow links requests might take some time for be processed, and as we always allow to queue requests timeout may trigger when the requests are still queued. Eg sending 128M requests over 30 queues over a 1GigE link will inevitably timeout before the last request could be sent. So reset the timeout if the request is still being queued or if it's in the process of being sent. Signed-off-by: Hannes Reinecke <hare@suse.de> --- drivers/nvme/host/tcp.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index bb67538d241b..ede76a0719a0 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2332,6 +2332,13 @@ nvme_tcp_timeout(struct request *rq, bool reserved) "queue %d: timeout request %#x type %d\n", nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type); + if (!list_empty(&req->entry) || req->queue->request == req) { + dev_warn(ctrl->device, + "queue %d: queue stall, resetting timeout\n", + nvme_tcp_queue_id(req->queue)); + return BLK_EH_RESET_TIMER; + } + if (ctrl->state != NVME_CTRL_LIVE) { /* * If we are resetting, connecting or deleting we should -- 2.29.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-19 6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke @ 2022-05-20 9:05 ` Sagi Grimberg 2022-05-23 8:42 ` Hannes Reinecke 0 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-20 9:05 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme The patch title does not explain what the patch does, or what it fixes. > When running on slow links requests might take some time > for be processed, and as we always allow to queue requests > timeout may trigger when the requests are still queued. > Eg sending 128M requests over 30 queues over a 1GigE link > will inevitably timeout before the last request could be sent. > So reset the timeout if the request is still being queued > or if it's in the process of being sent. Maybe I'm missing something... But you are overloading so much that you timeout even before a command is sent out. That still does not change the fact that the timeout expired. Why is resetting the timer without taking any action the acceptable action in this case? Is this solving a bug? The fact that you get timeouts in your test is somewhat expected isn't it? > > Signed-off-by: Hannes Reinecke <hare@suse.de> > --- > drivers/nvme/host/tcp.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index bb67538d241b..ede76a0719a0 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -2332,6 +2332,13 @@ nvme_tcp_timeout(struct request *rq, bool reserved) > "queue %d: timeout request %#x type %d\n", > nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type); > > + if (!list_empty(&req->entry) || req->queue->request == req) { > + dev_warn(ctrl->device, > + "queue %d: queue stall, resetting timeout\n", > + nvme_tcp_queue_id(req->queue)); > + return BLK_EH_RESET_TIMER; > + } > + > if (ctrl->state != NVME_CTRL_LIVE) { > /* > * If we are resetting, connecting or deleting we should ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-20 9:05 ` Sagi Grimberg @ 2022-05-23 8:42 ` Hannes Reinecke 2022-05-23 13:36 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-23 8:42 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/20/22 11:05, Sagi Grimberg wrote: > The patch title does not explain what the patch does, or what it > fixes. > >> When running on slow links requests might take some time >> for be processed, and as we always allow to queue requests >> timeout may trigger when the requests are still queued. >> Eg sending 128M requests over 30 queues over a 1GigE link >> will inevitably timeout before the last request could be sent. >> So reset the timeout if the request is still being queued >> or if it's in the process of being sent. > > Maybe I'm missing something... But you are overloading so much that you > timeout even before a command is sent out. That still does not change > the fact that the timeout expired. Why is resetting the timer without > taking any action the acceptable action in this case? > > Is this solving a bug? The fact that you get timeouts in your test > is somewhat expected isn't it? > Yes, and no. We happily let requests sit in the (blk-layer) queue for basically any amount of time. And it's a design decision within the driver _when_ to start the timer. My point is that starting the timer and _then_ do internal queuing is questionable; we might have returned BLK_STS_AGAIN (or something) when we found that we cannot send requests right now. Or we might have started the timer only when the request is being sent to the HW. So returning a timeout in one case but not the other is somewhat erratic. I would argue that we should only start the timer when requests have had a chance to be sent to the HW; when it's still within the driver one has a hard time arguing why timeouts do apply on one level but not on the other, especially as both levels to exactly the same (to wit: queue commands until they can be sent). I'm open to discussion what we should be doing when the request is in the process of being sent. But when it didn't have a chance to be sent and we just overloaded our internal queuing we shouldn't be sending timeouts. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-23 8:42 ` Hannes Reinecke @ 2022-05-23 13:36 ` Sagi Grimberg 2022-05-23 14:01 ` Hannes Reinecke 0 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-23 13:36 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme >> The patch title does not explain what the patch does, or what it >> fixes. >> >>> When running on slow links requests might take some time >>> for be processed, and as we always allow to queue requests >>> timeout may trigger when the requests are still queued. >>> Eg sending 128M requests over 30 queues over a 1GigE link >>> will inevitably timeout before the last request could be sent. >>> So reset the timeout if the request is still being queued >>> or if it's in the process of being sent. >> >> Maybe I'm missing something... But you are overloading so much that you >> timeout even before a command is sent out. That still does not change >> the fact that the timeout expired. Why is resetting the timer without >> taking any action the acceptable action in this case? >> >> Is this solving a bug? The fact that you get timeouts in your test >> is somewhat expected isn't it? >> > > Yes, and no. > We happily let requests sit in the (blk-layer) queue for basically any > amount of time. > And it's a design decision within the driver _when_ to start the timer. Is it? isn't it supposed to start when the request is queued? > My point is that starting the timer and _then_ do internal queuing is > questionable; we might have returned BLK_STS_AGAIN (or something) when > we found that we cannot send requests right now. > Or we might have started the timer only when the request is being sent > to the HW. It is not sent to the HW, it is sent down the TCP stack. But it is not any different than posting the request to a hw queue on a pci/rdma/fc device. The device has some context that process the queue and sends it to the wire, in nvme-tcp that context is io_work. > So returning a timeout in one case but not the other is somewhat erratic. What is the difference than posting a work request to an rdma nic on a congested network? an imaginary 1Gb rdma nic :) Or maybe lets ask it differently, what happens if you run this test on the same nic, but with soft-iwarp/soft-roce interface on top of it? > I would argue that we should only start the timer when requests have had > a chance to be sent to the HW; when it's still within the driver one has > a hard time arguing why timeouts do apply on one level but not on the > other, especially as both levels to exactly the same (to wit: queue > commands until they can be sent). I look at this differently, the way I see it, is that nvme-tcp is exactly like nvme-rdma/nvme-fc but also implements context executing the command, in software. So in my mind, it is mixing different layers. > I'm open to discussion what we should be doing when the request is in > the process of being sent. But when it didn't have a chance to be sent > and we just overloaded our internal queuing we shouldn't be sending > timeouts. As mentioned above, what happens if that same reporter opens another bug that the same phenomenon happens with soft-iwarp? What would you tell him/her? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-23 13:36 ` Sagi Grimberg @ 2022-05-23 14:01 ` Hannes Reinecke 2022-05-23 15:05 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-23 14:01 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/23/22 15:36, Sagi Grimberg wrote: > >>> The patch title does not explain what the patch does, or what it >>> fixes. >>> >>>> When running on slow links requests might take some time >>>> for be processed, and as we always allow to queue requests >>>> timeout may trigger when the requests are still queued. >>>> Eg sending 128M requests over 30 queues over a 1GigE link >>>> will inevitably timeout before the last request could be sent. >>>> So reset the timeout if the request is still being queued >>>> or if it's in the process of being sent. >>> >>> Maybe I'm missing something... But you are overloading so much that you >>> timeout even before a command is sent out. That still does not change >>> the fact that the timeout expired. Why is resetting the timer without >>> taking any action the acceptable action in this case? >>> >>> Is this solving a bug? The fact that you get timeouts in your test >>> is somewhat expected isn't it? >>> >> >> Yes, and no. >> We happily let requests sit in the (blk-layer) queue for basically any >> amount of time. >> And it's a design decision within the driver _when_ to start the timer. > > Is it? isn't it supposed to start when the request is queued? > Queued where? >> My point is that starting the timer and _then_ do internal queuing is >> questionable; we might have returned BLK_STS_AGAIN (or something) when >> we found that we cannot send requests right now. >> Or we might have started the timer only when the request is being sent >> to the HW. > > It is not sent to the HW, it is sent down the TCP stack. But it is not > any different than posting the request to a hw queue on a pci/rdma/fc > device. The device has some context that process the queue and sends > it to the wire, in nvme-tcp that context is io_work. > >> So returning a timeout in one case but not the other is somewhat erratic. > > What is the difference than posting a work request to an rdma nic on > a congested network? an imaginary 1Gb rdma nic :) > > Or maybe lets ask it differently, what happens if you run this test > on the same nic, but with soft-iwarp/soft-roce interface on top of it? > I can't really tell, as I haven't tried. Can give it a go, though. >> I would argue that we should only start the timer when requests have >> had a chance to be sent to the HW; when it's still within the driver >> one has a hard time arguing why timeouts do apply on one level but not >> on the other, especially as both levels to exactly the same (to wit: >> queue commands until they can be sent). > > I look at this differently, the way I see it, is that nvme-tcp is > exactly like nvme-rdma/nvme-fc but also implements context executing the > command, in software. So in my mind, it is mixing different layers. > Hmm. Yes, of course one could take this stance. Especially given the NVMe-oF notion of 'transport'. Sadly it's hard to reproduce this with other transports, as they inevitably only run on HW fast enough to not directly exhibit this problem (FC is now on 8G min, and IB probably on at least 10G). The issue arises when running a fio test with a variable size (4M - 128M), which works on other transports like FC. For TCP we're running into the said timeouts, but adding things like blk-cgroup or rq-qos make the issue go away. So question naturally would be why we need a traffic shaper on TCP, but not on FC. >> I'm open to discussion what we should be doing when the request is in >> the process of being sent. But when it didn't have a chance to be sent >> and we just overloaded our internal queuing we shouldn't be sending >> timeouts. > > As mentioned above, what happens if that same reporter opens another bug > that the same phenomenon happens with soft-iwarp? What would you tell > him/her? Nope. It's a HW appliance. Not a chance to change that. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-23 14:01 ` Hannes Reinecke @ 2022-05-23 15:05 ` Sagi Grimberg 2022-05-23 16:07 ` Hannes Reinecke 0 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-23 15:05 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme >>>> The patch title does not explain what the patch does, or what it >>>> fixes. >>>> >>>>> When running on slow links requests might take some time >>>>> for be processed, and as we always allow to queue requests >>>>> timeout may trigger when the requests are still queued. >>>>> Eg sending 128M requests over 30 queues over a 1GigE link >>>>> will inevitably timeout before the last request could be sent. >>>>> So reset the timeout if the request is still being queued >>>>> or if it's in the process of being sent. >>>> >>>> Maybe I'm missing something... But you are overloading so much that you >>>> timeout even before a command is sent out. That still does not change >>>> the fact that the timeout expired. Why is resetting the timer without >>>> taking any action the acceptable action in this case? >>>> >>>> Is this solving a bug? The fact that you get timeouts in your test >>>> is somewhat expected isn't it? >>>> >>> >>> Yes, and no. >>> We happily let requests sit in the (blk-layer) queue for basically >>> any amount of time. >>> And it's a design decision within the driver _when_ to start the timer. >> >> Is it? isn't it supposed to start when the request is queued? >> > Queued where? When .queue_rq() is called, when it returns the request should be queued. >>> My point is that starting the timer and _then_ do internal queuing is >>> questionable; we might have returned BLK_STS_AGAIN (or something) >>> when we found that we cannot send requests right now. >>> Or we might have started the timer only when the request is being >>> sent to the HW. >> >> It is not sent to the HW, it is sent down the TCP stack. But it is not >> any different than posting the request to a hw queue on a pci/rdma/fc >> device. The device has some context that process the queue and sends >> it to the wire, in nvme-tcp that context is io_work. >> >>> So returning a timeout in one case but not the other is somewhat >>> erratic. >> >> What is the difference than posting a work request to an rdma nic on >> a congested network? an imaginary 1Gb rdma nic :) >> >> Or maybe lets ask it differently, what happens if you run this test >> on the same nic, but with soft-iwarp/soft-roce interface on top of it? >> > I can't really tell, as I haven't tried. > Can give it a go, though. You can, but the same thing will happen. the only difference is that soft-iwarp is a TCP ULP that does not expose the state of the socket to nvme transport (rdma), nor any congestion related attributes. >>> I would argue that we should only start the timer when requests have >>> had a chance to be sent to the HW; when it's still within the driver >>> one has a hard time arguing why timeouts do apply on one level but >>> not on the other, especially as both levels to exactly the same (to >>> wit: queue commands until they can be sent). >> >> I look at this differently, the way I see it, is that nvme-tcp is >> exactly like nvme-rdma/nvme-fc but also implements context executing the >> command, in software. So in my mind, it is mixing different layers. >> > Hmm. Yes, of course one could take this stance. > Especially given the NVMe-oF notion of 'transport'. > > Sadly it's hard to reproduce this with other transports, as they > inevitably only run on HW fast enough to not directly exhibit this > problem (FC is now on 8G min, and IB probably on at least 10G). use soft-iwarp/soft-roce and it should do the exact same thing on a 1Gb nic. > The issue arises when running a fio test with a variable size > (4M - 128M), which works on other transports like FC. > For TCP we're running into the said timeouts, but adding things like > blk-cgroup or rq-qos make the issue go away. > > So question naturally would be why we need a traffic shaper on TCP, but > not on FC. If I were to take an old 10G rdma nic (or even new 25G), stick it into a 128 core machine, create 128 queues of depth 1024, run workload with bs=4M jobs=128 qd=1024, you will also see timeouts. In every setup exists a workload that will overwhelm it... >>> I'm open to discussion what we should be doing when the request is in >>> the process of being sent. But when it didn't have a chance to be >>> sent and we just overloaded our internal queuing we shouldn't be >>> sending timeouts. >> >> As mentioned above, what happens if that same reporter opens another bug >> that the same phenomenon happens with soft-iwarp? What would you tell >> him/her? > > Nope. It's a HW appliance. Not a chance to change that. It was just a theoretical question. Do note that I'm not against solving a problem for anyone, I'm just questioning if increasing the io_timeout to be unbound in case the network is congested, is the right solution for everyone instead of a particular case that can easily be solved with udev to make the io_timeout to be as high as needed. One can argue that this patchset is making nvme-tcp to basically ignore the device io_timeout in certain cases. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-23 15:05 ` Sagi Grimberg @ 2022-05-23 16:07 ` Hannes Reinecke 2022-05-24 7:57 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-23 16:07 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/23/22 17:05, Sagi Grimberg wrote: > [ .. ] >>>> I'm open to discussion what we should be doing when the request is >>>> in the process of being sent. But when it didn't have a chance to be >>>> sent and we just overloaded our internal queuing we shouldn't be >>>> sending timeouts. >>> >>> As mentioned above, what happens if that same reporter opens another bug >>> that the same phenomenon happens with soft-iwarp? What would you tell >>> him/her? >> >> Nope. It's a HW appliance. Not a chance to change that. > > It was just a theoretical question. > > Do note that I'm not against solving a problem for anyone, I'm just > questioning if increasing the io_timeout to be unbound in case the > network is congested, is the right solution for everyone instead of > a particular case that can easily be solved with udev to make the > io_timeout to be as high as needed. > > One can argue that this patchset is making nvme-tcp to basically > ignore the device io_timeout in certain cases. Oh, yes, sure, that will happen. What I'm actually arguing is the imprecise difference between BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq() and command timeouts in case of resource constraints on the driver implementing ->queue_rq(). If there is a resource constrain driver is free to return BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept the request (in which case there will be a timeout). I could live with a timeout if that would just result in the command being retried. But in the case of nvme it results in a connection reset to boot, making customers really nervous that their system is broken. And having a workload which can generate connection resets feels like a DoS attack to me; applications shouldn't be able to do that. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-23 16:07 ` Hannes Reinecke @ 2022-05-24 7:57 ` Sagi Grimberg 2022-05-24 8:08 ` Hannes Reinecke 0 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-24 7:57 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme >>>>> I'm open to discussion what we should be doing when the request is >>>>> in the process of being sent. But when it didn't have a chance to >>>>> be sent and we just overloaded our internal queuing we shouldn't be >>>>> sending timeouts. >>>> >>>> As mentioned above, what happens if that same reporter opens another >>>> bug >>>> that the same phenomenon happens with soft-iwarp? What would you tell >>>> him/her? >>> >>> Nope. It's a HW appliance. Not a chance to change that. >> >> It was just a theoretical question. >> >> Do note that I'm not against solving a problem for anyone, I'm just >> questioning if increasing the io_timeout to be unbound in case the >> network is congested, is the right solution for everyone instead of >> a particular case that can easily be solved with udev to make the >> io_timeout to be as high as needed. >> >> One can argue that this patchset is making nvme-tcp to basically >> ignore the device io_timeout in certain cases. > > Oh, yes, sure, that will happen. > What I'm actually arguing is the imprecise difference between > BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq() > and command timeouts in case of resource constraints on the driver > implementing ->queue_rq(). > > If there is a resource constrain driver is free to return > BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept > the request (in which case there will be a timeout). There is no resource constraint. The driver sizes up the resources to be able to queue all the requests it is getting. > I could live with a timeout if that would just result in the command > being retried. But in the case of nvme it results in a connection reset > to boot, making customers really nervous that their system is broken. But how does the driver know that it is running in this environment that is completely congested? What I'm saying is that this is a specific use case that the solution can have negative side-effects for other common use-cases, because it is beyond the scope of the driver to handle. We can also trigger this condition with nvme-rdma. We could stay with this patch, but I'd argue that this might be the wrong thing to do in certain use-cases. > And having a workload which can generate connection resets feels like a > DoS attack to me; applications shouldn't be able to do that. Having a workload occupying all the tags all the time is also a DDOS attack. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-24 7:57 ` Sagi Grimberg @ 2022-05-24 8:08 ` Hannes Reinecke 2022-05-24 8:53 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-24 8:08 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/24/22 09:57, Sagi Grimberg wrote: > >>>>>> I'm open to discussion what we should be doing when the request is >>>>>> in the process of being sent. But when it didn't have a chance to >>>>>> be sent and we just overloaded our internal queuing we shouldn't >>>>>> be sending timeouts. >>>>> >>>>> As mentioned above, what happens if that same reporter opens >>>>> another bug >>>>> that the same phenomenon happens with soft-iwarp? What would you tell >>>>> him/her? >>>> >>>> Nope. It's a HW appliance. Not a chance to change that. >>> >>> It was just a theoretical question. >>> >>> Do note that I'm not against solving a problem for anyone, I'm just >>> questioning if increasing the io_timeout to be unbound in case the >>> network is congested, is the right solution for everyone instead of >>> a particular case that can easily be solved with udev to make the >>> io_timeout to be as high as needed. >>> >>> One can argue that this patchset is making nvme-tcp to basically >>> ignore the device io_timeout in certain cases. >> >> Oh, yes, sure, that will happen. >> What I'm actually arguing is the imprecise difference between >> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq() >> and command timeouts in case of resource constraints on the driver >> implementing ->queue_rq(). >> >> If there is a resource constrain driver is free to return >> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept >> the request (in which case there will be a timeout). > > There is no resource constraint. The driver sizes up the resources > to be able to queue all the requests it is getting. > >> I could live with a timeout if that would just result in the command >> being retried. But in the case of nvme it results in a connection >> reset to boot, making customers really nervous that their system is >> broken. > > But how does the driver know that it is running in this environment that > is completely congested? What I'm saying is that this is a specific use > case that the solution can have negative side-effects for other common > use-cases, because it is beyond the scope of the driver to handle. > > We can also trigger this condition with nvme-rdma. > > We could stay with this patch, but I'd argue that this might be the > wrong thing to do in certain use-cases. > Right, okay. Arguably this is a workload corner case, and we might not want to fix this in the driver. _However_: do we need to do a controller reset in this case? Shouldn't it be sufficient to just complete the command w/ timeout error and be done with it? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-24 8:08 ` Hannes Reinecke @ 2022-05-24 8:53 ` Sagi Grimberg 2022-05-24 9:34 ` Hannes Reinecke 0 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-24 8:53 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme >>>>>>> I'm open to discussion what we should be doing when the request >>>>>>> is in the process of being sent. But when it didn't have a chance >>>>>>> to be sent and we just overloaded our internal queuing we >>>>>>> shouldn't be sending timeouts. >>>>>> >>>>>> As mentioned above, what happens if that same reporter opens >>>>>> another bug >>>>>> that the same phenomenon happens with soft-iwarp? What would you tell >>>>>> him/her? >>>>> >>>>> Nope. It's a HW appliance. Not a chance to change that. >>>> >>>> It was just a theoretical question. >>>> >>>> Do note that I'm not against solving a problem for anyone, I'm just >>>> questioning if increasing the io_timeout to be unbound in case the >>>> network is congested, is the right solution for everyone instead of >>>> a particular case that can easily be solved with udev to make the >>>> io_timeout to be as high as needed. >>>> >>>> One can argue that this patchset is making nvme-tcp to basically >>>> ignore the device io_timeout in certain cases. >>> >>> Oh, yes, sure, that will happen. >>> What I'm actually arguing is the imprecise difference between >>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq() >>> and command timeouts in case of resource constraints on the driver >>> implementing ->queue_rq(). >>> >>> If there is a resource constrain driver is free to return >>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept >>> the request (in which case there will be a timeout). >> >> There is no resource constraint. The driver sizes up the resources >> to be able to queue all the requests it is getting. >> >>> I could live with a timeout if that would just result in the command >>> being retried. But in the case of nvme it results in a connection >>> reset to boot, making customers really nervous that their system is >>> broken. >> >> But how does the driver know that it is running in this environment that >> is completely congested? What I'm saying is that this is a specific use >> case that the solution can have negative side-effects for other common >> use-cases, because it is beyond the scope of the driver to handle. >> >> We can also trigger this condition with nvme-rdma. >> >> We could stay with this patch, but I'd argue that this might be the >> wrong thing to do in certain use-cases. >> > Right, okay. > > Arguably this is a workload corner case, and we might not want to fix > this in the driver. > > _However_: do we need to do a controller reset in this case? > Shouldn't it be sufficient to just complete the command w/ timeout error > and be done with it? The question is what is special about this timeout vs. any other timeout? pci attempts to abort the command before triggering a controller reset, Maybe we should also? although abort is not really reliable going on the admin queue... ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-24 8:53 ` Sagi Grimberg @ 2022-05-24 9:34 ` Hannes Reinecke 2022-05-24 9:58 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-24 9:34 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/24/22 10:53, Sagi Grimberg wrote: > >>>>>>>> I'm open to discussion what we should be doing when the request >>>>>>>> is in the process of being sent. But when it didn't have a >>>>>>>> chance to be sent and we just overloaded our internal queuing we >>>>>>>> shouldn't be sending timeouts. >>>>>>> >>>>>>> As mentioned above, what happens if that same reporter opens >>>>>>> another bug >>>>>>> that the same phenomenon happens with soft-iwarp? What would you >>>>>>> tell >>>>>>> him/her? >>>>>> >>>>>> Nope. It's a HW appliance. Not a chance to change that. >>>>> >>>>> It was just a theoretical question. >>>>> >>>>> Do note that I'm not against solving a problem for anyone, I'm just >>>>> questioning if increasing the io_timeout to be unbound in case the >>>>> network is congested, is the right solution for everyone instead of >>>>> a particular case that can easily be solved with udev to make the >>>>> io_timeout to be as high as needed. >>>>> >>>>> One can argue that this patchset is making nvme-tcp to basically >>>>> ignore the device io_timeout in certain cases. >>>> >>>> Oh, yes, sure, that will happen. >>>> What I'm actually arguing is the imprecise difference between >>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq() >>>> and command timeouts in case of resource constraints on the driver >>>> implementing ->queue_rq(). >>>> >>>> If there is a resource constrain driver is free to return >>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or >>>> accept the request (in which case there will be a timeout). >>> >>> There is no resource constraint. The driver sizes up the resources >>> to be able to queue all the requests it is getting. >>> >>>> I could live with a timeout if that would just result in the command >>>> being retried. But in the case of nvme it results in a connection >>>> reset to boot, making customers really nervous that their system is >>>> broken. >>> >>> But how does the driver know that it is running in this environment that >>> is completely congested? What I'm saying is that this is a specific use >>> case that the solution can have negative side-effects for other common >>> use-cases, because it is beyond the scope of the driver to handle. >>> >>> We can also trigger this condition with nvme-rdma. >>> >>> We could stay with this patch, but I'd argue that this might be the >>> wrong thing to do in certain use-cases. >>> >> Right, okay. >> >> Arguably this is a workload corner case, and we might not want to fix >> this in the driver. >> >> _However_: do we need to do a controller reset in this case? >> Shouldn't it be sufficient to just complete the command w/ timeout >> error and be done with it? > > The question is what is special about this timeout vs. any other > timeout? > > pci attempts to abort the command before triggering a controller > reset, Maybe we should also? although abort is not really reliable > going on the admin queue... I am not talking about NVMe abort. I'm talking about this: @@ -2335,6 +2340,11 @@ nvme_tcp_timeout(struct request *rq, bool reserved) "queue %d: timeout request %#x type %d\n", nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type); + if (!list_empty(&req->entry)) { + nvme_tcp_complete_timed_out(rq); + return BLK_EH_DONE; + } + if (ctrl->state != NVME_CTRL_LIVE) { /* * If we are resetting, connecting or deleting we should as the command is still in the queue and NVMe abort don't enter the picture at all. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: spurious I/O timeout under high load 2022-05-24 9:34 ` Hannes Reinecke @ 2022-05-24 9:58 ` Sagi Grimberg 0 siblings, 0 replies; 24+ messages in thread From: Sagi Grimberg @ 2022-05-24 9:58 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/24/22 12:34, Hannes Reinecke wrote: > On 5/24/22 10:53, Sagi Grimberg wrote: >> >>>>>>>>> I'm open to discussion what we should be doing when the request >>>>>>>>> is in the process of being sent. But when it didn't have a >>>>>>>>> chance to be sent and we just overloaded our internal queuing >>>>>>>>> we shouldn't be sending timeouts. >>>>>>>> >>>>>>>> As mentioned above, what happens if that same reporter opens >>>>>>>> another bug >>>>>>>> that the same phenomenon happens with soft-iwarp? What would you >>>>>>>> tell >>>>>>>> him/her? >>>>>>> >>>>>>> Nope. It's a HW appliance. Not a chance to change that. >>>>>> >>>>>> It was just a theoretical question. >>>>>> >>>>>> Do note that I'm not against solving a problem for anyone, I'm just >>>>>> questioning if increasing the io_timeout to be unbound in case the >>>>>> network is congested, is the right solution for everyone instead of >>>>>> a particular case that can easily be solved with udev to make the >>>>>> io_timeout to be as high as needed. >>>>>> >>>>>> One can argue that this patchset is making nvme-tcp to basically >>>>>> ignore the device io_timeout in certain cases. >>>>> >>>>> Oh, yes, sure, that will happen. >>>>> What I'm actually arguing is the imprecise difference between >>>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq() >>>>> and command timeouts in case of resource constraints on the driver >>>>> implementing ->queue_rq(). >>>>> >>>>> If there is a resource constrain driver is free to return >>>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or >>>>> accept the request (in which case there will be a timeout). >>>> >>>> There is no resource constraint. The driver sizes up the resources >>>> to be able to queue all the requests it is getting. >>>> >>>>> I could live with a timeout if that would just result in the >>>>> command being retried. But in the case of nvme it results in a >>>>> connection reset to boot, making customers really nervous that >>>>> their system is broken. >>>> >>>> But how does the driver know that it is running in this environment >>>> that >>>> is completely congested? What I'm saying is that this is a specific use >>>> case that the solution can have negative side-effects for other common >>>> use-cases, because it is beyond the scope of the driver to handle. >>>> >>>> We can also trigger this condition with nvme-rdma. >>>> >>>> We could stay with this patch, but I'd argue that this might be the >>>> wrong thing to do in certain use-cases. >>>> >>> Right, okay. >>> >>> Arguably this is a workload corner case, and we might not want to fix >>> this in the driver. >>> >>> _However_: do we need to do a controller reset in this case? >>> Shouldn't it be sufficient to just complete the command w/ timeout >>> error and be done with it? >> >> The question is what is special about this timeout vs. any other >> timeout? >> >> pci attempts to abort the command before triggering a controller >> reset, Maybe we should also? although abort is not really reliable >> going on the admin queue... > > I am not talking about NVMe abort. > I'm talking about this: > > @@ -2335,6 +2340,11 @@ nvme_tcp_timeout(struct request *rq, bool reserved) > "queue %d: timeout request %#x type %d\n", > nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type); > > + if (!list_empty(&req->entry)) { > + nvme_tcp_complete_timed_out(rq); > + return BLK_EH_DONE; > + } > + > if (ctrl->state != NVME_CTRL_LIVE) { > /* > * If we are resetting, connecting or deleting we should > > > as the command is still in the queue and NVMe abort don't enter the > picture at all. That for sure will not help because nvme_tcp_complete_timed_out stops the queue. What you can do is maybe just remove it from the pending list and complete it. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/3] nvme-tcp: Check for write space before queueing requests 2022-05-19 6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke 2022-05-19 6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke @ 2022-05-19 6:26 ` Hannes Reinecke 2022-05-20 9:17 ` Sagi Grimberg 2022-05-19 6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke 2022-05-20 9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg 3 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-19 6:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke The current model of always queue incoming requests lead to write stalls as we easily overload the network device under high I/O load. To avoid unlimited queueing we should rather check if write space is available before accepting new requests. Signed-off-by: Hannes Reinecke <hare@suse.de> --- drivers/nvme/host/tcp.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index ede76a0719a0..606565a4c708 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2464,6 +2464,9 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx, if (unlikely(ret)) return ret; + if (!sk_stream_is_writeable(queue->sock->sk)) + return BLK_STS_RESOURCE; + blk_mq_start_request(rq); nvme_tcp_queue_request(req, true, bd->last); -- 2.29.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] nvme-tcp: Check for write space before queueing requests 2022-05-19 6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke @ 2022-05-20 9:17 ` Sagi Grimberg 2022-05-20 10:05 ` Hannes Reinecke 0 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-20 9:17 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme > The current model of always queue incoming requests lead to > write stalls as we easily overload the network device under > high I/O load. > To avoid unlimited queueing we should rather check if write > space is available before accepting new requests. I'm somewhat on the fence with this one... On one end, we are checking the sock write space, but don't check the queued requests. And, this is purely advisory and not really a check we rely on. The merit of doing something like this is that we don't start the request timer, but we can just as easily queue the request and have it later queued for long due to sock being overloaded. Can you explain your thoughts to why this is a good solution? > Signed-off-by: Hannes Reinecke <hare@suse.de> > --- > drivers/nvme/host/tcp.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index ede76a0719a0..606565a4c708 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -2464,6 +2464,9 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx, > if (unlikely(ret)) > return ret; > > + if (!sk_stream_is_writeable(queue->sock->sk)) > + return BLK_STS_RESOURCE; > + > blk_mq_start_request(rq); > > nvme_tcp_queue_request(req, true, bd->last); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] nvme-tcp: Check for write space before queueing requests 2022-05-20 9:17 ` Sagi Grimberg @ 2022-05-20 10:05 ` Hannes Reinecke 2022-05-21 20:01 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-20 10:05 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/20/22 11:17, Sagi Grimberg wrote: > >> The current model of always queue incoming requests lead to >> write stalls as we easily overload the network device under >> high I/O load. >> To avoid unlimited queueing we should rather check if write >> space is available before accepting new requests. > > I'm somewhat on the fence with this one... On one end, we > are checking the sock write space, but don't check the queued > requests. And, this is purely advisory and not really a check > we rely on. > > The merit of doing something like this is that we don't start > the request timer, but we can just as easily queue the request > and have it later queued for long due to sock being overloaded. > > Can you explain your thoughts to why this is a good solution? > Request timeouts. As soon as we call 'blk_mq_start_request()' the I/O timer is called, and given that we (currently) queue _every_ request irrespective of the underlying device status we might end up queueing for a _loooong_ time. Timeouts while still in the queue are being handled by the first patch, but the underlying network might also be busy with retries and whatnot. So again, queuing requests when we _know_ there'll be a congestion is just asking for trouble (or, rather, spurious I/O timeouts). If one is worried about performance one can always increase the wmem size :-), but really it means that either your testcase or your network is misdesigned. And I'm perfectly fine with increasing the latency in these cases. What I don't like is timeouts, as these will show up to the user and we get all the supportcalls telling us that the kernel is broken. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] nvme-tcp: Check for write space before queueing requests 2022-05-20 10:05 ` Hannes Reinecke @ 2022-05-21 20:01 ` Sagi Grimberg 0 siblings, 0 replies; 24+ messages in thread From: Sagi Grimberg @ 2022-05-21 20:01 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme >>> The current model of always queue incoming requests lead to >>> write stalls as we easily overload the network device under >>> high I/O load. >>> To avoid unlimited queueing we should rather check if write >>> space is available before accepting new requests. >> >> I'm somewhat on the fence with this one... On one end, we >> are checking the sock write space, but don't check the queued >> requests. And, this is purely advisory and not really a check >> we rely on. >> >> The merit of doing something like this is that we don't start >> the request timer, but we can just as easily queue the request >> and have it later queued for long due to sock being overloaded. >> >> Can you explain your thoughts to why this is a good solution? >> > Request timeouts. > As soon as we call 'blk_mq_start_request()' the I/O timer is called, and > given that we (currently) queue _every_ request irrespective of the > underlying device status we might end up queueing for a _loooong_ time. > > Timeouts while still in the queue are being handled by the first patch, > but the underlying network might also be busy with retries and whatnot. > So again, queuing requests when we _know_ there'll be a congestion is > just asking for trouble (or, rather, spurious I/O timeouts). > > If one is worried about performance one can always increase the wmem > size :-), but really it means that either your testcase or your network > is misdesigned. > And I'm perfectly fine with increasing the latency in these cases. > What I don't like is timeouts, as these will show up to the user and we > get all the supportcalls telling us that the kernel is broken. Can you run some sanity perf tests to understand if anything unexpected comes up from this? ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() 2022-05-19 6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke 2022-05-19 6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke 2022-05-19 6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke @ 2022-05-19 6:26 ` Hannes Reinecke 2022-05-20 9:19 ` Sagi Grimberg 2022-05-20 9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg 3 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-19 6:26 UTC (permalink / raw) To: Christoph Hellwig Cc: Sagi Grimberg, Keith Busch, linux-nvme, Daniel Wagner, Hannes Reinecke From: Daniel Wagner <dwagner@suse.de> Add a send quota in nvme_tcp_send_all() to avoid stalls when sending large amounts of requests. Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de> --- drivers/nvme/host/tcp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 606565a4c708..87d760dfa3a9 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req, static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue) { int ret; + unsigned long deadline = jiffies + msecs_to_jiffies(1); /* drain the send queue as much as we can... */ do { ret = nvme_tcp_try_send(queue); - } while (ret > 0); + } while (ret > 0 || !time_after(jiffies, deadline)); } static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue) -- 2.29.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() 2022-05-19 6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke @ 2022-05-20 9:19 ` Sagi Grimberg 2022-05-20 9:59 ` Hannes Reinecke 0 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-20 9:19 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Daniel Wagner > Add a send quota in nvme_tcp_send_all() to avoid stalls when sending > large amounts of requests. > > Signed-off-by: Daniel Wagner <dwagner@suse.de> > Signed-off-by: Hannes Reinecke <hare@suse.de> > --- > drivers/nvme/host/tcp.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 606565a4c708..87d760dfa3a9 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req, > static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue) > { > int ret; > + unsigned long deadline = jiffies + msecs_to_jiffies(1); > > /* drain the send queue as much as we can... */ > do { > ret = nvme_tcp_try_send(queue); > - } while (ret > 0); > + } while (ret > 0 || !time_after(jiffies, deadline)); Umm, this will stay here a deadline period even if we don't have anything to send? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() 2022-05-20 9:19 ` Sagi Grimberg @ 2022-05-20 9:59 ` Hannes Reinecke 2022-05-21 20:02 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-20 9:59 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Daniel Wagner On 5/20/22 11:19, Sagi Grimberg wrote: > >> Add a send quota in nvme_tcp_send_all() to avoid stalls when sending >> large amounts of requests. >> >> Signed-off-by: Daniel Wagner <dwagner@suse.de> >> Signed-off-by: Hannes Reinecke <hare@suse.de> >> --- >> drivers/nvme/host/tcp.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >> index 606565a4c708..87d760dfa3a9 100644 >> --- a/drivers/nvme/host/tcp.c >> +++ b/drivers/nvme/host/tcp.c >> @@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct >> nvme_tcp_request *req, >> static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue) >> { >> int ret; >> + unsigned long deadline = jiffies + msecs_to_jiffies(1); >> /* drain the send queue as much as we can... */ >> do { >> ret = nvme_tcp_try_send(queue); >> - } while (ret > 0); >> + } while (ret > 0 || !time_after(jiffies, deadline)); > > Umm, this will stay here a deadline period even if we don't have > anything to send? Ah. Yeah, maybe. We can change it to '&&', which should solve it. (I think) Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() 2022-05-20 9:59 ` Hannes Reinecke @ 2022-05-21 20:02 ` Sagi Grimberg 0 siblings, 0 replies; 24+ messages in thread From: Sagi Grimberg @ 2022-05-21 20:02 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Daniel Wagner >>> Add a send quota in nvme_tcp_send_all() to avoid stalls when sending >>> large amounts of requests. >>> >>> Signed-off-by: Daniel Wagner <dwagner@suse.de> >>> Signed-off-by: Hannes Reinecke <hare@suse.de> >>> --- >>> drivers/nvme/host/tcp.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >>> index 606565a4c708..87d760dfa3a9 100644 >>> --- a/drivers/nvme/host/tcp.c >>> +++ b/drivers/nvme/host/tcp.c >>> @@ -308,11 +308,12 @@ static inline void nvme_tcp_advance_req(struct >>> nvme_tcp_request *req, >>> static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue) >>> { >>> int ret; >>> + unsigned long deadline = jiffies + msecs_to_jiffies(1); >>> /* drain the send queue as much as we can... */ >>> do { >>> ret = nvme_tcp_try_send(queue); >>> - } while (ret > 0); >>> + } while (ret > 0 || !time_after(jiffies, deadline)); >> >> Umm, this will stay here a deadline period even if we don't have >> anything to send? > > Ah. Yeah, maybe. We can change it to '&&', which should solve it. > (I think) See what is done in io_work() ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/3] nvme-tcp: queue stalls under high load 2022-05-19 6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke ` (2 preceding siblings ...) 2022-05-19 6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke @ 2022-05-20 9:20 ` Sagi Grimberg 2022-05-20 10:01 ` Hannes Reinecke 3 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2022-05-20 9:20 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme > Hi all, > > one of our partners registered queue stalls and I/O timeouts under > high load. Analysis revealed that we see an extremely 'choppy' I/O > behaviour when running large transfers on systems on low-performance > links (eg 1GigE networks). > We had a system with 30 queues trying to transfer 128M requests; simple > calculation shows that transferring a _single_ request on all queues > will take up to 38 seconds, thereby timing out the last request before > it got sent. > As a solution I first fixed up the timeout handler to reset the timeout > if the request is still queued or in the process of being send. The > second path modifies the send path to only allow for new requests if we > have enough space on the TX queue, and finally break up the send loop to > avoid system stalls when sending large request. What is the average latency you are seeing with this test? I'm guessing more than 30 seconds :) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/3] nvme-tcp: queue stalls under high load 2022-05-20 9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg @ 2022-05-20 10:01 ` Hannes Reinecke 2022-05-21 20:03 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Hannes Reinecke @ 2022-05-20 10:01 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 5/20/22 11:20, Sagi Grimberg wrote: > >> Hi all, >> >> one of our partners registered queue stalls and I/O timeouts under >> high load. Analysis revealed that we see an extremely 'choppy' I/O >> behaviour when running large transfers on systems on low-performance >> links (eg 1GigE networks). >> We had a system with 30 queues trying to transfer 128M requests; simple >> calculation shows that transferring a _single_ request on all queues >> will take up to 38 seconds, thereby timing out the last request before >> it got sent. >> As a solution I first fixed up the timeout handler to reset the timeout >> if the request is still queued or in the process of being send. The >> second path modifies the send path to only allow for new requests if we >> have enough space on the TX queue, and finally break up the send loop to >> avoid system stalls when sending large request. > > What is the average latency you are seeing with this test? > I'm guessing more than 30 seconds :) Yes, of course. Simple maths, in the end. (Actually it's more as we're always triggering a reconnect cycle...) And telling the customer to change his testcase only helps _so_ much. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/3] nvme-tcp: queue stalls under high load 2022-05-20 10:01 ` Hannes Reinecke @ 2022-05-21 20:03 ` Sagi Grimberg 0 siblings, 0 replies; 24+ messages in thread From: Sagi Grimberg @ 2022-05-21 20:03 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme >>> Hi all, >>> >>> one of our partners registered queue stalls and I/O timeouts under >>> high load. Analysis revealed that we see an extremely 'choppy' I/O >>> behaviour when running large transfers on systems on low-performance >>> links (eg 1GigE networks). >>> We had a system with 30 queues trying to transfer 128M requests; simple >>> calculation shows that transferring a _single_ request on all queues >>> will take up to 38 seconds, thereby timing out the last request before >>> it got sent. >>> As a solution I first fixed up the timeout handler to reset the timeout >>> if the request is still queued or in the process of being send. The >>> second path modifies the send path to only allow for new requests if we >>> have enough space on the TX queue, and finally break up the send loop to >>> avoid system stalls when sending large request. >> >> What is the average latency you are seeing with this test? >> I'm guessing more than 30 seconds :) > > Yes, of course. Simple maths, in the end. > (Actually it's more as we're always triggering a reconnect cycle...) > And telling the customer to change his testcase only helps _so_ much. Not to change the test case, simply making the io_timeout substantially larger. ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2022-05-24 9:59 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-19 6:26 [PATCH 0/3] nvme-tcp: queue stalls under high load Hannes Reinecke 2022-05-19 6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke 2022-05-20 9:05 ` Sagi Grimberg 2022-05-23 8:42 ` Hannes Reinecke 2022-05-23 13:36 ` Sagi Grimberg 2022-05-23 14:01 ` Hannes Reinecke 2022-05-23 15:05 ` Sagi Grimberg 2022-05-23 16:07 ` Hannes Reinecke 2022-05-24 7:57 ` Sagi Grimberg 2022-05-24 8:08 ` Hannes Reinecke 2022-05-24 8:53 ` Sagi Grimberg 2022-05-24 9:34 ` Hannes Reinecke 2022-05-24 9:58 ` Sagi Grimberg 2022-05-19 6:26 ` [PATCH 2/3] nvme-tcp: Check for write space before queueing requests Hannes Reinecke 2022-05-20 9:17 ` Sagi Grimberg 2022-05-20 10:05 ` Hannes Reinecke 2022-05-21 20:01 ` Sagi Grimberg 2022-05-19 6:26 ` [PATCH 3/3] nvme-tcp: send quota for nvme_tcp_send_all() Hannes Reinecke 2022-05-20 9:19 ` Sagi Grimberg 2022-05-20 9:59 ` Hannes Reinecke 2022-05-21 20:02 ` Sagi Grimberg 2022-05-20 9:20 ` [PATCH 0/3] nvme-tcp: queue stalls under high load Sagi Grimberg 2022-05-20 10:01 ` Hannes Reinecke 2022-05-21 20:03 ` Sagi Grimberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.