[PATCH 0/3] nvme-tcp: queue stalls under high load

* [PATCH 0/3] nvme-tcp: queue stalls under high load
@ 2022-05-19  6:26 Hannes Reinecke
  2022-05-19  6:26 ` [PATCH 1/3] nvme-tcp: spurious I/O timeout " Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Hannes Reinecke @ 2022-05-19  6:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

one of our partners registered queue stalls and I/O timeouts under
high load. Analysis revealed that we see an extremely 'choppy' I/O
behaviour when running large transfers on systems on low-performance
links (eg 1GigE networks).
We had a system with 30 queues trying to transfer 128M requests; simple
calculation shows that transferring a _single_ request on all queues
will take up to 38 seconds, thereby timing out the last request before
it got sent.
As a solution I first fixed up the timeout handler to reset the timeout
if the request is still queued or in the process of being send. The
second path modifies the send path to only allow for new requests if we
have enough space on the TX queue, and finally break up the send loop to
avoid system stalls when sending large request.

As usual, comments and reviews are welcome.

Hannes Reinecke (3):
  nvme-tcp: spurious I/O timeout under high load
  nvme-tcp: Check for write space before queueing requests
  nvme-tcp: send quota for nvme_tcp_send_all()

 drivers/nvme/host/tcp.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

-- 
2.29.2

^ permalink raw reply	[flat|nested] 24+ messages in thread