All of lore.kernel.org
 help / color / mirror / Atom feed
* Command timeouts with NVMe TCP kernel driver
@ 2021-08-30 13:30 Samuel Jones
  2021-08-31  2:37 ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Samuel Jones @ 2021-08-30 13:30 UTC (permalink / raw)
  To: linux-nvme

Hi all, 

I'm experiencing command timeouts with recent versions of nvme-tcp driver. I have set up with a VM running 5.8 which is ok, but the same test load (100%RD 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue. The initatior complains as follows: 

Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4 
Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery 

My target is a Kalray target, but I have no trace of any outstanding commands when this situation occurs. Quite the opposite: The Kalray board observes that the initiator stops sending new requests on the queue some time before this command times out. 

I don't have any similar issues using SPDK as an initiator. I made the following modification to nvme-tcp code and my problem has disappeared: 

--- a/drivers/nvme/host/tcp.c 
+++ b/drivers/nvme/host/tcp.c 
@@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, 
* directly, otherwise queue io_work. Also, only do that if we 
* are on the same cpu, so we don't introduce contention. 
*/ 
+#if 0 
if (queue->io_cpu == raw_smp_processor_id() && 
sync && empty && mutex_trylock(&queue->send_mutex)) { 
queue->more_requests = !last; 
@@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, 
} else if (last) { 
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); 
} 
+#else 
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); 
+#endif 
} 

To be honest, this bit of code has always bothered me: I don't understand how we can guarantee that the thread doesn't change CPU between the call to raw_smp_processor_id() and the trylock. I assume that even if this does occur, the fact that we hold queue->send_mutex() is supposed to make sure that there are no atomicity issues, but I'm concerned about the logic that tries to determine if there is any more pending work and reschedule io_work if and only if there is something to do. I can't pinpoint an issue from just reading the code though.. 

Does anyone have any thoughts on this? Is my patch proof that there is an issue in this code, or am I likely to be changing the program flow so significantly that we can't conclude? Are there any open issues wrt command timeouts at the moment? 

Thanks in advance for any help 
Sam

Samuel Jones 
Datacenter SW Development Manager • Kalray 
Phone: 
sjones@kalrayinc.com • [ https://www.kalrayinc.com/ | www.kalrayinc.com ] 

[ https://www.kalrayinc.com/ |   ] 
	
Intelligent Data Processing 
From Cloud to Edge 

Please consider the environment before printing this e-mail. 
This message contains information that may be privileged or confidential and is the property of Kalray S.A. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.





_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-08-30 13:30 Command timeouts with NVMe TCP kernel driver Samuel Jones
@ 2021-08-31  2:37 ` Keith Busch
  2021-08-31  2:44   ` Keith Busch
  2021-08-31 22:46   ` Samuel Jones
  0 siblings, 2 replies; 19+ messages in thread
From: Keith Busch @ 2021-08-31  2:37 UTC (permalink / raw)
  To: Samuel Jones; +Cc: linux-nvme

On Mon, Aug 30, 2021 at 03:30:29PM +0200, Samuel Jones wrote:
> Hi all, 
> 
> I'm experiencing command timeouts with recent versions of nvme-tcp driver. I have set up with a VM running 5.8 which is ok, but the same test load (100%RD 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue. The initatior complains as follows: 
> 
> Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4 
> Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery 
> 
> My target is a Kalray target, but I have no trace of any outstanding commands when this situation occurs. Quite the opposite: The Kalray board observes that the initiator stops sending new requests on the queue some time before this command times out. 
> 
> I don't have any similar issues using SPDK as an initiator. I made the following modification to nvme-tcp code and my problem has disappeared: 
> 
> --- a/drivers/nvme/host/tcp.c 
> +++ b/drivers/nvme/host/tcp.c 
> @@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, 
> * directly, otherwise queue io_work. Also, only do that if we 
> * are on the same cpu, so we don't introduce contention. 
> */ 
> +#if 0 
> if (queue->io_cpu == raw_smp_processor_id() && 
> sync && empty && mutex_trylock(&queue->send_mutex)) { 
> queue->more_requests = !last; 
> @@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, 
> } else if (last) { 
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); 
> } 
> +#else 
> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); 
> +#endif 
> } 
> 
> To be honest, this bit of code has always bothered me: I don't understand how we can guarantee that the thread doesn't change CPU between the call to raw_smp_processor_id() and the trylock. I assume that even if this does occur, the fact that we hold queue->send_mutex() is supposed to make sure that there are no atomicity issues, but I'm concerned about the logic that tries to determine if there is any more pending work and reschedule io_work if and only if there is something to do. I can't pinpoint an issue from just reading the code though.. 
> 
> Does anyone have any thoughts on this? Is my patch proof that there is an issue in this code, or am I likely to be changing the program flow so significantly that we can't conclude? Are there any open issues wrt command timeouts at the moment? 

I've seen similar, but I thought we'd fixed all those issues. Your poc
patch indicates to me that there is still at least another condition
that's not being considered. I don't think the possibility of CPU
rescheduling is the culprit, though.

One thing that looks problematic is if 'bd.last' is true but
nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
sequence. That could cause the io_work to never kick. There should have
been something else in dmesg indicating that error, though, so I'll look
for some other possibility.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-08-31  2:37 ` Keith Busch
@ 2021-08-31  2:44   ` Keith Busch
  2021-09-01 12:56     ` Sagi Grimberg
  2021-08-31 22:46   ` Samuel Jones
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Busch @ 2021-08-31  2:44 UTC (permalink / raw)
  To: Samuel Jones; +Cc: linux-nvme

On Mon, Aug 30, 2021 at 07:37:12PM -0700, Keith Busch wrote:
> One thing that looks problematic is if 'bd.last' is true but
> nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
> sequence. That could cause the io_work to never kick. 

Eh, your observed problem should not be the above: the block layer will
call .commit_rqs() anytime an error occurs during .queue_rq(), which
should kick io_work. So yeah, I'll look for other possibilities.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-08-31  2:37 ` Keith Busch
  2021-08-31  2:44   ` Keith Busch
@ 2021-08-31 22:46   ` Samuel Jones
  2021-09-01 13:21     ` Sagi Grimberg
  1 sibling, 1 reply; 19+ messages in thread
From: Samuel Jones @ 2021-08-31 22:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme

> On Mon, Aug 30, 2021 at 03:30:29PM +0200, Samuel Jones wrote:
>> Hi all,
>> 
>> I'm experiencing command timeouts with recent versions of nvme-tcp driver. I
>> have set up with a VM running 5.8 which is ok, but the same test load (100%RD
>> 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue.
>> The initatior complains as follows:
>> 
>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4
>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery
>> 
>> My target is a Kalray target, but I have no trace of any outstanding commands
>> when this situation occurs. Quite the opposite: The Kalray board observes that
>> the initiator stops sending new requests on the queue some time before this
>> command times out.
>> 
>> I don't have any similar issues using SPDK as an initiator. I made the following
>> modification to nvme-tcp code and my problem has disappeared:
>> 
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct
>> nvme_tcp_request *req,
>> * directly, otherwise queue io_work. Also, only do that if we
>> * are on the same cpu, so we don't introduce contention.
>> */
>> +#if 0
>> if (queue->io_cpu == raw_smp_processor_id() &&
>> sync && empty && mutex_trylock(&queue->send_mutex)) {
>> queue->more_requests = !last;
>> @@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct
>> nvme_tcp_request *req,
>> } else if (last) {
>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> }
>> +#else
>> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> +#endif
>> }
>> 
>> To be honest, this bit of code has always bothered me: I don't understand how we
>> can guarantee that the thread doesn't change CPU between the call to
>> raw_smp_processor_id() and the trylock. I assume that even if this does occur,
>> the fact that we hold queue->send_mutex() is supposed to make sure that there
>> are no atomicity issues, but I'm concerned about the logic that tries to
>> determine if there is any more pending work and reschedule io_work if and only
>> if there is something to do. I can't pinpoint an issue from just reading the
>> code though..
>> 
>> Does anyone have any thoughts on this? Is my patch proof that there is an issue
>> in this code, or am I likely to be changing the program flow so significantly
>> that we can't conclude? Are there any open issues wrt command timeouts at the
>> moment?
> 
> I've seen similar, but I thought we'd fixed all those issues. Your poc
> patch indicates to me that there is still at least another condition
> that's not being considered. I don't think the possibility of CPU
> rescheduling is the culprit, though.
> 
> One thing that looks problematic is if 'bd.last' is true but
> nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
> sequence. That could cause the io_work to never kick. There should have
> been something else in dmesg indicating that error, though, so I'll look
> for some other possibility.
> 

Thanks for your help on this Keith. I think I have a better idea about what is happening now. First of all, I have a large number of devices exposed over NVMe-TCP: I'm connecting to 4 remote NVMe devices each of which is multi-namespace, so I have a total of 37 block devices at once. Then I run a userspace benchmark that's running 8 threads per block device and generating IO to each of them. I'm not sure how the multi queue block devices are layered over the X NVMe queues per device, but however it works my system is on its knees. I have been timestamping the time spent by NVMe-TCP threads inside kernel_sendpage() within nvme_tcp_try_send_cmd_pdu() when called from nvme_tcp_queue_rq() and I'm getting some ridiculous outliers : up to 60 seconds. I am confident this is what is responsible for my timeouts. If I increase the nvme_io_timeout to 300 seconds, my test case passes.

What's interesting is my quick patch that removed the direct call to kernel_sendpage via queue_rq() makes the system behave much better: no more outliers and no timeouts. I don't know much about the kernel network stack or the block device layer but I imagine that the difference is due to the number of calls into the network stack that can happen in parallel. Is queue_rq executed from the context of the userspace caller? In this case I will have close to 300 threads active on this code path. Whereas when we go through the workqueue, I guess we have only N max calls into the network stack at once, where N = the number of CPUs on the system? 

What do you think? Does my analysis make any sense?
Thanks
Sam





_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-08-31  2:44   ` Keith Busch
@ 2021-09-01 12:56     ` Sagi Grimberg
  0 siblings, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2021-09-01 12:56 UTC (permalink / raw)
  To: Keith Busch, Samuel Jones; +Cc: linux-nvme


>> One thing that looks problematic is if 'bd.last' is true but
>> nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
>> sequence. That could cause the io_work to never kick.
> 
> Eh, your observed problem should not be the above: the block layer will
> call .commit_rqs() anytime an error occurs during .queue_rq(), which
> should kick io_work. So yeah, I'll look for other possibilities.

Also, any temporary failure (EAGAIN) should trigger write_space
that will also schedule io_work.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-08-31 22:46   ` Samuel Jones
@ 2021-09-01 13:21     ` Sagi Grimberg
  2021-09-07  9:05       ` Samuel Jones
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2021-09-01 13:21 UTC (permalink / raw)
  To: Samuel Jones, Keith Busch; +Cc: linux-nvme


>>> Hi all,
>>>
>>> I'm experiencing command timeouts with recent versions of nvme-tcp driver. I
>>> have set up with a VM running 5.8 which is ok, but the same test load (100%RD
>>> 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue.
>>> The initatior complains as follows:
>>>
>>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4
>>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery
>>>
>>> My target is a Kalray target, but I have no trace of any outstanding commands
>>> when this situation occurs. Quite the opposite: The Kalray board observes that
>>> the initiator stops sending new requests on the queue some time before this
>>> command times out.
>>>
>>> I don't have any similar issues using SPDK as an initiator. I made the following
>>> modification to nvme-tcp code and my problem has disappeared:
>>>
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct
>>> nvme_tcp_request *req,
>>> * directly, otherwise queue io_work. Also, only do that if we
>>> * are on the same cpu, so we don't introduce contention.
>>> */
>>> +#if 0
>>> if (queue->io_cpu == raw_smp_processor_id() &&
>>> sync && empty && mutex_trylock(&queue->send_mutex)) {
>>> queue->more_requests = !last;
>>> @@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct
>>> nvme_tcp_request *req,
>>> } else if (last) {
>>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> }
>>> +#else
>>> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> +#endif
>>> }
>>>
>>> To be honest, this bit of code has always bothered me: I don't understand how we
>>> can guarantee that the thread doesn't change CPU between the call to
>>> raw_smp_processor_id() and the trylock. I assume that even if this does occur,
>>> the fact that we hold queue->send_mutex() is supposed to make sure that there
>>> are no atomicity issues, but I'm concerned about the logic that tries to
>>> determine if there is any more pending work and reschedule io_work if and only
>>> if there is something to do. I can't pinpoint an issue from just reading the
>>> code though..
>>>
>>> Does anyone have any thoughts on this? Is my patch proof that there is an issue
>>> in this code, or am I likely to be changing the program flow so significantly
>>> that we can't conclude? Are there any open issues wrt command timeouts at the
>>> moment?
>>
>> I've seen similar, but I thought we'd fixed all those issues. Your poc
>> patch indicates to me that there is still at least another condition
>> that's not being considered. I don't think the possibility of CPU
>> rescheduling is the culprit, though.
>>
>> One thing that looks problematic is if 'bd.last' is true but
>> nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
>> sequence. That could cause the io_work to never kick. There should have
>> been something else in dmesg indicating that error, though, so I'll look
>> for some other possibility.
>>
> 
> Thanks for your help on this Keith. I think I have a better idea about what is happening now. First of all, I have a large number of devices exposed over NVMe-TCP: I'm connecting to 4 remote NVMe devices each of which is multi-namespace, so I have a total of 37 block devices at once.

That shouldn't matter really, what matters is the number of controllers 
that you have. you have 4 controllers, and how many queues are there 
per-controller? and how many cores does the VM have?

> Then I run a userspace benchmark that's running 8 threads per block device and generating IO to each of them. I'm not sure how the multi queue block devices are layered over the X NVMe queues per device, but however it works my system is on its knees. I have been timestamping the time spent by NVMe-TCP threads inside kernel_sendpage() within nvme_tcp_try_send_cmd_pdu() when called from nvme_tcp_queue_rq() and I'm getting some ridiculous outliers : up to 60 seconds. I am confident this is what is responsible for my timeouts. If I increase the nvme_io_timeout to 300 seconds, my test case passes.

That is strange indeed. what are you taking the timing on? just
kernel_sendpage?

> What's interesting is my quick patch that removed the direct call to kernel_sendpage via queue_rq() makes the system behave much better: no more outliers and no timeouts. I don't know much about the kernel network stack or the block device layer but I imagine that the difference is due to the number of calls into the network stack that can happen in parallel. Is queue_rq executed from the context of the userspace caller? In this case I will have close to 300 threads active on this code path. Whereas when we go through the workqueue, I guess we have only N max calls into the network stack at once, where N = the number of CPUs on the system?
> 
> What do you think? Does my analysis make any sense?

I don't think this is related to the number of contexts you have running 
against the namespaces.

However, the difference between the submission from queue_rq and the
workqueue is that queue_rq may happen concurrently and hence may be
blocked by the socket mutex as the io_work does RX. Now I guess that
if RX is intensive (and it sounds like it is in your test) that the
TX path would be potentially starved...

But if the RX is so intensive, then I'd expect you to have timeouts
regardless as the context is not performing sends anyways, but maybe
this needs a bit more thought...

If this theory is correct, then a possible reason for the big delay
inside kernel_sendpage is due to the fact that io_work just ingest as 
much as the stack gives it (which can be a lot). If we were to break
the recvs into fixed chunks, it would allow to send between recvs and
avoid this potential starvation.

Will it be possible to tryout this untested patch?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 032e8d582e6d..ec05b26cee38 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -837,6 +837,7 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
*desc, struct sk_buff *skb,
                 }
         }

+       desc->count -= consumed;
         return consumed;
  }

@@ -1107,7 +1108,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue 
*queue)
         int consumed;

         rd_desc.arg.data = queue;
-       rd_desc.count = 1;
+       rd_desc.count = 0x4000; /* 16k */
         lock_sock(sk);
         queue->nr_cqe = 0;
         consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-01 13:21     ` Sagi Grimberg
@ 2021-09-07  9:05       ` Samuel Jones
  2021-09-07 16:42         ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Samuel Jones @ 2021-09-07  9:05 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, linux-nvme

>>>> Hi all,
>>>>
>>>> I'm experiencing command timeouts with recent versions of nvme-tcp driver. I
>>>> have set up with a VM running 5.8 which is ok, but the same test load (100%RD
>>>> 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue.
>>>> The initatior complains as follows:
>>>>
>>>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4
>>>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery
>>>>
>>>> My target is a Kalray target, but I have no trace of any outstanding commands
>>>> when this situation occurs. Quite the opposite: The Kalray board observes that
>>>> the initiator stops sending new requests on the queue some time before this
>>>> command times out.
>>>>
>>>> I don't have any similar issues using SPDK as an initiator. I made the following
>>>> modification to nvme-tcp code and my problem has disappeared:
>>>>
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct
>>>> nvme_tcp_request *req,
>>>> * directly, otherwise queue io_work. Also, only do that if we
>>>> * are on the same cpu, so we don't introduce contention.
>>>> */
>>>> +#if 0
>>>> if (queue->io_cpu == raw_smp_processor_id() &&
>>>> sync && empty && mutex_trylock(&queue->send_mutex)) {
>>>> queue->more_requests = !last;
>>>> @@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct
>>>> nvme_tcp_request *req,
>>>> } else if (last) {
>>>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>>> }
>>>> +#else
>>>> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>>> +#endif
>>>> }
>>>>
>>>> To be honest, this bit of code has always bothered me: I don't understand how we
>>>> can guarantee that the thread doesn't change CPU between the call to
>>>> raw_smp_processor_id() and the trylock. I assume that even if this does occur,
>>>> the fact that we hold queue->send_mutex() is supposed to make sure that there
>>>> are no atomicity issues, but I'm concerned about the logic that tries to
>>>> determine if there is any more pending work and reschedule io_work if and only
>>>> if there is something to do. I can't pinpoint an issue from just reading the
>>>> code though..
>>>>
>>>> Does anyone have any thoughts on this? Is my patch proof that there is an issue
>>>> in this code, or am I likely to be changing the program flow so significantly
>>>> that we can't conclude? Are there any open issues wrt command timeouts at the
>>>> moment?
>>>
>>> I've seen similar, but I thought we'd fixed all those issues. Your poc
>>> patch indicates to me that there is still at least another condition
>>> that's not being considered. I don't think the possibility of CPU
>>> rescheduling is the culprit, though.
>>>
>>> One thing that looks problematic is if 'bd.last' is true but
>>> nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
>>> sequence. That could cause the io_work to never kick. There should have
>>> been something else in dmesg indicating that error, though, so I'll look
>>> for some other possibility.
>>>
>> 
>> Thanks for your help on this Keith. I think I have a better idea about what is
>> happening now. First of all, I have a large number of devices exposed over
>> NVMe-TCP: I'm connecting to 4 remote NVMe devices each of which is
>> multi-namespace, so I have a total of 37 block devices at once.
> 
> That shouldn't matter really, what matters is the number of controllers
> that you have. you have 4 controllers, and how many queues are there
> per-controller? and how many cores does the VM have?
> 
>> Then I run a userspace benchmark that's running 8 threads per block device and
>> generating IO to each of them. I'm not sure how the multi queue block devices
>> are layered over the X NVMe queues per device, but however it works my system
>> is on its knees. I have been timestamping the time spent by NVMe-TCP threads
>> inside kernel_sendpage() within nvme_tcp_try_send_cmd_pdu() when called from
>> nvme_tcp_queue_rq() and I'm getting some ridiculous outliers : up to 60
>> seconds. I am confident this is what is responsible for my timeouts. If I
>> increase the nvme_io_timeout to 300 seconds, my test case passes.
> 
> That is strange indeed. what are you taking the timing on? just
> kernel_sendpage?
> 
>> What's interesting is my quick patch that removed the direct call to
>> kernel_sendpage via queue_rq() makes the system behave much better: no more
>> outliers and no timeouts. I don't know much about the kernel network stack or
>> the block device layer but I imagine that the difference is due to the number
>> of calls into the network stack that can happen in parallel. Is queue_rq
>> executed from the context of the userspace caller? In this case I will have
>> close to 300 threads active on this code path. Whereas when we go through the
>> workqueue, I guess we have only N max calls into the network stack at once,
>> where N = the number of CPUs on the system?
>> 
>> What do you think? Does my analysis make any sense?
> 
> I don't think this is related to the number of contexts you have running
> against the namespaces.
> 
> However, the difference between the submission from queue_rq and the
> workqueue is that queue_rq may happen concurrently and hence may be
> blocked by the socket mutex as the io_work does RX. Now I guess that
> if RX is intensive (and it sounds like it is in your test) that the
> TX path would be potentially starved...
> 
> But if the RX is so intensive, then I'd expect you to have timeouts
> regardless as the context is not performing sends anyways, but maybe
> this needs a bit more thought...
> 
> If this theory is correct, then a possible reason for the big delay
> inside kernel_sendpage is due to the fact that io_work just ingest as
> much as the stack gives it (which can be a lot). If we were to break
> the recvs into fixed chunks, it would allow to send between recvs and
> avoid this potential starvation.
> 
> Will it be possible to tryout this untested patch?
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 032e8d582e6d..ec05b26cee38 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -837,6 +837,7 @@ static int nvme_tcp_recv_skb(read_descriptor_t
> *desc, struct sk_buff *skb,
>                 }
>         }
> 
> +       desc->count -= consumed;
>         return consumed;
>  }
> 
> @@ -1107,7 +1108,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue
> *queue)
>         int consumed;
> 
>         rd_desc.arg.data = queue;
> -       rd_desc.count = 1;
> +       rd_desc.count = 0x4000; /* 16k */
>         lock_sock(sk);
>         queue->nr_cqe = 0;
>         consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
> --

Hi Sagi, Keith, all,

I think I have a better idea about what is happening now. First of all, answers to Sagi's questions: I have 4 controllers with 8 queues each, the VM has 16 CPUs. I was timing on kernel_sendpage. I tried your patch Sagi but it didn't make any difference. Indeed, given that the queue stops transmitting requests, I guess the RX path gets very quiet very fast.

After a lot of time spent timing various functions in kernel_sendpage, the only thing I was able to observe was that my thread is descheduled and not rescheduled again for a LONG time. I think what is happening is the following.:

1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
2. Userspace context is descheduled.
3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.

This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:

1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread

Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?

Thanks for your help
Sam





_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-07  9:05       ` Samuel Jones
@ 2021-09-07 16:42         ` Keith Busch
  2021-09-07 20:46           ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2021-09-07 16:42 UTC (permalink / raw)
  To: Samuel Jones; +Cc: Sagi Grimberg, linux-nvme

On Tue, Sep 07, 2021 at 11:05:53AM +0200, Samuel Jones wrote:
> Hi Sagi, Keith, all,
> 
> I think I have a better idea about what is happening now. First of all, answers to Sagi's questions: I have 4 controllers with 8 queues each, the VM has 16 CPUs. I was timing on kernel_sendpage. I tried your patch Sagi but it didn't make any difference. Indeed, given that the queue stops transmitting requests, I guess the RX path gets very quiet very fast.
> 
> After a lot of time spent timing various functions in kernel_sendpage, the only thing I was able to observe was that my thread is descheduled and not rescheduled again for a LONG time. I think what is happening is the following.:
> 
> 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
> 2. Userspace context is descheduled.
> 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
> 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
> 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
> 
> This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
> 
> 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
> 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
> 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
> 
> Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?

Sounds possible. The "pending" check on the req_list was to fix a race
condition where the io_work could miss handling a request, resulting in
a different command time out. It sounds like that could make the io_work
spin without having anything to do, while starving the lower priority
task. I'll try to think of another way to handle it.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-07 16:42         ` Keith Busch
@ 2021-09-07 20:46           ` Keith Busch
  2021-09-08 10:29             ` Samuel Jones
  2021-09-08 20:55             ` Sagi Grimberg
  0 siblings, 2 replies; 19+ messages in thread
From: Keith Busch @ 2021-09-07 20:46 UTC (permalink / raw)
  To: Samuel Jones; +Cc: Sagi Grimberg, linux-nvme

On Tue, Sep 07, 2021 at 09:42:36AM -0700, Keith Busch wrote:
> On Tue, Sep 07, 2021 at 11:05:53AM +0200, Samuel Jones wrote:
> > Hi Sagi, Keith, all,
> > 
> > I think I have a better idea about what is happening now. First of all, answers to Sagi's questions: I have 4 controllers with 8 queues each, the VM has 16 CPUs. I was timing on kernel_sendpage. I tried your patch Sagi but it didn't make any difference. Indeed, given that the queue stops transmitting requests, I guess the RX path gets very quiet very fast.
> > 
> > After a lot of time spent timing various functions in kernel_sendpage, the only thing I was able to observe was that my thread is descheduled and not rescheduled again for a LONG time. I think what is happening is the following.:
> > 
> > 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
> > 2. Userspace context is descheduled.
> > 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
> > 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
> > 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
> > 
> > This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
> > 
> > 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
> > 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
> > 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
> > 
> > Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
> 
> Sounds possible. The "pending" check on the req_list was to fix a race
> condition where the io_work could miss handling a request, resulting in
> a different command time out. It sounds like that could make the io_work
> spin without having anything to do, while starving the lower priority
> task. I'll try to think of another way to handle it.

Instead of relying on io_work to determine if there's pending requests,
the queuing action can try to re-schedule it only after the send_mutex
is released, and I think that should address the issue you've described
while still fixing the IO timeout the "pending" trick was addressing.
Can you try the below patch?

Sagi, does this look okay to you?

---
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e2ab12f3f51c..e4249b7dc056 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -274,6 +274,12 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
 	} while (ret > 0);
 }
 
+static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
+{
+	return !list_empty(&queue->send_list) ||
+		!llist_empty(&queue->req_list) || queue->more_requests;
+}
+
 static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 		bool sync, bool last)
 {
@@ -294,9 +300,10 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 		nvme_tcp_send_all(queue);
 		queue->more_requests = false;
 		mutex_unlock(&queue->send_mutex);
-	} else if (last) {
-		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 	}
+
+	if (last && nvme_tcp_queue_more(queue))
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 }
 
 static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
@@ -906,12 +913,6 @@ static void nvme_tcp_state_change(struct sock *sk)
 	read_unlock_bh(&sk->sk_callback_lock);
 }
 
-static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
-{
-	return !list_empty(&queue->send_list) ||
-		!llist_empty(&queue->req_list) || queue->more_requests;
-}
-
 static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
 {
 	queue->request = NULL;
@@ -1145,8 +1146,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
 				pending = true;
 			else if (unlikely(result < 0))
 				break;
-		} else
-			pending = !llist_empty(&queue->req_list);
+		}
 
 		result = nvme_tcp_try_recv(queue);
 		if (result > 0)
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-07 20:46           ` Keith Busch
@ 2021-09-08 10:29             ` Samuel Jones
  2021-09-08 20:55             ` Sagi Grimberg
  1 sibling, 0 replies; 19+ messages in thread
From: Samuel Jones @ 2021-09-08 10:29 UTC (permalink / raw)
  To: Keith Busch; +Cc: Sagi Grimberg, linux-nvme

On Tue, Sep 07, 2021 at 09:42:36AM -0700, Keith Busch wrote:
> On Tue, Sep 07, 2021 at 11:05:53AM +0200, Samuel Jones wrote:
> > Hi Sagi, Keith, all,
> > 
> > I think I have a better idea about what is happening now. First of all, answers to Sagi's questions: I have 4 controllers with 8 queues each, the VM has 16 CPUs. I was timing on kernel_sendpage. I tried your patch Sagi but it didn't make any difference. Indeed, given that the queue stops transmitting requests, I guess the RX path gets very quiet very fast.
> > 
> > After a lot of time spent timing various functions in kernel_sendpage, the only thing I was able to observe was that my thread is descheduled and not rescheduled again for a LONG time. I think what is happening is the following.:
> > 
> > 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
> > 2. Userspace context is descheduled.
> > 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
> > 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
> > 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
> > 
> > This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
> > 
> > 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
> > 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
> > 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
> > 
> > Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
> 
> Sounds possible. The "pending" check on the req_list was to fix a race
> condition where the io_work could miss handling a request, resulting in
> a different command time out. It sounds like that could make the io_work
> spin without having anything to do, while starving the lower priority
> task. I'll try to think of another way to handle it.

Instead of relying on io_work to determine if there's pending requests,
the queuing action can try to re-schedule it only after the send_mutex
is released, and I think that should address the issue you've described
while still fixing the IO timeout the "pending" trick was addressing.
Can you try the below patch?

Sagi, does this look okay to you?

---
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e2ab12f3f51c..e4249b7dc056 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -274,6 +274,12 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
 	} while (ret > 0);
 }
 
+static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
+{
+	return !list_empty(&queue->send_list) ||
+		!llist_empty(&queue->req_list) || queue->more_requests;
+}
+
 static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 		bool sync, bool last)
 {
@@ -294,9 +300,10 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 		nvme_tcp_send_all(queue);
 		queue->more_requests = false;
 		mutex_unlock(&queue->send_mutex);
-	} else if (last) {
-		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 	}
+
+	if (last && nvme_tcp_queue_more(queue))
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 }
 
 static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
@@ -906,12 +913,6 @@ static void nvme_tcp_state_change(struct sock *sk)
 	read_unlock_bh(&sk->sk_callback_lock);
 }
 
-static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
-{
-	return !list_empty(&queue->send_list) ||
-		!llist_empty(&queue->req_list) || queue->more_requests;
-}
-
 static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
 {
 	queue->request = NULL;
@@ -1145,8 +1146,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
 				pending = true;
 			else if (unlikely(result < 0))
 				break;
-		} else
-			pending = !llist_empty(&queue->req_list);
+		}
 
 		result = nvme_tcp_try_recv(queue);
 		if (result > 0)
--

Hi Keith,

Your patch works correctly on my setup, no more timeout issues. Thanks!

Sam


To declare a filtering error, please use the following link : https://www.security-mail.net/reporter.php?mid=17531.6137cf9b.62d11.0&r=sjones%40kalrayinc.com&s=kbusch%40kernel.org&o=Re%3A+Command+timeouts+with+NVMe+TCP+kernel+driver&verdict=C&c=988ab53472e1563b11ab76f9be058e442334922a





_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-07 20:46           ` Keith Busch
  2021-09-08 10:29             ` Samuel Jones
@ 2021-09-08 20:55             ` Sagi Grimberg
  2021-09-08 21:50               ` Keith Busch
  1 sibling, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2021-09-08 20:55 UTC (permalink / raw)
  To: Keith Busch, Samuel Jones; +Cc: linux-nvme


>>> 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
>>> 2. Userspace context is descheduled.
>>> 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
>>> 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
>>> 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
>>>
>>> This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
>>>
>>> 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
>>> 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
>>> 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
>>>
>>> Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
>>
>> Sounds possible. The "pending" check on the req_list was to fix a race
>> condition where the io_work could miss handling a request, resulting in
>> a different command time out. It sounds like that could make the io_work
>> spin without having anything to do, while starving the lower priority
>> task. I'll try to think of another way to handle it.
> 
> Instead of relying on io_work to determine if there's pending requests,
> the queuing action can try to re-schedule it only after the send_mutex
> is released, and I think that should address the issue you've described
> while still fixing the IO timeout the "pending" trick was addressing.
> Can you try the below patch?
> 
> Sagi, does this look okay to you?

It looks OK to me, but on the other hand the former
one looked perfectly fine to me. I'm trying to understand
if we can have a race condition hiding here....

> ---
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index e2ab12f3f51c..e4249b7dc056 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -274,6 +274,12 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>   	} while (ret > 0);
>   }
>   
> +static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
> +{
> +	return !list_empty(&queue->send_list) ||
> +		!llist_empty(&queue->req_list) || queue->more_requests;
> +}
> +
>   static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   		bool sync, bool last)
>   {
> @@ -294,9 +300,10 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   		nvme_tcp_send_all(queue);
>   		queue->more_requests = false;
>   		mutex_unlock(&queue->send_mutex);
> -	} else if (last) {
> -		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   	}
> +
> +	if (last && nvme_tcp_queue_more(queue))
> +		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

Basically now the last request in the batch will not trigger
unconditionally (if the direct-send path was not taken), but only if it
sees queued requests, which in theory could be reaped by a running
io_work, that would result in an empty io_work invocation. On one hand
it is undesired, on the other hand should be harmless as it shouldn't
miss an invocation... So it looks reasonable to me.

Sam, can you run some more extensive I/O testing with this (different
patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
to also take it for a spin?

All of these issues have a common theme, which is less I/O queues than
CPUs for each controller which make RX/TX and queue_rq/io_work compete
and usually controllers with smaller MDTS (that trigger multiple R2Ts
per request and uncovered some races in this area). So testing those
would be beneficial.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-08 20:55             ` Sagi Grimberg
@ 2021-09-08 21:50               ` Keith Busch
  2021-09-09  6:36                 ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2021-09-08 21:50 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Samuel Jones, linux-nvme

On Wed, Sep 08, 2021 at 11:55:59PM +0300, Sagi Grimberg wrote:
> 
> > > > 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
> > > > 2. Userspace context is descheduled.
> > > > 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
> > > > 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
> > > > 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
> > > > 
> > > > This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
> > > > 
> > > > 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
> > > > 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
> > > > 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
> > > > 
> > > > Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
> > > 
> > > Sounds possible. The "pending" check on the req_list was to fix a race
> > > condition where the io_work could miss handling a request, resulting in
> > > a different command time out. It sounds like that could make the io_work
> > > spin without having anything to do, while starving the lower priority
> > > task. I'll try to think of another way to handle it.
> > 
> > Instead of relying on io_work to determine if there's pending requests,
> > the queuing action can try to re-schedule it only after the send_mutex
> > is released, and I think that should address the issue you've described
> > while still fixing the IO timeout the "pending" trick was addressing.
> > Can you try the below patch?
> > 
> > Sagi, does this look okay to you?
> 
> It looks OK to me, but on the other hand the former
> one looked perfectly fine to me. 

It's a priority inversion. The .queue_rq() task is holding onto the
'req_list' resource via send_mutex, but it can't schedule in because
io_work is higher priority and will remain in TASK_RUNNING while
'req_list' is non-empty.

I think the user must have pinned the .queue_rq task to that specific
CPU, otherwise I believe it would have scheduled somewhere else to
unblock forward progress.

> I'm trying to understand
> if we can have a race condition hiding here....
> 
> > ---
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index e2ab12f3f51c..e4249b7dc056 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -274,6 +274,12 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
> >   	} while (ret > 0);
> >   }
> > +static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
> > +{
> > +	return !list_empty(&queue->send_list) ||
> > +		!llist_empty(&queue->req_list) || queue->more_requests;
> > +}
> > +
> >   static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> >   		bool sync, bool last)
> >   {
> > @@ -294,9 +300,10 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> >   		nvme_tcp_send_all(queue);
> >   		queue->more_requests = false;
> >   		mutex_unlock(&queue->send_mutex);
> > -	} else if (last) {
> > -		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> >   	}
> > +
> > +	if (last && nvme_tcp_queue_more(queue))
> > +		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> 
> Basically now the last request in the batch will not trigger
> unconditionally (if the direct-send path was not taken), but only if it
> sees queued requests, which in theory could be reaped by a running
> io_work, that would result in an empty io_work invocation.  On one
> hand it is undesired, on the other hand should be harmless as it
> shouldn't miss an invocation... So it looks reasonable to me.

I can change the check to the same as .commit_rqs(), which just
consideres 'llist_empty(&queue->req_list))'.

That should be just as safe and have fewer unnecessary io_work
scheduling for the same reasoning commit_rqs() uses it. In any case, I
don't think an occasional unneeded work check is really that harmful.

 
> Sam, can you run some more extensive I/O testing with this (different
> patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
> to also take it for a spin?

Sure, I'll request this patch for their next test cycle.

> All of these issues have a common theme, which is less I/O queues than
> CPUs for each controller which make RX/TX and queue_rq/io_work compete
> and usually controllers with smaller MDTS (that trigger multiple R2Ts
> per request and uncovered some races in this area). So testing those
> would be beneficial.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-08 21:50               ` Keith Busch
@ 2021-09-09  6:36                 ` Sagi Grimberg
  2021-09-09  8:01                   ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2021-09-09  6:36 UTC (permalink / raw)
  To: Keith Busch; +Cc: Samuel Jones, linux-nvme



On 9/9/21 12:50 AM, Keith Busch wrote:
> On Wed, Sep 08, 2021 at 11:55:59PM +0300, Sagi Grimberg wrote:
>>
>>>>> 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
>>>>> 2. Userspace context is descheduled.
>>>>> 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
>>>>> 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
>>>>> 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
>>>>>
>>>>> This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
>>>>>
>>>>> 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
>>>>> 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
>>>>> 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
>>>>>
>>>>> Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
>>>>
>>>> Sounds possible. The "pending" check on the req_list was to fix a race
>>>> condition where the io_work could miss handling a request, resulting in
>>>> a different command time out. It sounds like that could make the io_work
>>>> spin without having anything to do, while starving the lower priority
>>>> task. I'll try to think of another way to handle it.
>>>
>>> Instead of relying on io_work to determine if there's pending requests,
>>> the queuing action can try to re-schedule it only after the send_mutex
>>> is released, and I think that should address the issue you've described
>>> while still fixing the IO timeout the "pending" trick was addressing.
>>> Can you try the below patch?
>>>
>>> Sagi, does this look okay to you?
>>
>> It looks OK to me, but on the other hand the former
>> one looked perfectly fine to me.
> 
> It's a priority inversion. The .queue_rq() task is holding onto the
> 'req_list' resource via send_mutex, but it can't schedule in because
> io_work is higher priority and will remain in TASK_RUNNING while
> 'req_list' is non-empty.

Yes, I understand the issue.

> I think the user must have pinned the .queue_rq task to that specific
> CPU, otherwise I believe it would have scheduled somewhere else to
> unblock forward progress.

Yes, I agree.

>> Basically now the last request in the batch will not trigger
>> unconditionally (if the direct-send path was not taken), but only if it
>> sees queued requests, which in theory could be reaped by a running
>> io_work, that would result in an empty io_work invocation.  On one
>> hand it is undesired, on the other hand should be harmless as it
>> shouldn't miss an invocation... So it looks reasonable to me.
> 
> I can change the check to the same as .commit_rqs(), which just
> consideres 'llist_empty(&queue->req_list))'.
> 
> That should be just as safe and have fewer unnecessary io_work
> scheduling for the same reasoning commit_rqs() uses it. In any case, I
> don't think an occasional unneeded work check is really that harmful.

What we need to think about is if there is a case where we miss
scheduling the io_work

>> Sam, can you run some more extensive I/O testing with this (different
>> patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
>> to also take it for a spin?
> 
> Sure, I'll request this patch for their next test cycle.

Great.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-09  6:36                 ` Sagi Grimberg
@ 2021-09-09  8:01                   ` Sagi Grimberg
  2021-09-09  8:09                     ` Samuel Jones
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2021-09-09  8:01 UTC (permalink / raw)
  To: Keith Busch; +Cc: Samuel Jones, linux-nvme


>>> Sam, can you run some more extensive I/O testing with this (different
>>> patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
>>> to also take it for a spin?
>>
>> Sure, I'll request this patch for their next test cycle.
> 
> Great.

Keith, can you send a proper patch with a Fixes: tag and a descriptive
title so this can go to stable.

And let's wait for Sam to do some more extensive testing. Ideally we can
also hear from the WD folks that this doesn't cause any regressions.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-09  8:01                   ` Sagi Grimberg
@ 2021-09-09  8:09                     ` Samuel Jones
  2021-09-09  8:54                       ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Samuel Jones @ 2021-09-09  8:09 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, linux-nvme

>>>> Sam, can you run some more extensive I/O testing with this (different
>>>> patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
>>>> to also take it for a spin?
>>>
>>> Sure, I'll request this patch for their next test cycle.
>> 
>> Great.
> 
> Keith, can you send a proper patch with a Fixes: tag and a descriptive
> title so this can go to stable.
> 
> And let's wait for Sam to do some more extensive testing. Ideally we can
> also hear from the WD folks that this doesn't cause any regressions.

OK, I'll do some more testing over the weekend and let you know next week.






_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-09  8:09                     ` Samuel Jones
@ 2021-09-09  8:54                       ` Sagi Grimberg
  2021-09-13  7:44                         ` Samuel Jones
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2021-09-09  8:54 UTC (permalink / raw)
  To: Samuel Jones; +Cc: Keith Busch, linux-nvme


>> Keith, can you send a proper patch with a Fixes: tag and a descriptive
>> title so this can go to stable.
>>
>> And let's wait for Sam to do some more extensive testing. Ideally we can
>> also hear from the WD folks that this doesn't cause any regressions.
> 
> OK, I'll do some more testing over the weekend and let you know next week.

Thanks.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-09  8:54                       ` Sagi Grimberg
@ 2021-09-13  7:44                         ` Samuel Jones
  2021-09-13  9:15                           ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Samuel Jones @ 2021-09-13  7:44 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, linux-nvme

> From: "Sagi Grimberg" <sagi@grimberg.me>
> To: "Samuel Jones" <sjones@kalrayinc.com>
> Cc: "Keith Busch" <kbusch@kernel.org>, "linux-nvme" <linux-nvme@lists.infradead.org>
> Sent: Thursday, 9 September, 2021 10:54:32
> Subject: Re: Command timeouts with NVMe TCP kernel driver

>>> Keith, can you send a proper patch with a Fixes: tag and a descriptive
>>> title so this can go to stable.
>>>
>>> And let's wait for Sam to do some more extensive testing. Ideally we can
>>> also hear from the WD folks that this doesn't cause any regressions.
>> 
>> OK, I'll do some more testing over the weekend and let you know next week.
> 
> Thanks.

Hey,

No further issues on my side with this patch.

Sam





_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-13  7:44                         ` Samuel Jones
@ 2021-09-13  9:15                           ` Sagi Grimberg
  2021-09-13 14:10                             ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2021-09-13  9:15 UTC (permalink / raw)
  To: Samuel Jones; +Cc: Keith Busch, linux-nvme


>>>> Keith, can you send a proper patch with a Fixes: tag and a descriptive
>>>> title so this can go to stable.
>>>>
>>>> And let's wait for Sam to do some more extensive testing. Ideally we can
>>>> also hear from the WD folks that this doesn't cause any regressions.
>>>
>>> OK, I'll do some more testing over the weekend and let you know next week.
>>
>> Thanks.
> 
> Hey,
> 
> No further issues on my side with this patch.

Thanks Sam,

Keith, any updates from the WD team?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Command timeouts with NVMe TCP kernel driver
  2021-09-13  9:15                           ` Sagi Grimberg
@ 2021-09-13 14:10                             ` Keith Busch
  0 siblings, 0 replies; 19+ messages in thread
From: Keith Busch @ 2021-09-13 14:10 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Samuel Jones, linux-nvme

On Mon, Sep 13, 2021 at 12:15:25PM +0300, Sagi Grimberg wrote:
> 
> > > > > Keith, can you send a proper patch with a Fixes: tag and a descriptive
> > > > > title so this can go to stable.
> > > > > 
> > > > > And let's wait for Sam to do some more extensive testing. Ideally we can
> > > > > also hear from the WD folks that this doesn't cause any regressions.
> > > > 
> > > > OK, I'll do some more testing over the weekend and let you know next week.
> > > 
> > > Thanks.
> > 
> > Hey,
> > 
> > No further issues on my side with this patch.
> 
> Thanks Sam,
> 
> Keith, any updates from the WD team?

This was queued up for their next test cycle late last week, I will send
an update as soon as I hear back from them. I believe it's a multi-day
test, so it may be later this week depending on when they started.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-09-13 14:10 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30 13:30 Command timeouts with NVMe TCP kernel driver Samuel Jones
2021-08-31  2:37 ` Keith Busch
2021-08-31  2:44   ` Keith Busch
2021-09-01 12:56     ` Sagi Grimberg
2021-08-31 22:46   ` Samuel Jones
2021-09-01 13:21     ` Sagi Grimberg
2021-09-07  9:05       ` Samuel Jones
2021-09-07 16:42         ` Keith Busch
2021-09-07 20:46           ` Keith Busch
2021-09-08 10:29             ` Samuel Jones
2021-09-08 20:55             ` Sagi Grimberg
2021-09-08 21:50               ` Keith Busch
2021-09-09  6:36                 ` Sagi Grimberg
2021-09-09  8:01                   ` Sagi Grimberg
2021-09-09  8:09                     ` Samuel Jones
2021-09-09  8:54                       ` Sagi Grimberg
2021-09-13  7:44                         ` Samuel Jones
2021-09-13  9:15                           ` Sagi Grimberg
2021-09-13 14:10                             ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.