All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blk-mq: avoid to hang in the cpuhp offline handler
@ 2022-09-20  2:17 Ming Lei
  2022-09-22  6:25 ` Christoph Hellwig
  2022-09-22  8:47 ` John Garry
  0 siblings, 2 replies; 5+ messages in thread
From: Ming Lei @ 2022-09-20  2:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Ming Lei, linux-nvme, Yi Zhang

For avoiding to trigger io timeout when one hctx becomes inactive, we
drain IOs when all CPUs of one hctx are offline. However, driver's
timeout handler may require cpus_read_lock, such as nvme-pci,
pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
and irq_build_affinity_masks() needs cpus_read_lock().

Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
is held, so deadlock is caused.

Fixes the issue by breaking the wait loop if enough long time elapses,
and these in-flight not drained IO still can be handled by timeout
handler.

Cc: linux-nvme@lists.infradead.org
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c96c8c4f751b..4585985b8537 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3301,6 +3301,7 @@ static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
 	return true;
 }
 
+#define BLK_MQ_MAX_OFFLINE_WAIT_MSECS 3000
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
 	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
@@ -3326,8 +3327,13 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	 * frozen and there are no requests.
 	 */
 	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
-		while (blk_mq_hctx_has_requests(hctx))
+		unsigned int wait_ms = 0;
+
+		while (blk_mq_hctx_has_requests(hctx) && wait_ms <
+				BLK_MQ_MAX_OFFLINE_WAIT_MSECS) {
 			msleep(5);
+			wait_ms += 5;
+		}
 		percpu_ref_put(&hctx->queue->q_usage_counter);
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: avoid to hang in the cpuhp offline handler
  2022-09-20  2:17 [PATCH] blk-mq: avoid to hang in the cpuhp offline handler Ming Lei
@ 2022-09-22  6:25 ` Christoph Hellwig
  2022-09-22  7:41   ` Ming Lei
  2022-09-22  8:47 ` John Garry
  1 sibling, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2022-09-22  6:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme, Yi Zhang,
	homas Gleixner

On Tue, Sep 20, 2022 at 10:17:24AM +0800, Ming Lei wrote:
> For avoiding to trigger io timeout when one hctx becomes inactive, we
> drain IOs when all CPUs of one hctx are offline. However, driver's
> timeout handler may require cpus_read_lock, such as nvme-pci,
> pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
> and irq_build_affinity_masks() needs cpus_read_lock().
> 
> Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
> is held, so deadlock is caused.
> 
> Fixes the issue by breaking the wait loop if enough long time elapses,
> and these in-flight not drained IO still can be handled by timeout
> handler.

I'm not sure that this actually is a good idea on its own, and it kinda
defeats the cpu hotplug processing.

So if I understand your log above correctly the problem is that
we have commands that would time out, and we exacalate to a
controller reset that is racing with the CPU unplug, or if not can
you explain the scenario a little more?

> Cc: linux-nvme@lists.infradead.org
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c96c8c4f751b..4585985b8537 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3301,6 +3301,7 @@ static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
>  	return true;
>  }
>  
> +#define BLK_MQ_MAX_OFFLINE_WAIT_MSECS 3000
>  static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>  {
>  	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> @@ -3326,8 +3327,13 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>  	 * frozen and there are no requests.
>  	 */
>  	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
> -		while (blk_mq_hctx_has_requests(hctx))
> +		unsigned int wait_ms = 0;
> +
> +		while (blk_mq_hctx_has_requests(hctx) && wait_ms <
> +				BLK_MQ_MAX_OFFLINE_WAIT_MSECS) {
>  			msleep(5);
> +			wait_ms += 5;
> +		}
>  		percpu_ref_put(&hctx->queue->q_usage_counter);
>  	}
>  
> -- 
> 2.31.1
---end quoted text---

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: avoid to hang in the cpuhp offline handler
  2022-09-22  6:25 ` Christoph Hellwig
@ 2022-09-22  7:41   ` Ming Lei
  0 siblings, 0 replies; 5+ messages in thread
From: Ming Lei @ 2022-09-22  7:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-nvme, Yi Zhang, homas Gleixner

On Thu, Sep 22, 2022 at 08:25:17AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 20, 2022 at 10:17:24AM +0800, Ming Lei wrote:
> > For avoiding to trigger io timeout when one hctx becomes inactive, we
> > drain IOs when all CPUs of one hctx are offline. However, driver's
> > timeout handler may require cpus_read_lock, such as nvme-pci,
> > pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
> > and irq_build_affinity_masks() needs cpus_read_lock().
> > 
> > Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
> > is held, so deadlock is caused.
> > 
> > Fixes the issue by breaking the wait loop if enough long time elapses,
> > and these in-flight not drained IO still can be handled by timeout
> > handler.
> 
> I'm not sure that this actually is a good idea on its own, and it kinda
> defeats the cpu hotplug processing.
> 
> So if I understand your log above correctly the problem is that
> we have commands that would time out, and we exacalate to a
> controller reset that is racing with the CPU unplug.

Yes. 

blk_mq_hctx_notify_offline() is waiting for inflight requests, then
cpu_write_lock() is held since it is cpuhp code path.

Meantime nvme reset grabs dev->shutdown_lock, then calls
pci_alloc_irq_vectors_affinity()->irq_build_affinity_masks() which
is waiting for cpu_read_lock().

Meantime nvme_dev_disable() can't move on for handling any io timeout
because dev->shutdown_lock is held by nvme reset. Then in-flight IO
can't be drained by blk_mq_hctx_notify_offline()

One real IO deadlock between cpuhp and nvme_reset.


thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: avoid to hang in the cpuhp offline handler
  2022-09-20  2:17 [PATCH] blk-mq: avoid to hang in the cpuhp offline handler Ming Lei
  2022-09-22  6:25 ` Christoph Hellwig
@ 2022-09-22  8:47 ` John Garry
  2022-09-22  9:13   ` Ming Lei
  1 sibling, 1 reply; 5+ messages in thread
From: John Garry @ 2022-09-22  8:47 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-nvme, Yi Zhang

On 20/09/2022 03:17, Ming Lei wrote:
> For avoiding to trigger io timeout when one hctx becomes inactive, we
> drain IOs when all CPUs of one hctx are offline. However, driver's
> timeout handler may require cpus_read_lock, such as nvme-pci,
> pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
> and irq_build_affinity_masks() needs cpus_read_lock().
> 
> Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
> is held, so deadlock is caused.
> 
> Fixes the issue by breaking the wait loop if enough long time elapses,
> and these in-flight not drained IO still can be handled by timeout
> handler.

I don't think that that this is a good idea - that is because often 
drivers cannot safely handle scenario of timeout of an IO which has 
actually completed. NVMe timeout handler may poll for completion, but 
SCSI does not.

Indeed, if we were going to allow the timeout handler handle these 
in-flight IO then there is no point in having this hotplug handler in 
the first place.

> 
> Cc: linux-nvme@lists.infradead.org
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c96c8c4f751b..4585985b8537 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3301,6 +3301,7 @@ static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
>   	return true;
>   }
>   
> +#define BLK_MQ_MAX_OFFLINE_WAIT_MSECS 3000

That's so low compared to default SCSI sd timeout.

>   static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   {
>   	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> @@ -3326,8 +3327,13 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   	 * frozen and there are no requests.
>   	 */
>   	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
> -		while (blk_mq_hctx_has_requests(hctx))
> +		unsigned int wait_ms = 0;
> +
> +		while (blk_mq_hctx_has_requests(hctx) && wait_ms <
> +				BLK_MQ_MAX_OFFLINE_WAIT_MSECS) {
>   			msleep(5);
> +			wait_ms += 5;
> +		}
>   		percpu_ref_put(&hctx->queue->q_usage_counter);
>   	}
>   

Thanks,
John

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: avoid to hang in the cpuhp offline handler
  2022-09-22  8:47 ` John Garry
@ 2022-09-22  9:13   ` Ming Lei
  0 siblings, 0 replies; 5+ messages in thread
From: Ming Lei @ 2022-09-22  9:13 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme, Yi Zhang,
	ming.lei

On Thu, Sep 22, 2022 at 09:47:09AM +0100, John Garry wrote:
> On 20/09/2022 03:17, Ming Lei wrote:
> > For avoiding to trigger io timeout when one hctx becomes inactive, we
> > drain IOs when all CPUs of one hctx are offline. However, driver's
> > timeout handler may require cpus_read_lock, such as nvme-pci,
> > pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
> > and irq_build_affinity_masks() needs cpus_read_lock().
> > 
> > Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
> > is held, so deadlock is caused.
> > 
> > Fixes the issue by breaking the wait loop if enough long time elapses,
> > and these in-flight not drained IO still can be handled by timeout
> > handler.
> 
> I don't think that that this is a good idea - that is because often drivers
> cannot safely handle scenario of timeout of an IO which has actually
> completed. NVMe timeout handler may poll for completion, but SCSI does not.
> 
> Indeed, if we were going to allow the timeout handler handle these in-flight
> IO then there is no point in having this hotplug handler in the first place.

That is true from the beginning, and we did know the point, I remember that
Hannes asked this question in LSF/MM, and there are many drivers which don't
implement timeout handler.

For this issue, it looks more like one nvme specific since nvme timeout handler
can't move on during nvme reset. Let's see if it can be fixed by nvme
driver.

BTW nvme error handling is really fragile, not only this one, such as, any timeout
during reset will cause device removed.


Thanks.
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-09-22  9:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-20  2:17 [PATCH] blk-mq: avoid to hang in the cpuhp offline handler Ming Lei
2022-09-22  6:25 ` Christoph Hellwig
2022-09-22  7:41   ` Ming Lei
2022-09-22  8:47 ` John Garry
2022-09-22  9:13   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.