linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* nvme: hung task in blk_mq_freeze_queue_wait
@ 2022-11-08  2:56 Francesco Ruggeri
  2022-11-08 11:28 ` Hillf Danton
  2022-11-08 16:51 ` Keith Busch
  0 siblings, 2 replies; 4+ messages in thread
From: Francesco Ruggeri @ 2022-11-08  2:56 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, sagi, hch, axboe, kbusch, fruggeri

We are running into hung tasks in blk_mq_freeze_queue_wait, sometimes
in nvme_reset_work and sometimes in nvme_scan_work.
In some cases we also see a warning in blk_mq_unfreeze_queue
(WARN_ON_ONCE(q->mq_freeze_depth < 0)):
WARNING: CPU: 10 PID: 152 at block/blk-mq.c:199 blk_mq_unfreeze_queue+0x26/0x5a
The hung tasks are preceded by two instances of nvme_timeout racing in
nvme_dev_disable after a device becomes unresponsive.
The first nvme_dev_disable blocks in wait_for_completion_io_timeout, and
a while later a second instance blocks in mutex_lock(&dev->shutdown_lock).
When the timeout expires and the first instance releases the mutex,
the two instances start racing.
We have seen the hung tasks in 4.19, but from a quick look at the
latest 6.1 code the issue is probably there as well.
There seem to be two different scenarios where unbalanced
blk_freeze_queue_start/blk_mq_unfreeze_queue could be the cause.

1) In this case we have an unfreeze without a corresponding freeze
preceding it.

TIMEOUT 1	TIMEOUT 2	RESET WORK 1	RESET WORK 2

state = NVME_CTRL_LIVE
nvme_timeout
nvme_dev_disable
mutex_lock
nvme_start_freeze
blk_freeze_queue_start
nvme_disable_io_queues
wait_for_completion_io_timeout
		nvme_timeout
		nvme_dev_disable
		mutex_lock
mutex_unlock
nvme_reset_ctrl
state = NVME_CTRL_RESETTING
queue_work(nvme_reset_work)
				nvme_reset_work
				state = NVME_CTRL_CONNECTING
		(state != NVME_CTRL_LIVE and
		state != NVME_CTRL_RESETTING)
		skips nvme_start_freeze
		mutex_unlock
		nvme_reset_ctrl
		state = NVME_CTRL_RESETTING
		queue_work(nvme_reset_work)
				nvme_unfreeze (matches
				nvme_start_freeze in
				TIMEOUT 1)
						nvme_reset_work
						nvme_unfreeze (no
						match in TIMEOUT 2)


2) In this case a freeze has no corresponding unfreeze following it.
TIMEOUT 2 cannot schedule nvme_reset_work because TIMEOUT 1's is already
scheduled but not yet running.

TIMEOUT 1	TIMEOUT 2	RESET WORK 1	RESET WORK 2

state = NVME_CTRL_LIVE
nvme_timeout
nvme_dev_disable
mutex_lock
nvme_start_freeze
blk_freeze_queue_start
nvme_disable_io_queues
wait_for_completion_io_timeout
		nvme_timeout
		nvme_dev_disable
		mutex_lock
mutex_unlock
nvme_reset_ctrl
state = NVME_CTRL_RESETTING
queue_work(nvme_reset_work)
		(state == NVME_CTRL_LIVE or
		state == NVME_CTRL_RESETTING)
		nvme_start_freeze
		blk_freeze_queue_start
		mutex_unlock
		nvme_reset_ctrl
		state = NVME_CTRL_RESETTING
		queue_work(nvme_reset_work)
		fails because nvme_reset_work
		is not running yet
				nvme_reset_work
				nvme_unfreeze (matches
				nvme_start_freeze in
				TIMEOUT 1)
						It gets never
						scheduled.


Following is one such backtrace.

[ 4390.119745] INFO: task kworker/u80:2:8279 blocked for more than 300 seconds.
[ 4390.471456]       Tainted: P           O      4.19.142.Ar-29007847.buytenhb7335070 #1
[ 4390.832568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4391.193663] kworker/u80:2   D    0  8279      2 0x80000000
[ 4391.526603] Workqueue: nvme-reset-wq nvme_reset_work
[ 4391.853272] Call Trace:
[ 4392.149696]  __schedule+0x75b/0x7c2
[ 4392.458627]  schedule+0x78/0x8c
[ 4392.763399]  blk_mq_freeze_queue_wait+0x8b/0xb6
[ 4393.084849]  ? wait_woken+0x92/0x92
[ 4393.393791]  nvme_wait_freeze+0x39/0x4e
[ 4393.706900]  nvme_reset_work+0x1464/0x1665
[ 4394.023142]  ? dequeue_entity+0x694/0x6a3
[ 4394.338340]  process_one_work+0x1c2/0x30f
[ 4394.511642]  worker_thread+0x1e9/0x2cc
[ 4394.556626]  ? rescuer_thread+0x2b7/0x2b7
[ 4394.604738]  kthread+0x15d/0x165
[ 4394.643460]  ? kthread_park+0x98/0x98
[ 4394.687396]  ret_from_fork+0x1f/0x30




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: nvme: hung task in blk_mq_freeze_queue_wait
  2022-11-08  2:56 nvme: hung task in blk_mq_freeze_queue_wait Francesco Ruggeri
@ 2022-11-08 11:28 ` Hillf Danton
  2022-11-08 16:51 ` Keith Busch
  1 sibling, 0 replies; 4+ messages in thread
From: Hillf Danton @ 2022-11-08 11:28 UTC (permalink / raw)
  To: fruggeri; +Cc: linux-kernel, linux-nvme, hch

On 07 Nov 2022 18:56:02 -0800 Francesco Ruggeri <fruggeri@arista.com> 
> We are running into hung tasks in blk_mq_freeze_queue_wait, sometimes
> in nvme_reset_work and sometimes in nvme_scan_work.
> In some cases we also see a warning in blk_mq_unfreeze_queue
> (WARN_ON_ONCE(q->mq_freeze_depth < 0)):

Evidence for hang after double wakeup.

> WARNING: CPU: 10 PID: 152 at block/blk-mq.c:199 blk_mq_unfreeze_queue+0x26/0x5a
> The hung tasks are preceded by two instances of nvme_timeout racing in
> nvme_dev_disable after a device becomes unresponsive.
> The first nvme_dev_disable blocks in wait_for_completion_io_timeout, and
> a while later a second instance blocks in mutex_lock(&dev->shutdown_lock).
> When the timeout expires and the first instance releases the mutex,
> the two instances start racing.
> We have seen the hung tasks in 4.19, but from a quick look at the
> latest 6.1 code the issue is probably there as well.
> There seem to be two different scenarios where unbalanced
> blk_freeze_queue_start/blk_mq_unfreeze_queue could be the cause.
> 
> 1) In this case we have an unfreeze without a corresponding freeze
> preceding it.
> 
> TIMEOUT 1	TIMEOUT 2	RESET WORK 1	RESET WORK 2
> 
> state = NVME_CTRL_LIVE
> nvme_timeout
> nvme_dev_disable
> mutex_lock
> nvme_start_freeze
> blk_freeze_queue_start
> nvme_disable_io_queues
> wait_for_completion_io_timeout
> 		nvme_timeout
> 		nvme_dev_disable
> 		mutex_lock
> mutex_unlock
> nvme_reset_ctrl
> state = NVME_CTRL_RESETTING
> queue_work(nvme_reset_work)
> 				nvme_reset_work
> 				state = NVME_CTRL_CONNECTING
> 		(state != NVME_CTRL_LIVE and
> 		state != NVME_CTRL_RESETTING)
> 		skips nvme_start_freeze
> 		mutex_unlock
> 		nvme_reset_ctrl
> 		state = NVME_CTRL_RESETTING
> 		queue_work(nvme_reset_work)
> 				nvme_unfreeze (matches
> 				nvme_start_freeze in
> 				TIMEOUT 1)
> 						nvme_reset_work
> 						nvme_unfreeze (no
> 						match in TIMEOUT 2)
> 
> 
> 2) In this case a freeze has no corresponding unfreeze following it.
> TIMEOUT 2 cannot schedule nvme_reset_work because TIMEOUT 1's is already
> scheduled but not yet running.
> 
> TIMEOUT 1	TIMEOUT 2	RESET WORK 1	RESET WORK 2
> 
> state = NVME_CTRL_LIVE
> nvme_timeout
> nvme_dev_disable
> mutex_lock
> nvme_start_freeze
> blk_freeze_queue_start
> nvme_disable_io_queues
> wait_for_completion_io_timeout
> 		nvme_timeout
> 		nvme_dev_disable
> 		mutex_lock
> mutex_unlock
> nvme_reset_ctrl
> state = NVME_CTRL_RESETTING
> queue_work(nvme_reset_work)
> 		(state == NVME_CTRL_LIVE or
> 		state == NVME_CTRL_RESETTING)
> 		nvme_start_freeze
> 		blk_freeze_queue_start
> 		mutex_unlock
> 		nvme_reset_ctrl
> 		state = NVME_CTRL_RESETTING
> 		queue_work(nvme_reset_work)
> 		fails because nvme_reset_work
> 		is not running yet
> 				nvme_reset_work
> 				nvme_unfreeze (matches
> 				nvme_start_freeze in
> 				TIMEOUT 1)
> 						It gets never
> 						scheduled.
> 
> 
> Following is one such backtrace.
> 
> [ 4390.119745] INFO: task kworker/u80:2:8279 blocked for more than 300 seconds.
> [ 4390.471456]       Tainted: P           O      4.19.142.Ar-29007847.buytenhb7335070 #1
> [ 4390.832568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 4391.193663] kworker/u80:2   D    0  8279      2 0x80000000
> [ 4391.526603] Workqueue: nvme-reset-wq nvme_reset_work
> [ 4391.853272] Call Trace:
> [ 4392.149696]  __schedule+0x75b/0x7c2
> [ 4392.458627]  schedule+0x78/0x8c
> [ 4392.763399]  blk_mq_freeze_queue_wait+0x8b/0xb6
> [ 4393.084849]  ? wait_woken+0x92/0x92
> [ 4393.393791]  nvme_wait_freeze+0x39/0x4e
> [ 4393.706900]  nvme_reset_work+0x1464/0x1665
> [ 4394.023142]  ? dequeue_entity+0x694/0x6a3
> [ 4394.338340]  process_one_work+0x1c2/0x30f
> [ 4394.511642]  worker_thread+0x1e9/0x2cc
> [ 4394.556626]  ? rescuer_thread+0x2b7/0x2b7
> [ 4394.604738]  kthread+0x15d/0x165
> [ 4394.643460]  ? kthread_park+0x98/0x98
> [ 4394.687396]  ret_from_fork+0x1f/0x30

Given blk_mq_freeze_queue_wait() in the call trace, another explanation
may be percpu refcount [1], which can be tested with the diff attached.

Only for thoughts.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/block/blk-mq.c?h=next-20221104#n228

Hillf

+++ next-20221104/block/blk-mq.c
@@ -175,9 +175,20 @@ void blk_freeze_queue_start(struct reque
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
+static bool blk_mq_zero_freeze_depth(struct request_queue *q)
+{
+	bool ret;
+
+	mutex_lock(&q->mq_freeze_lock);
+	ret = !q->mq_freeze_depth;
+	mutex_unlock(&q->mq_freeze_lock);
+
+	return ret;
+}
+
 void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+	wait_event(q->mq_freeze_wq, blk_mq_zero_freeze_depth(q));
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
@@ -185,7 +196,7 @@ int blk_mq_freeze_queue_wait_timeout(str
 				     unsigned long timeout)
 {
 	return wait_event_timeout(q->mq_freeze_wq,
-					percpu_ref_is_zero(&q->q_usage_counter),
+					blk_mq_zero_freeze_depth(q),
 					timeout);
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: nvme: hung task in blk_mq_freeze_queue_wait
  2022-11-08  2:56 nvme: hung task in blk_mq_freeze_queue_wait Francesco Ruggeri
  2022-11-08 11:28 ` Hillf Danton
@ 2022-11-08 16:51 ` Keith Busch
  2022-11-08 17:30   ` Francesco Ruggeri
  1 sibling, 1 reply; 4+ messages in thread
From: Keith Busch @ 2022-11-08 16:51 UTC (permalink / raw)
  To: Francesco Ruggeri; +Cc: linux-kernel, linux-nvme, sagi, hch, axboe

On Mon, Nov 07, 2022 at 06:56:02PM -0800, Francesco Ruggeri wrote:
> [ 4390.119745] INFO: task kworker/u80:2:8279 blocked for more than 300 seconds.
> [ 4390.471456]       Tainted: P           O      4.19.142.Ar-29007847.buytenhb7335070 #1

That is an old kernel, I think it'd be worth trying something newer to
confirm if this observation isn't already fixed. Specifically, it looks
like you have multiple namespaces timing out IO near simultaneously, and
causing a mismatched handling. That kind of thing was fixed after the
follow (plus some prior dependencies): 

  commit d6135c3a1ec0cddda7b8b8e1b5b4abeeafd98289
  Author: Keith Busch <kbusch@kernel.org>
  Date:   Tue May 14 14:46:09 2019 -0600
  
      nvme-pci: Sync queues on reset


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: nvme: hung task in blk_mq_freeze_queue_wait
  2022-11-08 16:51 ` Keith Busch
@ 2022-11-08 17:30   ` Francesco Ruggeri
  0 siblings, 0 replies; 4+ messages in thread
From: Francesco Ruggeri @ 2022-11-08 17:30 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-kernel, linux-nvme, sagi, hch, axboe

On Tue, Nov 8, 2022 at 8:51 AM Keith Busch <kbusch@kernel.org> wrote:
> That is an old kernel, I think it'd be worth trying something newer to
> confirm if this observation isn't already fixed. Specifically, it looks
> like you have multiple namespaces timing out IO near simultaneously, and
> causing a mismatched handling. That kind of thing was fixed after the
> follow (plus some prior dependencies):
>
>   commit d6135c3a1ec0cddda7b8b8e1b5b4abeeafd98289
>   Author: Keith Busch <kbusch@kernel.org>
>   Date:   Tue May 14 14:46:09 2019 -0600
>
>       nvme-pci: Sync queues on reset


On Tue, Nov 8, 2022 at 3:29 AM Hillf Danton <hdanton@sina.com> wrote:
>
> Given blk_mq_freeze_queue_wait() in the call trace, another explanation
> may be percpu refcount [1], which can be tested with the diff attached.
>
> Only for thoughts.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/block/blk-mq.c?h=next-20221104#n228
>
> Hillf

Thanks Keith, Hillf, I will look into that.

Francesco


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-08 17:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-08  2:56 nvme: hung task in blk_mq_freeze_queue_wait Francesco Ruggeri
2022-11-08 11:28 ` Hillf Danton
2022-11-08 16:51 ` Keith Busch
2022-11-08 17:30   ` Francesco Ruggeri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).