linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* lockdep WARNING at blktests block/011
@ 2022-09-30  0:19 Shinichiro Kawasaki
  2022-09-30 11:06 ` Tetsuo Handa
  0 siblings, 1 reply; 16+ messages in thread
From: Shinichiro Kawasaki @ 2022-09-30  0:19 UTC (permalink / raw)
  To: linux-block, linux-nvme
  Cc: Tetsuo Handa, Tejun Heo, Johannes Thumshirn, Damien Le Moal

Kernel v6.0-rc7 triggered the WARN "possible circular locking dependency
detected" at blktests test case block/011 for NVME devices [1]. The trigger
commit was c0feea594e05 ("workqueue: don't skip lockdep work dependency in
cancel_work_sync()"). The commit was backported to v5.15.71 stable kernel and
the WARN is observed with the stable kernel also. It looks that the possible
deadlock scenario has been hidden for long time and the commit revealed it now.

As the commit title says, it restored lockdep work dependency in
cancel_work_sync(), which is called from nvme_reset_work() via blk_sync_queue().
The test case repeats PCI disable during I/O on the PCI-NVME device, then
nvme_reset_work() call is repeated.

The WARN was observed with QEMU NVME emulation device. I tried two of my real
NVME devices, and one of them recreated the WARN. The other real NVME device did
not recreate it.

I tried to understand the deadlock scenario, but can not tell if it is for real,
or false-positive. Help to understand the scenario will be appreciated. The
lockdep splat mentions three locks (ctrl->namespaces_rwsem, dev->shutdown_lock
and q->timeout_work). In the related function call paths, it looks that
ctrl->namespaces_rwsem is locked only for read, so the deadlock scenario does
not look possible for me. (Maybe I'm misunderstanding something...)

[1]

[ 3033.791353] run blktests block/011 at 2022-09-27 13:24:34
[ 3064.944407] nvme nvme5: controller is down; will reset: CSTS=0x2, PCI_STATUS=0x10

[ 3066.531825] ======================================================
[ 3066.533372] WARNING: possible circular locking dependency detected
[ 3066.534914] 6.0.0-rc7 #1 Not tainted
[ 3066.535891] ------------------------------------------------------
[ 3066.537492] kworker/u8:0/904 is trying to acquire lock:
[ 3066.538807] ffff888102f24788 ((work_completion)(&q->timeout_work)){+.+.}-{0:0}, at: __flush_work+0xc2/0x900
[ 3066.541161] 
               but task is already holding lock:
[ 3066.542793] ffff888116840528 (&ctrl->namespaces_rwsem){++++}-{3:3}, at: nvme_sync_queues+0x26/0x100 [nvme_core]
[ 3066.545294] 
               which lock already depends on the new lock.

[ 3066.547563] 
               the existing dependency chain (in reverse order) is:
[ 3066.549530] 
               -> #2 (&ctrl->namespaces_rwsem){++++}-{3:3}:
[ 3066.551431]        down_read+0x3e/0x50
[ 3066.552421]        nvme_start_freeze+0x24/0xb0 [nvme_core]
[ 3066.553862]        nvme_dev_disable+0x4ee/0x660 [nvme]
[ 3066.555208]        nvme_timeout.cold+0x3f2/0x699 [nvme]
[ 3066.556575]        blk_mq_check_expired+0x19a/0x270
[ 3066.557865]        bt_iter+0x21e/0x300
[ 3066.558935]        blk_mq_queue_tag_busy_iter+0x7fd/0x1240
[ 3066.560356]        blk_mq_timeout_work+0x143/0x380
[ 3066.562602]        process_one_work+0x816/0x1300
[ 3066.563847]        worker_thread+0xfc/0x1270
[ 3066.564942]        kthread+0x29b/0x340
[ 3066.565973]        ret_from_fork+0x1f/0x30
[ 3066.567083] 
               -> #1 (&dev->shutdown_lock){+.+.}-{3:3}:
[ 3066.568851]        __mutex_lock+0x14c/0x12b0
[ 3066.570010]        nvme_dev_disable+0x5d/0x660 [nvme]
[ 3066.571309]        nvme_timeout.cold+0x3f2/0x699 [nvme]
[ 3066.572678]        blk_mq_check_expired+0x19a/0x270
[ 3066.573972]        bt_iter+0x21e/0x300
[ 3066.575024]        blk_mq_queue_tag_busy_iter+0x7fd/0x1240
[ 3066.576431]        blk_mq_timeout_work+0x143/0x380
[ 3066.577688]        process_one_work+0x816/0x1300
[ 3066.578882]        worker_thread+0xfc/0x1270
[ 3066.580046]        kthread+0x29b/0x340
[ 3066.581081]        ret_from_fork+0x1f/0x30
[ 3066.582194] 
               -> #0 ((work_completion)(&q->timeout_work)){+.+.}-{0:0}:
[ 3066.584265]        __lock_acquire+0x2843/0x5520
[ 3066.585426]        lock_acquire+0x194/0x4f0
[ 3066.586458]        __flush_work+0xe2/0x900
[ 3066.587484]        __cancel_work_timer+0x27a/0x3a0
[ 3066.588568]        nvme_sync_queues+0x71/0x100 [nvme_core]
[ 3066.589782]        nvme_reset_work+0x137/0x39f0 [nvme]
[ 3066.590896]        process_one_work+0x816/0x1300
[ 3066.591895]        worker_thread+0xfc/0x1270
[ 3066.592795]        kthread+0x29b/0x340
[ 3066.593654]        ret_from_fork+0x1f/0x30
[ 3066.594569] 
               other info that might help us debug this:

[ 3066.596290] Chain exists of:
                 (work_completion)(&q->timeout_work) --> &dev->shutdown_lock --> &ctrl->namespaces_rwsem

[ 3066.598960]  Possible unsafe locking scenario:

[ 3066.600167]        CPU0                    CPU1
[ 3066.600985]        ----                    ----
[ 3066.601779]   lock(&ctrl->namespaces_rwsem);
[ 3066.602506]                                lock(&dev->shutdown_lock);
[ 3066.603597]                                lock(&ctrl->namespaces_rwsem);
[ 3066.604739]   lock((work_completion)(&q->timeout_work));
[ 3066.605653] 
                *** DEADLOCK ***

[ 3066.606728] 3 locks held by kworker/u8:0/904:
[ 3066.607436]  #0: ffff888111fe3138 ((wq_completion)nvme-reset-wq){+.+.}-{0:0}, at: process_one_work+0x73c/0x1300
[ 3066.608959]  #1: ffff88811ddcfdd0 ((work_completion)(&dev->ctrl.reset_work)){+.+.}-{0:0}, at: process_one_work+0x769/0x1300
[ 3066.610491]  #2: ffff888116840528 (&ctrl->namespaces_rwsem){++++}-{3:3}, at: nvme_sync_queues+0x26/0x100 [nvme_core]
[ 3066.612010] 
               stack backtrace:
[ 3066.612819] CPU: 1 PID: 904 Comm: kworker/u8:0 Not tainted 6.0.0-rc7 #1
[ 3066.613951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 3066.615613] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 3066.616532] Call Trace:
[ 3066.616985]  <TASK>
[ 3066.617378]  dump_stack_lvl+0x5b/0x77
[ 3066.618057]  check_noncircular+0x242/0x2f0
[ 3066.618734]  ? lock_chain_count+0x20/0x20
[ 3066.727613]  ? print_circular_bug+0x1e0/0x1e0
[ 3066.950865]  ? unwind_next_frame+0x21b/0x1e60
[ 3066.951388]  ? ret_from_fork+0x1f/0x30
[ 3066.951855]  ? lockdep_lock+0xbc/0x1a0
[ 3066.952361]  ? call_rcu_zapped+0xc0/0xc0
[ 3066.952835]  __lock_acquire+0x2843/0x5520
[ 3066.953315]  ? stack_trace_save+0x81/0xa0
[ 3066.953845]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 3066.954439]  lock_acquire+0x194/0x4f0
[ 3066.954931]  ? __flush_work+0xc2/0x900
[ 3066.955389]  ? lock_downgrade+0x6b0/0x6b0
[ 3066.955865]  ? lockdep_unlock+0xf2/0x250
[ 3066.956380]  ? __lock_acquire+0x2080/0x5520
[ 3066.956878]  __flush_work+0xe2/0x900
[ 3066.957316]  ? __flush_work+0xc2/0x900
[ 3066.957813]  ? mark_lock+0xee/0x1610
[ 3066.958249]  ? queue_delayed_work_on+0x90/0x90
[ 3066.958774]  ? lock_chain_count+0x20/0x20
[ 3066.959312]  ? lock_acquire+0x194/0x4f0
[ 3066.959776]  ? lock_acquire+0x1a4/0x4f0
[ 3066.960233]  ? lock_is_held_type+0xe3/0x140
[ 3066.960777]  ? find_held_lock+0x2d/0x110
[ 3066.961247]  ? mark_held_locks+0x9e/0xe0
[ 3066.961707]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
[ 3066.962330]  __cancel_work_timer+0x27a/0x3a0
[ 3066.962832]  ? cancel_delayed_work+0x10/0x10
[ 3066.963381]  ? _raw_spin_unlock_irqrestore+0x40/0x60
[ 3066.963952]  ? try_to_del_timer_sync+0xd0/0xd0
[ 3066.964467]  nvme_sync_queues+0x71/0x100 [nvme_core]
[ 3066.965087]  nvme_reset_work+0x137/0x39f0 [nvme]
[ 3066.965627]  ? __lock_acquire+0x1593/0x5520
[ 3066.966173]  ? nvme_queue_rqs+0xa80/0xa80 [nvme]
[ 3066.966713]  ? lock_acquire+0x194/0x4f0
[ 3066.967180]  ? lock_acquire+0x1a4/0x4f0
[ 3066.967698]  ? lock_downgrade+0x6b0/0x6b0
[ 3066.968176]  ? reacquire_held_locks+0x4e0/0x4e0
[ 3066.968751]  ? lock_is_held_type+0xe3/0x140
[ 3066.969247]  process_one_work+0x816/0x1300
[ 3066.969746]  ? lock_downgrade+0x6b0/0x6b0
[ 3066.970279]  ? pwq_dec_nr_in_flight+0x230/0x230
[ 3066.970818]  ? rwlock_bug.part.0+0x90/0x90
[ 3066.971303]  worker_thread+0xfc/0x1270
[ 3066.971818]  ? process_one_work+0x1300/0x1300
[ 3066.972330]  kthread+0x29b/0x340
[ 3066.972746]  ? kthread_complete_and_exit+0x20/0x20
[ 3066.973345]  ret_from_fork+0x1f/0x30
[ 3066.973791]  </TASK>
[ 3067.963008] nvme nvme5: 4/0/0 default/read/poll queues
[ 3067.972454] nvme nvme5: Ignoring bogus Namespace Identifiers
[ 3098.224067] nvme nvme5: I/O 936 (I/O Cmd) QID 2 timeout, aborting
[ 3098.225903] nvme nvme5: I/O 598 (I/O Cmd) QID 4 timeout, aborting
[ 3098.226229] nvme nvme5: Abort status: 0x0
[ 3098.227499] nvme nvme5: I/O 599 (I/O Cmd) QID 4 timeout, aborting
[ 3098.227872] nvme nvme5: Abort status: 0x0
[ 3098.231401] nvme nvme5: Abort status: 0x0
[ 3128.432035] nvme nvme5: I/O 936 QID 2 timeout, reset controller
[ 3130.808541] nvme nvme5: 4/0/0 default/read/poll queues
[ 3130.816416] nvme nvme5: Ignoring bogus Namespace Identifiers
[ 3161.199887] nvme nvme5: I/O 936 (I/O Cmd) QID 2 timeout, aborting
[ 3161.201909] nvme nvme5: Abort status: 0x0
[ 3191.407811] nvme nvme5: I/O 936 QID 2 timeout, reset controller
[ 3254.895644] nvme nvme5: controller is down; will reset: CSTS=0x2, PCI_STATUS=0x10
[ 3257.299155] nvme nvme5: 4/0/0 default/read/poll queues
[ 3257.312480] nvme nvme5: Ignoring bogus Namespace Identifiers

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-09-30  0:19 lockdep WARNING at blktests block/011 Shinichiro Kawasaki
@ 2022-09-30 11:06 ` Tetsuo Handa
  2022-10-03 13:32   ` Shinichiro Kawasaki
  0 siblings, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2022-09-30 11:06 UTC (permalink / raw)
  To: Shinichiro Kawasaki, linux-block, linux-nvme
  Cc: Tejun Heo, Johannes Thumshirn, Damien Le Moal

On 2022/09/30 9:19, Shinichiro Kawasaki wrote:
> Kernel v6.0-rc7 triggered the WARN "possible circular locking dependency
> detected" at blktests test case block/011 for NVME devices [1]. The trigger
> commit was c0feea594e05 ("workqueue: don't skip lockdep work dependency in
> cancel_work_sync()"). The commit was backported to v5.15.71 stable kernel and
> the WARN is observed with the stable kernel also. It looks that the possible
> deadlock scenario has been hidden for long time and the commit revealed it now.

Right.

> I tried to understand the deadlock scenario, but can not tell if it is for real,
> or false-positive. Help to understand the scenario will be appreciated. The
> lockdep splat mentions three locks (ctrl->namespaces_rwsem, dev->shutdown_lock
> and q->timeout_work).

Right.

This is a circular locking problem which involves three locks.

  (A) (work_completion)(&q->timeout_work) --> &dev->shutdown_lock

  (B) &dev->shutdown_lock --> &ctrl->namespaces_rwsem

  (C) &ctrl->namespaces_rwsem --> (work_completion)(&q->timeout_work)

(A) and (B) happens due to

  INIT_WORK(&q->timeout_work, blk_mq_timeout_work);

  blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
    blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
      bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
        blk_mq_rq_timed_out() {
          req->q->mq_ops->timeout(req) == nvme_timeout(req) {
            nvme_dev_disable() {
              mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
              nvme_start_freeze(&dev->ctrl) {
                down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
                up_read(&ctrl->namespaces_rwsem);
              }
              mutex_unlock(&dev->shutdown_lock);
            }
          }
        }
      }
    }
  }

(C) happens due to

  nvme_sync_queues(ctrl) {
    nvme_sync_io_queues(ctrl) {
      down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
      /*** Question comes here. Please check the last block in this mail. ***/
      blk_sync_queue(ns->queue) {
	cancel_work_sync(&q->timeout_work) {
          __flush_work((&q->timeout_work, true) {
            // Might wait for completion of blk_mq_timeout_work() with ctrl->namespaces_rwsem held.
          }
        }
      }
      up_read(&ctrl->namespaces_rwsem);
    }
  }

>                       In the related function call paths, it looks that
> ctrl->namespaces_rwsem is locked only for read, so the deadlock scenario does
> not look possible for me.

Right.

>                           (Maybe I'm misunderstanding something...)

Although ctrl->namespaces_rwsem is a rw-sem and is held for read in these paths,
there are paths which hold ctrl->namespaces_rwsem for write. If someone is trying
to hold that rw-sem for write, these down_read(&ctrl->namespaces_rwsem) will be blocked.

That is, it also depends on whether
  down_read(&ctrl->namespaces_rwsem);
and
  down_write(&ctrl->namespaces_rwsem);
might run in parallel.

Question:
  Does someone else call down_write(&ctrl->namespaces_rwsem) immediately after
  someone's down_read(&ctrl->namespaces_rwsem) succeeded?

nvme_alloc_ns()/nvme_ns_remove()/nvme_remove_invalid_namespaces()/nvme_remove_namespaces() calls
down_write(&ctrl->namespaces_rwsem), and some of these are called from work callback functions
which might run in parallel with other work callback functions?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-09-30 11:06 ` Tetsuo Handa
@ 2022-10-03 13:32   ` Shinichiro Kawasaki
  2022-10-03 15:28     ` Keith Busch
  0 siblings, 1 reply; 16+ messages in thread
From: Shinichiro Kawasaki @ 2022-10-03 13:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-block, linux-nvme, Tejun Heo, Johannes Thumshirn, Damien Le Moal

Tetsuo, thank you very much for your comments. They clarified one of my
misunderstanding points.

On Sep 30, 2022 / 20:06, Tetsuo Handa wrote:
> On 2022/09/30 9:19, Shinichiro Kawasaki wrote:
> > Kernel v6.0-rc7 triggered the WARN "possible circular locking dependency
> > detected" at blktests test case block/011 for NVME devices [1]. The trigger
> > commit was c0feea594e05 ("workqueue: don't skip lockdep work dependency in
> > cancel_work_sync()"). The commit was backported to v5.15.71 stable kernel and
> > the WARN is observed with the stable kernel also. It looks that the possible
> > deadlock scenario has been hidden for long time and the commit revealed it now.
> 
> Right.
> 
> > I tried to understand the deadlock scenario, but can not tell if it is for real,
> > or false-positive. Help to understand the scenario will be appreciated. The
> > lockdep splat mentions three locks (ctrl->namespaces_rwsem, dev->shutdown_lock
> > and q->timeout_work).
> 
> Right.
> 
> This is a circular locking problem which involves three locks.
> 
>   (A) (work_completion)(&q->timeout_work) --> &dev->shutdown_lock
> 
>   (B) &dev->shutdown_lock --> &ctrl->namespaces_rwsem
> 
>   (C) &ctrl->namespaces_rwsem --> (work_completion)(&q->timeout_work)
> 
> (A) and (B) happens due to
> 
>   INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
> 
>   blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
>     blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
>       bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
>         blk_mq_rq_timed_out() {
>           req->q->mq_ops->timeout(req) == nvme_timeout(req) {
>             nvme_dev_disable() {
>               mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
>               nvme_start_freeze(&dev->ctrl) {
>                 down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
>                 up_read(&ctrl->namespaces_rwsem);
>               }
>               mutex_unlock(&dev->shutdown_lock);
>             }
>           }
>         }
>       }
>     }
>   }
> 
> (C) happens due to
> 
>   nvme_sync_queues(ctrl) {
>     nvme_sync_io_queues(ctrl) {
>       down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
>       /*** Question comes here. Please check the last block in this mail. ***/
>       blk_sync_queue(ns->queue) {
> 	cancel_work_sync(&q->timeout_work) {
>           __flush_work((&q->timeout_work, true) {
>             // Might wait for completion of blk_mq_timeout_work() with ctrl->namespaces_rwsem held.
>           }
>         }
>       }
>       up_read(&ctrl->namespaces_rwsem);
>     }
>   }
> 
> >                       In the related function call paths, it looks that
> > ctrl->namespaces_rwsem is locked only for read, so the deadlock scenario does
> > not look possible for me.
> 
> Right.
> 
> >                           (Maybe I'm misunderstanding something...)
> 
> Although ctrl->namespaces_rwsem is a rw-sem and is held for read in these paths,
> there are paths which hold ctrl->namespaces_rwsem for write. If someone is trying
> to hold that rw-sem for write, these down_read(&ctrl->namespaces_rwsem) will be blocked.

Ah, that's the point. If write lock comes in parallel between two read locks,
the read locks can cause circulating deadlock.

> 
> That is, it also depends on whether
>   down_read(&ctrl->namespaces_rwsem);
> and
>   down_write(&ctrl->namespaces_rwsem);
> might run in parallel.
> 
> Question:
>   Does someone else call down_write(&ctrl->namespaces_rwsem) immediately after
>   someone's down_read(&ctrl->namespaces_rwsem) succeeded?
> 
> nvme_alloc_ns()/nvme_ns_remove()/nvme_remove_invalid_namespaces()/nvme_remove_namespaces() calls
> down_write(&ctrl->namespaces_rwsem), and some of these are called from work callback functions
> which might run in parallel with other work callback functions?

Though I'm new to NVME code, let me try to answer your question based on my
understanding. (NVME experts' comments are more valuable. Please chime in.)

NVME host manages dev->ctrl.state not to run contradicting works. Then scan
work, delete work and reset work do not run in parallel in most cases. So the
write lock by the scan/delete work does not happen in parallel with the reset
work which calls nvme_sync_queues() and does the read lock, in most cases.
Having said that still I can think of a rare scenario [1] that does the write
lock after the read lock in reset work context. Then I think the answer to your
question is "Yes". From this point of view, the deadlock scenario is very rare
but still looks possible, in theory.

BTW, I came up to another question during code read. I found nvme_reset_work()
calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
controller is already disabled when the reset work calls nvme_sync_queues().
Then another nvme_dev_disable() call in I/O timeout path does not result in
nvme_start_freeze() and the 2nd read lock. From this point of view, the deadlock
scenario does not look possible. If this guess is correct, why the lockdep splat
shows the deadlock scenario??

In addition, I found nvme_reset_prepare() and nvme_suspend() also call
nvme_sync_queues(). These functions are not in the lockdep splat, but may have
similar deadlock scenario. I will spend some more time on them tomorrow.

[1]

TASK A            scan work TASK       TASK B            reset work TASK
--------------------------------------------------------------------------------
nvme_sysfs_rescan()
nvme_queue_scan()
check status is LIVE
queue scan work
                  nvme_scan_work()
		  check status is LIVE
                                       nvme_sysfs_reset()
                                       nvme_reset_ctrl()
                                       change status to RESETTING
                                       queue reset work
                                                         nvme_reset_work()
                                                         check status is RESETTING
                                                         nvme_sync_queues()
                                                         read lock ctrl->namespace_rwsem
		  write lock ctrl->namespace_rwsem

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-03 13:32   ` Shinichiro Kawasaki
@ 2022-10-03 15:28     ` Keith Busch
  2022-10-04 10:44       ` Shinichiro Kawasaki
  2022-10-07 20:34       ` Bart Van Assche
  0 siblings, 2 replies; 16+ messages in thread
From: Keith Busch @ 2022-10-03 15:28 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Tetsuo Handa, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On Mon, Oct 03, 2022 at 01:32:41PM +0000, Shinichiro Kawasaki wrote:
> 
> BTW, I came up to another question during code read. I found nvme_reset_work()
> calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
> controller is already disabled when the reset work calls nvme_sync_queues().

Right, everything previously outstanding has been reclaimed, and the queues are
quiesced at this point. There's nothing for timeout work to wait for, and the
sync is just ensuring every timeout work has returned.

It looks like a timeout is required in order to hit this reported deadlock, but
the driver ensures there's nothing to timeout prior to syncing the queues. I
don't think lockdep could reasonably know that, though.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-03 15:28     ` Keith Busch
@ 2022-10-04 10:44       ` Shinichiro Kawasaki
  2022-10-04 11:10         ` Tetsuo Handa
  2022-10-04 22:34         ` Damien Le Moal
  2022-10-07 20:34       ` Bart Van Assche
  1 sibling, 2 replies; 16+ messages in thread
From: Shinichiro Kawasaki @ 2022-10-04 10:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Tetsuo Handa, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On Oct 03, 2022 / 09:28, Keith Busch wrote:
> On Mon, Oct 03, 2022 at 01:32:41PM +0000, Shinichiro Kawasaki wrote:
> > 
> > BTW, I came up to another question during code read. I found nvme_reset_work()
> > calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
> > controller is already disabled when the reset work calls nvme_sync_queues().
> 
> Right, everything previously outstanding has been reclaimed, and the queues are
> quiesced at this point. There's nothing for timeout work to wait for, and the
> sync is just ensuring every timeout work has returned.

Thank you Keith, good to know that we do not need to worry about the deadlock
scenario with nvme_reset_work() :)  I also checked nvme_reset_prepare() and
nvme_suspend(). They also call nvme_dev_disable() or nvme_start/wait_freeze()
before nvme_sync_queues(). So I think the queues are quiesced in these context
also, and do not worry them either.

> 
> It looks like a timeout is required in order to hit this reported deadlock, but
> the driver ensures there's nothing to timeout prior to syncing the queues. I
> don't think lockdep could reasonably know that, though.

I see... From blktests point of view, I would like to suppress the lockdep splat
which triggers the block/011 failure. I created a quick patch using
lockdep_off() and lockdep_on() which skips lockdep check for the read lock in
nvme_sync_io_queues() [1] and confirmed it avoids the lockdep splat. I think
this lockdep check relax is fine because nvme_sync_io_queues() callers do call
nvme_dev_disable(), nvme_start/wait_freeze() or nvme_stop_queues(). The queues
are quiesced at nvme_sync_io_queues() and the deadlock scenario does not happen.

Any comment on this patch will be appreciated. If this action approach is ok,
I'll post as a formal patch for review.

[1]

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 66446f1e06c..9d091327a88 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5121,10 +5121,14 @@ void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
 {
        struct nvme_ns *ns;

+       lockdep_off();
        down_read(&ctrl->namespaces_rwsem);
+       lockdep_on();
        list_for_each_entry(ns, &ctrl->namespaces, list)
                blk_sync_queue(ns->queue);
+       lockdep_off();
        up_read(&ctrl->namespaces_rwsem);
+       lockdep_on();
 }
 EXPORT_SYMBOL_GPL(nvme_sync_io_queues);

-- 
Shin'ichiro Kawasaki

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-04 10:44       ` Shinichiro Kawasaki
@ 2022-10-04 11:10         ` Tetsuo Handa
  2022-10-04 12:23           ` Shinichiro Kawasaki
  2022-10-04 22:34         ` Damien Le Moal
  1 sibling, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2022-10-04 11:10 UTC (permalink / raw)
  To: Shinichiro Kawasaki, Keith Busch
  Cc: linux-block, linux-nvme, Tejun Heo, Johannes Thumshirn, Damien Le Moal

On 2022/10/04 19:44, Shinichiro Kawasaki wrote:
> Any comment on this patch will be appreciated. If this action approach is ok,
> I'll post as a formal patch for review.

I don't want you to make such change.

I saw a case where real deadlock was hidden by lockdep_set_novalidate_class().

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62ebaf2f9261cd2367ae928a39343fcdbfe9f877
  https://groups.google.com/g/syzkaller-bugs/c/Uj9LqEUCwac/m/BhdTjWhNAQAJ

In general, this kind of deadlock possibility had better be addressed by bringing
problematic locks out of cancel{,_delayed}_work_sync() section.

  https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a91b750fd6629354460282bbf5146c01b05c4859



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-04 11:10         ` Tetsuo Handa
@ 2022-10-04 12:23           ` Shinichiro Kawasaki
  2022-10-05  8:31             ` Shinichiro Kawasaki
  0 siblings, 1 reply; 16+ messages in thread
From: Shinichiro Kawasaki @ 2022-10-04 12:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Keith Busch, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On Oct 04, 2022 / 20:10, Tetsuo Handa wrote:
> On 2022/10/04 19:44, Shinichiro Kawasaki wrote:
> > Any comment on this patch will be appreciated. If this action approach is ok,
> > I'll post as a formal patch for review.
> 
> I don't want you to make such change.
> 
> I saw a case where real deadlock was hidden by lockdep_set_novalidate_class().
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62ebaf2f9261cd2367ae928a39343fcdbfe9f877
>   https://groups.google.com/g/syzkaller-bugs/c/Uj9LqEUCwac/m/BhdTjWhNAQAJ
> 
> In general, this kind of deadlock possibility had better be addressed by bringing
> problematic locks out of cancel{,_delayed}_work_sync() section.
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a91b750fd6629354460282bbf5146c01b05c4859

Thanks for the comment. Then, I think the question is how to move the
blk_sync_queue() call out of the ctrl->namespaces_rwsem critical section in
nvme_sync_io_queues():

void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
{
	struct nvme_ns *ns;

	down_read(&ctrl->namespaces_rwsem);
	list_for_each_entry(ns, &ctrl->namespaces, list)
		blk_sync_queue(ns->queue);
	up_read(&ctrl->namespaces_rwsem);
}

I'm not yet sure how we can do it. I guess it is needed to copy the
ctrl->namespaces list to a temporary array to refer out of the critical section.
Also need to keep kref of each ns not to free. Will try to implement tomorrow.

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-04 10:44       ` Shinichiro Kawasaki
  2022-10-04 11:10         ` Tetsuo Handa
@ 2022-10-04 22:34         ` Damien Le Moal
  1 sibling, 0 replies; 16+ messages in thread
From: Damien Le Moal @ 2022-10-04 22:34 UTC (permalink / raw)
  To: Shinichiro Kawasaki, Keith Busch
  Cc: Tetsuo Handa, linux-block, linux-nvme, Tejun Heo, Johannes Thumshirn

On 10/4/22 19:44, Shinichiro Kawasaki wrote:
> On Oct 03, 2022 / 09:28, Keith Busch wrote:
>> On Mon, Oct 03, 2022 at 01:32:41PM +0000, Shinichiro Kawasaki wrote:
>>>
>>> BTW, I came up to another question during code read. I found nvme_reset_work()
>>> calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
>>> controller is already disabled when the reset work calls nvme_sync_queues().
>>
>> Right, everything previously outstanding has been reclaimed, and the queues are
>> quiesced at this point. There's nothing for timeout work to wait for, and the
>> sync is just ensuring every timeout work has returned.
> 
> Thank you Keith, good to know that we do not need to worry about the deadlock
> scenario with nvme_reset_work() :)  I also checked nvme_reset_prepare() and
> nvme_suspend(). They also call nvme_dev_disable() or nvme_start/wait_freeze()
> before nvme_sync_queues(). So I think the queues are quiesced in these context
> also, and do not worry them either.
> 
>>
>> It looks like a timeout is required in order to hit this reported deadlock, but
>> the driver ensures there's nothing to timeout prior to syncing the queues. I
>> don't think lockdep could reasonably know that, though.
> 
> I see... From blktests point of view, I would like to suppress the lockdep splat
> which triggers the block/011 failure. I created a quick patch using
> lockdep_off() and lockdep_on() which skips lockdep check for the read lock in
> nvme_sync_io_queues() [1] and confirmed it avoids the lockdep splat. I think
> this lockdep check relax is fine because nvme_sync_io_queues() callers do call
> nvme_dev_disable(), nvme_start/wait_freeze() or nvme_stop_queues(). The queues
> are quiesced at nvme_sync_io_queues() and the deadlock scenario does not happen.
> 
> Any comment on this patch will be appreciated. If this action approach is ok,
> I'll post as a formal patch for review.
> 
> [1]
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 66446f1e06c..9d091327a88 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5121,10 +5121,14 @@ void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
>  {
>         struct nvme_ns *ns;
> 

I think there should be a comment here explaining why your turn off lockdep.

> +       lockdep_off();
>         down_read(&ctrl->namespaces_rwsem);
> +       lockdep_on();
>         list_for_each_entry(ns, &ctrl->namespaces, list)
>                 blk_sync_queue(ns->queue);
> +       lockdep_off();
>         up_read(&ctrl->namespaces_rwsem);
> +       lockdep_on();
>  }
>  EXPORT_SYMBOL_GPL(nvme_sync_io_queues);
> 

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-04 12:23           ` Shinichiro Kawasaki
@ 2022-10-05  8:31             ` Shinichiro Kawasaki
  2022-10-05 10:00               ` Tetsuo Handa
  0 siblings, 1 reply; 16+ messages in thread
From: Shinichiro Kawasaki @ 2022-10-05  8:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Keith Busch, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On Oct 04, 2022 / 21:23, Shin'ichiro Kawasaki wrote:
> On Oct 04, 2022 / 20:10, Tetsuo Handa wrote:
> > On 2022/10/04 19:44, Shinichiro Kawasaki wrote:
> > > Any comment on this patch will be appreciated. If this action approach is ok,
> > > I'll post as a formal patch for review.
> > 
> > I don't want you to make such change.
> > 
> > I saw a case where real deadlock was hidden by lockdep_set_novalidate_class().
> > 
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62ebaf2f9261cd2367ae928a39343fcdbfe9f877
> >   https://groups.google.com/g/syzkaller-bugs/c/Uj9LqEUCwac/m/BhdTjWhNAQAJ
> > 
> > In general, this kind of deadlock possibility had better be addressed by bringing
> > problematic locks out of cancel{,_delayed}_work_sync() section.
> > 
> >   https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a91b750fd6629354460282bbf5146c01b05c4859
> 
> Thanks for the comment. Then, I think the question is how to move the
> blk_sync_queue() call out of the ctrl->namespaces_rwsem critical section in
> nvme_sync_io_queues():
> 
> void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
> {
> 	struct nvme_ns *ns;
> 
> 	down_read(&ctrl->namespaces_rwsem);
> 	list_for_each_entry(ns, &ctrl->namespaces, list)
> 		blk_sync_queue(ns->queue);
> 	up_read(&ctrl->namespaces_rwsem);
> }
> 
> I'm not yet sure how we can do it. I guess it is needed to copy the
> ctrl->namespaces list to a temporary array to refer out of the critical section.
> Also need to keep kref of each ns not to free. Will try to implement tomorrow.

Getting suggestion by Johaness, I created a patch below and confirmed it avoids
the lockdep splat. It moves elements of ctrl->namespaces to a list head on a
stack, so that blk_sync_queue() can be called out of ctrl->namespaces_rwsem
critical section. Then the list elements are put back to ctrl->namespaces. I
think this temporal list move is fine since the controller is in process for
reset, stop or tear down. This point needs confirmation with NVME experts.

Keith, what do you think?

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 66446f1e06c..6ed68bb6008 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5120,11 +5120,27 @@ EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
 void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	LIST_HEAD(splice);
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	/*
+	 * blk_sync_queues() call in ctrl->snamespaces_rwsem critical section
+	 * triggers deadlock warning by lockdep since cancel_work_sync() in
+	 * blk_sync_queue() waits for nvme_timeout() work completion which may
+	 * lock the ctrl->snamespaces_rwsem. To avoid the deadlock possibility,
+	 * call blk_sync_queues() out of the critical section by moving the
+         * ctrl->namespaces list elements to the stack list head temporally.
+	 */
+
+	down_write(&ctrl->namespaces_rwsem);
+	list_splice_init(&ctrl->namespaces, &splice);
+	up_write(&ctrl->namespaces_rwsem);
+
+	list_for_each_entry(ns, &splice, list)
 		blk_sync_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+
+	down_write(&ctrl->namespaces_rwsem);
+	list_splice(&splice, &ctrl->namespaces);
+	up_write(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_sync_io_queues);
 


-- 
Shin'ichiro Kawasaki

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-05  8:31             ` Shinichiro Kawasaki
@ 2022-10-05 10:00               ` Tetsuo Handa
  2022-10-05 14:20                 ` Keith Busch
  0 siblings, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2022-10-05 10:00 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Keith Busch, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On 2022/10/05 17:31, Shinichiro Kawasaki wrote:
> @@ -5120,11 +5120,27 @@ EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
>  void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
>  {
>  	struct nvme_ns *ns;
> +	LIST_HEAD(splice);
>  
> -	down_read(&ctrl->namespaces_rwsem);
> -	list_for_each_entry(ns, &ctrl->namespaces, list)
> +	/*
> +	 * blk_sync_queues() call in ctrl->snamespaces_rwsem critical section
> +	 * triggers deadlock warning by lockdep since cancel_work_sync() in
> +	 * blk_sync_queue() waits for nvme_timeout() work completion which may
> +	 * lock the ctrl->snamespaces_rwsem. To avoid the deadlock possibility,
> +	 * call blk_sync_queues() out of the critical section by moving the
> +         * ctrl->namespaces list elements to the stack list head temporally.
> +	 */
> +
> +	down_write(&ctrl->namespaces_rwsem);
> +	list_splice_init(&ctrl->namespaces, &splice);
> +	up_write(&ctrl->namespaces_rwsem);

Does this work?

ctrl->namespaces being empty when calling blk_sync_queue() means that
e.g. nvme_start_freeze() cannot find namespaces to freeze, doesn't it?

  blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
    blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
      bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
        blk_mq_rq_timed_out() {
          req->q->mq_ops->timeout(req) == nvme_timeout(req) {
            nvme_dev_disable() {
              mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
              nvme_start_freeze(&dev->ctrl) {
                down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
                //blk_freeze_queue_start(ns->queue); // <= Never be called because ctrl->namespaces is empty.
                up_read(&ctrl->namespaces_rwsem);
              }
              mutex_unlock(&dev->shutdown_lock);
            }
          }
        }
      }
    }
  }

Are you sure that down_read(&ctrl->namespaces_rwsem) users won't run
when ctrl->namespaces is temporarily made empty? (And if you are sure
that down_read(&ctrl->namespaces_rwsem) users won't run when
ctrl->namespaces is temporarily made empty, why ctrl->namespaces_rwsem
needs to be a rw-sem rather than a plain mutex or spinlock ?)

> +
> +	list_for_each_entry(ns, &splice, list)
>  		blk_sync_queue(ns->queue);
> -	up_read(&ctrl->namespaces_rwsem);
> +
> +	down_write(&ctrl->namespaces_rwsem);
> +	list_splice(&splice, &ctrl->namespaces);
> +	up_write(&ctrl->namespaces_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(nvme_sync_io_queues);

I don't know about dependency chain, but you might be able to add
"struct nvme_ctrl"->sync_io_queue_mutex which is held for serializing
nvme_sync_io_queues() and down_write(&ctrl->namespaces_rwsem) users?

If we can guarantee that ctrl->namespaces_rwsem => ctrl->sync_io_queue_mutex
is impossible, nvme_sync_io_queues() can use ctrl->sync_io_queue_mutex
rather than ctrl->namespaces_rwsem, and down_write(&ctrl->namespaces_rwsem)/
up_write(&ctrl->namespaces_rwsem) users are replaced with
  mutex_lock(&ctrl->sync_io_queue_mutex);
  down_write(&ctrl->namespaces_rwsem);
and
  up_write(&ctrl->namespaces_rwsem);
  mutex_unlock(&ctrl->sync_io_queue_mutex);
sequences respectively.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-05 10:00               ` Tetsuo Handa
@ 2022-10-05 14:20                 ` Keith Busch
  2022-10-06  2:30                   ` Shinichiro Kawasaki
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2022-10-05 14:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Shinichiro Kawasaki, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On Wed, Oct 05, 2022 at 07:00:30PM +0900, Tetsuo Handa wrote:
> On 2022/10/05 17:31, Shinichiro Kawasaki wrote:
> > @@ -5120,11 +5120,27 @@ EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
> >  void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
> >  {
> >  	struct nvme_ns *ns;
> > +	LIST_HEAD(splice);
> >  
> > -	down_read(&ctrl->namespaces_rwsem);
> > -	list_for_each_entry(ns, &ctrl->namespaces, list)
> > +	/*
> > +	 * blk_sync_queues() call in ctrl->snamespaces_rwsem critical section
> > +	 * triggers deadlock warning by lockdep since cancel_work_sync() in
> > +	 * blk_sync_queue() waits for nvme_timeout() work completion which may
> > +	 * lock the ctrl->snamespaces_rwsem. To avoid the deadlock possibility,
> > +	 * call blk_sync_queues() out of the critical section by moving the
> > +         * ctrl->namespaces list elements to the stack list head temporally.
> > +	 */
> > +
> > +	down_write(&ctrl->namespaces_rwsem);
> > +	list_splice_init(&ctrl->namespaces, &splice);
> > +	up_write(&ctrl->namespaces_rwsem);
> 
> Does this work?
> 
> ctrl->namespaces being empty when calling blk_sync_queue() means that
> e.g. nvme_start_freeze() cannot find namespaces to freeze, doesn't it?

There can't be anything to timeout at this point. The controller is disabled
prior to syncing the queues. Not only is there no IO for timeout work to
operate on, the controller state is already disabled, so a subsequent freeze
would be skipped.
 
>   blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
>     blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
>       bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
>         blk_mq_rq_timed_out() {
>           req->q->mq_ops->timeout(req) == nvme_timeout(req) {
>             nvme_dev_disable() {
>               mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
>               nvme_start_freeze(&dev->ctrl) {
>                 down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
>                 //blk_freeze_queue_start(ns->queue); // <= Never be called because ctrl->namespaces is empty.
>                 up_read(&ctrl->namespaces_rwsem);
>               }
>               mutex_unlock(&dev->shutdown_lock);
>             }
>           }
>         }
>       }
>     }
>   }
> 
> Are you sure that down_read(&ctrl->namespaces_rwsem) users won't run
> when ctrl->namespaces is temporarily made empty? (And if you are sure
> that down_read(&ctrl->namespaces_rwsem) users won't run when
> ctrl->namespaces is temporarily made empty, why ctrl->namespaces_rwsem
> needs to be a rw-sem rather than a plain mutex or spinlock ?)

We iterate the list in some target fast paths, so we don't want this to be an
exclusive section for readers.
 
> > +	list_for_each_entry(ns, &splice, list)
> >  		blk_sync_queue(ns->queue);
> > -	up_read(&ctrl->namespaces_rwsem);
> > +
> > +	down_write(&ctrl->namespaces_rwsem);
> > +	list_splice(&splice, &ctrl->namespaces);
> > +	up_write(&ctrl->namespaces_rwsem);
> >  }
> >  EXPORT_SYMBOL_GPL(nvme_sync_io_queues);
> 
> I don't know about dependency chain, but you might be able to add
> "struct nvme_ctrl"->sync_io_queue_mutex which is held for serializing
> nvme_sync_io_queues() and down_write(&ctrl->namespaces_rwsem) users?
> 
> If we can guarantee that ctrl->namespaces_rwsem => ctrl->sync_io_queue_mutex
> is impossible, nvme_sync_io_queues() can use ctrl->sync_io_queue_mutex
> rather than ctrl->namespaces_rwsem, and down_write(&ctrl->namespaces_rwsem)/
> up_write(&ctrl->namespaces_rwsem) users are replaced with
>   mutex_lock(&ctrl->sync_io_queue_mutex);
>   down_write(&ctrl->namespaces_rwsem);
> and
>   up_write(&ctrl->namespaces_rwsem);
>   mutex_unlock(&ctrl->sync_io_queue_mutex);
> sequences respectively.
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-05 14:20                 ` Keith Busch
@ 2022-10-06  2:30                   ` Shinichiro Kawasaki
  2022-10-07  1:36                     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 16+ messages in thread
From: Shinichiro Kawasaki @ 2022-10-06  2:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Tetsuo Handa, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On Oct 05, 2022 / 08:20, Keith Busch wrote:
> On Wed, Oct 05, 2022 at 07:00:30PM +0900, Tetsuo Handa wrote:
> > On 2022/10/05 17:31, Shinichiro Kawasaki wrote:
> > > @@ -5120,11 +5120,27 @@ EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
> > >  void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
> > >  {
> > >  	struct nvme_ns *ns;
> > > +	LIST_HEAD(splice);
> > >  
> > > -	down_read(&ctrl->namespaces_rwsem);
> > > -	list_for_each_entry(ns, &ctrl->namespaces, list)
> > > +	/*
> > > +	 * blk_sync_queues() call in ctrl->snamespaces_rwsem critical section
> > > +	 * triggers deadlock warning by lockdep since cancel_work_sync() in
> > > +	 * blk_sync_queue() waits for nvme_timeout() work completion which may
> > > +	 * lock the ctrl->snamespaces_rwsem. To avoid the deadlock possibility,
> > > +	 * call blk_sync_queues() out of the critical section by moving the
> > > +         * ctrl->namespaces list elements to the stack list head temporally.
> > > +	 */
> > > +
> > > +	down_write(&ctrl->namespaces_rwsem);
> > > +	list_splice_init(&ctrl->namespaces, &splice);
> > > +	up_write(&ctrl->namespaces_rwsem);
> > 
> > Does this work?
> > 
> > ctrl->namespaces being empty when calling blk_sync_queue() means that
> > e.g. nvme_start_freeze() cannot find namespaces to freeze, doesn't it?
> 
> There can't be anything to timeout at this point. The controller is disabled
> prior to syncing the queues. Not only is there no IO for timeout work to
> operate on, the controller state is already disabled, so a subsequent freeze
> would be skipped.

Thank you. So, this temporary list move approach should be ok.

>  
> >   blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
> >     blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
> >       bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
> >         blk_mq_rq_timed_out() {
> >           req->q->mq_ops->timeout(req) == nvme_timeout(req) {
> >             nvme_dev_disable() {
> >               mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
> >               nvme_start_freeze(&dev->ctrl) {
> >                 down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
> >                 //blk_freeze_queue_start(ns->queue); // <= Never be called because ctrl->namespaces is empty.
> >                 up_read(&ctrl->namespaces_rwsem);
> >               }
> >               mutex_unlock(&dev->shutdown_lock);
> >             }
> >           }
> >         }
> >       }
> >     }
> >   }
> > 
> > Are you sure that down_read(&ctrl->namespaces_rwsem) users won't run
> > when ctrl->namespaces is temporarily made empty? (And if you are sure
> > that down_read(&ctrl->namespaces_rwsem) users won't run when
> > ctrl->namespaces is temporarily made empty, why ctrl->namespaces_rwsem
> > needs to be a rw-sem rather than a plain mutex or spinlock ?)
> 
> We iterate the list in some target fast paths, so we don't want this to be an
> exclusive section for readers.
>  
> > > +	list_for_each_entry(ns, &splice, list)
> > >  		blk_sync_queue(ns->queue);
> > > -	up_read(&ctrl->namespaces_rwsem);
> > > +
> > > +	down_write(&ctrl->namespaces_rwsem);
> > > +	list_splice(&splice, &ctrl->namespaces);
> > > +	up_write(&ctrl->namespaces_rwsem);
> > >  }
> > >  EXPORT_SYMBOL_GPL(nvme_sync_io_queues);
> > 
> > I don't know about dependency chain, but you might be able to add
> > "struct nvme_ctrl"->sync_io_queue_mutex which is held for serializing
> > nvme_sync_io_queues() and down_write(&ctrl->namespaces_rwsem) users?
> > 
> > If we can guarantee that ctrl->namespaces_rwsem => ctrl->sync_io_queue_mutex
> > is impossible, nvme_sync_io_queues() can use ctrl->sync_io_queue_mutex
> > rather than ctrl->namespaces_rwsem, and down_write(&ctrl->namespaces_rwsem)/
> > up_write(&ctrl->namespaces_rwsem) users are replaced with
> >   mutex_lock(&ctrl->sync_io_queue_mutex);
> >   down_write(&ctrl->namespaces_rwsem);
> > and
> >   up_write(&ctrl->namespaces_rwsem);
> >   mutex_unlock(&ctrl->sync_io_queue_mutex);
> > sequences respectively.

Thanks for the idea (educational for me :). I read through code and did not find
the dependency "ctrl->namespaces_rwsem => ctrl->sync_io_queue_mutex". I also did
a quick implementation and observed the lockdep splat disappears. So, your idea
looks working. This approach needs slight additional memory, then I'm not so
sure if this is the best way. The list move fix still looks good enough for me.

I would like to wait for some more comments. Or, I can post a formal patch for
wider review.

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-06  2:30                   ` Shinichiro Kawasaki
@ 2022-10-07  1:36                     ` Shinichiro Kawasaki
  0 siblings, 0 replies; 16+ messages in thread
From: Shinichiro Kawasaki @ 2022-10-07  1:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: Tetsuo Handa, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On Oct 06, 2022 / 02:30, Shinichiro Kawasaki wrote:
> On Oct 05, 2022 / 08:20, Keith Busch wrote:
> > On Wed, Oct 05, 2022 at 07:00:30PM +0900, Tetsuo Handa wrote:
> > > On 2022/10/05 17:31, Shinichiro Kawasaki wrote:
> > > > @@ -5120,11 +5120,27 @@ EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
> > > >  void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
> > > >  {
> > > >  	struct nvme_ns *ns;
> > > > +	LIST_HEAD(splice);
> > > >  
> > > > -	down_read(&ctrl->namespaces_rwsem);
> > > > -	list_for_each_entry(ns, &ctrl->namespaces, list)
> > > > +	/*
> > > > +	 * blk_sync_queues() call in ctrl->snamespaces_rwsem critical section
> > > > +	 * triggers deadlock warning by lockdep since cancel_work_sync() in
> > > > +	 * blk_sync_queue() waits for nvme_timeout() work completion which may
> > > > +	 * lock the ctrl->snamespaces_rwsem. To avoid the deadlock possibility,
> > > > +	 * call blk_sync_queues() out of the critical section by moving the
> > > > +         * ctrl->namespaces list elements to the stack list head temporally.
> > > > +	 */
> > > > +
> > > > +	down_write(&ctrl->namespaces_rwsem);
> > > > +	list_splice_init(&ctrl->namespaces, &splice);
> > > > +	up_write(&ctrl->namespaces_rwsem);
> > > 
> > > Does this work?
> > > 
> > > ctrl->namespaces being empty when calling blk_sync_queue() means that
> > > e.g. nvme_start_freeze() cannot find namespaces to freeze, doesn't it?
> > 
> > There can't be anything to timeout at this point. The controller is disabled
> > prior to syncing the queues. Not only is there no IO for timeout work to
> > operate on, the controller state is already disabled, so a subsequent freeze
> > would be skipped.
> 
> Thank you. So, this temporary list move approach should be ok.

Keith, while I was preparing the formal patch, I noticed a path which may call
nvme_sync_io_queues() when NVME controller is not disabled. Quote from
drivers/nvme/host/pci.c:

static int nvme_suspend(struct device *dev)
{
	struct pci_dev *pdev = to_pci_dev(dev);
	struct nvme_dev *ndev = pci_get_drvdata(pdev);
	struct nvme_ctrl *ctrl = &ndev->ctrl;

        /* ... */

	nvme_start_freeze(ctrl);
	nvme_wait_freeze(ctrl);
	nvme_sync_queues(ctrl);

	if (ctrl->state != NVME_CTRL_LIVE)
		goto unfreeze;

When nvme_sync_queues(ctrl) is called, still ctrl->state can be NMVE_CTRL_LIVE.
So, I think namespace addition or removal can happen in parallel to this
nvme_supsend() context (this is super rare though...). If this is true, the
patch to move namespace list to stack list head may cause removed (or added)
namespace to appear (or disappear) after suspend & resume. (I think other paths
of nvme_sync_io_queues() disables the controller and fine.)

Comment on this guess will be appreciated. If this guess is correct, Tetsuo's
suggestion would be the better, even though it adds a new mutex.

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-03 15:28     ` Keith Busch
  2022-10-04 10:44       ` Shinichiro Kawasaki
@ 2022-10-07 20:34       ` Bart Van Assche
  2022-10-10 13:31         ` Keith Busch
  1 sibling, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2022-10-07 20:34 UTC (permalink / raw)
  To: Keith Busch, Shinichiro Kawasaki
  Cc: Tetsuo Handa, linux-block, linux-nvme, Tejun Heo,
	Johannes Thumshirn, Damien Le Moal

On 10/3/22 08:28, Keith Busch wrote:
> On Mon, Oct 03, 2022 at 01:32:41PM +0000, Shinichiro Kawasaki wrote:
>>
>> BTW, I came up to another question during code read. I found nvme_reset_work()
>> calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
>> controller is already disabled when the reset work calls nvme_sync_queues().
> 
> Right, everything previously outstanding has been reclaimed, and the queues are
> quiesced at this point. There's nothing for timeout work to wait for, and the
> sync is just ensuring every timeout work has returned.
> 
> It looks like a timeout is required in order to hit this reported deadlock, but
> the driver ensures there's nothing to timeout prior to syncing the queues. I
> don't think lockdep could reasonably know that, though.

Hi Keith,

Commit b2a0eb1a0ac7 ("nvme-pci: Remove watchdog timer") introduced the
nvme_dev_disable() and nvme_reset_ctrl() calls in the nvme_timeout()
function. Has it been considered to invoke these two calls asynchronously
instead of synchronously from the NVMe timeout handler (queue_work())?
Although it may require some work to make sure that this approach does not
trigger any race conditions, do you agree that this should be sufficient to
make lockdep happy?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-07 20:34       ` Bart Van Assche
@ 2022-10-10 13:31         ` Keith Busch
  2022-10-11 17:11           ` Bart Van Assche
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2022-10-10 13:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Shinichiro Kawasaki, Tetsuo Handa, linux-block,
	linux-nvme, Tejun Heo, Johannes Thumshirn, Damien Le Moal

On Fri, Oct 7, 2022 at 10:34 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 10/3/22 08:28, Keith Busch wrote:
> > On Mon, Oct 03, 2022 at 01:32:41PM +0000, Shinichiro Kawasaki wrote:
> >>
> >> BTW, I came up to another question during code read. I found nvme_reset_work()
> >> calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
> >> controller is already disabled when the reset work calls nvme_sync_queues().
> >
> > Right, everything previously outstanding has been reclaimed, and the queues are
> > quiesced at this point. There's nothing for timeout work to wait for, and the
> > sync is just ensuring every timeout work has returned.
> >
> > It looks like a timeout is required in order to hit this reported deadlock, but
> > the driver ensures there's nothing to timeout prior to syncing the queues. I
> > don't think lockdep could reasonably know that, though.
>
> Hi Keith,
>
> Commit b2a0eb1a0ac7 ("nvme-pci: Remove watchdog timer") introduced the
> nvme_dev_disable() and nvme_reset_ctrl() calls in the nvme_timeout()
> function. Has it been considered to invoke these two calls asynchronously
> instead of synchronously from the NVMe timeout handler (queue_work())?
> Although it may require some work to make sure that this approach does not
> trigger any race conditions, do you agree that this should be sufficient to
> make lockdep happy?

We still have to sync whatever work does the reset, so that would just
shift which work the lockdep splat indicates.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: lockdep WARNING at blktests block/011
  2022-10-10 13:31         ` Keith Busch
@ 2022-10-11 17:11           ` Bart Van Assche
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Van Assche @ 2022-10-11 17:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Shinichiro Kawasaki, Tetsuo Handa, linux-block,
	linux-nvme, Tejun Heo, Johannes Thumshirn, Damien Le Moal

On 10/10/22 06:31, Keith Busch wrote:
> On Fri, Oct 7, 2022 at 10:34 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> On 10/3/22 08:28, Keith Busch wrote:
>>> On Mon, Oct 03, 2022 at 01:32:41PM +0000, Shinichiro Kawasaki wrote:
>>>>
>>>> BTW, I came up to another question during code read. I found nvme_reset_work()
>>>> calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
>>>> controller is already disabled when the reset work calls nvme_sync_queues().
>>>
>>> Right, everything previously outstanding has been reclaimed, and the queues are
>>> quiesced at this point. There's nothing for timeout work to wait for, and the
>>> sync is just ensuring every timeout work has returned.
>>>
>>> It looks like a timeout is required in order to hit this reported deadlock, but
>>> the driver ensures there's nothing to timeout prior to syncing the queues. I
>>> don't think lockdep could reasonably know that, though.
>>
>> Hi Keith,
>>
>> Commit b2a0eb1a0ac7 ("nvme-pci: Remove watchdog timer") introduced the
>> nvme_dev_disable() and nvme_reset_ctrl() calls in the nvme_timeout()
>> function. Has it been considered to invoke these two calls asynchronously
>> instead of synchronously from the NVMe timeout handler (queue_work())?
>> Although it may require some work to make sure that this approach does not
>> trigger any race conditions, do you agree that this should be sufficient to
>> make lockdep happy?
> 
> We still have to sync whatever work does the reset, so that would just
> shift which work the lockdep splat indicates.

Hi Keith,

It seems like my email was not clear enough? What I meant is to queue
asynchronous work from inside the timeout handler and to wait for the
completion of that work from *outside* the timeout handler. This is not a
new approach. As an example, the SCSI core queues abort work from inside the
timeout handler and only allows new SCSI commands to be queued after error
handling has finished. I'm not claiming that this approach should be followed
by the NVMe driver - I'm only mentioning this as an example.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-10-11 17:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30  0:19 lockdep WARNING at blktests block/011 Shinichiro Kawasaki
2022-09-30 11:06 ` Tetsuo Handa
2022-10-03 13:32   ` Shinichiro Kawasaki
2022-10-03 15:28     ` Keith Busch
2022-10-04 10:44       ` Shinichiro Kawasaki
2022-10-04 11:10         ` Tetsuo Handa
2022-10-04 12:23           ` Shinichiro Kawasaki
2022-10-05  8:31             ` Shinichiro Kawasaki
2022-10-05 10:00               ` Tetsuo Handa
2022-10-05 14:20                 ` Keith Busch
2022-10-06  2:30                   ` Shinichiro Kawasaki
2022-10-07  1:36                     ` Shinichiro Kawasaki
2022-10-04 22:34         ` Damien Le Moal
2022-10-07 20:34       ` Bart Van Assche
2022-10-10 13:31         ` Keith Busch
2022-10-11 17:11           ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).