[bug report] WARNING: possible circular locking at: rdma_destroy_id+0x17/0x20 [rdma_cm] triggered by blktests nvmeof-mp/002

* [bug report] WARNING: possible circular locking at: rdma_destroy_id+0x17/0x20 [rdma_cm] triggered by blktests nvmeof-mp/002
@ 2021-12-04  3:04 Yi Zhang
  2022-02-27 23:21 ` Bart Van Assche
  2022-05-25 11:01 ` Sagi Grimberg
  0 siblings, 2 replies; 19+ messages in thread
From: Yi Zhang @ 2021-12-04  3:04 UTC (permalink / raw)
  To: RDMA mailing list, open list:NVM EXPRESS DRIVER

Hello
Below WARNING triggered by blktests nvmeof-mp/002, it can be
reproduced with the latest 5.16.0-rc3 and also exists on 5.12, pls
check it.

# use_siw=1 ./check nvmeof-mp/002
nvmeof-mp/002 (File I/O on top of multipath concurrently with logout
and login (mq)) [passed]
    runtime  41.767s  ...  36.877s
# ./check nvmeof-mp/002
nvmeof-mp/002 (File I/O on top of multipath concurrently with logout
and login (mq)) [passed]
    runtime    ...  41.277s

[  609.166100] run blktests nvmeof-mp/002 at 2021-12-03 21:27:52
[  609.977883] null_blk: module loaded
[  610.538697] rdma_rxe: loaded
[  610.602763] infiniband eno1_rxe: set active
[  610.602782] infiniband eno1_rxe: added eno1
[  610.639520] eno2 speed is unknown, defaulting to 1000
[  610.644946] eno2 speed is unknown, defaulting to 1000
[  610.650318] eno2 speed is unknown, defaulting to 1000
[  610.690546] infiniband eno2_rxe: set down
[  610.690553] infiniband eno2_rxe: added eno2
[  610.691112] eno2 speed is unknown, defaulting to 1000
[  610.714293] eno2 speed is unknown, defaulting to 1000
[  610.729535] eno3 speed is unknown, defaulting to 1000
[  610.734666] eno3 speed is unknown, defaulting to 1000
[  610.740026] eno3 speed is unknown, defaulting to 1000
[  610.781372] infiniband eno3_rxe: set down
[  610.781379] infiniband eno3_rxe: added eno3
[  610.781906] eno3 speed is unknown, defaulting to 1000
[  610.806211] eno2 speed is unknown, defaulting to 1000
[  610.811508] eno3 speed is unknown, defaulting to 1000
[  610.826655] eno4 speed is unknown, defaulting to 1000
[  610.831810] eno4 speed is unknown, defaulting to 1000
[  610.837176] eno4 speed is unknown, defaulting to 1000
[  610.885844] infiniband eno4_rxe: set down
[  610.885850] infiniband eno4_rxe: added eno4
[  610.886426] eno4 speed is unknown, defaulting to 1000
[  611.266756] Loading iSCSI transport class v2.0-870.
[  611.335358] iscsi: registered transport (iser)
[  611.387969] nvmet: adding nsid 1 to subsystem nvme-test
[  611.403878] eno2 speed is unknown, defaulting to 1000
[  611.409276] eno3 speed is unknown, defaulting to 1000
[  611.414522] eno4 speed is unknown, defaulting to 1000
[  611.468712] Rounding down aligned max_sectors from 4294967295 to 4294967288
[  611.469203] db_root: cannot open: /etc/target
[  611.515816] nvmet_rdma: enabling port 1 (10.16.221.116:7777)
[  611.622158] eno2 speed is unknown, defaulting to 1000
[  611.628243] eno3 speed is unknown, defaulting to 1000
[  611.633422] eno4 speed is unknown, defaulting to 1000
[  611.703648] eno2 speed is unknown, defaulting to 1000
[  611.708949] eno3 speed is unknown, defaulting to 1000
[  611.714267] eno4 speed is unknown, defaulting to 1000
[  611.754520] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: address
resolved (0): status 0 id 00000000d0d64a6d
[  611.765454] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: route
resolved  (2): status 0 id 00000000d0d64a6d
[  611.767778] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: connect
request (4): status 0 id 000000007ccc375a
[  611.767935] nvmet_rdma:nvmet_rdma_find_get_device: nvmet_rdma:
added eno1_rxe.
[  611.777083] nvmet_rdma:nvmet_rdma_create_queue_ib: nvmet_rdma:
nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 32 sq_size = 289
cm_id= 000000007ccc375a
--snip--
[  612.132317] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: established
(9): status 0 id 00000000cede4b76
[  612.132452] nvme nvme0: mapped 32/0/0 default/read/poll queues.
[  612.132537] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
established (9): status 0 id 0000000070c2f22c
[  612.157849] nvmet:nvmet_execute_io_connect: nvmet: adding queue 1 to ctrl 1.
[  612.158194] nvmet:nvmet_execute_io_connect: nvmet: adding queue 2 to ctrl 1.
[  612.158548] nvmet:nvmet_execute_io_connect: nvmet: adding queue 3 to ctrl 1.
[  612.158854] nvmet:nvmet_execute_io_connect: nvmet: adding queue 4 to ctrl 1.
[  612.159173] nvmet:nvmet_execute_io_connect: nvmet: adding queue 5 to ctrl 1.
[  612.159510] nvmet:nvmet_execute_io_connect: nvmet: adding queue 6 to ctrl 1.
[  612.159853] nvmet:nvmet_execute_io_connect: nvmet: adding queue 7 to ctrl 1.
[  612.160225] nvmet:nvmet_execute_io_connect: nvmet: adding queue 8 to ctrl 1.
[  612.160572] nvmet:nvmet_execute_io_connect: nvmet: adding queue 9 to ctrl 1.
[  612.160885] nvmet:nvmet_execute_io_connect: nvmet: adding queue 10 to ctrl 1.
[  612.161216] nvmet:nvmet_execute_io_connect: nvmet: adding queue 11 to ctrl 1.
[  612.161759] nvmet:nvmet_execute_io_connect: nvmet: adding queue 12 to ctrl 1.
[  612.162123] nvmet:nvmet_execute_io_connect: nvmet: adding queue 13 to ctrl 1.
[  612.162473] nvmet:nvmet_execute_io_connect: nvmet: adding queue 14 to ctrl 1.
[  612.162793] nvmet:nvmet_execute_io_connect: nvmet: adding queue 15 to ctrl 1.
[  612.163135] nvmet:nvmet_execute_io_connect: nvmet: adding queue 16 to ctrl 1.
[  612.163458] nvmet:nvmet_execute_io_connect: nvmet: adding queue 17 to ctrl 1.
[  612.163776] nvmet:nvmet_execute_io_connect: nvmet: adding queue 18 to ctrl 1.
[  612.164103] nvmet:nvmet_execute_io_connect: nvmet: adding queue 19 to ctrl 1.
[  612.164417] nvmet:nvmet_execute_io_connect: nvmet: adding queue 20 to ctrl 1.
[  612.164700] nvmet:nvmet_execute_io_connect: nvmet: adding queue 21 to ctrl 1.
[  612.164971] nvmet:nvmet_execute_io_connect: nvmet: adding queue 22 to ctrl 1.
[  612.165275] nvmet:nvmet_execute_io_connect: nvmet: adding queue 23 to ctrl 1.
[  612.165689] nvmet:nvmet_execute_io_connect: nvmet: adding queue 24 to ctrl 1.
[  612.166082] nvmet:nvmet_execute_io_connect: nvmet: adding queue 25 to ctrl 1.
[  612.166383] nvmet:nvmet_execute_io_connect: nvmet: adding queue 26 to ctrl 1.
[  612.166663] nvmet:nvmet_execute_io_connect: nvmet: adding queue 27 to ctrl 1.
[  612.166963] nvmet:nvmet_execute_io_connect: nvmet: adding queue 28 to ctrl 1.
[  612.167273] nvmet:nvmet_execute_io_connect: nvmet: adding queue 29 to ctrl 1.
[  612.167529] nvmet:nvmet_execute_io_connect: nvmet: adding queue 30 to ctrl 1.
[  612.167804] nvmet:nvmet_execute_io_connect: nvmet: adding queue 31 to ctrl 1.
[  612.168114] nvmet:nvmet_execute_io_connect: nvmet: adding queue 32 to ctrl 1.
[  612.168523] nvme nvme0: new ctrl: NQN "nvme-test", addr 10.16.221.116:7777
[  612.293734] device-mapper: multipath service-time: version 0.3.0 loaded
[  612.296099] device-mapper: table: 253:3: multipath: error getting
device (-EBUSY)
[  612.303676] device-mapper: ioctl: error adding target to table
[  612.501014] device-mapper: table: 253:3: multipath: error getting
device (-EBUSY)
[  612.508751] device-mapper: ioctl: error adding target to table
[  612.514717] device-mapper: table: 253:3: multipath: error getting
device (-EBUSY)
[  612.522632] device-mapper: ioctl: error adding target to table
[  612.950887] device-mapper: table: 253:3: multipath: error getting
device (-EBUSY)
[  612.958616] device-mapper: ioctl: error adding target to table
[  613.869348] EXT4-fs (dm-3): mounted filesystem without journal.
Opts: (null). Quota mode: none.
[  613.869619] ext4 filesystem being mounted at
/root/kernel-tests/storage/blktests/nvme/nvmeof-mp/blktests/results/tmpdir.nvmeof-mp.002.ZYO/mnt0
supports timestamps until 2038 (0x7fffffff)
[  617.144570] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
traffic based keep-alive timer
[  617.272414] nvmet:nvmet_execute_keep_alive: nvmet: ctrl 1 update
keep-alive timer for 5 secs
[  622.775724] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
traffic based keep-alive timer
[  627.895810] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
traffic based keep-alive timer
[  633.015369] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
traffic based keep-alive timer
[  638.138209] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
traffic based keep-alive timer
[  643.254966] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
traffic based keep-alive timer
[  644.822178] eno2 speed is unknown, defaulting to 1000
[  644.827494] eno3 speed is unknown, defaulting to 1000
[  644.832797] eno4 speed is unknown, defaulting to 1000
[  648.377790] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
traffic based keep-alive timer
[  651.364570] nvme nvme0: Removing ctrl: NQN "nvme-test"
[  651.430891] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
disconnected (10): status 0 id 000000009ae8bc67
[  651.430914] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
cm_id= 000000009ae8bc67 queue->state= 1
[  651.430983] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
disconnected (10): status 0 id 0000000078827089
[  651.430994] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
cm_id= 0000000078827089 queue->state= 1
--snip--
[  651.438145] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0:
disconnected (10): status 0 id 00000000cede4b76
[  651.438150] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: disconnect
received - connection closed

[  651.448883] ======================================================
[  651.455061] WARNING: possible circular locking dependency detected
[  651.461242] 5.16.0-rc3+ #2 Tainted: G S        I
[  651.466380] ------------------------------------------------------
[  651.472559] kworker/14:4/4839 is trying to acquire lock:
[  651.477873] ffff8884c0e353e0 (&id_priv->handler_mutex){+.+.}-{3:3},
at: rdma_destroy_id+0x17/0x20 [rdma_cm]
[  651.487623]
               but task is already holding lock:
[  651.493453] ffff8883cc21fe00
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
process_one_work+0x888/0x1740
[  651.504062]
               which lock already depends on the new lock.

[  651.512233]
               the existing dependency chain (in reverse order) is:
[  651.519715]
               -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
[  651.527894]        lock_acquire+0x1d2/0x5a0
[  651.532079]        process_one_work+0x8da/0x1740
[  651.536700]        worker_thread+0x87/0xbf0
[  651.540886]        kthread+0x3aa/0x490
[  651.544637]        ret_from_fork+0x1f/0x30
[  651.548737]
               -> #2 ((wq_completion)events){+.+.}-{0:0}:
[  651.555349]        lock_acquire+0x1d2/0x5a0
[  651.559535]        flush_workqueue+0x109/0x15f0
[  651.564067]        nvmet_rdma_queue_connect+0x1c79/0x2610 [nvmet_rdma]
[  651.570594]        cma_cm_event_handler+0xef/0x530 [rdma_cm]
[  651.576261]        cma_ib_req_handler+0x1a76/0x4a50 [rdma_cm]
[  651.582016]        cm_process_work+0x44/0x190 [ib_cm]
[  651.587077]        cm_req_handler+0x2218/0x7040 [ib_cm]
[  651.592312]        cm_work_handler+0x1acf/0x4a30 [ib_cm]
[  651.597633]        process_one_work+0x978/0x1740
[  651.602253]        worker_thread+0x87/0xbf0
[  651.606437]        kthread+0x3aa/0x490
[  651.610190]        ret_from_fork+0x1f/0x30
[  651.614290]
               -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
[  651.621249]        lock_acquire+0x1d2/0x5a0
[  651.625435]        __mutex_lock+0x155/0x15f0
[  651.629708]        cma_ib_req_handler+0x1709/0x4a50 [rdma_cm]
[  651.635462]        cm_process_work+0x44/0x190 [ib_cm]
[  651.640521]        cm_req_handler+0x2218/0x7040 [ib_cm]
[  651.645749]        cm_work_handler+0x1acf/0x4a30 [ib_cm]
[  651.651071]        process_one_work+0x978/0x1740
[  651.655689]        worker_thread+0x87/0xbf0
[  651.659874]        kthread+0x3aa/0x490
[  651.663628]        ret_from_fork+0x1f/0x30
[  651.667726]
               -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
[  651.674511]        check_prevs_add+0x3fd/0x2480
[  651.679045]        __lock_acquire+0x23f7/0x2f80
[  651.683577]        lock_acquire+0x1d2/0x5a0
[  651.687762]        __mutex_lock+0x155/0x15f0
[  651.692036]        rdma_destroy_id+0x17/0x20 [rdma_cm]
[  651.697184]        nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[  651.703189]        nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma]
[  651.709802]        process_one_work+0x978/0x1740
[  651.714422]        worker_thread+0x87/0xbf0
[  651.718608]        kthread+0x3aa/0x490
[  651.722358]        ret_from_fork+0x1f/0x30
[  651.726459]
               other info that might help us debug this:

[  651.734457] Chain exists of:
                 &id_priv->handler_mutex --> (wq_completion)events -->
(work_completion)(&queue->release_work)

[  651.748531]  Possible unsafe locking scenario:

[  651.754451]        CPU0                    CPU1
[  651.758984]        ----                    ----
[  651.763515]   lock((work_completion)(&queue->release_work));
[  651.769174]                                lock((wq_completion)events);
[  651.775786]
lock((work_completion)(&queue->release_work));
[  651.783959]   lock(&id_priv->handler_mutex);
[  651.788233]
                *** DEADLOCK ***

[  651.794153] 2 locks held by kworker/14:4/4839:
[  651.798596]  #0: ffff88810005b948
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x854/0x1740
[  651.808078]  #1: ffff8883cc21fe00
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
process_one_work+0x888/0x1740
[  651.819121]
               stack backtrace:
[  651.823478] CPU: 14 PID: 4839 Comm: kworker/14:4 Tainted: G S
 I       5.16.0-rc3+ #2
[  651.831824] Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS
2.11.2 004/21/2021
[  651.839476] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma]
[  651.846175] Call Trace:
[  651.848628]  <TASK>
[  651.850735]  dump_stack_lvl+0x44/0x57
[  651.854401]  check_noncircular+0x280/0x320
[  651.858501]  ? print_circular_bug.isra.47+0x430/0x430
[  651.863553]  ? lock_chain_count+0x20/0x20
[  651.867566]  ? wait_for_completion+0x1e0/0x270
[  651.872010]  ? mark_lock.part.52+0xf7/0x1050
[  651.876283]  ? sched_clock_cpu+0x15/0x200
[  651.880295]  ? find_held_lock+0x3a/0x1c0
[  651.884223]  check_prevs_add+0x3fd/0x2480
[  651.888238]  ? mark_held_locks+0xb7/0x120
[  651.892248]  ? check_irq_usage+0xb50/0xb50
[  651.896346]  ? lockdep_lock+0xcb/0x1c0
[  651.900100]  ? static_obj+0xc0/0xc0
[  651.903591]  ? sched_clock_cpu+0x15/0x200
[  651.907606]  __lock_acquire+0x23f7/0x2f80
[  651.911620]  ? rcu_read_lock_bh_held+0xc0/0xc0
[  651.916063]  lock_acquire+0x1d2/0x5a0
[  651.919729]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  651.924529]  ? rcu_read_unlock+0x50/0x50
[  651.928456]  ? lock_is_held_type+0xd9/0x130
[  651.932644]  __mutex_lock+0x155/0x15f0
[  651.936393]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  651.941194]  ? kasan_quarantine_put+0xa4/0x180
[  651.945641]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  651.950442]  ? mutex_lock_io_nested+0x1460/0x1460
[  651.955146]  ? lock_is_held_type+0xd9/0x130
[  651.959334]  ? rcu_read_lock_sched_held+0xaf/0xe0
[  651.964039]  ? rcu_read_lock_bh_held+0xc0/0xc0
[  651.968487]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  651.973287]  rdma_destroy_id+0x17/0x20 [rdma_cm]
[  651.977915]  nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[  651.983401]  nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma]
[  651.989493]  process_one_work+0x978/0x1740
[  651.993595]  ? pwq_dec_nr_in_flight+0x270/0x270
[  651.998126]  worker_thread+0x87/0xbf0
[  652.001790]  ? __kthread_parkme+0xc3/0x1d0
[  652.005890]  ? process_one_work+0x1740/0x1740
[  652.010248]  kthread+0x3aa/0x490
[  652.013480]  ? set_kthread_struct+0x100/0x100
[  652.017841]  ret_from_fork+0x1f/0x30
[  652.021424]  </TASK>
[  652.050739] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
disconnected (10): status 0 id 000000007ccc375a
[  652.060318] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
cm_id= 000000007ccc375a queue->state= 1
[  652.069937] nvmet_rdma:nvmet_rdma_free_queue: nvmet_rdma: freeing queue 0
[  652.082650] nvmet:nvmet_stop_keep_alive_timer: nvmet: ctrl 1 stop keep-alive
[  652.488476] eno2 speed is unknown, defaulting to 1000
[  652.493657] eno3 speed is unknown, defaulting to 1000
[  652.498847] eno4 speed is unknown, defaulting to 1000
[  652.558751] rdma_rxe: rxe-ah pool destroyed with unfree'd elem
[  652.678415] rdma_rxe: unloaded

-- 
Best Regards,
  Yi Zhang

^ permalink raw reply	[flat|nested] 19+ messages in thread