* v4.14-rc5 NVMeOF regression?
@ 2017-10-16 22:23 Bart Van Assche
2017-10-17 10:01 ` Sagi Grimberg
0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2017-10-16 22:23 UTC (permalink / raw)
Hello,
It has been a while since I ran any NVMeOF tests. But when I tried to test
the v4.14-rc5 NVMeOF drivers the output shown below appeared. Is this a
known issue? The following test triggered these call stacks:
# srp-test/run_tests -n -f xfs -e deadline -r 60
Thanks,
Bart.
======================================================
======================================================
WARNING: possible circular locking dependency detected
WARNING: possible circular locking dependency detected
4.14.0-rc5-dbg+ #3 Not tainted
4.14.0-rc5-dbg+ #3 Not tainted
------------------------------------------------------
------------------------------------------------------
modprobe/2272 is trying to acquire lock:
modprobe/2272 is trying to acquire lock:
("events"){+.+.}, at: [<ffffffff81084185>] flush_workqueue+0x75/0x520
("events"){+.+.}, at: [<ffffffff81084185>] flush_workqueue+0x75/0x520
but task is already holding lock:
but task is already holding lock:
(device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
(device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
which lock already depends on the new lock.
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
the existing dependency chain (in reverse order) is:
-> #3 (device_mutex){+.+.}:
-> #3 (device_mutex){+.+.}:
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
__mutex_lock+0x86/0x990
__mutex_lock+0x86/0x990
mutex_lock_nested+0x1b/0x20
mutex_lock_nested+0x1b/0x20
ib_register_device+0xa3/0x650 [ib_core]
ib_register_device+0xa3/0x650 [ib_core]
mlx4_ib_add+0xcfd/0x1440 [mlx4_ib]
mlx4_ib_add+0xcfd/0x1440 [mlx4_ib]
mlx4_add_device+0x45/0xe0 [mlx4_core]
mlx4_add_device+0x45/0xe0 [mlx4_core]
mlx4_register_interface+0xa8/0x120 [mlx4_core]
mlx4_register_interface+0xa8/0x120 [mlx4_core]
0xffffffffa05b2051
0xffffffffa05b2051
do_one_initcall+0x43/0x166
do_one_initcall+0x43/0x166
do_init_module+0x5f/0x206
do_init_module+0x5f/0x206
load_module+0x26fe/0x2db0
load_module+0x26fe/0x2db0
SYSC_finit_module+0xbc/0xf0
SYSC_finit_module+0xbc/0xf0
SyS_finit_module+0xe/0x10
SyS_finit_module+0xe/0x10
entry_SYSCALL_64_fastpath+0x18/0xad
entry_SYSCALL_64_fastpath+0x18/0xad
-> #2 (intf_mutex){+.+.}:
-> #2 (intf_mutex){+.+.}:
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
__mutex_lock+0x86/0x990
__mutex_lock+0x86/0x990
mutex_lock_nested+0x1b/0x20
mutex_lock_nested+0x1b/0x20
mlx4_register_device+0x30/0xc0 [mlx4_core]
mlx4_register_device+0x30/0xc0 [mlx4_core]
mlx4_load_one+0x15f4/0x16f0 [mlx4_core]
mlx4_load_one+0x15f4/0x16f0 [mlx4_core]
mlx4_init_one+0x4b9/0x700 [mlx4_core]
mlx4_init_one+0x4b9/0x700 [mlx4_core]
local_pci_probe+0x42/0xa0
local_pci_probe+0x42/0xa0
work_for_cpu_fn+0x14/0x20
work_for_cpu_fn+0x14/0x20
process_one_work+0x1fd/0x630
process_one_work+0x1fd/0x630
worker_thread+0x1db/0x3b0
worker_thread+0x1db/0x3b0
kthread+0x11e/0x150
kthread+0x11e/0x150
ret_from_fork+0x27/0x40
ret_from_fork+0x27/0x40
-> #1 ((&wfc.work)){+.+.}:
-> #1 ((&wfc.work)){+.+.}:
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
process_one_work+0x1da/0x630
process_one_work+0x1da/0x630
worker_thread+0x4e/0x3b0
worker_thread+0x4e/0x3b0
kthread+0x11e/0x150
kthread+0x11e/0x150
ret_from_fork+0x27/0x40
ret_from_fork+0x27/0x40
-> #0 ("events"){+.+.}:
-> #0 ("events"){+.+.}:
__lock_acquire+0x13b5/0x13f0
__lock_acquire+0x13b5/0x13f0
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
flush_workqueue+0x98/0x520
flush_workqueue+0x98/0x520
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
ib_unregister_client+0x18f/0x200 [ib_core]
ib_unregister_client+0x18f/0x200 [ib_core]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
SyS_delete_module+0x18c/0x1e0
SyS_delete_module+0x18c/0x1e0
entry_SYSCALL_64_fastpath+0x18/0xad
entry_SYSCALL_64_fastpath+0x18/0xad
other info that might help us debug this:
other info that might help us debug this:
Chain exists of:
"events" --> intf_mutex --> device_mutex
Chain exists of:
"events" --> intf_mutex --> device_mutex
Possible unsafe locking scenario:
Possible unsafe locking scenario:
CPU0 CPU1
CPU0 CPU1
---- ----
---- ----
lock(device_mutex);
lock(device_mutex);
lock(intf_mutex);
lock(intf_mutex);
lock(device_mutex);
lock(device_mutex);
lock("events");
lock("events");
*** DEADLOCK ***
*** DEADLOCK ***
1 lock held by modprobe/2272:
1 lock held by modprobe/2272:
#0: (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
#0: (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
stack backtrace:
stack backtrace:
CPU: 9 PID: 2272 Comm: modprobe Not tainted 4.14.0-rc5-dbg+ #3
CPU: 9 PID: 2272 Comm: modprobe Not tainted 4.14.0-rc5-dbg+ #3
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012
Call Trace:
Call Trace:
dump_stack+0x68/0x9f
dump_stack+0x68/0x9f
print_circular_bug.isra.38+0x1d8/0x1e6
print_circular_bug.isra.38+0x1d8/0x1e6
__lock_acquire+0x13b5/0x13f0
__lock_acquire+0x13b5/0x13f0
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
? lock_acquire+0xdc/0x1d0
? lock_acquire+0xdc/0x1d0
? flush_workqueue+0x75/0x520
? flush_workqueue+0x75/0x520
flush_workqueue+0x98/0x520
flush_workqueue+0x98/0x520
? flush_workqueue+0x75/0x520
? flush_workqueue+0x75/0x520
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
? nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
? nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
ib_unregister_client+0x18f/0x200 [ib_core]
ib_unregister_client+0x18f/0x200 [ib_core]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
SyS_delete_module+0x18c/0x1e0
SyS_delete_module+0x18c/0x1e0
entry_SYSCALL_64_fastpath+0x18/0xad
entry_SYSCALL_64_fastpath+0x18/0xad
^ permalink raw reply [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression?
2017-10-16 22:23 v4.14-rc5 NVMeOF regression? Bart Van Assche
@ 2017-10-17 10:01 ` Sagi Grimberg
[not found] ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>
0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2017-10-17 10:01 UTC (permalink / raw)
> Hello,
Hi Bart, thanks for reporting
> It has been a while since I ran any NVMeOF tests. But when I tried to test
> the v4.14-rc5 NVMeOF drivers the output shown below appeared. Is this a
> known issue? The following test triggered these call stacks:
Yea, we probably need to silence those with:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 92a03ff5fb4d..d7504f180d93 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1988,7 +1988,7 @@ static void nvme_rdma_remove_one(struct ib_device
*ib_device, void *client_data)
struct nvme_rdma_ctrl *ctrl;
/* Delete all controllers using this device */
- mutex_lock(&nvme_rdma_ctrl_mutex);
+ mutex_lock_nested(&nvme_rdma_ctrl_mutex, SINGLE_DEPTH_NESTING);
list_for_each_entry(ctrl, &nvme_rdma_ctrl_list, list) {
if (ctrl->device->dev != ib_device)
continue;
--
I'll compile with CONFIG_DEBUG_LOCK_ALLOC and check this, but if you
happen to beat me to it, I can use your Tested-by tag.
^ permalink raw reply related [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression?
[not found] ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>
@ 2017-10-18 5:26 ` Sagi Grimberg
2017-10-18 6:55 ` Christoph Hellwig
2017-10-18 16:24 ` Bart Van Assche
0 siblings, 2 replies; 6+ messages in thread
From: Sagi Grimberg @ 2017-10-18 5:26 UTC (permalink / raw)
> Hello Sagi,
>
> Had I mentioned that I ran into a real deadlock and not only into a
> lockdep complaint?
No, and the output is a lockdep complaint which I thought needed
silencing.
If you ran into a real deadlock, did you have any other output from
hung_task watchdog? I do not yet understand the root cause from
lockdep info provided.
Also, do you know at which test-case this happened?
> Additionally, can you explain to me why you think the
> above patch is necessary? As far as I can see nvme_rdma_ctrl_mutex is
> not held around the ib_register_client(&nvme_rdma_ib_client) call nor
> around the ib_unregister_client(&nvme_rdma_ib_client) calls.
Its held under ib_unregister_client, my understanding is that the
_nested annotation is required for the inner mutex lock acquire.
^ permalink raw reply [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression?
2017-10-18 5:26 ` Sagi Grimberg
@ 2017-10-18 6:55 ` Christoph Hellwig
2017-10-18 16:24 ` Bart Van Assche
1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2017-10-18 6:55 UTC (permalink / raw)
On Wed, Oct 18, 2017@08:26:05AM +0300, Sagi Grimberg wrote:
> > Additionally, can you explain to me why you think the above patch is
> > necessary? As far as I can see nvme_rdma_ctrl_mutex is not held around
> > the ib_register_client(&nvme_rdma_ib_client) call nor around the
> > ib_unregister_client(&nvme_rdma_ib_client) calls.
>
> Its held under ib_unregister_client, my understanding is that the
> _nested annotation is required for the inner mutex lock acquire.
No. Anytime we take a single lock out of order it is a bug an a
potential deadlock. _nested only works if we have multiple instances
of a lock (say in each inode) and we have a well defined ordering
between the instances (e.g. by the address of the inode structure).
^ permalink raw reply [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression?
2017-10-18 5:26 ` Sagi Grimberg
2017-10-18 6:55 ` Christoph Hellwig
@ 2017-10-18 16:24 ` Bart Van Assche
2017-10-22 17:16 ` Sagi Grimberg
1 sibling, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2017-10-18 16:24 UTC (permalink / raw)
On Wed, 2017-10-18@08:26 +0300, Sagi Grimberg wrote:
> If you ran into a real deadlock, did you have any other output from
> hung_task watchdog? I do not yet understand the root cause from
> lockdep info provided.
>
> Also, do you know at which test-case this happened?
Hello Sagi,
Running test case 1 should be sufficient to trigger the deadlock. SysRq-w
produced the following output:
sysrq: SysRq : Show Blocked State
task PC stack pid father
kworker/u66:2 D 0 440 2 0x80000000
Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
Call Trace:
__schedule+0x3e9/0xb00
schedule+0x40/0x90
schedule_timeout+0x221/0x580
io_schedule_timeout+0x1e/0x50
wait_for_completion_io_timeout+0x118/0x180
blk_execute_rq+0x86/0xc0
__nvme_submit_sync_cmd+0x89/0xf0
nvmf_reg_write32+0x4b/0x90 [nvme_fabrics]
nvme_shutdown_ctrl+0x41/0xe0
nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma]
nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma]
nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma]
process_one_work+0x1fd/0x630
worker_thread+0x1db/0x3b0
kthread+0x11e/0x150
ret_from_fork+0x27/0x40
01 D 0 2868 2862 0x00000000
Call Trace:
__schedule+0x3e9/0xb00
schedule+0x40/0x90
schedule_timeout+0x260/0x580
wait_for_completion+0x108/0x170
flush_work+0x1e0/0x270
nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma]
nvme_sysfs_delete+0x2a/0x40
dev_attr_store+0x18/0x30
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x124/0x1c0
__vfs_write+0x28/0x150
vfs_write+0xc7/0x1b0
SyS_write+0x49/0xa0
entry_SYSCALL_64_fastpath+0x18/0xad
Bart.
^ permalink raw reply [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression?
2017-10-18 16:24 ` Bart Van Assche
@ 2017-10-22 17:16 ` Sagi Grimberg
0 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2017-10-22 17:16 UTC (permalink / raw)
>> If you ran into a real deadlock, did you have any other output from
>> hung_task watchdog? I do not yet understand the root cause from
>> lockdep info provided.
>>
>> Also, do you know at which test-case this happened?
>
> Hello Sagi,
>
> Running test case 1 should be sufficient to trigger the deadlock. SysRq-w
> produced the following output:
>
> sysrq: SysRq : Show Blocked State
> task PC stack pid father
> kworker/u66:2 D 0 440 2 0x80000000
> Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
> Call Trace:
> __schedule+0x3e9/0xb00
> schedule+0x40/0x90
> schedule_timeout+0x221/0x580
> io_schedule_timeout+0x1e/0x50
> wait_for_completion_io_timeout+0x118/0x180
> blk_execute_rq+0x86/0xc0
> __nvme_submit_sync_cmd+0x89/0xf0
> nvmf_reg_write32+0x4b/0x90 [nvme_fabrics]
> nvme_shutdown_ctrl+0x41/0xe0
> nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma]
> nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma]
> nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma]
> process_one_work+0x1fd/0x630
> worker_thread+0x1db/0x3b0
> kthread+0x11e/0x150
> ret_from_fork+0x27/0x40
> 01 D 0 2868 2862 0x00000000
> Call Trace:
> __schedule+0x3e9/0xb00
> schedule+0x40/0x90
> schedule_timeout+0x260/0x580
> wait_for_completion+0x108/0x170
> flush_work+0x1e0/0x270
> nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma]
> nvme_sysfs_delete+0x2a/0x40
> dev_attr_store+0x18/0x30
> sysfs_kf_write+0x45/0x60
> kernfs_fop_write+0x124/0x1c0
> __vfs_write+0x28/0x150
> vfs_write+0xc7/0x1b0
> SyS_write+0x49/0xa0
> entry_SYSCALL_64_fastpath+0x18/0xad
Hi Bart,
So I've looked into this, and I want to share my findings.
I'm able to reproduce this hang when trying to disconnect from
a controller which is already in reconnecting state.
The problem as I see it, is that we are returning BLK_STS_RESOURCE
from nvme_rdma_queue_rq() before we start the request timer (fail
before blk_mq_start_request) so the request timeout never
expires (and given that we are in deleting sequence, the command is
never expected to complete).
But for some reason, I don't see the request get re-issued again,
should the driver take care of this by calling
blk_mq_delay_run_hw_queue()?
Thinking more on this, if we are disconnecting from a controller,
and we are unable to issue admin/io commands (queue state is not
LIVE) we probably should not fail with BLK_STS_RESOURCE but rather
BLK_STS_IOERR.
This change makes the issue go away:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5b5458012c2c..be77cd098182 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1393,6 +1393,12 @@ nvme_rdma_queue_is_ready(struct nvme_rdma_queue
*queue, struct request *rq)
cmd->common.opcode != nvme_fabrics_command ||
cmd->fabrics.fctype != nvme_fabrics_type_connect) {
/*
+ * deleting state means that the ctrl will never
accept
+ * commands again, fail it permanently.
+ */
+ if (queue->ctrl->ctrl.state == NVME_CTRL_DELETING)
+ return BLK_STS_IOERR;
+ /*
* reconnecting state means transport
disruption, which
* can take a long time and even might fail
permanently,
* so we can't let incoming I/O be requeued
forever.
--
Does anyone have a better idea?
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-10-22 17:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-16 22:23 v4.14-rc5 NVMeOF regression? Bart Van Assche
2017-10-17 10:01 ` Sagi Grimberg
[not found] ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>
2017-10-18 5:26 ` Sagi Grimberg
2017-10-18 6:55 ` Christoph Hellwig
2017-10-18 16:24 ` Bart Van Assche
2017-10-22 17:16 ` Sagi Grimberg
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.