* [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
@ 2024-04-18 12:52 Nilay Shroff
2024-04-21 10:28 ` Sagi Grimberg
0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2024-04-18 12:52 UTC (permalink / raw)
To: linux-nvme
Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, axboe,
Gregory Joyce, Srimannarayana Murthy Maram
Hi,
We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.
Notes and observations:
======================
This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.
Test details:
=============
Steps to reproduce this issue:
1. Run some disk IO using fio or any other tool
2. While disk IO is running, inject pci error
3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)
Kernel Logs:
============
When we follow steps described in the test details we get the below logs:
[ 295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
[ 295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[ 490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
[ 490.381614] Not tainted 6.9.0-rc4+ #8
[ 490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 490.381623] task:bash state:D stack:0 pid:2510 tgid:2510 ppid:2509 flags:0x00042080
[ 490.381632] Call Trace:
[ 490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
[ 490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
[ 490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
[ 490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
[ 490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
[ 490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
[ 490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
[ 490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
[ 490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
[ 490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
[ 490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
[ 490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
[ 490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
[ 490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
[ 490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
[ 490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
[ 490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
[ 490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
[ 490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
[ 490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
[ 490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
[ 490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
[ 490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
[ 490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
NVMe controller state:
======================
# cat /sys/class/nvme/nvme1/state
deleting (no IO)
Process State:
==============
# ps -aex
[..]
2510 pts/2 Ds+ 0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
2549 ? Ds 0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based
[..]
Observation:
============
As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep)
for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were
in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error
"Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE.
As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of
"disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)"
indefinitely. The only way we found to come out of this state is to reboot the system.
Proposed fix:
============
static void nvme_remove(struct pci_dev *pdev)
{
struct nvme_dev *dev = pci_get_drvdata(pdev);
nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
pci_set_drvdata(pdev, NULL);
if (!pci_device_is_present(pdev)) {
nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
nvme_dev_disable(dev, true);
}
flush_work(&dev->ctrl.reset_work);
nvme_stop_ctrl(&dev->ctrl);
nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
[..]
}
As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the
device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes
sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if
we were to unconditionally invoke nvme_dev_disable() at the first place?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-18 12:52 [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang Nilay Shroff
@ 2024-04-21 10:28 ` Sagi Grimberg
2024-04-21 16:53 ` Nilay Shroff
2024-04-21 16:56 ` Nilay Shroff
0 siblings, 2 replies; 10+ messages in thread
From: Sagi Grimberg @ 2024-04-21 10:28 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme
Cc: Keith Busch, Christoph Hellwig, axboe, Gregory Joyce,
Srimannarayana Murthy Maram
On 18/04/2024 15:52, Nilay Shroff wrote:
> Hi,
>
> We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.
>
> Notes and observations:
> ======================
> This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.
>
> Test details:
> =============
> Steps to reproduce this issue:
>
> 1. Run some disk IO using fio or any other tool
> 2. While disk IO is running, inject pci error
> 3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)
>
> Kernel Logs:
> ============
> When we follow steps described in the test details we get the below logs:
>
> [ 295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> [ 295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
> [ 295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
> [ 490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
> [ 490.381614] Not tainted 6.9.0-rc4+ #8
> [ 490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 490.381623] task:bash state:D stack:0 pid:2510 tgid:2510 ppid:2509 flags:0x00042080
> [ 490.381632] Call Trace:
> [ 490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
> [ 490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
> [ 490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
> [ 490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
> [ 490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
> [ 490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
> [ 490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
> [ 490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
> [ 490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
> [ 490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
> [ 490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
> [ 490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
> [ 490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
> [ 490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
> [ 490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
> [ 490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
> [ 490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
> [ 490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
> [ 490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
> [ 490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
> [ 490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
> [ 490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
> [ 490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
> [ 490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>
> NVMe controller state:
> ======================
> # cat /sys/class/nvme/nvme1/state
> deleting (no IO)
>
> Process State:
> ==============
> # ps -aex
> [..]
> 2510 pts/2 Ds+ 0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
> 2549 ? Ds 0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based
> [..]
>
> Observation:
> ============
> As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep)
> for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were
> in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error
> "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE.
> As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of
> "disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)"
> indefinitely. The only way we found to come out of this state is to reboot the system.
>
> Proposed fix:
> ============
> static void nvme_remove(struct pci_dev *pdev)
> {
> struct nvme_dev *dev = pci_get_drvdata(pdev);
>
> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> pci_set_drvdata(pdev, NULL);
>
> if (!pci_device_is_present(pdev)) {
> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
> nvme_dev_disable(dev, true);
> }
> flush_work(&dev->ctrl.reset_work);
> nvme_stop_ctrl(&dev->ctrl);
> nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
> [..]
> }
>
> As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the
> device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes
> sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if
> we were to unconditionally invoke nvme_dev_disable() at the first place?
Shouldn't the correct place to handle the cancellation is
nvme_error_detected() given that the
pci error is preventing the request from completing and the timeout
handler from addressing it?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-21 10:28 ` Sagi Grimberg
@ 2024-04-21 16:53 ` Nilay Shroff
2024-04-21 16:56 ` Nilay Shroff
1 sibling, 0 replies; 10+ messages in thread
From: Nilay Shroff @ 2024-04-21 16:53 UTC (permalink / raw)
To: linux-nvme
On 4/21/24 15:58, Sagi Grimberg wrote:
>
>
> On 18/04/2024 15:52, Nilay Shroff wrote:
>> Hi,
>>
>> We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.
>>
>> Notes and observations:
>> ======================
>> This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.
>>
>> Test details:
>> =============
>> Steps to reproduce this issue:
>>
>> 1. Run some disk IO using fio or any other tool
>> 2. While disk IO is running, inject pci error
>> 3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)
>>
>> Kernel Logs:
>> ============
>> When we follow steps described in the test details we get the below logs:
>>
>> [ 295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>> [ 295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
>> [ 295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
>> [ 490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
>> [ 490.381614] Not tainted 6.9.0-rc4+ #8
>> [ 490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 490.381623] task:bash state:D stack:0 pid:2510 tgid:2510 ppid:2509 flags:0x00042080
>> [ 490.381632] Call Trace:
>> [ 490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
>> [ 490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
>> [ 490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
>> [ 490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
>> [ 490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
>> [ 490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
>> [ 490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
>> [ 490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
>> [ 490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
>> [ 490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
>> [ 490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
>> [ 490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
>> [ 490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
>> [ 490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
>> [ 490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
>> [ 490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
>> [ 490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
>> [ 490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
>> [ 490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
>> [ 490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
>> [ 490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
>> [ 490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
>> [ 490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
>> [ 490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>>
>> NVMe controller state:
>> ======================
>> # cat /sys/class/nvme/nvme1/state
>> deleting (no IO)
>>
>> Process State:
>> ==============
>> # ps -aex
>> [..]
>> 2510 pts/2 Ds+ 0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
>> 2549 ? Ds 0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based
>> [..]
>>
>> Observation:
>> ============
>> As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep)
>> for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were
>> in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error
>> "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE.
>> As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of
>> "disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)"
>> indefinitely. The only way we found to come out of this state is to reboot the system.
>>
>> Proposed fix:
>> ============
>> static void nvme_remove(struct pci_dev *pdev)
>> {
>> struct nvme_dev *dev = pci_get_drvdata(pdev);
>>
>> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>> pci_set_drvdata(pdev, NULL);
>>
>> if (!pci_device_is_present(pdev)) {
>> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>> nvme_dev_disable(dev, true);
>> }
>> flush_work(&dev->ctrl.reset_work);
>> nvme_stop_ctrl(&dev->ctrl);
>> nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
>> [..]
>> }
>>
>> As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the
>> device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes
>> sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if
>> we were to unconditionally invoke nvme_dev_disable() at the first place?
>
> Shouldn't the correct place to handle the cancellation is nvme_error_detected() given that the
> pci error is preventing the request from completing and the timeout handler from addressing it?
>
If a platform supports pci-error-recovery then it may be possible to cancel pending IOs from
nvme_error_detected(), however, if the platform doesn't support pci error recovery then
nvme_error_detected() would never be called. In fact, the issue which I reported above was
discovered on a platform which has pci-error-recovery disabled.
I also tested this scenario on a platform supporting the pci error recovery. On this platform,
when I ran this test (PCI error injection followed by NVMe hot unpulg), I found that the
pci-error-recovery thread races with hot-unplug task, please find below the dmesg logs
collected when this issue manifests:
INFO: task eehd:187 blocked for more than 122 seconds.
Not tainted 6.9.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:eehd state:D stack:0 pid:187 tgid:187 ppid:2 flags:0x00000000
Call Trace:
[c0000000089bf890] [c000000000fb027c] vsnprintf+0x3f8/0x578 (unreliable)
[c0000000089bfa40] [c00000000001f3fc] __switch_to+0x13c/0x220
[c0000000089bfaa0] [c000000000fb87e0] __schedule+0x268/0x7c4
[c0000000089bfb70] [c000000000fb8d7c] schedule+0x40/0x108
[c0000000089bfbe0] [c000000000fb93f8] schedule_preempt_disabled+0x20/0x30
[c0000000089bfc00] [c000000000fbbe84] __mutex_lock.constprop.0+0x5f4/0xc54
[c0000000089bfca0] [c000000000916380] pci_lock_rescan_remove+0x28/0x3c
[c0000000089bfcc0] [c00000000004fa4c] eeh_pe_report_edev+0x3c/0x52c
[c0000000089bfda0] [c00000000004ffdc] eeh_pe_report+0xa0/0x158
[c0000000089bfe40] [c000000000050490] eeh_handle_normal_event+0x310/0xa24
[c0000000089bff30] [c000000000051078] eeh_event_handler+0x118/0x19c
[c0000000089bff90] [c00000000018d29c] kthread+0x138/0x140
[c0000000089bffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
INFO: task bash:5420 blocked for more than 122 seconds.
Not tainted 6.9.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash state:D stack:0 pid:5420 tgid:5420 ppid:5419 flags:0x00042080
Call Trace:
[c000000054c67510] [c000000054c67550] 0xc000000054c67550 (unreliable)
[c000000054c676c0] [c00000000001f3fc] __switch_to+0x13c/0x220
[c000000054c67720] [c000000000fb87e0] __schedule+0x268/0x7c4
[c000000054c677f0] [c000000000fb8d7c] schedule+0x40/0x108
[c000000054c67860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
[c000000054c678c0] [c00000000081eba8] del_gendisk+0x284/0x464
[c000000054c67920] [c00800000b7174a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
[c000000054c67960] [c00800000b717704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
[c000000054c679d0] [c008000006294b70] nvme_remove+0x80/0x168 [nvme]
[c000000054c67a10] [c00000000092a10c] pci_device_remove+0x6c/0x110
[c000000054c67a50] [c000000000a4f504] device_remove+0x70/0xc4
[c000000054c67a80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
[c000000054c67ad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
[c000000054c67b10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
[c000000054c67b40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
[c000000054c67bd0] [c008000004380504] disable_slot+0x40/0x90 [rpaphp]
[c000000054c67c00] [c000000000956090] power_write_file+0xf8/0x19c
[c000000054c67c80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
[c000000054c67ca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
[c000000054c67cc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
[c000000054c67d10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
[c000000054c67dc0] [c0000000005e13c0] ksys_write+0x84/0x140
[c000000054c67e10] [c000000000030a84] system_call_exception+0x124/0x330
[c000000054c67e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
As we can see above the task eehd:187 (pci-error-recovery thread) is blocked for
acquiring the mutex (pci_rescan_remove_lock). The hot-unplug task (bash: 5420)
is blocked on the request queue to be freezed. The hot-unplug task first acquires
the pci_rescan_remove_lock and then invokes nvme_remove() method. So in summary,
in the traces shown above, as hot-unplug task is unable to proceed, it has also
blocked the pci-error-recovery thread.
On another note, if pci-error-recovery thread could progress first acquiring
pci_rescan_remove_lock then it shall be able to recover the pci error and hence
pending IOs could be finished. Later when hot-unplug task starts, it could
forward progress and cleanup all resources used by the nvme disk.
So does it make sense if we unconditionally cancel the pending IOs from
nvme_remove() before it forward progress to remove namespaces?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-21 10:28 ` Sagi Grimberg
2024-04-21 16:53 ` Nilay Shroff
@ 2024-04-21 16:56 ` Nilay Shroff
2024-04-22 13:00 ` Sagi Grimberg
1 sibling, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2024-04-21 16:56 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
Cc: Keith Busch, Christoph Hellwig, axboe, Gregory Joyce,
Srimannarayana Murthy Maram
On 4/21/24 15:58, Sagi Grimberg wrote:
>
>
> On 18/04/2024 15:52, Nilay Shroff wrote:
>> Hi,
>>
>> We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.
>>
>> Notes and observations:
>> ======================
>> This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.
>>
>> Test details:
>> =============
>> Steps to reproduce this issue:
>>
>> 1. Run some disk IO using fio or any other tool
>> 2. While disk IO is running, inject pci error
>> 3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)
>>
>> Kernel Logs:
>> ============
>> When we follow steps described in the test details we get the below logs:
>>
>> [ 295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>> [ 295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
>> [ 295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
>> [ 490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
>> [ 490.381614] Not tainted 6.9.0-rc4+ #8
>> [ 490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 490.381623] task:bash state:D stack:0 pid:2510 tgid:2510 ppid:2509 flags:0x00042080
>> [ 490.381632] Call Trace:
>> [ 490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
>> [ 490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
>> [ 490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
>> [ 490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
>> [ 490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
>> [ 490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
>> [ 490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
>> [ 490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
>> [ 490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
>> [ 490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
>> [ 490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
>> [ 490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
>> [ 490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
>> [ 490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
>> [ 490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
>> [ 490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
>> [ 490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
>> [ 490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
>> [ 490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
>> [ 490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
>> [ 490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
>> [ 490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
>> [ 490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
>> [ 490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>>
>> NVMe controller state:
>> ======================
>> # cat /sys/class/nvme/nvme1/state
>> deleting (no IO)
>>
>> Process State:
>> ==============
>> # ps -aex
>> [..]
>> 2510 pts/2 Ds+ 0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
>> 2549 ? Ds 0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based
>> [..]
>>
>> Observation:
>> ============
>> As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep)
>> for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were
>> in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error
>> "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE.
>> As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of
>> "disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)"
>> indefinitely. The only way we found to come out of this state is to reboot the system.
>>
>> Proposed fix:
>> ============
>> static void nvme_remove(struct pci_dev *pdev)
>> {
>> struct nvme_dev *dev = pci_get_drvdata(pdev);
>>
>> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>> pci_set_drvdata(pdev, NULL);
>>
>> if (!pci_device_is_present(pdev)) {
>> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>> nvme_dev_disable(dev, true);
>> }
>> flush_work(&dev->ctrl.reset_work);
>> nvme_stop_ctrl(&dev->ctrl);
>> nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
>> [..]
>> }
>>
>> As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the
>> device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes
>> sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if
>> we were to unconditionally invoke nvme_dev_disable() at the first place?
>
> Shouldn't the correct place to handle the cancellation is nvme_error_detected() given that the
> pci error is preventing the request from completing and the timeout handler from addressing it?
>
If a platform supports pci-error-recovery then it may be possible to cancel pending IOs from
nvme_error_detected(), however, if the platform doesn't support pci error recovery then
nvme_error_detected() would never be called. In fact, the issue which I reported above was
discovered on a platform which has pci-error-recovery disabled.
I also tested this scenario on a platform supporting the pci error recovery. On this platform,
when I ran this test (PCI error injection followed by NVMe hot unpulg), I found that the
pci-error-recovery thread races with hot-unplug task, please find below the dmesg logs
collected when this issue manifests:
INFO: task eehd:187 blocked for more than 122 seconds.
Not tainted 6.9.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:eehd state:D stack:0 pid:187 tgid:187 ppid:2 flags:0x00000000
Call Trace:
[c0000000089bf890] [c000000000fb027c] vsnprintf+0x3f8/0x578 (unreliable)
[c0000000089bfa40] [c00000000001f3fc] __switch_to+0x13c/0x220
[c0000000089bfaa0] [c000000000fb87e0] __schedule+0x268/0x7c4
[c0000000089bfb70] [c000000000fb8d7c] schedule+0x40/0x108
[c0000000089bfbe0] [c000000000fb93f8] schedule_preempt_disabled+0x20/0x30
[c0000000089bfc00] [c000000000fbbe84] __mutex_lock.constprop.0+0x5f4/0xc54
[c0000000089bfca0] [c000000000916380] pci_lock_rescan_remove+0x28/0x3c
[c0000000089bfcc0] [c00000000004fa4c] eeh_pe_report_edev+0x3c/0x52c
[c0000000089bfda0] [c00000000004ffdc] eeh_pe_report+0xa0/0x158
[c0000000089bfe40] [c000000000050490] eeh_handle_normal_event+0x310/0xa24
[c0000000089bff30] [c000000000051078] eeh_event_handler+0x118/0x19c
[c0000000089bff90] [c00000000018d29c] kthread+0x138/0x140
[c0000000089bffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
INFO: task bash:5420 blocked for more than 122 seconds.
Not tainted 6.9.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash state:D stack:0 pid:5420 tgid:5420 ppid:5419 flags:0x00042080
Call Trace:
[c000000054c67510] [c000000054c67550] 0xc000000054c67550 (unreliable)
[c000000054c676c0] [c00000000001f3fc] __switch_to+0x13c/0x220
[c000000054c67720] [c000000000fb87e0] __schedule+0x268/0x7c4
[c000000054c677f0] [c000000000fb8d7c] schedule+0x40/0x108
[c000000054c67860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
[c000000054c678c0] [c00000000081eba8] del_gendisk+0x284/0x464
[c000000054c67920] [c00800000b7174a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
[c000000054c67960] [c00800000b717704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
[c000000054c679d0] [c008000006294b70] nvme_remove+0x80/0x168 [nvme]
[c000000054c67a10] [c00000000092a10c] pci_device_remove+0x6c/0x110
[c000000054c67a50] [c000000000a4f504] device_remove+0x70/0xc4
[c000000054c67a80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
[c000000054c67ad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
[c000000054c67b10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
[c000000054c67b40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
[c000000054c67bd0] [c008000004380504] disable_slot+0x40/0x90 [rpaphp]
[c000000054c67c00] [c000000000956090] power_write_file+0xf8/0x19c
[c000000054c67c80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
[c000000054c67ca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
[c000000054c67cc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
[c000000054c67d10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
[c000000054c67dc0] [c0000000005e13c0] ksys_write+0x84/0x140
[c000000054c67e10] [c000000000030a84] system_call_exception+0x124/0x330
[c000000054c67e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
As we can see above the task eehd:187 (pci-error-recovery thread) is blocked for
acquiring the mutex (pci_rescan_remove_lock). The hot-unplug task (bash: 5420)
is blocked on the request queue to be freezed. The hot-unplug task first acquires
the pci_rescan_remove_lock and then invokes nvme_remove() method. So in summary,
in the traces shown above, as hot-unplug task is unable to proceed, it has also
blocked the pci-error-recovery thread.
On another note, if pci-error-recovery thread could progress first acquiring
pci_rescan_remove_lock then it shall be able to recover the pci error and hence
pending IOs could be finished. Later when hot-unplug task starts, it could
forward progress and cleanup all resources used by the nvme disk.
So does it make sense if we unconditionally cancel the pending IOs from
nvme_remove() before it forward progress to remove namespaces?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-21 16:56 ` Nilay Shroff
@ 2024-04-22 13:00 ` Sagi Grimberg
2024-04-22 13:52 ` Keith Busch
0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2024-04-22 13:00 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme
Cc: Keith Busch, Christoph Hellwig, axboe, Gregory Joyce,
Srimannarayana Murthy Maram
On 21/04/2024 19:56, Nilay Shroff wrote:
>
> On 4/21/24 15:58, Sagi Grimberg wrote:
>>
>> On 18/04/2024 15:52, Nilay Shroff wrote:
>>> Hi,
>>>
>>> We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.
>>>
>>> Notes and observations:
>>> ======================
>>> This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.
>>>
>>> Test details:
>>> =============
>>> Steps to reproduce this issue:
>>>
>>> 1. Run some disk IO using fio or any other tool
>>> 2. While disk IO is running, inject pci error
>>> 3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)
>>>
>>> Kernel Logs:
>>> ============
>>> When we follow steps described in the test details we get the below logs:
>>>
>>> [ 295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>>> [ 295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
>>> [ 295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
>>> [ 490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
>>> [ 490.381614] Not tainted 6.9.0-rc4+ #8
>>> [ 490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [ 490.381623] task:bash state:D stack:0 pid:2510 tgid:2510 ppid:2509 flags:0x00042080
>>> [ 490.381632] Call Trace:
>>> [ 490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
>>> [ 490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
>>> [ 490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
>>> [ 490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
>>> [ 490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
>>> [ 490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
>>> [ 490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
>>> [ 490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
>>> [ 490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
>>> [ 490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
>>> [ 490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
>>> [ 490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
>>> [ 490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
>>> [ 490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
>>> [ 490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
>>> [ 490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
>>> [ 490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
>>> [ 490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
>>> [ 490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
>>> [ 490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
>>> [ 490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
>>> [ 490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
>>> [ 490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
>>> [ 490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>>>
>>> NVMe controller state:
>>> ======================
>>> # cat /sys/class/nvme/nvme1/state
>>> deleting (no IO)
>>>
>>> Process State:
>>> ==============
>>> # ps -aex
>>> [..]
>>> 2510 pts/2 Ds+ 0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
>>> 2549 ? Ds 0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based
>>> [..]
>>>
>>> Observation:
>>> ============
>>> As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep)
>>> for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were
>>> in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error
>>> "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE.
>>> As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of
>>> "disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)"
>>> indefinitely. The only way we found to come out of this state is to reboot the system.
>>>
>>> Proposed fix:
>>> ============
>>> static void nvme_remove(struct pci_dev *pdev)
>>> {
>>> struct nvme_dev *dev = pci_get_drvdata(pdev);
>>>
>>> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>>> pci_set_drvdata(pdev, NULL);
>>>
>>> if (!pci_device_is_present(pdev)) {
>>> nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>>> nvme_dev_disable(dev, true);
>>> }
>>> flush_work(&dev->ctrl.reset_work);
>>> nvme_stop_ctrl(&dev->ctrl);
>>> nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
>>> [..]
>>> }
>>>
>>> As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the
>>> device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes
>>> sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if
>>> we were to unconditionally invoke nvme_dev_disable() at the first place?
>> Shouldn't the correct place to handle the cancellation is nvme_error_detected() given that the
>> pci error is preventing the request from completing and the timeout handler from addressing it?
>>
> If a platform supports pci-error-recovery then it may be possible to cancel pending IOs from
> nvme_error_detected(), however, if the platform doesn't support pci error recovery then
> nvme_error_detected() would never be called. In fact, the issue which I reported above was
> discovered on a platform which has pci-error-recovery disabled.
>
> I also tested this scenario on a platform supporting the pci error recovery. On this platform,
> when I ran this test (PCI error injection followed by NVMe hot unpulg), I found that the
> pci-error-recovery thread races with hot-unplug task, please find below the dmesg logs
> collected when this issue manifests:
>
> INFO: task eehd:187 blocked for more than 122 seconds.
> Not tainted 6.9.0-rc4+ #8
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:eehd state:D stack:0 pid:187 tgid:187 ppid:2 flags:0x00000000
> Call Trace:
> [c0000000089bf890] [c000000000fb027c] vsnprintf+0x3f8/0x578 (unreliable)
> [c0000000089bfa40] [c00000000001f3fc] __switch_to+0x13c/0x220
> [c0000000089bfaa0] [c000000000fb87e0] __schedule+0x268/0x7c4
> [c0000000089bfb70] [c000000000fb8d7c] schedule+0x40/0x108
> [c0000000089bfbe0] [c000000000fb93f8] schedule_preempt_disabled+0x20/0x30
> [c0000000089bfc00] [c000000000fbbe84] __mutex_lock.constprop.0+0x5f4/0xc54
> [c0000000089bfca0] [c000000000916380] pci_lock_rescan_remove+0x28/0x3c
> [c0000000089bfcc0] [c00000000004fa4c] eeh_pe_report_edev+0x3c/0x52c
> [c0000000089bfda0] [c00000000004ffdc] eeh_pe_report+0xa0/0x158
> [c0000000089bfe40] [c000000000050490] eeh_handle_normal_event+0x310/0xa24
> [c0000000089bff30] [c000000000051078] eeh_event_handler+0x118/0x19c
> [c0000000089bff90] [c00000000018d29c] kthread+0x138/0x140
> [c0000000089bffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
>
> INFO: task bash:5420 blocked for more than 122 seconds.
> Not tainted 6.9.0-rc4+ #8
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:bash state:D stack:0 pid:5420 tgid:5420 ppid:5419 flags:0x00042080
> Call Trace:
> [c000000054c67510] [c000000054c67550] 0xc000000054c67550 (unreliable)
> [c000000054c676c0] [c00000000001f3fc] __switch_to+0x13c/0x220
> [c000000054c67720] [c000000000fb87e0] __schedule+0x268/0x7c4
> [c000000054c677f0] [c000000000fb8d7c] schedule+0x40/0x108
> [c000000054c67860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
> [c000000054c678c0] [c00000000081eba8] del_gendisk+0x284/0x464
> [c000000054c67920] [c00800000b7174a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
> [c000000054c67960] [c00800000b717704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
> [c000000054c679d0] [c008000006294b70] nvme_remove+0x80/0x168 [nvme]
> [c000000054c67a10] [c00000000092a10c] pci_device_remove+0x6c/0x110
> [c000000054c67a50] [c000000000a4f504] device_remove+0x70/0xc4
> [c000000054c67a80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
> [c000000054c67ad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
> [c000000054c67b10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
> [c000000054c67b40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
> [c000000054c67bd0] [c008000004380504] disable_slot+0x40/0x90 [rpaphp]
> [c000000054c67c00] [c000000000956090] power_write_file+0xf8/0x19c
> [c000000054c67c80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
> [c000000054c67ca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
> [c000000054c67cc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
> [c000000054c67d10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
> [c000000054c67dc0] [c0000000005e13c0] ksys_write+0x84/0x140
> [c000000054c67e10] [c000000000030a84] system_call_exception+0x124/0x330
> [c000000054c67e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>
>
> As we can see above the task eehd:187 (pci-error-recovery thread) is blocked for
> acquiring the mutex (pci_rescan_remove_lock). The hot-unplug task (bash: 5420)
> is blocked on the request queue to be freezed. The hot-unplug task first acquires
> the pci_rescan_remove_lock and then invokes nvme_remove() method. So in summary,
> in the traces shown above, as hot-unplug task is unable to proceed, it has also
> blocked the pci-error-recovery thread.
Yea that needs fixing.
>
> On another note, if pci-error-recovery thread could progress first acquiring
> pci_rescan_remove_lock then it shall be able to recover the pci error and hence
> pending IOs could be finished. Later when hot-unplug task starts, it could
> forward progress and cleanup all resources used by the nvme disk.
>
> So does it make sense if we unconditionally cancel the pending IOs from
> nvme_remove() before it forward progress to remove namespaces?
The driver attempts to allow inflights I/O to complete successfully, if
the device
is still present in the remove stage. I am not sure we want to
unconditionally fail these
I/Os. Keith?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-22 13:00 ` Sagi Grimberg
@ 2024-04-22 13:52 ` Keith Busch
2024-04-22 14:35 ` Keith Busch
0 siblings, 1 reply; 10+ messages in thread
From: Keith Busch @ 2024-04-22 13:52 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Nilay Shroff, linux-nvme, Christoph Hellwig, axboe,
Gregory Joyce, Srimannarayana Murthy Maram
On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote:
> > pci_rescan_remove_lock then it shall be able to recover the pci error and hence
> > pending IOs could be finished. Later when hot-unplug task starts, it could
> > forward progress and cleanup all resources used by the nvme disk.
> >
> > So does it make sense if we unconditionally cancel the pending IOs from
> > nvme_remove() before it forward progress to remove namespaces?
>
> The driver attempts to allow inflights I/O to complete successfully, if the
> device
> is still present in the remove stage. I am not sure we want to
> unconditionally fail these
> I/Os. Keith?
We have a timeout handler to clean this up, but I think it was another
PPC specific patch that has the timeout handler do nothing if pcie error
recovery is in progress. Which seems questionable, we should be able to
concurrently run error handling and timeouts, but I think the error
handling just needs to syncronize the request_queue's in the
"error_detected" path.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-22 13:52 ` Keith Busch
@ 2024-04-22 14:35 ` Keith Busch
2024-04-23 9:52 ` Nilay Shroff
0 siblings, 1 reply; 10+ messages in thread
From: Keith Busch @ 2024-04-22 14:35 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Nilay Shroff, linux-nvme, Christoph Hellwig, axboe,
Gregory Joyce, Srimannarayana Murthy Maram
On Mon, Apr 22, 2024 at 07:52:25AM -0600, Keith Busch wrote:
> On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote:
> > > pci_rescan_remove_lock then it shall be able to recover the pci error and hence
> > > pending IOs could be finished. Later when hot-unplug task starts, it could
> > > forward progress and cleanup all resources used by the nvme disk.
> > >
> > > So does it make sense if we unconditionally cancel the pending IOs from
> > > nvme_remove() before it forward progress to remove namespaces?
> >
> > The driver attempts to allow inflights I/O to complete successfully, if the
> > device
> > is still present in the remove stage. I am not sure we want to
> > unconditionally fail these
> > I/Os. Keith?
>
> We have a timeout handler to clean this up, but I think it was another
> PPC specific patch that has the timeout handler do nothing if pcie error
> recovery is in progress. Which seems questionable, we should be able to
> concurrently run error handling and timeouts, but I think the error
> handling just needs to syncronize the request_queue's in the
> "error_detected" path.
This:
---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685d..38d0215fe53fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1286,13 +1286,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
u32 csts = readl(dev->bar + NVME_REG_CSTS);
u8 opcode;
- /* If PCI error recovery process is happening, we cannot reset or
- * the recovery mechanism will surely fail.
- */
- mb();
- if (pci_channel_offline(to_pci_dev(dev->dev)))
- return BLK_EH_RESET_TIMER;
-
/*
* Reset immediately if the controller is failed
*/
@@ -3300,6 +3293,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
return PCI_ERS_RESULT_DISCONNECT;
}
nvme_dev_disable(dev, false);
+ nvme_sync_queues(&dev->ctrl);
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
dev_warn(dev->ctrl.device,
--
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-22 14:35 ` Keith Busch
@ 2024-04-23 9:52 ` Nilay Shroff
2024-04-24 17:36 ` Keith Busch
0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2024-04-23 9:52 UTC (permalink / raw)
To: Keith Busch, Sagi Grimberg
Cc: linux-nvme, Christoph Hellwig, axboe, Gregory Joyce,
Srimannarayana Murthy Maram
On 4/22/24 20:05, Keith Busch wrote:
> On Mon, Apr 22, 2024 at 07:52:25AM -0600, Keith Busch wrote:
>> On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote:
>>>> pci_rescan_remove_lock then it shall be able to recover the pci error and hence
>>>> pending IOs could be finished. Later when hot-unplug task starts, it could
>>>> forward progress and cleanup all resources used by the nvme disk.
>>>>
>>>> So does it make sense if we unconditionally cancel the pending IOs from
>>>> nvme_remove() before it forward progress to remove namespaces?
>>>
>>> The driver attempts to allow inflights I/O to complete successfully, if the
>>> device
>>> is still present in the remove stage. I am not sure we want to
>>> unconditionally fail these
>>> I/Os. Keith?
>>
>> We have a timeout handler to clean this up, but I think it was another
>> PPC specific patch that has the timeout handler do nothing if pcie error
>> recovery is in progress. Which seems questionable, we should be able to
>> concurrently run error handling and timeouts, but I think the error
>> handling just needs to syncronize the request_queue's in the
>> "error_detected" path.
>
> This:
>
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8e0bb9692685d..38d0215fe53fc 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1286,13 +1286,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
> u32 csts = readl(dev->bar + NVME_REG_CSTS);
> u8 opcode;
>
> - /* If PCI error recovery process is happening, we cannot reset or
> - * the recovery mechanism will surely fail.
> - */
> - mb();
> - if (pci_channel_offline(to_pci_dev(dev->dev)))
> - return BLK_EH_RESET_TIMER;
> -
> /*
> * Reset immediately if the controller is failed
> */
> @@ -3300,6 +3293,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
> return PCI_ERS_RESULT_DISCONNECT;
> }
> nvme_dev_disable(dev, false);
> + nvme_sync_queues(&dev->ctrl);
> return PCI_ERS_RESULT_NEED_RESET;
> case pci_channel_io_perm_failure:
> dev_warn(dev->ctrl.device,
> --
>
I tested the above patch, however, it doesn't help to solve the issue.
I tested it for two cases listed below:
1. Platform which doesn't support pci-error-recovery:
-----------------------------------------------------
On this platform when nvme_timeout() is invoked, it falls through
nvme_shoud_reset()
-> nvme_warn_reset()
-> goto disable
When nvme_timeout() jumps to the label disable, it tries setting the
controller state to RESETTING but that couldn't succeed because the
(logical) hot-unplug/nvme_remove() of the disk is started on another
thread and hence controller state has already changed to
DELETING/DELETING_NOIO. As nvme_timeout() couldn't set the controller
state to RESETTING, nvme_timeout() returns BLK_EH_DONE. In summary,
as nvme_timeout() couldn't cancel pending IO, the hot-unplug/nvme_remove()
couldn't forward progress and it keeps waiting for request queue to be freezed.
2. Platform supporting pci-error-recovery:
------------------------------------------
Similarly, on this platform as explained for the above case, when
nvme_timeout() is invoked, it falls through nvme_shoud_reset()
-> nvme_warn_reset() -> goto disable. In this case as well,
nvme_timeout() returns BLK_EH_DONE. Please note that though this
platform supports pci-error-recovery, we couldn't get through
nvme_error_detected() because the pci-error-recovery thread is pending
on acquiring mutex "pci_lock_rescan_remove". This mutex is acquired by
hot-unplug thread before it invokes nvme_remove() and nvme_remove()
is currently waiting for request queue to be freezed. For reference,
I have already captured the task hang traces in previous email of this
thread where we could observe these hangs (for both pci-error-recovery
thread as well as hot-unplig/nvme_remove()).
I understand that we don't want to cancel pending IO from the nvme_remove()
unconditionally as if the disk is not physically hot-unplug then we still
want to wait for the in-flight IO to be finished. Also looking through
the above cases, I think that the nvme_timeout() might be the code path
from where we want to cancel in-flight/pending IO if controller is
in the terminal state (i.e. DELETING or DELETING_NOIO). Keeping this idea in
mind, I have worked out the below patch:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..e45a54d84649 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1286,6 +1286,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
u32 csts = readl(dev->bar + NVME_REG_CSTS);
u8 opcode;
+ if (nvme_state_terminal(&dev->ctrl))
+ goto disable;
+
/* If PCI error recovery process is happening, we cannot reset or
* the recovery mechanism will surely fail.
*/
@@ -1390,8 +1393,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
return BLK_EH_RESET_TIMER;
disable:
- if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+ if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+ if (nvme_state_terminal(&dev->ctrl)) {
+ nvme_dev_disable(dev, false);
+ nvme_sync_queues(&dev->ctrl);
+ }
return BLK_EH_DONE;
+ }
nvme_dev_disable(dev, false);
if (nvme_try_sched_reset(&dev->ctrl))
I have tested the above patch against all possible cases. Please let me know
if this looks good or if there are any further comments.
Thanks,
--Nilay
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-23 9:52 ` Nilay Shroff
@ 2024-04-24 17:36 ` Keith Busch
2024-04-25 13:49 ` Nilay Shroff
0 siblings, 1 reply; 10+ messages in thread
From: Keith Busch @ 2024-04-24 17:36 UTC (permalink / raw)
To: Nilay Shroff
Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, axboe,
Gregory Joyce, Srimannarayana Murthy Maram
On Tue, Apr 23, 2024 at 03:22:46PM +0530, Nilay Shroff wrote:
> >
> I tested the above patch, however, it doesn't help to solve the issue.
> I tested it for two cases listed below:
>
> 1. Platform which doesn't support pci-error-recovery:
> -----------------------------------------------------
> On this platform when nvme_timeout() is invoked, it falls through
> nvme_shoud_reset()
> -> nvme_warn_reset()
> -> goto disable
>
> When nvme_timeout() jumps to the label disable, it tries setting the
> controller state to RESETTING but that couldn't succeed because the
> (logical) hot-unplug/nvme_remove() of the disk is started on another
> thread and hence controller state has already changed to
> DELETING/DELETING_NOIO. As nvme_timeout() couldn't set the controller
> state to RESETTING, nvme_timeout() returns BLK_EH_DONE. In summary,
> as nvme_timeout() couldn't cancel pending IO, the hot-unplug/nvme_remove()
> couldn't forward progress and it keeps waiting for request queue to be freezed.
>
> 2. Platform supporting pci-error-recovery:
> ------------------------------------------
> Similarly, on this platform as explained for the above case, when
> nvme_timeout() is invoked, it falls through nvme_shoud_reset()
> -> nvme_warn_reset() -> goto disable. In this case as well,
> nvme_timeout() returns BLK_EH_DONE. Please note that though this
> platform supports pci-error-recovery, we couldn't get through
> nvme_error_detected() because the pci-error-recovery thread is pending
> on acquiring mutex "pci_lock_rescan_remove". This mutex is acquired by
> hot-unplug thread before it invokes nvme_remove() and nvme_remove()
> is currently waiting for request queue to be freezed. For reference,
> I have already captured the task hang traces in previous email of this
> thread where we could observe these hangs (for both pci-error-recovery
> thread as well as hot-unplig/nvme_remove()).
>
> I understand that we don't want to cancel pending IO from the nvme_remove()
> unconditionally as if the disk is not physically hot-unplug then we still
> want to wait for the in-flight IO to be finished. Also looking through
> the above cases, I think that the nvme_timeout() might be the code path
> from where we want to cancel in-flight/pending IO if controller is
> in the terminal state (i.e. DELETING or DELETING_NOIO). Keeping this idea in
> mind, I have worked out the below patch:
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8e0bb9692685..e45a54d84649 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1286,6 +1286,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
> u32 csts = readl(dev->bar + NVME_REG_CSTS);
> u8 opcode;
>
> + if (nvme_state_terminal(&dev->ctrl))
> + goto disable;
> +
> /* If PCI error recovery process is happening, we cannot reset or
> * the recovery mechanism will surely fail.
> */
> @@ -1390,8 +1393,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
> return BLK_EH_RESET_TIMER;
>
> disable:
> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> + if (nvme_state_terminal(&dev->ctrl)) {
> + nvme_dev_disable(dev, false);
> + nvme_sync_queues(&dev->ctrl);
> + }
> return BLK_EH_DONE;
> + }
>
> nvme_dev_disable(dev, false);
> if (nvme_try_sched_reset(&dev->ctrl))
>
> I have tested the above patch against all possible cases. Please let me know
> if this looks good or if there are any further comments.
This looks okay to me. Just a couple things:
Set nvme_dev_disable's "shutdown" parameter to "true" since we're
restarting the queues again from this state.
Remove "nvme_sync_queues()". I think that would deadlock: sync_queues
waits for the timeout work to complete, but your calling it within the
timeout work, so this would have it wait for itself.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
2024-04-24 17:36 ` Keith Busch
@ 2024-04-25 13:49 ` Nilay Shroff
0 siblings, 0 replies; 10+ messages in thread
From: Nilay Shroff @ 2024-04-25 13:49 UTC (permalink / raw)
To: Keith Busch
Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, axboe,
Gregory Joyce, Srimannarayana Murthy Maram
On 4/24/24 23:06, Keith Busch wrote:
> On Tue, Apr 23, 2024 at 03:22:46PM +0530, Nilay Shroff wrote:
>> I understand that we don't want to cancel pending IO from the nvme_remove()
>> unconditionally as if the disk is not physically hot-unplug then we still
>> want to wait for the in-flight IO to be finished. Also looking through
>> the above cases, I think that the nvme_timeout() might be the code path
>> from where we want to cancel in-flight/pending IO if controller is
>> in the terminal state (i.e. DELETING or DELETING_NOIO). Keeping this idea in
>> mind, I have worked out the below patch:
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 8e0bb9692685..e45a54d84649 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -1286,6 +1286,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
>> u32 csts = readl(dev->bar + NVME_REG_CSTS);
>> u8 opcode;
>>
>> + if (nvme_state_terminal(&dev->ctrl))
>> + goto disable;
>> +
>> /* If PCI error recovery process is happening, we cannot reset or
>> * the recovery mechanism will surely fail.
>> */
>> @@ -1390,8 +1393,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
>> return BLK_EH_RESET_TIMER;
>>
>> disable:
>> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
>> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> + if (nvme_state_terminal(&dev->ctrl)) {
>> + nvme_dev_disable(dev, false);
>> + nvme_sync_queues(&dev->ctrl);
>> + }
>> return BLK_EH_DONE;
>> + }
>>
>> nvme_dev_disable(dev, false);
>> if (nvme_try_sched_reset(&dev->ctrl))
>>
>> I have tested the above patch against all possible cases. Please let me know
>> if this looks good or if there are any further comments.
>
> This looks okay to me. Just a couple things:
>
> Set nvme_dev_disable's "shutdown" parameter to "true" since we're
> restarting the queues again from this state.
>
> Remove "nvme_sync_queues()". I think that would deadlock: sync_queues
> waits for the timeout work to complete, but your calling it within the
> timeout work, so this would have it wait for itself.
>
Thank you for reviewing the patch! And yes I agree with your suggestions.
I will make those changes and send a formal patch in another email.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-04-25 13:50 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18 12:52 [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang Nilay Shroff
2024-04-21 10:28 ` Sagi Grimberg
2024-04-21 16:53 ` Nilay Shroff
2024-04-21 16:56 ` Nilay Shroff
2024-04-22 13:00 ` Sagi Grimberg
2024-04-22 13:52 ` Keith Busch
2024-04-22 14:35 ` Keith Busch
2024-04-23 9:52 ` Nilay Shroff
2024-04-24 17:36 ` Keith Busch
2024-04-25 13:49 ` Nilay Shroff
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.