[Bug Report] PCIe errinject and hot-unplug causes nvme driver hang

* [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
@ 2024-04-18 12:52 Nilay Shroff
  2024-04-21 10:28 ` Sagi Grimberg
  0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2024-04-18 12:52 UTC (permalink / raw)
  To: linux-nvme
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, axboe,
	Gregory Joyce, Srimannarayana Murthy Maram

Hi,

We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.

Notes and observations:
====================== 
This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.

Test details:
=============
Steps to reproduce this issue:

1. Run some disk IO using fio or any other tool
2. While disk IO is running, inject pci error
3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)

Kernel Logs:
============
When we follow steps described in the test details we get the below logs:

[  295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[  295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
[  295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[  490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
[  490.381614]       Not tainted 6.9.0-rc4+ #8
[  490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  490.381623] task:bash            state:D stack:0     pid:2510  tgid:2510  ppid:2509   flags:0x00042080
[  490.381632] Call Trace:
[  490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
[  490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
[  490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
[  490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
[  490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
[  490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
[  490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
[  490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
[  490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
[  490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
[  490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
[  490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
[  490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
[  490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
[  490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
[  490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
[  490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
[  490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
[  490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
[  490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
[  490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
[  490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
[  490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
[  490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

NVMe controller state:
======================
# cat /sys/class/nvme/nvme1/state 
deleting (no IO)

Process State:
==============
# ps -aex 
   [..]
   2510 pts/2    Ds+    0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
   2549 ?        Ds     0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based 
   [..]

Observation:
============
As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep) 
for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were 
in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error 
"Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE. 
As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of 
"disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)" 
indefinitely. The only way we found to come out of this state is to reboot the system.

Proposed fix:
============
static void nvme_remove(struct pci_dev *pdev)
{
	struct nvme_dev *dev = pci_get_drvdata(pdev);

	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
	pci_set_drvdata(pdev, NULL);

	if (!pci_device_is_present(pdev)) {
		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
		nvme_dev_disable(dev, true);
	}
	flush_work(&dev->ctrl.reset_work);
	nvme_stop_ctrl(&dev->ctrl);
	nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
        [..]
}

As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the 
device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes 
sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if 
we were to unconditionally invoke nvme_dev_disable() at the first place?

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 10+ messages in thread