[REGRESSION] Deadlock in nvme drive removal (Was: Problems with commit "cb4bfda62afa nvme-pci: fix hot removal during error handling")

From: <Alex_Gagniuc@Dellteam.com>
To: <keith.busch@intel.com>
Cc: <linux-nvme@lists.infradead.org>, <linux-pci@vger.kernel.org>,
	<Austin.Bolen@dell.com>
Subject: [REGRESSION] Deadlock in nvme drive removal (Was: Problems with commit "cb4bfda62afa nvme-pci: fix hot removal during error handling")
Date: Wed, 19 Dec 2018 19:23:57 +0000	[thread overview]
Message-ID: <240b79e0c3c442c88de378f522ba5813@ausx13mps321.AMER.DELL.COM> (raw)
In-Reply-To: ab932da928c242ab976ff469be2fdc44@ausx13mps321.AMER.DELL.COM

I've tracked down the problem to pciehp_ist() deadlocking on nvme 
removal. I blame this on nvme bugs.

#$ ls /sys/bus/pci/devices/0000\:b0\:04.0/msi_irqs/
44

# ps aux |grep pcie
root       529  0.0  0.0      0     0 ?        D    11:04   0:00 
[irq/44-pciehp]

# cat /proc/529/stack
[<0>] __flush_work+0x11a/0x1d0
[<0>] __cancel_work_timer+0x10a/0x190
[<0>] nvme_remove+0x38/0x170 [nvme]
[<0>] pci_device_remove+0x3b/0xc0
[<0>] device_release_driver_internal+0x180/0x240
[<0>] pci_stop_bus_device+0x69/0x90
[<0>] pci_stop_and_remove_bus_device+0xe/0x20
[<0>] pciehp_unconfigure_device+0x84/0x140
[<0>] pciehp_disable_slot+0x67/0x110
[<0>] pciehp_handle_presence_or_link_change+0x25f/0x400
[<0>] pciehp_ist+0x22b/0x230
[<0>] irq_thread_fn+0x1f/0x50
[<0>] irq_thread+0xf3/0x190
[<0>] kthread+0x112/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

This is about two hours after the drive was removed, give or take a 
timezone.

<abridged_obligatory_frustrated_developer_rant>
This took about two damn weeks to track down. Where do I have to go to 
collect the coffee rations?
<\abridged_obligatory_frustrated_developer_rant>

Alex

On 12/18/2018 04:28 PM, Gagniuc, Alexandru - Dell Team wrote:
> Hi,
> 
> I've been seeing two issues with this commit. One was a NULL dereference
> that appears to have been fixed in 4.20-rc7.
> 
> For the second issue, I've included a system log in Appendix B with an
> explanation of what events are being generated -- all done with a
> hardware yanker, not ACPI or fake injection techniques.
> 
> After the second or so removal, the kernel will think the device is
> gone, and fails to recognize hotplug interrupts that bring the device
> back online. The hardware continues to send hotplug interrupts, but the
> handler doesn't want to recognize them -- look for "pciehp_isr hit"
> message in the logs.
> 
> I'm seeing this issue starting with commit
> cb4bfda62afa "nvme-pci: fix hot removal during error handling"
> I can't reproduce it with the commit right before it.
> 
> I'm not sure if this is a PCI or NVME bug, and it is baffling as to what
> the heck is going on. Even if I revert the commit from v4.20-rc7, I
> continue to see the hotplug interrupt misfire. I suspect it's the
> combination of several changes that breaks things.
> 
> Who let the bugs out?
> 
> Alex
> 
> 
> Appendix B: System log with explanations
> 
> [  133.772812] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
> 
> **** Lanes 0-2 are disconnected here
> 
> [  599.231003] pcieport 0000:b0:04.0: pciehp_isr hit
> [  599.231067] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Down
> [  599.272506] pcieport 0000:b0:04.0: pciehp_isr hit
> [  599.299943] pciehp 0000:b0:04.0:pcie204: Slot(178): Card present
> [  599.299947] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
> [  599.605523] pci 0000:b1:00.0: [8086:0a55] type 00 class 0x010802
> [  599.605565] pci 0000:b1:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> [  599.605612] pci 0000:b1:00.0: Max Payload Size set to 256 (was 128,
> max 512)
> [  599.605618] pci 0000:b1:00.0: enabling Extended Tags
> [  599.605845] acpi_run_hpx: Type 3 record not supported
> [  599.605984] pci 0000:b1:00.0: 7.876 Gb/s available PCIe bandwidth,
> limited by 8 GT/s x1 link at 0000:b0:04.0 (capable of 31.504 Gb/s with 8
> GT/s x4 link)
> [  599.606356] pcieport 0000:b0:04.0: bridge window [io  0x1000-0x0fff]
> to [bus b1] add_size 1000
> [  599.606364] pcieport 0000:b0:04.0: BAR 13: no space for [io  size 0x1000]
> [  599.606368] pcieport 0000:b0:04.0: BAR 13: failed to assign [io  size
> 0x1000]
> [  599.606371] pcieport 0000:b0:04.0: BAR 13: no space for [io  size 0x1000]
> [  599.606375] pcieport 0000:b0:04.0: BAR 13: failed to assign [io  size
> 0x1000]
> [  599.606382] pci 0000:b1:00.0: BAR 0: assigned [mem
> 0xe1500000-0xe1503fff 64bit]
> [  599.606426] pcieport 0000:b0:04.0: PCI bridge to [bus b1]
> [  599.606434] pcieport 0000:b0:04.0:   bridge window [mem
> 0xe1500000-0xe15fffff]
> [  599.606440] pcieport 0000:b0:04.0:   bridge window [mem
> 0x386000000000-0x3860001fffff 64bit pref]
> [  599.608379] nvme nvme4: pci function 0000:b1:00.0
> [  599.608680] nvme 0000:b1:00.0: enabling device (0000 -> 0002)
> [  603.386826]  nvme1n2: p1
> [  603.475185] md: super_written gets error=10
> [  603.475192] md/raid1:mdX: Disk failure on dm-13, disabling device.
>                  md/raid1:mdX: Operation continuing on 2 devices.
> [  603.519511] device-mapper: raid: Faulty raid1 device #1 has readable
> super block.  Attempting to revive it.
> 
> *** Lanes 0-2 are reconnected
> 
> [  603.548495] md: recovery of RAID array mdX
> 
> *** All PCIe lanes are disconnected here
> 
> [  623.162103] pcieport 0000:b0:04.0: pciehp_isr hit
> [  623.162169] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Down
> [  623.168534] pcieport 0000:b0:04.0: pciehp_isr hit
> [  623.185774] print_req_error: I/O error, dev nvme1n2, sector 7860864
> [  623.185812] print_req_error: I/O error, dev nvme1n2, sector 7861248
> [  623.185817] print_req_error: I/O error, dev nvme1n2, sector 7857920
> [  623.185822] print_req_error: I/O error, dev nvme1n2, sector 7858432
> [  623.185826] print_req_error: I/O error, dev nvme1n2, sector 7858560
> [  623.185830] print_req_error: I/O error, dev nvme1n2, sector 7862144
> [  623.185835] print_req_error: I/O error, dev nvme1n2, sector 7858944
> [  623.185870] md/raid1:mdX: Disk failure on dm-13, disabling device.
>                  md/raid1:mdX: Operation continuing on 2 devices.
> [  623.185873] md: mdX: recovery interrupted.
> [  623.185881] print_req_error: I/O error, dev nvme1n2, sector 7861376
> [  623.197723] print_req_error: I/O error, dev nvme1n2, sector 7859072
> [  623.197727] print_req_error: I/O error, dev nvme1n2, sector 7862528
> 
> *** All PCIe lanes are re-connected
> 
> [  628.842766] pcieport 0000:b0:04.0: pciehp_isr hit
> [  628.842829] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
> [  629.147089] pci 0000:b1:00.0: [8086:0a55] type 00 class 0x010802
> [  629.147130] pci 0000:b1:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> [  629.147177] pci 0000:b1:00.0: Max Payload Size set to 256 (was 128,
> max 512)
> [  629.147182] pci 0000:b1:00.0: enabling Extended Tags
> [  629.147401] acpi_run_hpx: Type 3 record not supported
> [  629.147888] pcieport 0000:b0:04.0: bridge window [io  0x1000-0x0fff]
> to [bus b1] add_size 1000
> [  629.147896] pcieport 0000:b0:04.0: BAR 13: no space for [io  size 0x1000]
> [  629.147899] pcieport 0000:b0:04.0: BAR 13: failed to assign [io  size
> 0x1000]
> [  629.147903] pcieport 0000:b0:04.0: BAR 13: no space for [io  size 0x1000]
> [  629.147906] pcieport 0000:b0:04.0: BAR 13: failed to assign [io  size
> 0x1000]
> [  629.147913] pci 0000:b1:00.0: BAR 0: assigned [mem
> 0xe1500000-0xe1503fff 64bit]
> [  629.147926] pcieport 0000:b0:04.0: PCI bridge to [bus b1]
> [  629.147933] pcieport 0000:b0:04.0:   bridge window [mem
> 0xe1500000-0xe15fffff]
> [  629.147939] pcieport 0000:b0:04.0:   bridge window [mem
> 0x386000000000-0x3860001fffff 64bit pref]
> [  629.148891] nvme nvme2: pci function 0000:b1:00.0
> [  629.148932] nvme 0000:b1:00.0: enabling device (0000 -> 0002)
> [  629.239386] pcieport 0000:b0:04.0: pciehp_isr hit
> [  629.239420] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
> 
> *** Lanes not disconnected again until here
> 
> [  635.951171] pcieport 0000:b0:04.0: pciehp_isr hit
> [  635.992718] pcieport 0000:b0:04.0: pciehp_isr hit
> [  659.866526] pcieport 0000:b0:04.0: pciehp_isr hit
> [  659.877189] pcieport 0000:b0:04.0: pciehp_isr hit
> [  665.555242] pcieport 0000:b0:04.0: pciehp_isr hit
>