linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: Deadlock during PCIe hot remove
@ 2020-03-24 15:21 Haeuptle, Michael
  2020-03-24 15:35 ` Hoyer, David
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Haeuptle, Michael @ 2020-03-24 15:21 UTC (permalink / raw)
  To: linux-pci; +Cc: michaelhaeuptle



From: Haeuptle, Michael 
Sent: Tuesday, March 24, 2020 8:46 AM
To: 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>
Cc: 'michaelhaeuptle@gmail.com' <michaelhaeuptle@gmail.com>
Subject: Deadlock during PCIe hot remove

Dear PCI maintainers,

I'm running into a deadlock scenario between the hotplug, pcie and vfio_pci driver when removing multiple devices in parallel.
This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the latest pciehp code, the rest is all 4.18.

The sequence that leads to the deadlock is as follows:

The pciehp_ist() takes the reset_lock early in its processing. While the pciehp_ist processing is progressing, vfio_pci calls pci_try_reset_function() as part of vfio_pci_release or open. The pci_try_reset_function() takes the device lock.

Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take the reset_lock but has to wait since it is already taken by pciehp_ist().

Eventually pciehp_ist calls pcie_stop_device() which calls device_release_driver_internal(). This function also tries to take device_lock causing the dead lock.

Here's the kernel stack trace when the deadlock occurs: 

[root@localhost ~]# cat /proc/8594/task/8598/stack
[<0>] pciehp_reset_slot+0xa5/0x220
[<0>] pci_reset_hotplug_slot.cold.72+0x20/0x36
[<0>] pci_dev_reset_slot_function+0x72/0x9b
[<0>] __pci_reset_function_locked+0x15b/0x190
[<0>] pci_try_reset_function.cold.77+0x9b/0x108
[<0>] vfio_pci_disable+0x261/0x280
[<0>] vfio_pci_release+0xcb/0xf0
[<0>] vfio_device_fops_release+0x1e/0x40
[<0>] __fput+0xa5/0x1d0
[<0>] task_work_run+0x8a/0xb0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0xe1/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff

I was wondering if there's a quick workaround. I think the pci_try_reset_function would need to take the reset_lock before the device lock but there doesn't seem to be a good way of doing that.

I'm also trying to see if we can delay calling the vfio functions that are initiated by SPDK but I think this inherent race should be addressed.

I'm also happy to submit a defect report if this emailing list is not appropriate.

* Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-24 15:21 Deadlock during PCIe hot remove Haeuptle, Michael
@ 2020-03-24 15:35 ` Hoyer, David
  2020-03-24 15:37   ` Haeuptle, Michael
  2020-03-24 16:15 ` Lukas Wunner
  2020-03-29 15:43 ` Lukas Wunner
  2 siblings, 1 reply; 18+ messages in thread
From: Hoyer, David @ 2020-03-24 15:35 UTC (permalink / raw)
  To: Haeuptle, Michael, linux-pci; +Cc: michaelhaeuptle

You mentioned that you are using the latest pciehp code.   Does this code contain the recently introduced ist_running flag?

-----Original Message-----
From: linux-pci-owner@vger.kernel.org <linux-pci-owner@vger.kernel.org> On Behalf Of Haeuptle, Michael
Sent: Tuesday, March 24, 2020 10:22 AM
To: linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




From: Haeuptle, Michael
Sent: Tuesday, March 24, 2020 8:46 AM
To: 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>
Cc: 'michaelhaeuptle@gmail.com' <michaelhaeuptle@gmail.com>
Subject: Deadlock during PCIe hot remove

Dear PCI maintainers,

I'm running into a deadlock scenario between the hotplug, pcie and vfio_pci driver when removing multiple devices in parallel.
This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the latest pciehp code, the rest is all 4.18.

The sequence that leads to the deadlock is as follows:

The pciehp_ist() takes the reset_lock early in its processing. While the pciehp_ist processing is progressing, vfio_pci calls pci_try_reset_function() as part of vfio_pci_release or open. The pci_try_reset_function() takes the device lock.

Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take the reset_lock but has to wait since it is already taken by pciehp_ist().

Eventually pciehp_ist calls pcie_stop_device() which calls device_release_driver_internal(). This function also tries to take device_lock causing the dead lock.

Here's the kernel stack trace when the deadlock occurs:

[root@localhost ~]# cat /proc/8594/task/8598/stack [<0>] pciehp_reset_slot+0xa5/0x220 [<0>] pci_reset_hotplug_slot.cold.72+0x20/0x36
[<0>] pci_dev_reset_slot_function+0x72/0x9b
[<0>] __pci_reset_function_locked+0x15b/0x190
[<0>] pci_try_reset_function.cold.77+0x9b/0x108
[<0>] vfio_pci_disable+0x261/0x280
[<0>] vfio_pci_release+0xcb/0xf0
[<0>] vfio_device_fops_release+0x1e/0x40
[<0>] __fput+0xa5/0x1d0
[<0>] task_work_run+0x8a/0xb0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0xe1/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff

I was wondering if there's a quick workaround. I think the pci_try_reset_function would need to take the reset_lock before the device lock but there doesn't seem to be a good way of doing that.

I'm also trying to see if we can delay calling the vfio functions that are initiated by SPDK but I think this inherent race should be addressed.

I'm also happy to submit a defect report if this emailing list is not appropriate.

* Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-24 15:35 ` Hoyer, David
@ 2020-03-24 15:37   ` Haeuptle, Michael
  2020-03-24 15:39     ` Hoyer, David
  0 siblings, 1 reply; 18+ messages in thread
From: Haeuptle, Michael @ 2020-03-24 15:37 UTC (permalink / raw)
  To: Hoyer, David, linux-pci; +Cc: michaelhaeuptle

Hi David, yes, it does.

-- Michael

-----Original Message-----
From: Hoyer, David [mailto:David.Hoyer@netapp.com] 
Sent: Tuesday, March 24, 2020 9:35 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

You mentioned that you are using the latest pciehp code.   Does this code contain the recently introduced ist_running flag?

-----Original Message-----
From: linux-pci-owner@vger.kernel.org <linux-pci-owner@vger.kernel.org> On Behalf Of Haeuptle, Michael
Sent: Tuesday, March 24, 2020 10:22 AM
To: linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




From: Haeuptle, Michael
Sent: Tuesday, March 24, 2020 8:46 AM
To: 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>
Cc: 'michaelhaeuptle@gmail.com' <michaelhaeuptle@gmail.com>
Subject: Deadlock during PCIe hot remove

Dear PCI maintainers,

I'm running into a deadlock scenario between the hotplug, pcie and vfio_pci driver when removing multiple devices in parallel.
This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the latest pciehp code, the rest is all 4.18.

The sequence that leads to the deadlock is as follows:

The pciehp_ist() takes the reset_lock early in its processing. While the pciehp_ist processing is progressing, vfio_pci calls pci_try_reset_function() as part of vfio_pci_release or open. The pci_try_reset_function() takes the device lock.

Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take the reset_lock but has to wait since it is already taken by pciehp_ist().

Eventually pciehp_ist calls pcie_stop_device() which calls device_release_driver_internal(). This function also tries to take device_lock causing the dead lock.

Here's the kernel stack trace when the deadlock occurs:

[root@localhost ~]# cat /proc/8594/task/8598/stack [<0>] pciehp_reset_slot+0xa5/0x220 [<0>] pci_reset_hotplug_slot.cold.72+0x20/0x36
[<0>] pci_dev_reset_slot_function+0x72/0x9b
[<0>] __pci_reset_function_locked+0x15b/0x190
[<0>] pci_try_reset_function.cold.77+0x9b/0x108
[<0>] vfio_pci_disable+0x261/0x280
[<0>] vfio_pci_release+0xcb/0xf0
[<0>] vfio_device_fops_release+0x1e/0x40
[<0>] __fput+0xa5/0x1d0
[<0>] task_work_run+0x8a/0xb0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0xe1/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff

I was wondering if there's a quick workaround. I think the pci_try_reset_function would need to take the reset_lock before the device lock but there doesn't seem to be a good way of doing that.

I'm also trying to see if we can delay calling the vfio functions that are initiated by SPDK but I think this inherent race should be addressed.

I'm also happy to submit a defect report if this emailing list is not appropriate.

* Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-24 15:37   ` Haeuptle, Michael
@ 2020-03-24 15:39     ` Hoyer, David
  2020-03-24 16:26       ` Haeuptle, Michael
  0 siblings, 1 reply; 18+ messages in thread
From: Hoyer, David @ 2020-03-24 15:39 UTC (permalink / raw)
  To: Haeuptle, Michael, linux-pci; +Cc: michaelhaeuptle

I hit the same problem and have worked with the developers on a patch.   You might give this a try if you are building your own kernel/driver.

https://www.spinics.net/lists/linux-pci/msg92395.html

-----Original Message-----
From: Haeuptle, Michael <michael.haeuptle@hpe.com> 
Sent: Tuesday, March 24, 2020 10:37 AM
To: Hoyer, David <David.Hoyer@netapp.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




Hi David, yes, it does.

-- Michael

-----Original Message-----
From: Hoyer, David [mailto:David.Hoyer@netapp.com]
Sent: Tuesday, March 24, 2020 9:35 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

You mentioned that you are using the latest pciehp code.   Does this code contain the recently introduced ist_running flag?

-----Original Message-----
From: linux-pci-owner@vger.kernel.org <linux-pci-owner@vger.kernel.org> On Behalf Of Haeuptle, Michael
Sent: Tuesday, March 24, 2020 10:22 AM
To: linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




From: Haeuptle, Michael
Sent: Tuesday, March 24, 2020 8:46 AM
To: 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>
Cc: 'michaelhaeuptle@gmail.com' <michaelhaeuptle@gmail.com>
Subject: Deadlock during PCIe hot remove

Dear PCI maintainers,

I'm running into a deadlock scenario between the hotplug, pcie and vfio_pci driver when removing multiple devices in parallel.
This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the latest pciehp code, the rest is all 4.18.

The sequence that leads to the deadlock is as follows:

The pciehp_ist() takes the reset_lock early in its processing. While the pciehp_ist processing is progressing, vfio_pci calls pci_try_reset_function() as part of vfio_pci_release or open. The pci_try_reset_function() takes the device lock.

Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take the reset_lock but has to wait since it is already taken by pciehp_ist().

Eventually pciehp_ist calls pcie_stop_device() which calls device_release_driver_internal(). This function also tries to take device_lock causing the dead lock.

Here's the kernel stack trace when the deadlock occurs:

[root@localhost ~]# cat /proc/8594/task/8598/stack [<0>] pciehp_reset_slot+0xa5/0x220 [<0>] pci_reset_hotplug_slot.cold.72+0x20/0x36
[<0>] pci_dev_reset_slot_function+0x72/0x9b
[<0>] __pci_reset_function_locked+0x15b/0x190
[<0>] pci_try_reset_function.cold.77+0x9b/0x108
[<0>] vfio_pci_disable+0x261/0x280
[<0>] vfio_pci_release+0xcb/0xf0
[<0>] vfio_device_fops_release+0x1e/0x40
[<0>] __fput+0xa5/0x1d0
[<0>] task_work_run+0x8a/0xb0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0xe1/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff

I was wondering if there's a quick workaround. I think the pci_try_reset_function would need to take the reset_lock before the device lock but there doesn't seem to be a good way of doing that.

I'm also trying to see if we can delay calling the vfio functions that are initiated by SPDK but I think this inherent race should be addressed.

I'm also happy to submit a defect report if this emailing list is not appropriate.

* Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
  2020-03-24 15:21 Deadlock during PCIe hot remove Haeuptle, Michael
  2020-03-24 15:35 ` Hoyer, David
@ 2020-03-24 16:15 ` Lukas Wunner
  2020-03-25 10:40   ` Christoph Hellwig
  2020-03-29 15:43 ` Lukas Wunner
  2 siblings, 1 reply; 18+ messages in thread
From: Lukas Wunner @ 2020-03-24 16:15 UTC (permalink / raw)
  To: Haeuptle, Michael, Christoph Hellwig; +Cc: linux-pci, michaelhaeuptle

[cc += Christoph]

On Tue, Mar 24, 2020 at 03:21:52PM +0000, Haeuptle, Michael wrote:
> I'm running into a deadlock scenario between the hotplug, pcie and
> vfio_pci driver when removing multiple devices in parallel.
> This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the
> latest pciehp code, the rest is all 4.18.
> 
> The sequence that leads to the deadlock is as follows:
> 
> The pciehp_ist() takes the reset_lock early in its processing. While
> the pciehp_ist processing is progressing, vfio_pci calls
> pci_try_reset_function() as part of vfio_pci_release or open.
> The pci_try_reset_function() takes the device lock.
> 
> Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot()
> which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take
> the reset_lock but has to wait since it is already taken by pciehp_ist().
> 
> Eventually pciehp_ist calls pcie_stop_device() which calls
> device_release_driver_internal(). This function also tries to take
> device_lock causing the dead lock.

The pci_dev_trylock() in pci_try_reset_function() looks questionable
to me.  It was added by commit b014e96d1abb ("PCI: Protect
pci_error_handlers->reset_notify() usage with device_lock()")
with the following rationale:

    Every method in struct device_driver or structures derived from it like
    struct pci_driver MUST provide exclusion vs the driver's ->remove()
    method, usually by using device_lock().
    [...]
    Without this, ->reset_notify() may race with ->remove() calls, which
    can be easily triggered in NVMe.

The intersection of drivers defining a ->reset_notify() hook and files
invoking pci_try_reset_function() appears to be empty.  So I don't quite
understand the problem the commit sought to address.  What am I missing?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-24 15:39     ` Hoyer, David
@ 2020-03-24 16:26       ` Haeuptle, Michael
  2020-03-24 16:32         ` Hoyer, David
  0 siblings, 1 reply; 18+ messages in thread
From: Haeuptle, Michael @ 2020-03-24 16:26 UTC (permalink / raw)
  To: Hoyer, David, linux-pci; +Cc: michaelhaeuptle

Hi David,

Unfortunately, this patch didn't help. Just to illustrate where the deadlock is happening in case it wasn't clear:

1) pciehp_hist takes ctrl->reset_lock

2) a user space process calls vfio_pci_release (implicitly via munmap())

3) vfio_pci_release calls pci_try_reset_function

4) pci_try_reset_function takes calls device_lock

5) pci_try_reset_function eventually calls pciehp_reset_slot which waits on ctrl->reset_lock

6) pcihp eventually calls device_release_driver_internal which waits on the device_lock

It's that user space process calling into pci_try_reset_function that leads to the deadlock issue.

-- Michael

-----Original Message-----
From: Hoyer, David [mailto:David.Hoyer@netapp.com] 
Sent: Tuesday, March 24, 2020 9:39 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

I hit the same problem and have worked with the developers on a patch.   You might give this a try if you are building your own kernel/driver.

https://www.spinics.net/lists/linux-pci/msg92395.html 

-----Original Message-----
From: Haeuptle, Michael <michael.haeuptle@hpe.com> 
Sent: Tuesday, March 24, 2020 10:37 AM
To: Hoyer, David <David.Hoyer@netapp.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




Hi David, yes, it does.

-- Michael

-----Original Message-----
From: Hoyer, David [mailto:David.Hoyer@netapp.com]
Sent: Tuesday, March 24, 2020 9:35 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

You mentioned that you are using the latest pciehp code.   Does this code contain the recently introduced ist_running flag?

-----Original Message-----
From: linux-pci-owner@vger.kernel.org <linux-pci-owner@vger.kernel.org> On Behalf Of Haeuptle, Michael
Sent: Tuesday, March 24, 2020 10:22 AM
To: linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




From: Haeuptle, Michael
Sent: Tuesday, March 24, 2020 8:46 AM
To: 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>
Cc: 'michaelhaeuptle@gmail.com' <michaelhaeuptle@gmail.com>
Subject: Deadlock during PCIe hot remove

Dear PCI maintainers,

I'm running into a deadlock scenario between the hotplug, pcie and vfio_pci driver when removing multiple devices in parallel.
This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the latest pciehp code, the rest is all 4.18.

The sequence that leads to the deadlock is as follows:

The pciehp_ist() takes the reset_lock early in its processing. While the pciehp_ist processing is progressing, vfio_pci calls pci_try_reset_function() as part of vfio_pci_release or open. The pci_try_reset_function() takes the device lock.

Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take the reset_lock but has to wait since it is already taken by pciehp_ist().

Eventually pciehp_ist calls pcie_stop_device() which calls device_release_driver_internal(). This function also tries to take device_lock causing the dead lock.

Here's the kernel stack trace when the deadlock occurs:

[root@localhost ~]# cat /proc/8594/task/8598/stack [<0>] pciehp_reset_slot+0xa5/0x220 [<0>] pci_reset_hotplug_slot.cold.72+0x20/0x36
[<0>] pci_dev_reset_slot_function+0x72/0x9b
[<0>] __pci_reset_function_locked+0x15b/0x190
[<0>] pci_try_reset_function.cold.77+0x9b/0x108
[<0>] vfio_pci_disable+0x261/0x280
[<0>] vfio_pci_release+0xcb/0xf0
[<0>] vfio_device_fops_release+0x1e/0x40
[<0>] __fput+0xa5/0x1d0
[<0>] task_work_run+0x8a/0xb0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0xe1/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff

I was wondering if there's a quick workaround. I think the pci_try_reset_function would need to take the reset_lock before the device lock but there doesn't seem to be a good way of doing that.

I'm also trying to see if we can delay calling the vfio functions that are initiated by SPDK but I think this inherent race should be addressed.

I'm also happy to submit a defect report if this emailing list is not appropriate.

* Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-24 16:26       ` Haeuptle, Michael
@ 2020-03-24 16:32         ` Hoyer, David
  0 siblings, 0 replies; 18+ messages in thread
From: Hoyer, David @ 2020-03-24 16:32 UTC (permalink / raw)
  To: Haeuptle, Michael, linux-pci; +Cc: michaelhaeuptle

Ok - sounds like it is not the same issue then.  I just was trying to feed you info in case you hit the same problem.

-----Original Message-----
From: Haeuptle, Michael <michael.haeuptle@hpe.com> 
Sent: Tuesday, March 24, 2020 11:27 AM
To: Hoyer, David <David.Hoyer@netapp.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




Hi David,

Unfortunately, this patch didn't help. Just to illustrate where the deadlock is happening in case it wasn't clear:

1) pciehp_hist takes ctrl->reset_lock

2) a user space process calls vfio_pci_release (implicitly via munmap())

3) vfio_pci_release calls pci_try_reset_function

4) pci_try_reset_function takes calls device_lock

5) pci_try_reset_function eventually calls pciehp_reset_slot which waits on ctrl->reset_lock

6) pcihp eventually calls device_release_driver_internal which waits on the device_lock

It's that user space process calling into pci_try_reset_function that leads to the deadlock issue.

-- Michael

-----Original Message-----
From: Hoyer, David [mailto:David.Hoyer@netapp.com]
Sent: Tuesday, March 24, 2020 9:39 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

I hit the same problem and have worked with the developers on a patch.   You might give this a try if you are building your own kernel/driver.

https://www.spinics.net/lists/linux-pci/msg92395.html

-----Original Message-----
From: Haeuptle, Michael <michael.haeuptle@hpe.com>
Sent: Tuesday, March 24, 2020 10:37 AM
To: Hoyer, David <David.Hoyer@netapp.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




Hi David, yes, it does.

-- Michael

-----Original Message-----
From: Hoyer, David [mailto:David.Hoyer@netapp.com]
Sent: Tuesday, March 24, 2020 9:35 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>; linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

You mentioned that you are using the latest pciehp code.   Does this code contain the recently introduced ist_running flag?

-----Original Message-----
From: linux-pci-owner@vger.kernel.org <linux-pci-owner@vger.kernel.org> On Behalf Of Haeuptle, Michael
Sent: Tuesday, March 24, 2020 10:22 AM
To: linux-pci@vger.kernel.org
Cc: michaelhaeuptle@gmail.com
Subject: RE: Deadlock during PCIe hot remove

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




From: Haeuptle, Michael
Sent: Tuesday, March 24, 2020 8:46 AM
To: 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>
Cc: 'michaelhaeuptle@gmail.com' <michaelhaeuptle@gmail.com>
Subject: Deadlock during PCIe hot remove

Dear PCI maintainers,

I'm running into a deadlock scenario between the hotplug, pcie and vfio_pci driver when removing multiple devices in parallel.
This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the latest pciehp code, the rest is all 4.18.

The sequence that leads to the deadlock is as follows:

The pciehp_ist() takes the reset_lock early in its processing. While the pciehp_ist processing is progressing, vfio_pci calls pci_try_reset_function() as part of vfio_pci_release or open. The pci_try_reset_function() takes the device lock.

Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take the reset_lock but has to wait since it is already taken by pciehp_ist().

Eventually pciehp_ist calls pcie_stop_device() which calls device_release_driver_internal(). This function also tries to take device_lock causing the dead lock.

Here's the kernel stack trace when the deadlock occurs:

[root@localhost ~]# cat /proc/8594/task/8598/stack [<0>] pciehp_reset_slot+0xa5/0x220 [<0>] pci_reset_hotplug_slot.cold.72+0x20/0x36
[<0>] pci_dev_reset_slot_function+0x72/0x9b
[<0>] __pci_reset_function_locked+0x15b/0x190
[<0>] pci_try_reset_function.cold.77+0x9b/0x108
[<0>] vfio_pci_disable+0x261/0x280
[<0>] vfio_pci_release+0xcb/0xf0
[<0>] vfio_device_fops_release+0x1e/0x40
[<0>] __fput+0xa5/0x1d0
[<0>] task_work_run+0x8a/0xb0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0xe1/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff

I was wondering if there's a quick workaround. I think the pci_try_reset_function would need to take the reset_lock before the device lock but there doesn't seem to be a good way of doing that.

I'm also trying to see if we can delay calling the vfio functions that are initiated by SPDK but I think this inherent race should be addressed.

I'm also happy to submit a defect report if this emailing list is not appropriate.

* Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
  2020-03-24 16:15 ` Lukas Wunner
@ 2020-03-25 10:40   ` Christoph Hellwig
  2020-03-26 19:30     ` Haeuptle, Michael
  2020-03-29 13:04     ` Lukas Wunner
  0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-03-25 10:40 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Haeuptle, Michael, Christoph Hellwig, linux-pci, michaelhaeuptle

On Tue, Mar 24, 2020 at 05:15:34PM +0100, Lukas Wunner wrote:
> The pci_dev_trylock() in pci_try_reset_function() looks questionable
> to me.  It was added by commit b014e96d1abb ("PCI: Protect
> pci_error_handlers->reset_notify() usage with device_lock()")
> with the following rationale:
> 
>     Every method in struct device_driver or structures derived from it like
>     struct pci_driver MUST provide exclusion vs the driver's ->remove()
>     method, usually by using device_lock().
>     [...]
>     Without this, ->reset_notify() may race with ->remove() calls, which
>     can be easily triggered in NVMe.
> 
> The intersection of drivers defining a ->reset_notify() hook and files
> invoking pci_try_reset_function() appears to be empty.  So I don't quite
> understand the problem the commit sought to address.  What am I missing?

No driver defines ->reset_notify as that has been split into
->reset_prepare and ->reset_done a while ago, and plenty of drivers
define those.  And we can't call into drivers unless we know the driver
actually still is bound to the device, which is why we need the locking.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-25 10:40   ` Christoph Hellwig
@ 2020-03-26 19:30     ` Haeuptle, Michael
  2020-03-29 13:04     ` Lukas Wunner
  1 sibling, 0 replies; 18+ messages in thread
From: Haeuptle, Michael @ 2020-03-26 19:30 UTC (permalink / raw)
  To: Christoph Hellwig, Lukas Wunner; +Cc: linux-pci, michaelhaeuptle

Folks, it seems that I can avoid the deadlock by replacing the down_write in pciehp_reset_slot with a down_write_trylock.

What would be the down side of doing that? As a reminder, vfio ultimately calls this function when it disables a vfio device.

int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
{
	struct controller *ctrl = to_ctrl(hotplug_slot);
	struct pci_dev *pdev = ctrl_dev(ctrl);
	u16 stat_mask = 0, ctrl_mask = 0;
	int rc;

	if (probe)
		return 0;

	// original:
	// down_write(&ctrl->reset_lock);
	// change:
	if (!down_write_trylock(&ctrl->reset_lock)) {
		return -EAGAIN; // ????? or just 0?
	}
    
	if (!ATTN_BUTTN(ctrl)) {
		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
		stat_mask |= PCI_EXP_SLTSTA_PDC;
	}
	ctrl_mask |= PCI_EXP_SLTCTL_DLLSCE;
	stat_mask |= PCI_EXP_SLTSTA_DLLSC;

	pcie_write_cmd(ctrl, 0, ctrl_mask);
	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, 0);

	rc = pci_bridge_secondary_bus_reset(ctrl->pcie->port);

	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, stat_mask);
	pcie_write_cmd_nowait(ctrl, ctrl_mask, ctrl_mask);
	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);

	up_write(&ctrl->reset_lock);

	return rc;
}

-----Original Message-----
From: Christoph Hellwig [mailto:hch@lst.de] 
Sent: Wednesday, March 25, 2020 4:40 AM
To: Lukas Wunner <lukas@wunner.de>
Cc: Haeuptle, Michael <michael.haeuptle@hpe.com>; Christoph Hellwig <hch@lst.de>; linux-pci@vger.kernel.org; michaelhaeuptle@gmail.com
Subject: Re: Deadlock during PCIe hot remove

On Tue, Mar 24, 2020 at 05:15:34PM +0100, Lukas Wunner wrote:
> The pci_dev_trylock() in pci_try_reset_function() looks questionable 
> to me.  It was added by commit b014e96d1abb ("PCI: Protect
> pci_error_handlers->reset_notify() usage with device_lock()") with the 
> following rationale:
> 
>     Every method in struct device_driver or structures derived from it like
>     struct pci_driver MUST provide exclusion vs the driver's ->remove()
>     method, usually by using device_lock().
>     [...]
>     Without this, ->reset_notify() may race with ->remove() calls, which
>     can be easily triggered in NVMe.
> 
> The intersection of drivers defining a ->reset_notify() hook and files 
> invoking pci_try_reset_function() appears to be empty.  So I don't 
> quite understand the problem the commit sought to address.  What am I missing?

No driver defines ->reset_notify as that has been split into
->reset_prepare and ->reset_done a while ago, and plenty of drivers
define those.  And we can't call into drivers unless we know the driver actually still is bound to the device, which is why we need the locking.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
  2020-03-25 10:40   ` Christoph Hellwig
  2020-03-26 19:30     ` Haeuptle, Michael
@ 2020-03-29 13:04     ` Lukas Wunner
  2020-03-31  8:14       ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Lukas Wunner @ 2020-03-29 13:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Haeuptle, Michael, linux-pci, michaelhaeuptle

On Wed, Mar 25, 2020 at 11:40:18AM +0100, Christoph Hellwig wrote:
> On Tue, Mar 24, 2020 at 05:15:34PM +0100, Lukas Wunner wrote:
> > The pci_dev_trylock() in pci_try_reset_function() looks questionable
> > to me.  It was added by commit b014e96d1abb ("PCI: Protect
> > pci_error_handlers->reset_notify() usage with device_lock()")
> > with the following rationale:
> > 
> >     Every method in struct device_driver or structures derived from it like
> >     struct pci_driver MUST provide exclusion vs the driver's ->remove()
> >     method, usually by using device_lock().
> >     [...]
> >     Without this, ->reset_notify() may race with ->remove() calls, which
> >     can be easily triggered in NVMe.
> > 
> > The intersection of drivers defining a ->reset_notify() hook and files
> > invoking pci_try_reset_function() appears to be empty.  So I don't quite
> > understand the problem the commit sought to address.  What am I missing?
> 
> No driver defines ->reset_notify as that has been split into
> ->reset_prepare and ->reset_done a while ago, and plenty of drivers
> define those.  And we can't call into drivers unless we know the driver
> actually still is bound to the device, which is why we need the locking.

Sure, you need to hold the driver in place while you're invoking one of
its callbacks.  But is it really necessary to hold the device lock while
performing the actual reset?  That locking seems awfully coarse-grained.

Do you see any potential problem in pushing down the pci_dev_lock() and
pci_dev_unlock() calls into pci_dev_save_and_disable() and
pci_dev_restore()?  I.e, acquire the lock for the invocation of
->reset_prepare() and ->reset_done() and release it immediately
afterwards?

That would seem to fix the deadlock Michael reported.

Of course that could result in ->reset_prepare() being invoked but
->reset_done() being not invoked if the driver is no longer bound.
Or in ->reset_done() being called for a different driver if the
device was rebound in the meantime.  Would this cause issues?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
  2020-03-24 15:21 Deadlock during PCIe hot remove Haeuptle, Michael
  2020-03-24 15:35 ` Hoyer, David
  2020-03-24 16:15 ` Lukas Wunner
@ 2020-03-29 15:43 ` Lukas Wunner
  2020-03-30 16:15   ` Haeuptle, Michael
  2 siblings, 1 reply; 18+ messages in thread
From: Lukas Wunner @ 2020-03-29 15:43 UTC (permalink / raw)
  To: Haeuptle, Michael; +Cc: linux-pci, michaelhaeuptle, Christoph Hellwig

On Tue, Mar 24, 2020 at 03:21:52PM +0000, Haeuptle, Michael wrote:
> I'm running into a deadlock scenario between the hotplug, pcie and
> vfio_pci driver when removing multiple devices in parallel.
> This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using
> the latest pciehp code, the rest is all 4.18.
> 
> The sequence that leads to the deadlock is as follows:
> 
> The pciehp_ist() takes the reset_lock early in its processing.
> While the pciehp_ist processing is progressing, vfio_pci calls
> pci_try_reset_function() as part of vfio_pci_release or open.
> The pci_try_reset_function() takes the device lock.
> 
> Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot()
> which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to
> take the reset_lock but has to wait since it is already taken by
> pciehp_ist().
> 
> Eventually pciehp_ist calls pcie_stop_device() which calls
> device_release_driver_internal(). This function also tries to take
> device_lock causing the dead lock.
> 
> Here's the kernel stack trace when the deadlock occurs: 
> 
> [root@localhost ~]# cat /proc/8594/task/8598/stack
> [<0>] pciehp_reset_slot+0xa5/0x220
> [<0>] pci_reset_hotplug_slot.cold.72+0x20/0x36
> [<0>] pci_dev_reset_slot_function+0x72/0x9b
> [<0>] __pci_reset_function_locked+0x15b/0x190
> [<0>] pci_try_reset_function.cold.77+0x9b/0x108
> [<0>] vfio_pci_disable+0x261/0x280
> [<0>] vfio_pci_release+0xcb/0xf0
> [<0>] vfio_device_fops_release+0x1e/0x40
> [<0>] __fput+0xa5/0x1d0
> [<0>] task_work_run+0x8a/0xb0
> [<0>] exit_to_usermode_loop+0xd3/0xe0
> [<0>] do_syscall_64+0xe1/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
> [<0>] 0xffffffffffffffff

There's something I don't understand here:

The device_lock exists per-device.
The reset_lock exists per hotplug-capable downstream port.

Now if I understand correctly, in the stacktrace above, the
device_lock of the *downstream port* is acquired and then
its reset_lock is tried to acquire, which however is already
held by pciehp_ist().

You're saying that pciehp_ist() is handling removal of the
endpoint device *below* the downstream port, which means that
device_release_driver_internal() tries to acquire the device_lock
of the *endpoint device*.  That's a separate lock than the one
acquired by vfio_pci_disable() before calling
pci_try_reset_function()!

So I don't quite understand how there can be a deadlock.  Could
you instrument the code with a few printk()'s and dump_stack()'s
to show exactly which device's device_lock is acquired from where?

Note that device_release_driver_internal() also acquires the
parent's device_lock and this would indeed be the one of the
downstream port.  However commit 8c97a46af04b ("driver core:
hold dev's parent lock when needed") constrained that to
USB devices.  So the parent lock shouldn't be taken for PCI
devices.  That commit went into v4.18, please double-check
that you have it in your tree.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-29 15:43 ` Lukas Wunner
@ 2020-03-30 16:15   ` Haeuptle, Michael
  2020-03-31 13:01     ` Lukas Wunner
  0 siblings, 1 reply; 18+ messages in thread
From: Haeuptle, Michael @ 2020-03-30 16:15 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci, michaelhaeuptle, Christoph Hellwig

Hi Lukas,

Thanks for your help on this issue. 

Here's the requested trace for the happy path (single device remove) and the instrumented code pieces for reference in case I'm doing something wrong with the instrumentation.

There are 2 places where vfio tries to take the device lock. Once in  pcie_try_reset_function and then later in pci_reset_bus.
The number enclosed in {} is the thread id. The dev=<address> is the pointer to the struct device.

As mentioned, this is the happy path with one device removal. When multiple devices are removed then execution piles up on the global remove_rescan lock and vfio most of the time gets the device lock first resulting in a dead lock.

Let me know if you need more traces/details.

-- Michael

[  667.067997] pcieport 0000:b4:0d.0: Slot(14): Card not present
// pciehp_ist takes the device lock
[  667.079485] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] before device_lock
 [  667.137592] Call Trace:
[  667.142448]  dump_stack+0x5c/0x80
[  667.149029]  device_release_driver_internal.part.28+0x47/0x290
[  667.160625]  ? irq_thread_check_affinity+0xa0/0xa0
[  667.170145]  pci_stop_bus_device+0x69/0x90
[  667.178281]  pci_stop_and_remove_bus_device+0xe/0x20
[  667.188145]  pciehp_unconfigure_device+0x7c/0x130
[  667.197492]  __pciehp_disable_slot+0x48/0xd0
[  667.205974]  pciehp_handle_presence_or_link_change+0xdc/0x440
[  667.217395]  ? __synchronize_hardirq+0x43/0x50
[  667.226228]  pciehp_ist+0x1c9/0x1d0
[  667.233159]  ? irq_finalize_oneshot.part.45+0xf0/0xf0
[  667.243199]  irq_thread_fn+0x1f/0x60
[  667.250300]  ? irq_finalize_oneshot.part.45+0xf0/0xf0
[  667.257356] { 8903} vfio_bar_restore: 0000:c2:00.0 reset recovery - restoring bars
[  667.260336]  irq_thread+0x142/0x190
[  667.260338]  ? irq_forced_thread_fn+0x70/0x70
[  667.260343]  kthread+0x112/0x130

// vfio tries to take the lock as well
[  667.275760] vfio-pci 0000:c2:00.0: { 8903} @@@ pci_try_reset_function: dev=00000000bb2b1fe3 [0000:c2:00.0] before device_lock

[  667.282314]  ? kthread_flush_work_fn+0x10/0x10
[  667.282318]  ret_from_fork+0x1f/0x40

// but pciehp_ist got it first
[  667.282337] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_lock

// This is the stack trace from above pcie_try_reset_function
[  667.290979] Call Trace:
[  667.290983]  dump_stack+0x5c/0x80
[  667.290988]  pci_try_reset_function+0x52/0xf0
[  667.290992]  vfio_pci_disable+0x3cd/0x480
[  667.290994]  vfio_pci_release+0x42/0x50
[  667.290995]  vfio_device_fops_release+0x1e/0x40
[  667.290998]  __fput+0xa5/0x1d0
[  667.291001]  task_work_run+0x8a/0xb0
[  667.291004]  exit_to_usermode_loop+0xd3/0xe0
[  667.291005]  do_syscall_64+0xe1/0x100
[  667.291007]  entry_SYSCALL_64_after_hwframe+0x65/0xca

// vfio did not get the lock since pciehp_ist/ device_release_driver_internal got it first
[  667.291017] vfio-pci 0000:c2:00.0: { 8903} @@@ pci_try_reset_function: dev=00000000bb2b1fe3 [0000:c2:00.0] DID NOT GET device_lock

// vfio does a pci_reset_bus which calls pci_slot_trylock that tries to take the device lockl
[  667.351615] { 8903} !!! pci_slot_trylock: dev=00000000bb2b1fe3 [0000:c2:00.0] before device_lock, slot=000000000dbce595 [14]

[  667.355872] vfio-pci 0000:c2:00.0: { 1917} No device request channel registered, blocked until released by user

// Stack trace for right before pci_slot_trylock.
[  667.377124] Call Trace:
[  667.377126]  dump_stack+0x5c/0x80
[  667.723398]  pci_reset_bus+0x128/0x160
[  667.730843]  vfio_pci_disable+0x37b/0x480
[  667.738808]  ? vfio_pci_rw+0x80/0x80
[  667.745907]  vfio_pci_release+0x42/0x50
[  667.753525]  vfio_device_fops_release+0x1e/0x40
[  667.762524]  __fput+0xa5/0x1d0
[  667.768587]  task_work_run+0x8a/0xb0
[  667.775686]  exit_to_usermode_loop+0xd3/0xe0
[  667.784166]  do_syscall_64+0xe1/0x100
[  667.791438]  entry_SYSCALL_64_after_hwframe+0x65/0xca

// vfio: pci_slot_trylock did not get the lock
[  667.931879] { 8903} !!! pci_slot_trylock: dev=00000000bb2b1fe3 [0000:c2:00.0] DID NOT GET device_lock
[  667.962175] iommu: Removing device 0000:c2:00.0 from group 12
[  667.985035] vfio-pci 0000:c2:00.0: Refused to change power state, currently in D3

// pciehp_ist is done and releases the lock
[  667.999956] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_unlock

// not sure why we pciehp_ist is executed again...
[  668.020316] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] before device_lock
[  668.040381] CPU: 11 PID: 1917 Comm: irq/44-pciehp Kdump: loaded Tainted: G     U     O     --------- -  - 4.18.0 #104
[  668.061476] Hardware name: Hewlett Packard Enterprise FALCON/FALCON, BIOS 66.14.01 2020_01_10
[  668.078422] Call Trace:
[  668.083276]  dump_stack+0x5c/0x80
[  668.089857]  device_release_driver_internal.part.28+0x47/0x290
[  668.101450]  ? irq_thread_check_affinity+0xa0/0xa0
[  668.110969]  bus_remove_device+0xf7/0x170
[  668.118934]  device_del+0x13b/0x340
[  668.125861]  pci_remove_bus_device+0x77/0x100
[  668.134517]  pciehp_unconfigure_device+0x7c/0x130
[  668.143866]  __pciehp_disable_slot+0x48/0xd0
[  668.152348]  pciehp_handle_presence_or_link_change+0xdc/0x440
[  668.163766]  ? __synchronize_hardirq+0x43/0x50
[  668.172593]  pciehp_ist+0x1c9/0x1d0
[  668.179522]  ? irq_finalize_oneshot.part.45+0xf0/0xf0
[  668.189558]  irq_thread_fn+0x1f/0x60
[  668.196656]  ? irq_finalize_oneshot.part.45+0xf0/0xf0
[  668.206692]  irq_thread+0x142/0x190
[  668.213619]  ? irq_forced_thread_fn+0x70/0x70
[  668.222275]  kthread+0x112/0x130
[  668.228684]  ? kthread_flush_work_fn+0x10/0x10
[  668.237513]  ret_from_fork+0x1f/0x40
[  668.244627] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_lock
[  668.264642] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_unlock


Instrumented code:

int pci_try_reset_function(struct pci_dev *dev)
{
	int rc;

	if (!dev->reset_fn) {
		return -ENOTTY;
	}

	pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] before device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name);
	dump_stack();
	if (!pci_dev_trylock(dev)) {
	    pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] DID NOT GET device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name);
		return -EAGAIN;
    }
	pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] after device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name);

	pci_dev_save_and_disable(dev);
	rc = __pci_reset_function_locked(dev);
	pci_dev_restore(dev);	

	pci_dev_unlock(dev);
	pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] after device_unlock\n",task_pid_nr(current), &dev->dev, dev->dev.kobj.name);

	return rc;
}

void device_release_driver_internal(struct device *dev,
				    struct device_driver *drv,
				    struct device *parent)
{
	 
	if (parent && dev->bus->need_parent_lock) {
		device_lock(parent);
	}

	printk(KERN_ERR "{%5d} !!! device_release_driver_internal: dev=%p: [%s] before device_lock\n",task_pid_nr(current), dev, dev->kobj.name);
	dump_stack();
	device_lock(dev); 
	printk(KERN_ERR "{%5d} !!! device_release_driver_internal: dev=%p: [%s] after device_lock\n",task_pid_nr(current), dev, dev->kobj.name);
	if (!drv || drv == dev->driver) {
		__device_release_driver(dev, parent);
	}

	device_unlock(dev);
	printk(KERN_ERR "{%5d} !!! device_release_driver_internal: dev=%p: [%s] after device_unlock\n",task_pid_nr(current), dev, dev->kobj.name);
	if (parent && dev->bus->need_parent_lock)
		device_unlock(parent);
}

static int pci_slot_trylock(struct pci_slot *slot)
{
	struct pci_dev *dev;

	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
		if (!dev->slot || dev->slot != slot)
			continue;
		printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] before device_lock, slot=%p [%s]\n",task_pid_nr(current), &dev->dev, dev->dev.kobj.name, slot, slot->kobj.name);
		dump_stack();
		if (!pci_dev_trylock(dev)) {
			printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] DID NOT GET device_lock\n",task_pid_nr(current), &dev->dev, dev->dev.kobj.name);
			goto unlock;
		}
		printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] after device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name);
		if (dev->subordinate) {
			if (!pci_bus_trylock(dev->subordinate)) {
				pci_dev_unlock(dev);
				goto unlock;
			}
		}
	}
	printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] success\n", task_pid_nr(current), &dev->dev, dev->dev.kobj.name);
	return 1;

unlock:
	list_for_each_entry_continue_reverse(dev,
					     &slot->bus->devices, bus_list) {
		if (!dev->slot || dev->slot != slot)
			continue;
		if (dev->subordinate)
			pci_bus_unlock(dev->subordinate);
		pci_dev_unlock(dev);
	}
	return 0;
}

-----Original Message-----
From: Lukas Wunner [mailto:lukas@wunner.de] 
Sent: Sunday, March 29, 2020 9:44 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>
Cc: linux-pci@vger.kernel.org; michaelhaeuptle@gmail.com; Christoph Hellwig <hch@lst.de>
Subject: Re: Deadlock during PCIe hot remove

On Tue, Mar 24, 2020 at 03:21:52PM +0000, Haeuptle, Michael wrote:
> I'm running into a deadlock scenario between the hotplug, pcie and 
> vfio_pci driver when removing multiple devices in parallel.
> This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the 
> latest pciehp code, the rest is all 4.18.
> 
> The sequence that leads to the deadlock is as follows:
> 
> The pciehp_ist() takes the reset_lock early in its processing.
> While the pciehp_ist processing is progressing, vfio_pci calls
> pci_try_reset_function() as part of vfio_pci_release or open.
> The pci_try_reset_function() takes the device lock.
> 
> Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() 
> which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take 
> the reset_lock but has to wait since it is already taken by 
> pciehp_ist().
> 
> Eventually pciehp_ist calls pcie_stop_device() which calls 
> device_release_driver_internal(). This function also tries to take 
> device_lock causing the dead lock.
> 
> Here's the kernel stack trace when the deadlock occurs: 
> 
> [root@localhost ~]# cat /proc/8594/task/8598/stack [<0>] 
> pciehp_reset_slot+0xa5/0x220 [<0>] 
> pci_reset_hotplug_slot.cold.72+0x20/0x36
> [<0>] pci_dev_reset_slot_function+0x72/0x9b
> [<0>] __pci_reset_function_locked+0x15b/0x190
> [<0>] pci_try_reset_function.cold.77+0x9b/0x108
> [<0>] vfio_pci_disable+0x261/0x280
> [<0>] vfio_pci_release+0xcb/0xf0
> [<0>] vfio_device_fops_release+0x1e/0x40
> [<0>] __fput+0xa5/0x1d0
> [<0>] task_work_run+0x8a/0xb0
> [<0>] exit_to_usermode_loop+0xd3/0xe0
> [<0>] do_syscall_64+0xe1/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
> [<0>] 0xffffffffffffffff

There's something I don't understand here:

The device_lock exists per-device.
The reset_lock exists per hotplug-capable downstream port.

Now if I understand correctly, in the stacktrace above, the device_lock of the *downstream port* is acquired and then its reset_lock is tried to acquire, which however is already held by pciehp_ist().

You're saying that pciehp_ist() is handling removal of the endpoint device *below* the downstream port, which means that
device_release_driver_internal() tries to acquire the device_lock of the *endpoint device*.  That's a separate lock than the one acquired by vfio_pci_disable() before calling pci_try_reset_function()!

So I don't quite understand how there can be a deadlock.  Could you instrument the code with a few printk()'s and dump_stack()'s to show exactly which device's device_lock is acquired from where?

Note that device_release_driver_internal() also acquires the parent's device_lock and this would indeed be the one of the downstream port.  However commit 8c97a46af04b ("driver core:
hold dev's parent lock when needed") constrained that to USB devices.  So the parent lock shouldn't be taken for PCI devices.  That commit went into v4.18, please double-check that you have it in your tree.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
  2020-03-29 13:04     ` Lukas Wunner
@ 2020-03-31  8:14       ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-03-31  8:14 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Christoph Hellwig, Haeuptle, Michael, linux-pci, michaelhaeuptle

On Sun, Mar 29, 2020 at 03:04:20PM +0200, Lukas Wunner wrote:
> Sure, you need to hold the driver in place while you're invoking one of
> its callbacks.  But is it really necessary to hold the device lock while
> performing the actual reset?  That locking seems awfully coarse-grained.
> 
> Do you see any potential problem in pushing down the pci_dev_lock() and
> pci_dev_unlock() calls into pci_dev_save_and_disable() and
> pci_dev_restore()?  I.e, acquire the lock for the invocation of
> ->reset_prepare() and ->reset_done() and release it immediately
> afterwards?
> 
> That would seem to fix the deadlock Michael reported.
> 
> Of course that could result in ->reset_prepare() being invoked but
> ->reset_done() being not invoked if the driver is no longer bound.
> Or in ->reset_done() being called for a different driver if the
> device was rebound in the meantime.  Would this cause issues?

And at least the driver I'm familiar with (nvme) will be broken
by that, as it the state machine expects them to pair.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
  2020-03-30 16:15   ` Haeuptle, Michael
@ 2020-03-31 13:01     ` Lukas Wunner
  2020-03-31 15:02       ` Haeuptle, Michael
  0 siblings, 1 reply; 18+ messages in thread
From: Lukas Wunner @ 2020-03-31 13:01 UTC (permalink / raw)
  To: Haeuptle, Michael; +Cc: linux-pci, michaelhaeuptle, Christoph Hellwig

On Mon, Mar 30, 2020 at 04:15:01PM +0000, Haeuptle, Michael wrote:
> There are 2 places where vfio tries to take the device lock. Once in
> pcie_try_reset_function and then later in pci_reset_bus.
> 
> As mentioned, this is the happy path with one device removal. When
> multiple devices are removed then execution piles up on the global
> remove_rescan lock and vfio most of the time gets the device lock
> first resulting in a dead lock.

So I'm not really familiar with vfio but my limited understanding is
that you've got NVMe drives attached to a hotplug slot of a host and
you're passing them through to VMs.  And when you hot-remove them
from the host, pciehp unbinds them from their driver and brings down
the slot, while simultaneously vfio tries to reset the hot-removed
device in order to put it in a consistent state before handing it
back to the host.

Resetting a hot-removed device is kind of pointless of course,
but it may be difficult to make vfio aware of the device's absence
because it's not predictable when pciehp will be finished with
bringing down the slot.  vfio would have to wait as long to know
that the device is gone and the reset can be skipped.

As for the deadlock, the reset_lock in pciehp's controller struct
seeks to prevent a reset while the slot is brought up or down.
The problem is that pciehp takes the reset lock first, then the
device lock, whereas the reset functions in drivers/pci/pci.c
essentially do it the other way round, provoking the AB/BA
deadlock.

The obvious solution is to push the reset_lock out of pciehp and
into the pci_slot struct such that it can be taken by the PCI core
before taking the device lock.  Below is a patch which does exactly
that, could you test if this fixes the issue for you?  It is
compile-tested only.  It is meant to be applied to Bjorn's pci/next
branch.  Since you're using v4.18 plus a bunch of backported patches,
I'm not sure if it will apply cleanly to your tree.

Unfortunately it is not sufficient to add the locking to
pci_slot_lock() et al because of pci_dev_reset_slot_function()
which is called from __pci_reset_function_locked(), which in turn
is called by vfio and xen, all of which require additional locking.

There's an invocation of __pci_reset_function_locked() in
drivers/xen/xen-pciback/pci_stub.c:pcistub_device_release() which
cannot be augmented with the required reset_lock locking because it
is apparently called on unbind, with the device lock already held.
I don't know how to fix this as I'm not familiar with xen.

And there's another mess:  When the PCI core invokes a hotplug_slot's
->reset_slot() hook, it currently doesn't take any precautions to
prevent the hotplug_slot's driver from unbinding.  We dereference
pci_dev->slot but that will become NULL when the hotplug driver
unbinds.  This can easily happen if the hotplug_slot's driver is
unbound via sysfs or if multiple cascaded hotplug slots are removed
at the same time (as is the case with Thunderbolt).  We've never done
this correctly.

Thanks,

Lukas

-- >8 --
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index ae44f46d1bf3..978b8fadfab7 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -20,7 +20,6 @@
 #include <linux/pci_hotplug.h>
 #include <linux/delay.h>
 #include <linux/mutex.h>
-#include <linux/rwsem.h>
 #include <linux/workqueue.h>
 
 #include "../pcie/portdrv.h"
@@ -69,9 +68,6 @@ extern int pciehp_poll_time;
  * @button_work: work item to turn the slot on or off after 5 seconds
  *	in response to an Attention Button press
  * @hotplug_slot: structure registered with the PCI hotplug core
- * @reset_lock: prevents access to the Data Link Layer Link Active bit in the
- *	Link Status register and to the Presence Detect State bit in the Slot
- *	Status register during a slot reset which may cause them to flap
  * @ist_running: flag to keep user request waiting while IRQ thread is running
  * @request_result: result of last user request submitted to the IRQ thread
  * @requester: wait queue to wake up on completion of user request,
@@ -102,7 +98,6 @@ struct controller {
 	struct delayed_work button_work;
 
 	struct hotplug_slot hotplug_slot;	/* hotplug core interface */
-	struct rw_semaphore reset_lock;
 	unsigned int ist_running;
 	int request_result;
 	wait_queue_head_t requester;
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 312cc45c44c7..7d2e372a3ac0 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -165,7 +165,7 @@ static void pciehp_check_presence(struct controller *ctrl)
 {
 	int occupied;
 
-	down_read(&ctrl->reset_lock);
+	down_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 	mutex_lock(&ctrl->state_lock);
 
 	occupied = pciehp_card_present_or_link_active(ctrl);
@@ -176,7 +176,7 @@ static void pciehp_check_presence(struct controller *ctrl)
 		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
 
 	mutex_unlock(&ctrl->state_lock);
-	up_read(&ctrl->reset_lock);
+	up_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 }
 
 static int pciehp_probe(struct pcie_device *dev)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..a1c9072c3e80 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -706,13 +706,17 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 	/*
 	 * Disable requests have higher priority than Presence Detect Changed
 	 * or Data Link Layer State Changed events.
+	 *
+	 * A slot reset may cause flaps of the Presence Detect State bit in the
+	 * Slot Status register and the Data Link Layer Link Active bit in the
+	 * Link Status register.  Prevent by holding the reset lock.
 	 */
-	down_read(&ctrl->reset_lock);
+	down_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 	if (events & DISABLE_SLOT)
 		pciehp_handle_disable_request(ctrl);
 	else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
 		pciehp_handle_presence_or_link_change(ctrl, events);
-	up_read(&ctrl->reset_lock);
+	up_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 
 	ret = IRQ_HANDLED;
 out:
@@ -841,8 +845,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	if (probe)
 		return 0;
 
-	down_write(&ctrl->reset_lock);
-
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
 		stat_mask |= PCI_EXP_SLTSTA_PDC;
@@ -861,7 +863,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
 
-	up_write(&ctrl->reset_lock);
 	return rc;
 }
 
@@ -925,7 +926,6 @@ struct controller *pcie_init(struct pcie_device *dev)
 	ctrl->slot_cap = slot_cap;
 	mutex_init(&ctrl->ctrl_lock);
 	mutex_init(&ctrl->state_lock);
-	init_rwsem(&ctrl->reset_lock);
 	init_waitqueue_head(&ctrl->requester);
 	init_waitqueue_head(&ctrl->queue);
 	INIT_DELAYED_WORK(&ctrl->button_work, pciehp_queue_pushbutton_work);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4aa46c7b0148..321980293c5e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5063,6 +5063,8 @@ int pci_reset_function(struct pci_dev *dev)
 	if (!dev->reset_fn)
 		return -ENOTTY;
 
+	if (dev->slot)
+		down_write(&dev->slot->reset_lock);
 	pci_dev_lock(dev);
 	pci_dev_save_and_disable(dev);
 
@@ -5070,6 +5072,8 @@ int pci_reset_function(struct pci_dev *dev)
 
 	pci_dev_restore(dev);
 	pci_dev_unlock(dev);
+	if (dev->slot)
+		up_write(&dev->slot->reset_lock);
 
 	return rc;
 }
@@ -5122,6 +5126,10 @@ int pci_try_reset_function(struct pci_dev *dev)
 	if (!dev->reset_fn)
 		return -ENOTTY;
 
+	if (dev->slot)
+		if (!down_write_trylock(&dev->slot->reset_lock))
+			return -EAGAIN;
+
 	if (!pci_dev_trylock(dev))
 		return -EAGAIN;
 
@@ -5129,6 +5137,8 @@ int pci_try_reset_function(struct pci_dev *dev)
 	rc = __pci_reset_function_locked(dev);
 	pci_dev_restore(dev);
 	pci_dev_unlock(dev);
+	if (dev->slot)
+		up_write(&dev->slot->reset_lock);
 
 	return rc;
 }
@@ -5227,6 +5237,7 @@ static void pci_slot_lock(struct pci_slot *slot)
 {
 	struct pci_dev *dev;
 
+	down_write(&slot->reset_lock);
 	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
 		if (!dev->slot || dev->slot != slot)
 			continue;
@@ -5248,6 +5259,7 @@ static void pci_slot_unlock(struct pci_slot *slot)
 			pci_bus_unlock(dev->subordinate);
 		pci_dev_unlock(dev);
 	}
+	up_write(&slot->reset_lock);
 }
 
 /* Return 1 on successful lock, 0 on contention */
@@ -5255,6 +5267,9 @@ static int pci_slot_trylock(struct pci_slot *slot)
 {
 	struct pci_dev *dev;
 
+	if (!down_write_trylock(&slot->reset_lock))
+		return 0;
+
 	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
 		if (!dev->slot || dev->slot != slot)
 			continue;
@@ -5278,6 +5293,7 @@ static int pci_slot_trylock(struct pci_slot *slot)
 			pci_bus_unlock(dev->subordinate);
 		pci_dev_unlock(dev);
 	}
+	up_write(&slot->reset_lock);
 	return 0;
 }
 
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index cc386ef2fa12..e8e7d0975889 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -279,6 +279,8 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 	INIT_LIST_HEAD(&slot->list);
 	list_add(&slot->list, &parent->slots);
 
+	init_rwsem(&slot->reset_lock);
+
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &parent->devices, bus_list)
 		if (PCI_SLOT(dev->devfn) == slot_nr)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 379a02c36e37..2a9cb7634e0e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -447,13 +447,20 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 	 * We can not use the "try" reset interface here, which will
 	 * overwrite the previously restored configuration information.
 	 */
-	if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
-		if (device_trylock(&pdev->dev)) {
-			if (!__pci_reset_function_locked(pdev))
-				vdev->needs_reset = false;
-			device_unlock(&pdev->dev);
+	if (vdev->reset_works) {
+		if (!pdev->slot ||
+		    down_write_trylock(&pdev->slot->reset_lock)) {
+			if (pci_cfg_access_trylock(pdev)) {
+				if (device_trylock(&pdev->dev)) {
+					if (!__pci_reset_function_locked(pdev))
+						vdev->needs_reset = false;
+					device_unlock(&pdev->dev);
+				}
+				pci_cfg_access_unlock(pdev);
+			}
+			if (pdev->slot)
+				up_write(&pdev->slot->reset_lock);
 		}
-		pci_cfg_access_unlock(pdev);
 	}
 
 	pci_restore_state(pdev);
diff --git a/drivers/xen/xen-pciback/passthrough.c b/drivers/xen/xen-pciback/passthrough.c
index 66e9b814cc86..98a9ec8accce 100644
--- a/drivers/xen/xen-pciback/passthrough.c
+++ b/drivers/xen/xen-pciback/passthrough.c
@@ -89,11 +89,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev,
 	mutex_unlock(&dev_data->lock);
 
 	if (found_dev) {
-		if (lock)
+		if (lock) {
+			if (found_dev->slot)
+				down_write(&found_dev->slot->reset_lock);
 			device_lock(&found_dev->dev);
+		}
 		pcistub_put_pci_dev(found_dev);
-		if (lock)
+		if (lock) {
 			device_unlock(&found_dev->dev);
+			if (found_dev->slot)
+				up_write(&found_dev->slot->reset_lock);
+		}
 	}
 }
 
@@ -164,9 +170,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev)
 	list_for_each_entry_safe(dev_entry, t, &dev_data->dev_list, list) {
 		struct pci_dev *dev = dev_entry->dev;
 		list_del(&dev_entry->list);
+		if (dev->slot)
+			down_write(&dev->slot->reset_lock);
 		device_lock(&dev->dev);
 		pcistub_put_pci_dev(dev);
 		device_unlock(&dev->dev);
+		if (dev->slot)
+			up_write(&dev->slot->reset_lock);
 		kfree(dev_entry);
 	}
 
diff --git a/drivers/xen/xen-pciback/vpci.c b/drivers/xen/xen-pciback/vpci.c
index f6ba18191c0f..e11ed4764371 100644
--- a/drivers/xen/xen-pciback/vpci.c
+++ b/drivers/xen/xen-pciback/vpci.c
@@ -171,11 +171,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev,
 	mutex_unlock(&vpci_dev->lock);
 
 	if (found_dev) {
-		if (lock)
+		if (lock) {
+			if (found_dev->slot)
+				down_write(&found_dev->slot->reset_lock);
 			device_lock(&found_dev->dev);
+		}
 		pcistub_put_pci_dev(found_dev);
-		if (lock)
+		if (lock) {
 			device_unlock(&found_dev->dev);
+			if (found_dev->slot)
+				up_write(&found_dev->slot->reset_lock);
+		}
 	}
 }
 
@@ -216,9 +222,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev)
 					 list) {
 			struct pci_dev *dev = e->dev;
 			list_del(&e->list);
+			if (dev->slot)
+				down_write(&dev->slot->reset_lock);
 			device_lock(&dev->dev);
 			pcistub_put_pci_dev(dev);
 			device_unlock(&dev->dev);
+			if (dev->slot)
+				up_write(&dev->slot->reset_lock);
 			kfree(e);
 		}
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 71c92b88bbc6..e8d31e6d495a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/resource_ext.h>
+#include <linux/rwsem.h>
 #include <uapi/linux/pci.h>
 
 #include <linux/pci_ids.h>
@@ -63,6 +64,7 @@ struct pci_slot {
 	struct pci_bus		*bus;		/* Bus this slot is on */
 	struct list_head	list;		/* Node in list of slots */
 	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
+	struct rw_semaphore	reset_lock;	/* Held during slot reset */
 	unsigned char		number;		/* PCI_SLOT(pci_dev->devfn) */
 	struct kobject		kobj;
 };

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-31 13:01     ` Lukas Wunner
@ 2020-03-31 15:02       ` Haeuptle, Michael
  2020-04-02 19:24         ` Haeuptle, Michael
  0 siblings, 1 reply; 18+ messages in thread
From: Haeuptle, Michael @ 2020-03-31 15:02 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci, michaelhaeuptle, Christoph Hellwig

Hi Lukas,

Thanks for providing a potential patch! I tried it this morning and it DOES work for our use case. Note, however, that I'm just testing hot remove/add of actual devices and there's no VM involved here.
I will do more stress testing in the coming weeks but I did not run into any issues in my 1h long test removing up to 20 devices in parallel.

Is this patch something you are trying to upstream?

-- Michael

-----Original Message-----
From: Lukas Wunner [mailto:lukas@wunner.de] 
Sent: Tuesday, March 31, 2020 7:02 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>
Cc: linux-pci@vger.kernel.org; michaelhaeuptle@gmail.com; Christoph Hellwig <hch@lst.de>
Subject: Re: Deadlock during PCIe hot remove

On Mon, Mar 30, 2020 at 04:15:01PM +0000, Haeuptle, Michael wrote:
> There are 2 places where vfio tries to take the device lock. Once in 
> pcie_try_reset_function and then later in pci_reset_bus.
> 
> As mentioned, this is the happy path with one device removal. When 
> multiple devices are removed then execution piles up on the global 
> remove_rescan lock and vfio most of the time gets the device lock 
> first resulting in a dead lock.

So I'm not really familiar with vfio but my limited understanding is that you've got NVMe drives attached to a hotplug slot of a host and you're passing them through to VMs.  And when you hot-remove them from the host, pciehp unbinds them from their driver and brings down the slot, while simultaneously vfio tries to reset the hot-removed device in order to put it in a consistent state before handing it back to the host.

Resetting a hot-removed device is kind of pointless of course, but it may be difficult to make vfio aware of the device's absence because it's not predictable when pciehp will be finished with bringing down the slot.  vfio would have to wait as long to know that the device is gone and the reset can be skipped.

As for the deadlock, the reset_lock in pciehp's controller struct seeks to prevent a reset while the slot is brought up or down.
The problem is that pciehp takes the reset lock first, then the device lock, whereas the reset functions in drivers/pci/pci.c essentially do it the other way round, provoking the AB/BA deadlock.

The obvious solution is to push the reset_lock out of pciehp and into the pci_slot struct such that it can be taken by the PCI core before taking the device lock.  Below is a patch which does exactly that, could you test if this fixes the issue for you?  It is compile-tested only.  It is meant to be applied to Bjorn's pci/next branch.  Since you're using v4.18 plus a bunch of backported patches, I'm not sure if it will apply cleanly to your tree.

Unfortunately it is not sufficient to add the locking to
pci_slot_lock() et al because of pci_dev_reset_slot_function() which is called from __pci_reset_function_locked(), which in turn is called by vfio and xen, all of which require additional locking.

There's an invocation of __pci_reset_function_locked() in
drivers/xen/xen-pciback/pci_stub.c:pcistub_device_release() which cannot be augmented with the required reset_lock locking because it is apparently called on unbind, with the device lock already held.
I don't know how to fix this as I'm not familiar with xen.

And there's another mess:  When the PCI core invokes a hotplug_slot's
->reset_slot() hook, it currently doesn't take any precautions to
prevent the hotplug_slot's driver from unbinding.  We dereference pci_dev->slot but that will become NULL when the hotplug driver unbinds.  This can easily happen if the hotplug_slot's driver is unbound via sysfs or if multiple cascaded hotplug slots are removed at the same time (as is the case with Thunderbolt).  We've never done this correctly.

Thanks,

Lukas

-- >8 --
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index ae44f46d1bf3..978b8fadfab7 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -20,7 +20,6 @@
 #include <linux/pci_hotplug.h>
 #include <linux/delay.h>
 #include <linux/mutex.h>
-#include <linux/rwsem.h>
 #include <linux/workqueue.h>
 
 #include "../pcie/portdrv.h"
@@ -69,9 +68,6 @@ extern int pciehp_poll_time;
  * @button_work: work item to turn the slot on or off after 5 seconds
  *	in response to an Attention Button press
  * @hotplug_slot: structure registered with the PCI hotplug core
- * @reset_lock: prevents access to the Data Link Layer Link Active bit in the
- *	Link Status register and to the Presence Detect State bit in the Slot
- *	Status register during a slot reset which may cause them to flap
  * @ist_running: flag to keep user request waiting while IRQ thread is running
  * @request_result: result of last user request submitted to the IRQ thread
  * @requester: wait queue to wake up on completion of user request, @@ -102,7 +98,6 @@ struct controller {
 	struct delayed_work button_work;
 
 	struct hotplug_slot hotplug_slot;	/* hotplug core interface */
-	struct rw_semaphore reset_lock;
 	unsigned int ist_running;
 	int request_result;
 	wait_queue_head_t requester;
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 312cc45c44c7..7d2e372a3ac0 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -165,7 +165,7 @@ static void pciehp_check_presence(struct controller *ctrl)  {
 	int occupied;
 
-	down_read(&ctrl->reset_lock);
+	down_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 	mutex_lock(&ctrl->state_lock);
 
 	occupied = pciehp_card_present_or_link_active(ctrl);
@@ -176,7 +176,7 @@ static void pciehp_check_presence(struct controller *ctrl)
 		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
 
 	mutex_unlock(&ctrl->state_lock);
-	up_read(&ctrl->reset_lock);
+	up_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 }
 
 static int pciehp_probe(struct pcie_device *dev) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..a1c9072c3e80 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -706,13 +706,17 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 	/*
 	 * Disable requests have higher priority than Presence Detect Changed
 	 * or Data Link Layer State Changed events.
+	 *
+	 * A slot reset may cause flaps of the Presence Detect State bit in the
+	 * Slot Status register and the Data Link Layer Link Active bit in the
+	 * Link Status register.  Prevent by holding the reset lock.
 	 */
-	down_read(&ctrl->reset_lock);
+	down_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 	if (events & DISABLE_SLOT)
 		pciehp_handle_disable_request(ctrl);
 	else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
 		pciehp_handle_presence_or_link_change(ctrl, events);
-	up_read(&ctrl->reset_lock);
+	up_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 
 	ret = IRQ_HANDLED;
 out:
@@ -841,8 +845,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	if (probe)
 		return 0;
 
-	down_write(&ctrl->reset_lock);
-
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
 		stat_mask |= PCI_EXP_SLTSTA_PDC;
@@ -861,7 +863,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
 
-	up_write(&ctrl->reset_lock);
 	return rc;
 }
 
@@ -925,7 +926,6 @@ struct controller *pcie_init(struct pcie_device *dev)
 	ctrl->slot_cap = slot_cap;
 	mutex_init(&ctrl->ctrl_lock);
 	mutex_init(&ctrl->state_lock);
-	init_rwsem(&ctrl->reset_lock);
 	init_waitqueue_head(&ctrl->requester);
 	init_waitqueue_head(&ctrl->queue);
 	INIT_DELAYED_WORK(&ctrl->button_work, pciehp_queue_pushbutton_work); diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 4aa46c7b0148..321980293c5e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5063,6 +5063,8 @@ int pci_reset_function(struct pci_dev *dev)
 	if (!dev->reset_fn)
 		return -ENOTTY;
 
+	if (dev->slot)
+		down_write(&dev->slot->reset_lock);
 	pci_dev_lock(dev);
 	pci_dev_save_and_disable(dev);
 
@@ -5070,6 +5072,8 @@ int pci_reset_function(struct pci_dev *dev)
 
 	pci_dev_restore(dev);
 	pci_dev_unlock(dev);
+	if (dev->slot)
+		up_write(&dev->slot->reset_lock);
 
 	return rc;
 }
@@ -5122,6 +5126,10 @@ int pci_try_reset_function(struct pci_dev *dev)
 	if (!dev->reset_fn)
 		return -ENOTTY;
 
+	if (dev->slot)
+		if (!down_write_trylock(&dev->slot->reset_lock))
+			return -EAGAIN;
+
 	if (!pci_dev_trylock(dev))
 		return -EAGAIN;
 
@@ -5129,6 +5137,8 @@ int pci_try_reset_function(struct pci_dev *dev)
 	rc = __pci_reset_function_locked(dev);
 	pci_dev_restore(dev);
 	pci_dev_unlock(dev);
+	if (dev->slot)
+		up_write(&dev->slot->reset_lock);
 
 	return rc;
 }
@@ -5227,6 +5237,7 @@ static void pci_slot_lock(struct pci_slot *slot)  {
 	struct pci_dev *dev;
 
+	down_write(&slot->reset_lock);
 	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
 		if (!dev->slot || dev->slot != slot)
 			continue;
@@ -5248,6 +5259,7 @@ static void pci_slot_unlock(struct pci_slot *slot)
 			pci_bus_unlock(dev->subordinate);
 		pci_dev_unlock(dev);
 	}
+	up_write(&slot->reset_lock);
 }
 
 /* Return 1 on successful lock, 0 on contention */ @@ -5255,6 +5267,9 @@ static int pci_slot_trylock(struct pci_slot *slot)  {
 	struct pci_dev *dev;
 
+	if (!down_write_trylock(&slot->reset_lock))
+		return 0;
+
 	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
 		if (!dev->slot || dev->slot != slot)
 			continue;
@@ -5278,6 +5293,7 @@ static int pci_slot_trylock(struct pci_slot *slot)
 			pci_bus_unlock(dev->subordinate);
 		pci_dev_unlock(dev);
 	}
+	up_write(&slot->reset_lock);
 	return 0;
 }
 
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index cc386ef2fa12..e8e7d0975889 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -279,6 +279,8 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 	INIT_LIST_HEAD(&slot->list);
 	list_add(&slot->list, &parent->slots);
 
+	init_rwsem(&slot->reset_lock);
+
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &parent->devices, bus_list)
 		if (PCI_SLOT(dev->devfn) == slot_nr)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 379a02c36e37..2a9cb7634e0e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -447,13 +447,20 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 	 * We can not use the "try" reset interface here, which will
 	 * overwrite the previously restored configuration information.
 	 */
-	if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
-		if (device_trylock(&pdev->dev)) {
-			if (!__pci_reset_function_locked(pdev))
-				vdev->needs_reset = false;
-			device_unlock(&pdev->dev);
+	if (vdev->reset_works) {
+		if (!pdev->slot ||
+		    down_write_trylock(&pdev->slot->reset_lock)) {
+			if (pci_cfg_access_trylock(pdev)) {
+				if (device_trylock(&pdev->dev)) {
+					if (!__pci_reset_function_locked(pdev))
+						vdev->needs_reset = false;
+					device_unlock(&pdev->dev);
+				}
+				pci_cfg_access_unlock(pdev);
+			}
+			if (pdev->slot)
+				up_write(&pdev->slot->reset_lock);
 		}
-		pci_cfg_access_unlock(pdev);
 	}
 
 	pci_restore_state(pdev);
diff --git a/drivers/xen/xen-pciback/passthrough.c b/drivers/xen/xen-pciback/passthrough.c
index 66e9b814cc86..98a9ec8accce 100644
--- a/drivers/xen/xen-pciback/passthrough.c
+++ b/drivers/xen/xen-pciback/passthrough.c
@@ -89,11 +89,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev,
 	mutex_unlock(&dev_data->lock);
 
 	if (found_dev) {
-		if (lock)
+		if (lock) {
+			if (found_dev->slot)
+				down_write(&found_dev->slot->reset_lock);
 			device_lock(&found_dev->dev);
+		}
 		pcistub_put_pci_dev(found_dev);
-		if (lock)
+		if (lock) {
 			device_unlock(&found_dev->dev);
+			if (found_dev->slot)
+				up_write(&found_dev->slot->reset_lock);
+		}
 	}
 }
 
@@ -164,9 +170,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev)
 	list_for_each_entry_safe(dev_entry, t, &dev_data->dev_list, list) {
 		struct pci_dev *dev = dev_entry->dev;
 		list_del(&dev_entry->list);
+		if (dev->slot)
+			down_write(&dev->slot->reset_lock);
 		device_lock(&dev->dev);
 		pcistub_put_pci_dev(dev);
 		device_unlock(&dev->dev);
+		if (dev->slot)
+			up_write(&dev->slot->reset_lock);
 		kfree(dev_entry);
 	}
 
diff --git a/drivers/xen/xen-pciback/vpci.c b/drivers/xen/xen-pciback/vpci.c index f6ba18191c0f..e11ed4764371 100644
--- a/drivers/xen/xen-pciback/vpci.c
+++ b/drivers/xen/xen-pciback/vpci.c
@@ -171,11 +171,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev,
 	mutex_unlock(&vpci_dev->lock);
 
 	if (found_dev) {
-		if (lock)
+		if (lock) {
+			if (found_dev->slot)
+				down_write(&found_dev->slot->reset_lock);
 			device_lock(&found_dev->dev);
+		}
 		pcistub_put_pci_dev(found_dev);
-		if (lock)
+		if (lock) {
 			device_unlock(&found_dev->dev);
+			if (found_dev->slot)
+				up_write(&found_dev->slot->reset_lock);
+		}
 	}
 }
 
@@ -216,9 +222,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev)
 					 list) {
 			struct pci_dev *dev = e->dev;
 			list_del(&e->list);
+			if (dev->slot)
+				down_write(&dev->slot->reset_lock);
 			device_lock(&dev->dev);
 			pcistub_put_pci_dev(dev);
 			device_unlock(&dev->dev);
+			if (dev->slot)
+				up_write(&dev->slot->reset_lock);
 			kfree(e);
 		}
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h index 71c92b88bbc6..e8d31e6d495a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/resource_ext.h>
+#include <linux/rwsem.h>
 #include <uapi/linux/pci.h>
 
 #include <linux/pci_ids.h>
@@ -63,6 +64,7 @@ struct pci_slot {
 	struct pci_bus		*bus;		/* Bus this slot is on */
 	struct list_head	list;		/* Node in list of slots */
 	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
+	struct rw_semaphore	reset_lock;	/* Held during slot reset */
 	unsigned char		number;		/* PCI_SLOT(pci_dev->devfn) */
 	struct kobject		kobj;
 };

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: Deadlock during PCIe hot remove
  2020-03-31 15:02       ` Haeuptle, Michael
@ 2020-04-02 19:24         ` Haeuptle, Michael
  0 siblings, 0 replies; 18+ messages in thread
From: Haeuptle, Michael @ 2020-04-02 19:24 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci, Christoph Hellwig, Stein, Dan

Hi Lukas et al

We were wondering if removing the reset_lock and using the state_lock would be sufficient (see below). Taking out the reset_lock obviously avoids the deadlock but the question is whether the state lock is strong enough.

-- Michael

diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index fc5366b50..cf4e3cac0 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -157,7 +157,9 @@ static void pciehp_check_presence(struct controller *ctrl)
 {
        bool occupied;

+#ifndef DEADLOCK_FIX
        down_read(&ctrl->reset_lock);
+#endif
        mutex_lock(&ctrl->state_lock);

        occupied = pciehp_card_present_or_link_active(ctrl);
@@ -168,7 +170,9 @@ static void pciehp_check_presence(struct controller *ctrl)
                pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);

        mutex_unlock(&ctrl->state_lock);
+#ifndef DEADLOCK_FIX
        up_read(&ctrl->reset_lock);
+#endif
 }

 static int pciehp_probe(struct pcie_device *dev)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 8bfcb8cd0..faafd51e2 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -644,12 +644,16 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
         * Disable requests have higher priority than Presence Detect Changed
         * or Data Link Layer State Changed events.
         */
+#ifndef DEADLOCK_FIX
        down_read(&ctrl->reset_lock);
+#endif
        if (events & DISABLE_SLOT)
                pciehp_handle_disable_request(ctrl);
        else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
                pciehp_handle_presence_or_link_change(ctrl, events);
+#ifndef DEADLOCK_FIX
        up_read(&ctrl->reset_lock);
+#endif

        pci_config_pm_runtime_put(pdev);
        wake_up(&ctrl->requester);
@@ -775,7 +779,11 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
        if (probe)
                return 0;

+#ifdef DEADLOCK_FIX
         // use state_lock instead of reset_lock
+       mutex_lock(&ctrl->state_lock);
+#else
        down_write(&ctrl->reset_lock);
+#endif

        if (!ATTN_BUTTN(ctrl)) {
                ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
@@ -795,7 +803,11 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
        ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
                 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);

+#ifdef DEADLOCK_FIX
+       mutex_unlock(&ctrl->state_lock);
+#else
        up_write(&ctrl->reset_lock);
+#endif
        return rc;
 }

@@ -862,7 +874,9 @@ struct controller *pcie_init(struct pcie_device *dev)
        ctrl->slot_cap = slot_cap;
        mutex_init(&ctrl->ctrl_lock);
        mutex_init(&ctrl->state_lock);
+#ifndef DEADLOCK_FIX
        init_rwsem(&ctrl->reset_lock);
+#endif
        init_waitqueue_head(&ctrl->requester);
        init_waitqueue_head(&ctrl->queue);
        INIT_DELAYED_WORK(&ctrl->button_work, pciehp_queue_pushbutton_work);

-----Original Message-----
From: Haeuptle, Michael 
Sent: Tuesday, March 31, 2020 9:03 AM
To: Lukas Wunner <lukas@wunner.de>
Cc: linux-pci@vger.kernel.org; michaelhaeuptle@gmail.com; Christoph Hellwig <hch@lst.de>
Subject: RE: Deadlock during PCIe hot remove

Hi Lukas,

Thanks for providing a potential patch! I tried it this morning and it DOES work for our use case. Note, however, that I'm just testing hot remove/add of actual devices and there's no VM involved here.
I will do more stress testing in the coming weeks but I did not run into any issues in my 1h long test removing up to 20 devices in parallel.

Is this patch something you are trying to upstream?

-- Michael

-----Original Message-----
From: Lukas Wunner [mailto:lukas@wunner.de]
Sent: Tuesday, March 31, 2020 7:02 AM
To: Haeuptle, Michael <michael.haeuptle@hpe.com>
Cc: linux-pci@vger.kernel.org; michaelhaeuptle@gmail.com; Christoph Hellwig <hch@lst.de>
Subject: Re: Deadlock during PCIe hot remove

On Mon, Mar 30, 2020 at 04:15:01PM +0000, Haeuptle, Michael wrote:
> There are 2 places where vfio tries to take the device lock. Once in 
> pcie_try_reset_function and then later in pci_reset_bus.
> 
> As mentioned, this is the happy path with one device removal. When 
> multiple devices are removed then execution piles up on the global 
> remove_rescan lock and vfio most of the time gets the device lock 
> first resulting in a dead lock.

So I'm not really familiar with vfio but my limited understanding is that you've got NVMe drives attached to a hotplug slot of a host and you're passing them through to VMs.  And when you hot-remove them from the host, pciehp unbinds them from their driver and brings down the slot, while simultaneously vfio tries to reset the hot-removed device in order to put it in a consistent state before handing it back to the host.

Resetting a hot-removed device is kind of pointless of course, but it may be difficult to make vfio aware of the device's absence because it's not predictable when pciehp will be finished with bringing down the slot.  vfio would have to wait as long to know that the device is gone and the reset can be skipped.

As for the deadlock, the reset_lock in pciehp's controller struct seeks to prevent a reset while the slot is brought up or down.
The problem is that pciehp takes the reset lock first, then the device lock, whereas the reset functions in drivers/pci/pci.c essentially do it the other way round, provoking the AB/BA deadlock.

The obvious solution is to push the reset_lock out of pciehp and into the pci_slot struct such that it can be taken by the PCI core before taking the device lock.  Below is a patch which does exactly that, could you test if this fixes the issue for you?  It is compile-tested only.  It is meant to be applied to Bjorn's pci/next branch.  Since you're using v4.18 plus a bunch of backported patches, I'm not sure if it will apply cleanly to your tree.

Unfortunately it is not sufficient to add the locking to
pci_slot_lock() et al because of pci_dev_reset_slot_function() which is called from __pci_reset_function_locked(), which in turn is called by vfio and xen, all of which require additional locking.

There's an invocation of __pci_reset_function_locked() in
drivers/xen/xen-pciback/pci_stub.c:pcistub_device_release() which cannot be augmented with the required reset_lock locking because it is apparently called on unbind, with the device lock already held.
I don't know how to fix this as I'm not familiar with xen.

And there's another mess:  When the PCI core invokes a hotplug_slot's
->reset_slot() hook, it currently doesn't take any precautions to
prevent the hotplug_slot's driver from unbinding.  We dereference pci_dev->slot but that will become NULL when the hotplug driver unbinds.  This can easily happen if the hotplug_slot's driver is unbound via sysfs or if multiple cascaded hotplug slots are removed at the same time (as is the case with Thunderbolt).  We've never done this correctly.

Thanks,

Lukas

-- >8 --
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index ae44f46d1bf3..978b8fadfab7 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -20,7 +20,6 @@
 #include <linux/pci_hotplug.h>
 #include <linux/delay.h>
 #include <linux/mutex.h>
-#include <linux/rwsem.h>
 #include <linux/workqueue.h>
 
 #include "../pcie/portdrv.h"
@@ -69,9 +68,6 @@ extern int pciehp_poll_time;
  * @button_work: work item to turn the slot on or off after 5 seconds
  *	in response to an Attention Button press
  * @hotplug_slot: structure registered with the PCI hotplug core
- * @reset_lock: prevents access to the Data Link Layer Link Active bit in the
- *	Link Status register and to the Presence Detect State bit in the Slot
- *	Status register during a slot reset which may cause them to flap
  * @ist_running: flag to keep user request waiting while IRQ thread is running
  * @request_result: result of last user request submitted to the IRQ thread
  * @requester: wait queue to wake up on completion of user request, @@ -102,7 +98,6 @@ struct controller {
 	struct delayed_work button_work;
 
 	struct hotplug_slot hotplug_slot;	/* hotplug core interface */
-	struct rw_semaphore reset_lock;
 	unsigned int ist_running;
 	int request_result;
 	wait_queue_head_t requester;
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 312cc45c44c7..7d2e372a3ac0 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -165,7 +165,7 @@ static void pciehp_check_presence(struct controller *ctrl)  {
 	int occupied;
 
-	down_read(&ctrl->reset_lock);
+	down_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 	mutex_lock(&ctrl->state_lock);
 
 	occupied = pciehp_card_present_or_link_active(ctrl);
@@ -176,7 +176,7 @@ static void pciehp_check_presence(struct controller *ctrl)
 		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
 
 	mutex_unlock(&ctrl->state_lock);
-	up_read(&ctrl->reset_lock);
+	up_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 }
 
 static int pciehp_probe(struct pcie_device *dev) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..a1c9072c3e80 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -706,13 +706,17 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 	/*
 	 * Disable requests have higher priority than Presence Detect Changed
 	 * or Data Link Layer State Changed events.
+	 *
+	 * A slot reset may cause flaps of the Presence Detect State bit in the
+	 * Slot Status register and the Data Link Layer Link Active bit in the
+	 * Link Status register.  Prevent by holding the reset lock.
 	 */
-	down_read(&ctrl->reset_lock);
+	down_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 	if (events & DISABLE_SLOT)
 		pciehp_handle_disable_request(ctrl);
 	else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
 		pciehp_handle_presence_or_link_change(ctrl, events);
-	up_read(&ctrl->reset_lock);
+	up_read(&ctrl->hotplug_slot.pci_slot->reset_lock);
 
 	ret = IRQ_HANDLED;
 out:
@@ -841,8 +845,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	if (probe)
 		return 0;
 
-	down_write(&ctrl->reset_lock);
-
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
 		stat_mask |= PCI_EXP_SLTSTA_PDC;
@@ -861,7 +863,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
 
-	up_write(&ctrl->reset_lock);
 	return rc;
 }
 
@@ -925,7 +926,6 @@ struct controller *pcie_init(struct pcie_device *dev)
 	ctrl->slot_cap = slot_cap;
 	mutex_init(&ctrl->ctrl_lock);
 	mutex_init(&ctrl->state_lock);
-	init_rwsem(&ctrl->reset_lock);
 	init_waitqueue_head(&ctrl->requester);
 	init_waitqueue_head(&ctrl->queue);
 	INIT_DELAYED_WORK(&ctrl->button_work, pciehp_queue_pushbutton_work); diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 4aa46c7b0148..321980293c5e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5063,6 +5063,8 @@ int pci_reset_function(struct pci_dev *dev)
 	if (!dev->reset_fn)
 		return -ENOTTY;
 
+	if (dev->slot)
+		down_write(&dev->slot->reset_lock);
 	pci_dev_lock(dev);
 	pci_dev_save_and_disable(dev);
 
@@ -5070,6 +5072,8 @@ int pci_reset_function(struct pci_dev *dev)
 
 	pci_dev_restore(dev);
 	pci_dev_unlock(dev);
+	if (dev->slot)
+		up_write(&dev->slot->reset_lock);
 
 	return rc;
 }
@@ -5122,6 +5126,10 @@ int pci_try_reset_function(struct pci_dev *dev)
 	if (!dev->reset_fn)
 		return -ENOTTY;
 
+	if (dev->slot)
+		if (!down_write_trylock(&dev->slot->reset_lock))
+			return -EAGAIN;
+
 	if (!pci_dev_trylock(dev))
 		return -EAGAIN;
 
@@ -5129,6 +5137,8 @@ int pci_try_reset_function(struct pci_dev *dev)
 	rc = __pci_reset_function_locked(dev);
 	pci_dev_restore(dev);
 	pci_dev_unlock(dev);
+	if (dev->slot)
+		up_write(&dev->slot->reset_lock);
 
 	return rc;
 }
@@ -5227,6 +5237,7 @@ static void pci_slot_lock(struct pci_slot *slot)  {
 	struct pci_dev *dev;
 
+	down_write(&slot->reset_lock);
 	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
 		if (!dev->slot || dev->slot != slot)
 			continue;
@@ -5248,6 +5259,7 @@ static void pci_slot_unlock(struct pci_slot *slot)
 			pci_bus_unlock(dev->subordinate);
 		pci_dev_unlock(dev);
 	}
+	up_write(&slot->reset_lock);
 }
 
 /* Return 1 on successful lock, 0 on contention */ @@ -5255,6 +5267,9 @@ static int pci_slot_trylock(struct pci_slot *slot)  {
 	struct pci_dev *dev;
 
+	if (!down_write_trylock(&slot->reset_lock))
+		return 0;
+
 	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
 		if (!dev->slot || dev->slot != slot)
 			continue;
@@ -5278,6 +5293,7 @@ static int pci_slot_trylock(struct pci_slot *slot)
 			pci_bus_unlock(dev->subordinate);
 		pci_dev_unlock(dev);
 	}
+	up_write(&slot->reset_lock);
 	return 0;
 }
 
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index cc386ef2fa12..e8e7d0975889 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -279,6 +279,8 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 	INIT_LIST_HEAD(&slot->list);
 	list_add(&slot->list, &parent->slots);
 
+	init_rwsem(&slot->reset_lock);
+
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &parent->devices, bus_list)
 		if (PCI_SLOT(dev->devfn) == slot_nr)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 379a02c36e37..2a9cb7634e0e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -447,13 +447,20 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 	 * We can not use the "try" reset interface here, which will
 	 * overwrite the previously restored configuration information.
 	 */
-	if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
-		if (device_trylock(&pdev->dev)) {
-			if (!__pci_reset_function_locked(pdev))
-				vdev->needs_reset = false;
-			device_unlock(&pdev->dev);
+	if (vdev->reset_works) {
+		if (!pdev->slot ||
+		    down_write_trylock(&pdev->slot->reset_lock)) {
+			if (pci_cfg_access_trylock(pdev)) {
+				if (device_trylock(&pdev->dev)) {
+					if (!__pci_reset_function_locked(pdev))
+						vdev->needs_reset = false;
+					device_unlock(&pdev->dev);
+				}
+				pci_cfg_access_unlock(pdev);
+			}
+			if (pdev->slot)
+				up_write(&pdev->slot->reset_lock);
 		}
-		pci_cfg_access_unlock(pdev);
 	}
 
 	pci_restore_state(pdev);
diff --git a/drivers/xen/xen-pciback/passthrough.c b/drivers/xen/xen-pciback/passthrough.c
index 66e9b814cc86..98a9ec8accce 100644
--- a/drivers/xen/xen-pciback/passthrough.c
+++ b/drivers/xen/xen-pciback/passthrough.c
@@ -89,11 +89,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev,
 	mutex_unlock(&dev_data->lock);
 
 	if (found_dev) {
-		if (lock)
+		if (lock) {
+			if (found_dev->slot)
+				down_write(&found_dev->slot->reset_lock);
 			device_lock(&found_dev->dev);
+		}
 		pcistub_put_pci_dev(found_dev);
-		if (lock)
+		if (lock) {
 			device_unlock(&found_dev->dev);
+			if (found_dev->slot)
+				up_write(&found_dev->slot->reset_lock);
+		}
 	}
 }
 
@@ -164,9 +170,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev)
 	list_for_each_entry_safe(dev_entry, t, &dev_data->dev_list, list) {
 		struct pci_dev *dev = dev_entry->dev;
 		list_del(&dev_entry->list);
+		if (dev->slot)
+			down_write(&dev->slot->reset_lock);
 		device_lock(&dev->dev);
 		pcistub_put_pci_dev(dev);
 		device_unlock(&dev->dev);
+		if (dev->slot)
+			up_write(&dev->slot->reset_lock);
 		kfree(dev_entry);
 	}
 
diff --git a/drivers/xen/xen-pciback/vpci.c b/drivers/xen/xen-pciback/vpci.c index f6ba18191c0f..e11ed4764371 100644
--- a/drivers/xen/xen-pciback/vpci.c
+++ b/drivers/xen/xen-pciback/vpci.c
@@ -171,11 +171,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev,
 	mutex_unlock(&vpci_dev->lock);
 
 	if (found_dev) {
-		if (lock)
+		if (lock) {
+			if (found_dev->slot)
+				down_write(&found_dev->slot->reset_lock);
 			device_lock(&found_dev->dev);
+		}
 		pcistub_put_pci_dev(found_dev);
-		if (lock)
+		if (lock) {
 			device_unlock(&found_dev->dev);
+			if (found_dev->slot)
+				up_write(&found_dev->slot->reset_lock);
+		}
 	}
 }
 
@@ -216,9 +222,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev)
 					 list) {
 			struct pci_dev *dev = e->dev;
 			list_del(&e->list);
+			if (dev->slot)
+				down_write(&dev->slot->reset_lock);
 			device_lock(&dev->dev);
 			pcistub_put_pci_dev(dev);
 			device_unlock(&dev->dev);
+			if (dev->slot)
+				up_write(&dev->slot->reset_lock);
 			kfree(e);
 		}
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h index 71c92b88bbc6..e8d31e6d495a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/resource_ext.h>
+#include <linux/rwsem.h>
 #include <uapi/linux/pci.h>
 
 #include <linux/pci_ids.h>
@@ -63,6 +64,7 @@ struct pci_slot {
 	struct pci_bus		*bus;		/* Bus this slot is on */
 	struct list_head	list;		/* Node in list of slots */
 	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
+	struct rw_semaphore	reset_lock;	/* Held during slot reset */
 	unsigned char		number;		/* PCI_SLOT(pci_dev->devfn) */
 	struct kobject		kobj;
 };

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
  2022-10-30  8:28 ` Christoph Hellwig
@ 2022-10-30  8:39   ` Alex Michon
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Michon @ 2022-10-30  8:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: michael haeuptle, CS1PR8401MB07283E5AFD950C136FFC565E95C60,
	dstein, linux-pci, lukas

Sorry, I wanted to reply to this thread: https://lore.kernel.org/linux-pci/CS1PR8401MB07283E5AFD950C136FFC565E95C60@CS1PR8401MB0728.NAMPRD84.PROD.OUTLOOK.COM/
but it looks like something went wrong with the email delivery.

----- Original Message -----
From: "Christoph Hellwig" <hch@lst.de>
To: "Alex Michon" <amichon@kalrayinc.com>
Cc: "michael haeuptle" <michael.haeuptle@hpe.com>, "CS1PR8401MB07283E5AFD950C136FFC565E95C60" <CS1PR8401MB07283E5AFD950C136FFC565E95C60@cs1pr8401mb0728.namprd84.prod.outlook.com>, "dstein" <dstein@hpe.com>, "hch" <hch@lst.de>, "linux-pci" <linux-pci@vger.kernel.org>, "lukas" <lukas@wunner.de>
Sent: Sunday, October 30, 2022 9:28:18 AM
Subject: Re: Deadlock during PCIe hot remove

On Wed, Oct 26, 2022 at 12:05:57PM +0200, Alex Michon wrote:
> Hello, 
> 
> Is there any update on this bug ? 

What is "this bug"?





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Deadlock during PCIe hot remove
       [not found] <492110694.79456.1666778757292.JavaMail.zimbra@kalray.eu>
@ 2022-10-30  8:28 ` Christoph Hellwig
  2022-10-30  8:39   ` Alex Michon
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2022-10-30  8:28 UTC (permalink / raw)
  To: Alex Michon
  Cc: michael.haeuptle, CS1PR8401MB07283E5AFD950C136FFC565E95C60,
	dstein, hch, linux-pci, lukas

On Wed, Oct 26, 2022 at 12:05:57PM +0200, Alex Michon wrote:
> Hello, 
> 
> Is there any update on this bug ? 

What is "this bug"?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-10-30  8:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24 15:21 Deadlock during PCIe hot remove Haeuptle, Michael
2020-03-24 15:35 ` Hoyer, David
2020-03-24 15:37   ` Haeuptle, Michael
2020-03-24 15:39     ` Hoyer, David
2020-03-24 16:26       ` Haeuptle, Michael
2020-03-24 16:32         ` Hoyer, David
2020-03-24 16:15 ` Lukas Wunner
2020-03-25 10:40   ` Christoph Hellwig
2020-03-26 19:30     ` Haeuptle, Michael
2020-03-29 13:04     ` Lukas Wunner
2020-03-31  8:14       ` Christoph Hellwig
2020-03-29 15:43 ` Lukas Wunner
2020-03-30 16:15   ` Haeuptle, Michael
2020-03-31 13:01     ` Lukas Wunner
2020-03-31 15:02       ` Haeuptle, Michael
2020-04-02 19:24         ` Haeuptle, Michael
     [not found] <492110694.79456.1666778757292.JavaMail.zimbra@kalray.eu>
2022-10-30  8:28 ` Christoph Hellwig
2022-10-30  8:39   ` Alex Michon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).