linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks
@ 2020-06-15 14:32 Ian May
  2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
  2020-06-15 14:32 ` [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock Ian May
  0 siblings, 2 replies; 8+ messages in thread
From: Ian May @ 2020-06-15 14:32 UTC (permalink / raw)
  To: linux-pci

We've encountered two deadlocks when the hotplug interrupt and AER
recovery trigger simultaneously on the same device.  The first
deadlock is with:

       Hotplug                                AER
----------------------------       ---------------------------
device_lock(&dev->dev)             down_write(&ctrl->reset_lock)
down_read(&ctrl->reset_lock)       device_lock(&dev->dev)

//Hotplug
[  637.084657] Call Trace:
[  637.084669]  __schedule+0x2a8/0x670
[  637.084670]  schedule+0x33/0xa0
[  637.084672]  schedule_preempt_disabled+0xe/0x10
[  637.084673]  __mutex_lock.isra.9+0x26d/0x4e0
[  637.084674]  __mutex_lock_slowpath+0x13/0x20
[  637.084675]  ? __mutex_lock_slowpath+0x13/0x20
[  637.084675]  mutex_lock+0x2f/0x40                 <---mutex_lock(&dev->mutex);
[  637.084678]  __device_driver_lock+0x2b/0x50
[  637.084679]  device_release_driver_internal+0x1f/0x1b0
[  637.084680]  device_release_driver+0x12/0x20
[  637.084680]  bus_remove_device+0xe1/0x150
[  637.084682]  device_del+0x16a/0x3a0
[  637.084687]  pci_remove_bus_device+0x7e/0x100
[  637.084690]  pci_stop_and_remove_bus_device+0x1a/0x20
[  637.084691]  pciehp_unconfigure_device+0x80/0x130
[  637.084692]  pciehp_disable_slot+0x6e/0x120
[  637.084694]  pciehp_handle_presence_or_link_change+0x76/0x470
[  637.084696]  pciehp_ist+0x14a/0x1e0               <---down_read(&ctrl->reset_lock);

//AER
[  637.110208] Call Trace:
[  637.110209]  __schedule+0x2a8/0x670
[  637.110213]  ? sched_clock+0x9/0x10
[  637.110222]  schedule+0x33/0xa0
[  637.110230]  rwsem_down_write_slowpath+0x233/0x4a0
[  637.110237]  ? __schedule+0x2b0/0x670
[  637.110239]  down_write+0x3d/0x40
[  637.110241]  ? down_write+0x3d/0x40               <---down_write(&ctrl->reset_lock)
[  637.110245]  pciehp_reset_slot+0x58/0x150
[  637.110250]  pci_reset_hotplug_slot+0x43/0x70
[  637.110256]  pci_slot_reset+0x12f/0x170           <---mutex_lock(&dev->mutex);
[  637.110264]  pci_bus_error_reset+0xac/0xd0
[  637.110266]  pcie_do_recovery+0x255/0x2c0
[  637.110267]  aer_recover_work_func+0xd2/0x100
[  637.110270]  process_one_work+0x1fd/0x3f0
[  637.110279]  worker_thread+0x34/0x410
[  637.110287]  kthread+0x121/0x140
[  637.110288]  ? process_one_work+0x3f0/0x3f0
[  637.110289]  ? kthread_park+0xb0/0xb0
[  637.110290]  ret_from_fork+0x22/0x40

Applying the first patch then exposes the following 2nd deadlock:

           Hotplug                                AER
----------------------------       ---------------------------
down_read(&ctrl->reset_lock)       mutex_lock(&pci_slot_mutex)
mutex_lock(&pci_slot_mutex)        down_write(&ctrl->reset_lock)

//Hotplug
[  391.250990] INFO: task irq/42-pciehp:1960 blocked for more than 120 seconds.
[  391.259086]       Tainted: P           OE     5.3.0-51-generic #44~18.04.2
[  391.266982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  391.275977] irq/42-pciehp   D    0  1960      2 0x80004000
[  391.275979] Call Trace:
[  391.275986]  __schedule+0x2b9/0x6c0
[  391.275987]  ? __switch_to_asm+0x34/0x70
[  391.275989]  schedule+0x42/0xb0
[  391.275992]  schedule_preempt_disabled+0xe/0x10
[  391.275997]  __mutex_lock.isra.0+0x182/0x4f0
[  391.276003]  ? raw_pci_read+0x29/0x40
[  391.276005]  ? pci_read+0x2c/0x30
[  391.276010]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  391.276015]  __mutex_lock_slowpath+0x13/0x20
[  391.276017]  mutex_lock+0x2e/0x40                    <---pci_slot_mutex
[  391.276021]  pci_dev_assign_slot+0x19/0x5f
[  391.276025]  pci_setup_device+0xc1/0x150
[  391.276029]  ? online_show+0x20/0x70
[  391.276034]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  391.276035]  pci_scan_single_device+0x95/0xd0
[  391.276041]  pci_scan_slot+0x4a/0x110
[  391.276042]  pciehp_configure_device+0x40/0x120
[  391.276044]  pciehp_handle_presence_or_link_change+0x184/0x290
[  391.276049]  pciehp_ist+0x141/0x150                  <---ctrl->reset_lock
[  391.276055]  irq_thread_fn+0x28/0x60
[  391.276056]  irq_thread+0xda/0x170
[  391.276057]  ? irq_forced_thread_fn+0x80/0x80
[  391.276059]  kthread+0x104/0x140
[  391.276060]  ? irq_thread_check_affinity+0xf0/0xf0
[  391.276061]  ? kthread_park+0x80/0x80
[  391.276062]  ret_from_fork+0x22/0x40

//AER
[  391.225984] INFO: task kworker/39:1:1593 blocked for more than 120 seconds.
[  391.233989]       Tainted: P           OE     5.3.0-51-generic #44~18.04.2
[  391.241887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  391.250882] kworker/39:1    D    0  1593      2 0x80004000
[  391.250890] Workqueue: events aer_recover_work_func
[  391.250893] Call Trace:
[  391.250899]  __schedule+0x2b9/0x6c0
[  391.250902]  ? sched_clock+0x9/0x10
[  391.250903]  schedule+0x42/0xb0
[  391.250906]  rwsem_down_write_slowpath+0x23a/0x4c0
[  391.250908]  down_write+0x41/0x50
[  391.250909]  pciehp_reset_lock+0x12/0x20
[  391.250911]  pci_slot_reset+0xa1/0x170              <---ctrl->reset_lock
[  391.250912]  pci_bus_error_reset+0xfb/0x130         <---pci_slot_mutex
[  391.250914]  pcie_do_recovery+0x238/0x264
[  391.250917]  aer_recover_work_func.cold+0x9a/0xa0
[  391.250919]  process_one_work+0x1da/0x390
[  391.250920]  worker_thread+0x4d/0x400
[  391.250923]  kthread+0x104/0x140
[  391.250924]  ? process_one_work+0x390/0x390
[  391.250925]  ? kthread_park+0x80/0x80
[  391.250927]  ret_from_fork+0x22/0x40

The 2nd patch is unlocking and locking the pci_slot mutex in the AER path around
the controller reset call.  This allows for both the Hotplug and AER execution 
to complete.

Ian May (2):
  PCIe hotplug interrupt and AER deadlock with reset_lock and
    device_lock
  PCIe hotplug interrupt and AER pci_slot_mutex deadlock

 drivers/pci/hotplug/pciehp.h      |  2 ++
 drivers/pci/hotplug/pciehp_core.c |  2 ++
 drivers/pci/hotplug/pciehp_hpc.c  | 21 ++++++++++++++++++---
 drivers/pci/pci.c                 |  8 ++++++++
 include/linux/pci_hotplug.h       |  2 ++
 5 files changed, 32 insertions(+), 3 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 14:32 [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks Ian May
@ 2020-06-15 14:32 ` Ian May
  2020-06-15 18:56   ` Lukas Wunner
  2020-06-15 14:32 ` [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock Ian May
  1 sibling, 1 reply; 8+ messages in thread
From: Ian May @ 2020-06-15 14:32 UTC (permalink / raw)
  To: linux-pci

Currently when a hotplug interrupt and AER recovery triggers simultaneously
the following deadlock can occur.

        Hotplug				       AER
----------------------------       ---------------------------
device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
down_read(&ctrl->reset_lock)       device_lock(&dev->dev)

This patch adds a reset_lock and reset_unlock hotplug_slot_op.  This would allow
the controller reset_lock/reset_unlock to be moved from pciehp_reset_slot to
pci_slot_reset function allowing the controller reset_lock to be acquired before
the device lock allowing for both hotplug and AER to grab the reset_lock
and device lock in proper order.

Signed-off-by: Ian May <ian.may@canonical.com>
---
 drivers/pci/hotplug/pciehp.h      |  2 ++
 drivers/pci/hotplug/pciehp_core.c |  2 ++
 drivers/pci/hotplug/pciehp_hpc.c  | 21 ++++++++++++++++++---
 drivers/pci/pci.c                 |  6 ++++++
 include/linux/pci_hotplug.h       |  2 ++
 5 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index e28b4fffd84d..1e98604cec83 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -183,6 +183,8 @@ void pciehp_release_ctrl(struct controller *ctrl);
 
 int pciehp_sysfs_enable_slot(struct hotplug_slot *hotplug_slot);
 int pciehp_sysfs_disable_slot(struct hotplug_slot *hotplug_slot);
+int pciehp_reset_lock(struct hotplug_slot *hotplug_slot);
+int pciehp_reset_unlock(struct hotplug_slot *hotplug_slot);
 int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe);
 int pciehp_get_attention_status(struct hotplug_slot *hotplug_slot, u8 *status);
 int pciehp_set_raw_indicator_status(struct hotplug_slot *h_slot, u8 status);
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 4c032d75c874..a8da3e6a59b8 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -63,6 +63,8 @@ static int init_slot(struct controller *ctrl)
 	ops->get_power_status = get_power_status;
 	ops->get_adapter_status = get_adapter_status;
 	ops->reset_slot = pciehp_reset_slot;
+	ops->reset_lock = pciehp_reset_lock;
+	ops->reset_unlock = pciehp_reset_unlock;
 	if (MRL_SENS(ctrl))
 		ops->get_latch_status = get_latch_status;
 	if (ATTN_LED(ctrl)) {
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 81b23f07719e..185ec9d1b0d0 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -798,6 +798,24 @@ void pcie_disable_interrupt(struct controller *ctrl)
 	pcie_write_cmd(ctrl, 0, mask);
 }
 
+int pciehp_reset_lock(struct hotplug_slot *hotplug_slot)
+{
+	struct controller *ctrl = to_ctrl(hotplug_slot);
+
+	down_write(&ctrl->reset_lock);
+
+	return 0;
+}
+
+int pciehp_reset_unlock(struct hotplug_slot *hotplug_slot)
+{
+	struct controller *ctrl = to_ctrl(hotplug_slot);
+
+	up_write(&ctrl->reset_lock);
+
+	return 0;
+}
+
 /*
  * pciehp has a 1:1 bus:slot relationship so we ultimately want a secondary
  * bus reset of the bridge, but at the same time we want to ensure that it is
@@ -816,8 +834,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	if (probe)
 		return 0;
 
-	down_write(&ctrl->reset_lock);
-
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
 		stat_mask |= PCI_EXP_SLTSTA_PDC;
@@ -836,7 +852,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
 
-	up_write(&ctrl->reset_lock);
 	return rc;
 }
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index da21293f1111..e32c5a1a706e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5228,14 +5228,20 @@ static int pci_slot_reset(struct pci_slot *slot, int probe)
 		return -ENOTTY;
 
 	if (!probe)
+	{
+		slot->hotplug->ops->reset_lock(slot->hotplug);
 		pci_slot_lock(slot);
+	}
 
 	might_sleep();
 
 	rc = pci_reset_hotplug_slot(slot->hotplug, probe);
 
 	if (!probe)
+	{
 		pci_slot_unlock(slot);
+		slot->hotplug->ops->reset_unlock(slot->hotplug);
+	}
 
 	return rc;
 }
diff --git a/include/linux/pci_hotplug.h b/include/linux/pci_hotplug.h
index f694eb2ca978..fce5ad979346 100644
--- a/include/linux/pci_hotplug.h
+++ b/include/linux/pci_hotplug.h
@@ -45,6 +45,8 @@ struct hotplug_slot_ops {
 	int (*get_latch_status)		(struct hotplug_slot *slot, u8 *value);
 	int (*get_adapter_status)	(struct hotplug_slot *slot, u8 *value);
 	int (*reset_slot)		(struct hotplug_slot *slot, int probe);
+	int (*reset_lock)               (struct hotplug_slot *slot);
+	int (*reset_unlock)             (struct hotplug_slot *slot);
 };
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock
  2020-06-15 14:32 [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks Ian May
  2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
@ 2020-06-15 14:32 ` Ian May
  1 sibling, 0 replies; 8+ messages in thread
From: Ian May @ 2020-06-15 14:32 UTC (permalink / raw)
  To: linux-pci

In the AER recovery code path, pci_bus_error_reset acquires the pci_slot_mutex,
unlock and relock in call to pci_slot_reset around controller reset to lock in
same order as the hotplug code path.

        Hotplug                                AER
----------------------------       ---------------------------
down_read(&ctrl->reset_lock)	   mutex_lock(&pci_slot_mutex)
mutex_lock(&pci_slot_mutex)	   down_write(&ctrl->reset_lock)

Signed-off-by: Ian May <ian.may@canonical.com>
---
 drivers/pci/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e32c5a1a706e..4eeca8b18664 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5229,6 +5229,7 @@ static int pci_slot_reset(struct pci_slot *slot, int probe)
 
 	if (!probe)
 	{
+		mutex_unlock(&pci_slot_mutex);
 		slot->hotplug->ops->reset_lock(slot->hotplug);
 		pci_slot_lock(slot);
 	}
@@ -5241,6 +5242,7 @@ static int pci_slot_reset(struct pci_slot *slot, int probe)
 	{
 		pci_slot_unlock(slot);
 		slot->hotplug->ops->reset_unlock(slot->hotplug);
+		mutex_lock(&pci_slot_mutex);
 	}
 
 	return rc;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
@ 2020-06-15 18:56   ` Lukas Wunner
  2020-06-16 17:13     ` Ian May
  2020-07-14 23:58     ` Bjorn Helgaas
  0 siblings, 2 replies; 8+ messages in thread
From: Lukas Wunner @ 2020-06-15 18:56 UTC (permalink / raw)
  To: Ian May; +Cc: linux-pci

On Mon, Jun 15, 2020 at 09:32:49AM -0500, Ian May wrote:
> Currently when a hotplug interrupt and AER recovery triggers simultaneously
> the following deadlock can occur.
> 
>         Hotplug				       AER
> ----------------------------       ---------------------------
> device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
> down_read(&ctrl->reset_lock)       device_lock(&dev->dev)
> 
> This patch adds a reset_lock and reset_unlock hotplug_slot_op.
> This would allow the controller reset_lock/reset_unlock to be moved
> from pciehp_reset_slot to pci_slot_reset function allowing the controller
> reset_lock to be acquired before the device lock allowing for both hotplug
> and AER to grab the reset_lock and device lock in proper order.

I've posted a patch to address such issues on Mar 31, just haven't
gotten around to respin it with a proper commit message:

https://lore.kernel.org/linux-pci/20200331130139.46oxbade6rcbaicb@wunner.de/

I've solved it by moving the reset lock into struct pci_slot.
I think that's simpler than adding two callbacks.

Do you think the AER deadlock could be fixed based on my approach?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 18:56   ` Lukas Wunner
@ 2020-06-16 17:13     ` Ian May
  2020-07-17  5:20       ` Lukas Wunner
  2020-07-14 23:58     ` Bjorn Helgaas
  1 sibling, 1 reply; 8+ messages in thread
From: Ian May @ 2020-06-16 17:13 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci

Hi Lukas,

Thanks for the quick reply!  I like your solution and have confirmed it
solves the first deadlock we see between the Hotplug interrupt and AER
recovery.

Thanks,
Ian

On 6/15/20 1:56 PM, Lukas Wunner wrote:
> On Mon, Jun 15, 2020 at 09:32:49AM -0500, Ian May wrote:
>> Currently when a hotplug interrupt and AER recovery triggers simultaneously
>> the following deadlock can occur.
>>
>>         Hotplug				       AER
>> ----------------------------       ---------------------------
>> device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
>> down_read(&ctrl->reset_lock)       device_lock(&dev->dev)
>>
>> This patch adds a reset_lock and reset_unlock hotplug_slot_op.
>> This would allow the controller reset_lock/reset_unlock to be moved
>> from pciehp_reset_slot to pci_slot_reset function allowing the controller
>> reset_lock to be acquired before the device lock allowing for both hotplug
>> and AER to grab the reset_lock and device lock in proper order.
> I've posted a patch to address such issues on Mar 31, just haven't
> gotten around to respin it with a proper commit message:
>
> https://lore.kernel.org/linux-pci/20200331130139.46oxbade6rcbaicb@wunner.de/
>
> I've solved it by moving the reset lock into struct pci_slot.
> I think that's simpler than adding two callbacks.
>
> Do you think the AER deadlock could be fixed based on my approach?
>
> Thanks,
>
> Lukas


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 18:56   ` Lukas Wunner
  2020-06-16 17:13     ` Ian May
@ 2020-07-14 23:58     ` Bjorn Helgaas
  1 sibling, 0 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2020-07-14 23:58 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Ian May, linux-pci

On Mon, Jun 15, 2020 at 08:56:50PM +0200, Lukas Wunner wrote:
> On Mon, Jun 15, 2020 at 09:32:49AM -0500, Ian May wrote:
> > Currently when a hotplug interrupt and AER recovery triggers simultaneously
> > the following deadlock can occur.
> > 
> >         Hotplug				       AER
> > ----------------------------       ---------------------------
> > device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
> > down_read(&ctrl->reset_lock)       device_lock(&dev->dev)
> > 
> > This patch adds a reset_lock and reset_unlock hotplug_slot_op.
> > This would allow the controller reset_lock/reset_unlock to be moved
> > from pciehp_reset_slot to pci_slot_reset function allowing the controller
> > reset_lock to be acquired before the device lock allowing for both hotplug
> > and AER to grab the reset_lock and device lock in proper order.
> 
> I've posted a patch to address such issues on Mar 31, just haven't
> gotten around to respin it with a proper commit message:
> 
> https://lore.kernel.org/linux-pci/20200331130139.46oxbade6rcbaicb@wunner.de/
> 
> I've solved it by moving the reset lock into struct pci_slot.
> I think that's simpler than adding two callbacks.
> 
> Do you think the AER deadlock could be fixed based on my approach?

How should we proceed on this?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-16 17:13     ` Ian May
@ 2020-07-17  5:20       ` Lukas Wunner
       [not found]         ` <CAE1ug=a8PJpKh0Jx2ZZxo5kwQvvK883xs+mzyMWbW8T1oqyKDg@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Lukas Wunner @ 2020-07-17  5:20 UTC (permalink / raw)
  To: Ian May; +Cc: linux-pci

On Tue, Jun 16, 2020 at 12:13:23PM -0500, Ian May wrote:
> Thanks for the quick reply! I like your solution and have confirmed it
> solves the first deadlock we see between the Hotplug interrupt and AER
> recovery.

Thank you for the confirmation (and sorry for the delay).  I'm cooking
up a proper patch right now.

One question regarding your patch [2/2]:  If, instead of this patch,
you change pci_bus_error_reset() to call "device_lock(bridge)" rather
than "mutex_lock(&pci_slot_mutex)", do you still see deadlocks?

Taking the pci_slot_mutex in pci_bus_error_reset() was actually the
right thing to do because it holds the driver of the hotplug port
in place.  (The hotplug port above the bus being reset.)  Without
that, dereferencing slot->hotplug in pci_slot_reset() wouldn't be
safe.  My fear is that acquiring the device_lock() of the bridge
leading to the bus being reset may cause other deadlocks, in particular
in cascaded topologies such as Thunderbolt, which I suspect may be
what you're dealing with.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
       [not found]         ` <CAE1ug=a8PJpKh0Jx2ZZxo5kwQvvK883xs+mzyMWbW8T1oqyKDg@mail.gmail.com>
@ 2020-08-01 16:14           ` Lukas Wunner
  0 siblings, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2020-08-01 16:14 UTC (permalink / raw)
  To: Ian May; +Cc: linux-pci, Keith Busch

[cc += Keith]

On Fri, Jul 17, 2020 at 09:02:22AM -0500, Ian May wrote:
> I do now have a "better" patch that I was going to submit to the list
> where I converted the pci_slot_mutex to a rw_semaphore.  Do you see
> any potential problems with changing the lock type?  I attached the
> patch if you are interested in checking it over.

The question is, if pci_slot_mutex is an rw_semaphore, can it happen
that pciehp acquires it for writing, provoking a deadlock like this:

        Hotplug                                AER
	----------------------------       ---------------------------
      1 down_read(&ctrl->reset_lock)
	                                 2 down_read(&pci_slot_mutex)
      3 down_write(&pci_slot_mutex)
                                         4 down_write(&ctrl->reset_lock)
	** DEADLOCK **

I think this can happen if the device inserted into the hotplug slot
contains a PCIe switch which itself has hotplug ports.  That's the
case with Thunderbolt:  Every Thunderbolt device contains a PCIe
switch with hotplug ports to extend the Thunderbolt chain.  E.g.
the PCIe hierarchy looks like this for a Thunderbolt host controller
with a chain of two devices:

Root - Upstream - Downstream - Upstream - Downstream - Upstream - Downstream

(host ...)                     (1st device ...)        (2nd device ...)

When a Thunderbolt device is attached, pci_slot_mutex would be taken
for writing in pci_create_slot():

pciehp_configure_device()
  pci_scan_slot()
    pci_scan_single_device()
      pci_scan_device()
            pci_setup_device()
                pci_dev_assign_slot() # acquire pci_slot_mutex for reading
        pci_device_add() # match_driver = false; device_add()
    pci_bus_add_devices()
      pci_bus_add_device() # match_driver = true;  device_attach()
        device_attach()
          __device_attach()
            __device_attach_driver()
              driver_probe_device()
                pcie_portdrv_probe()
                  pcie_port_device_register()
                    pcie_device_init()
                      device_register()
                        device_add()
                          bus_probe_device()
                            device_initial_probe()
                              __device_attach()
                                __device_attach_driver()
                                  driver_probe_device()
                                    pciehp_probe()
                                      init_slot()
				        pci_hp_initialize()
					  pci_create_slot()
					    down_write(pci_slot_mutex)

(You may want to double-check that I got this right.)

In principle, Keith did the right thing to acquire pci_slot_mutex in
pci_bus_error_reset() for accessing the bus->slots list.

I need to think some more to come up with a solution for this particular
deadlock.  Maybe using a klist and traversing it with klist_iter_init()
(holds a ref on each slot, allowing concurrent list access) or something
along those lines...

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-08-01 16:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 14:32 [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks Ian May
2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
2020-06-15 18:56   ` Lukas Wunner
2020-06-16 17:13     ` Ian May
2020-07-17  5:20       ` Lukas Wunner
     [not found]         ` <CAE1ug=a8PJpKh0Jx2ZZxo5kwQvvK883xs+mzyMWbW8T1oqyKDg@mail.gmail.com>
2020-08-01 16:14           ` Lukas Wunner
2020-07-14 23:58     ` Bjorn Helgaas
2020-06-15 14:32 ` [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock Ian May

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).