All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks
@ 2020-06-15 14:32 Ian May
  2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
  2020-06-15 14:32 ` [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock Ian May
  0 siblings, 2 replies; 8+ messages in thread
From: Ian May @ 2020-06-15 14:32 UTC (permalink / raw)
  To: linux-pci

We've encountered two deadlocks when the hotplug interrupt and AER
recovery trigger simultaneously on the same device.  The first
deadlock is with:

       Hotplug                                AER
----------------------------       ---------------------------
device_lock(&dev->dev)             down_write(&ctrl->reset_lock)
down_read(&ctrl->reset_lock)       device_lock(&dev->dev)

//Hotplug
[  637.084657] Call Trace:
[  637.084669]  __schedule+0x2a8/0x670
[  637.084670]  schedule+0x33/0xa0
[  637.084672]  schedule_preempt_disabled+0xe/0x10
[  637.084673]  __mutex_lock.isra.9+0x26d/0x4e0
[  637.084674]  __mutex_lock_slowpath+0x13/0x20
[  637.084675]  ? __mutex_lock_slowpath+0x13/0x20
[  637.084675]  mutex_lock+0x2f/0x40                 <---mutex_lock(&dev->mutex);
[  637.084678]  __device_driver_lock+0x2b/0x50
[  637.084679]  device_release_driver_internal+0x1f/0x1b0
[  637.084680]  device_release_driver+0x12/0x20
[  637.084680]  bus_remove_device+0xe1/0x150
[  637.084682]  device_del+0x16a/0x3a0
[  637.084687]  pci_remove_bus_device+0x7e/0x100
[  637.084690]  pci_stop_and_remove_bus_device+0x1a/0x20
[  637.084691]  pciehp_unconfigure_device+0x80/0x130
[  637.084692]  pciehp_disable_slot+0x6e/0x120
[  637.084694]  pciehp_handle_presence_or_link_change+0x76/0x470
[  637.084696]  pciehp_ist+0x14a/0x1e0               <---down_read(&ctrl->reset_lock);

//AER
[  637.110208] Call Trace:
[  637.110209]  __schedule+0x2a8/0x670
[  637.110213]  ? sched_clock+0x9/0x10
[  637.110222]  schedule+0x33/0xa0
[  637.110230]  rwsem_down_write_slowpath+0x233/0x4a0
[  637.110237]  ? __schedule+0x2b0/0x670
[  637.110239]  down_write+0x3d/0x40
[  637.110241]  ? down_write+0x3d/0x40               <---down_write(&ctrl->reset_lock)
[  637.110245]  pciehp_reset_slot+0x58/0x150
[  637.110250]  pci_reset_hotplug_slot+0x43/0x70
[  637.110256]  pci_slot_reset+0x12f/0x170           <---mutex_lock(&dev->mutex);
[  637.110264]  pci_bus_error_reset+0xac/0xd0
[  637.110266]  pcie_do_recovery+0x255/0x2c0
[  637.110267]  aer_recover_work_func+0xd2/0x100
[  637.110270]  process_one_work+0x1fd/0x3f0
[  637.110279]  worker_thread+0x34/0x410
[  637.110287]  kthread+0x121/0x140
[  637.110288]  ? process_one_work+0x3f0/0x3f0
[  637.110289]  ? kthread_park+0xb0/0xb0
[  637.110290]  ret_from_fork+0x22/0x40

Applying the first patch then exposes the following 2nd deadlock:

           Hotplug                                AER
----------------------------       ---------------------------
down_read(&ctrl->reset_lock)       mutex_lock(&pci_slot_mutex)
mutex_lock(&pci_slot_mutex)        down_write(&ctrl->reset_lock)

//Hotplug
[  391.250990] INFO: task irq/42-pciehp:1960 blocked for more than 120 seconds.
[  391.259086]       Tainted: P           OE     5.3.0-51-generic #44~18.04.2
[  391.266982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  391.275977] irq/42-pciehp   D    0  1960      2 0x80004000
[  391.275979] Call Trace:
[  391.275986]  __schedule+0x2b9/0x6c0
[  391.275987]  ? __switch_to_asm+0x34/0x70
[  391.275989]  schedule+0x42/0xb0
[  391.275992]  schedule_preempt_disabled+0xe/0x10
[  391.275997]  __mutex_lock.isra.0+0x182/0x4f0
[  391.276003]  ? raw_pci_read+0x29/0x40
[  391.276005]  ? pci_read+0x2c/0x30
[  391.276010]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  391.276015]  __mutex_lock_slowpath+0x13/0x20
[  391.276017]  mutex_lock+0x2e/0x40                    <---pci_slot_mutex
[  391.276021]  pci_dev_assign_slot+0x19/0x5f
[  391.276025]  pci_setup_device+0xc1/0x150
[  391.276029]  ? online_show+0x20/0x70
[  391.276034]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  391.276035]  pci_scan_single_device+0x95/0xd0
[  391.276041]  pci_scan_slot+0x4a/0x110
[  391.276042]  pciehp_configure_device+0x40/0x120
[  391.276044]  pciehp_handle_presence_or_link_change+0x184/0x290
[  391.276049]  pciehp_ist+0x141/0x150                  <---ctrl->reset_lock
[  391.276055]  irq_thread_fn+0x28/0x60
[  391.276056]  irq_thread+0xda/0x170
[  391.276057]  ? irq_forced_thread_fn+0x80/0x80
[  391.276059]  kthread+0x104/0x140
[  391.276060]  ? irq_thread_check_affinity+0xf0/0xf0
[  391.276061]  ? kthread_park+0x80/0x80
[  391.276062]  ret_from_fork+0x22/0x40

//AER
[  391.225984] INFO: task kworker/39:1:1593 blocked for more than 120 seconds.
[  391.233989]       Tainted: P           OE     5.3.0-51-generic #44~18.04.2
[  391.241887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  391.250882] kworker/39:1    D    0  1593      2 0x80004000
[  391.250890] Workqueue: events aer_recover_work_func
[  391.250893] Call Trace:
[  391.250899]  __schedule+0x2b9/0x6c0
[  391.250902]  ? sched_clock+0x9/0x10
[  391.250903]  schedule+0x42/0xb0
[  391.250906]  rwsem_down_write_slowpath+0x23a/0x4c0
[  391.250908]  down_write+0x41/0x50
[  391.250909]  pciehp_reset_lock+0x12/0x20
[  391.250911]  pci_slot_reset+0xa1/0x170              <---ctrl->reset_lock
[  391.250912]  pci_bus_error_reset+0xfb/0x130         <---pci_slot_mutex
[  391.250914]  pcie_do_recovery+0x238/0x264
[  391.250917]  aer_recover_work_func.cold+0x9a/0xa0
[  391.250919]  process_one_work+0x1da/0x390
[  391.250920]  worker_thread+0x4d/0x400
[  391.250923]  kthread+0x104/0x140
[  391.250924]  ? process_one_work+0x390/0x390
[  391.250925]  ? kthread_park+0x80/0x80
[  391.250927]  ret_from_fork+0x22/0x40

The 2nd patch is unlocking and locking the pci_slot mutex in the AER path around
the controller reset call.  This allows for both the Hotplug and AER execution 
to complete.

Ian May (2):
  PCIe hotplug interrupt and AER deadlock with reset_lock and
    device_lock
  PCIe hotplug interrupt and AER pci_slot_mutex deadlock

 drivers/pci/hotplug/pciehp.h      |  2 ++
 drivers/pci/hotplug/pciehp_core.c |  2 ++
 drivers/pci/hotplug/pciehp_hpc.c  | 21 ++++++++++++++++++---
 drivers/pci/pci.c                 |  8 ++++++++
 include/linux/pci_hotplug.h       |  2 ++
 5 files changed, 32 insertions(+), 3 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 14:32 [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks Ian May
@ 2020-06-15 14:32 ` Ian May
  2020-06-15 18:56   ` Lukas Wunner
  2020-06-15 14:32 ` [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock Ian May
  1 sibling, 1 reply; 8+ messages in thread
From: Ian May @ 2020-06-15 14:32 UTC (permalink / raw)
  To: linux-pci

Currently when a hotplug interrupt and AER recovery triggers simultaneously
the following deadlock can occur.

        Hotplug				       AER
----------------------------       ---------------------------
device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
down_read(&ctrl->reset_lock)       device_lock(&dev->dev)

This patch adds a reset_lock and reset_unlock hotplug_slot_op.  This would allow
the controller reset_lock/reset_unlock to be moved from pciehp_reset_slot to
pci_slot_reset function allowing the controller reset_lock to be acquired before
the device lock allowing for both hotplug and AER to grab the reset_lock
and device lock in proper order.

Signed-off-by: Ian May <ian.may@canonical.com>
---
 drivers/pci/hotplug/pciehp.h      |  2 ++
 drivers/pci/hotplug/pciehp_core.c |  2 ++
 drivers/pci/hotplug/pciehp_hpc.c  | 21 ++++++++++++++++++---
 drivers/pci/pci.c                 |  6 ++++++
 include/linux/pci_hotplug.h       |  2 ++
 5 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index e28b4fffd84d..1e98604cec83 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -183,6 +183,8 @@ void pciehp_release_ctrl(struct controller *ctrl);
 
 int pciehp_sysfs_enable_slot(struct hotplug_slot *hotplug_slot);
 int pciehp_sysfs_disable_slot(struct hotplug_slot *hotplug_slot);
+int pciehp_reset_lock(struct hotplug_slot *hotplug_slot);
+int pciehp_reset_unlock(struct hotplug_slot *hotplug_slot);
 int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe);
 int pciehp_get_attention_status(struct hotplug_slot *hotplug_slot, u8 *status);
 int pciehp_set_raw_indicator_status(struct hotplug_slot *h_slot, u8 status);
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 4c032d75c874..a8da3e6a59b8 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -63,6 +63,8 @@ static int init_slot(struct controller *ctrl)
 	ops->get_power_status = get_power_status;
 	ops->get_adapter_status = get_adapter_status;
 	ops->reset_slot = pciehp_reset_slot;
+	ops->reset_lock = pciehp_reset_lock;
+	ops->reset_unlock = pciehp_reset_unlock;
 	if (MRL_SENS(ctrl))
 		ops->get_latch_status = get_latch_status;
 	if (ATTN_LED(ctrl)) {
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 81b23f07719e..185ec9d1b0d0 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -798,6 +798,24 @@ void pcie_disable_interrupt(struct controller *ctrl)
 	pcie_write_cmd(ctrl, 0, mask);
 }
 
+int pciehp_reset_lock(struct hotplug_slot *hotplug_slot)
+{
+	struct controller *ctrl = to_ctrl(hotplug_slot);
+
+	down_write(&ctrl->reset_lock);
+
+	return 0;
+}
+
+int pciehp_reset_unlock(struct hotplug_slot *hotplug_slot)
+{
+	struct controller *ctrl = to_ctrl(hotplug_slot);
+
+	up_write(&ctrl->reset_lock);
+
+	return 0;
+}
+
 /*
  * pciehp has a 1:1 bus:slot relationship so we ultimately want a secondary
  * bus reset of the bridge, but at the same time we want to ensure that it is
@@ -816,8 +834,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	if (probe)
 		return 0;
 
-	down_write(&ctrl->reset_lock);
-
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
 		stat_mask |= PCI_EXP_SLTSTA_PDC;
@@ -836,7 +852,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
 
-	up_write(&ctrl->reset_lock);
 	return rc;
 }
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index da21293f1111..e32c5a1a706e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5228,14 +5228,20 @@ static int pci_slot_reset(struct pci_slot *slot, int probe)
 		return -ENOTTY;
 
 	if (!probe)
+	{
+		slot->hotplug->ops->reset_lock(slot->hotplug);
 		pci_slot_lock(slot);
+	}
 
 	might_sleep();
 
 	rc = pci_reset_hotplug_slot(slot->hotplug, probe);
 
 	if (!probe)
+	{
 		pci_slot_unlock(slot);
+		slot->hotplug->ops->reset_unlock(slot->hotplug);
+	}
 
 	return rc;
 }
diff --git a/include/linux/pci_hotplug.h b/include/linux/pci_hotplug.h
index f694eb2ca978..fce5ad979346 100644
--- a/include/linux/pci_hotplug.h
+++ b/include/linux/pci_hotplug.h
@@ -45,6 +45,8 @@ struct hotplug_slot_ops {
 	int (*get_latch_status)		(struct hotplug_slot *slot, u8 *value);
 	int (*get_adapter_status)	(struct hotplug_slot *slot, u8 *value);
 	int (*reset_slot)		(struct hotplug_slot *slot, int probe);
+	int (*reset_lock)               (struct hotplug_slot *slot);
+	int (*reset_unlock)             (struct hotplug_slot *slot);
 };
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock
  2020-06-15 14:32 [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks Ian May
  2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
@ 2020-06-15 14:32 ` Ian May
  1 sibling, 0 replies; 8+ messages in thread
From: Ian May @ 2020-06-15 14:32 UTC (permalink / raw)
  To: linux-pci

In the AER recovery code path, pci_bus_error_reset acquires the pci_slot_mutex,
unlock and relock in call to pci_slot_reset around controller reset to lock in
same order as the hotplug code path.

        Hotplug                                AER
----------------------------       ---------------------------
down_read(&ctrl->reset_lock)	   mutex_lock(&pci_slot_mutex)
mutex_lock(&pci_slot_mutex)	   down_write(&ctrl->reset_lock)

Signed-off-by: Ian May <ian.may@canonical.com>
---
 drivers/pci/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e32c5a1a706e..4eeca8b18664 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5229,6 +5229,7 @@ static int pci_slot_reset(struct pci_slot *slot, int probe)
 
 	if (!probe)
 	{
+		mutex_unlock(&pci_slot_mutex);
 		slot->hotplug->ops->reset_lock(slot->hotplug);
 		pci_slot_lock(slot);
 	}
@@ -5241,6 +5242,7 @@ static int pci_slot_reset(struct pci_slot *slot, int probe)
 	{
 		pci_slot_unlock(slot);
 		slot->hotplug->ops->reset_unlock(slot->hotplug);
+		mutex_lock(&pci_slot_mutex);
 	}
 
 	return rc;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
@ 2020-06-15 18:56   ` Lukas Wunner
  2020-06-16 17:13     ` Ian May
  2020-07-14 23:58     ` Bjorn Helgaas
  0 siblings, 2 replies; 8+ messages in thread
From: Lukas Wunner @ 2020-06-15 18:56 UTC (permalink / raw)
  To: Ian May; +Cc: linux-pci

On Mon, Jun 15, 2020 at 09:32:49AM -0500, Ian May wrote:
> Currently when a hotplug interrupt and AER recovery triggers simultaneously
> the following deadlock can occur.
> 
>         Hotplug				       AER
> ----------------------------       ---------------------------
> device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
> down_read(&ctrl->reset_lock)       device_lock(&dev->dev)
> 
> This patch adds a reset_lock and reset_unlock hotplug_slot_op.
> This would allow the controller reset_lock/reset_unlock to be moved
> from pciehp_reset_slot to pci_slot_reset function allowing the controller
> reset_lock to be acquired before the device lock allowing for both hotplug
> and AER to grab the reset_lock and device lock in proper order.

I've posted a patch to address such issues on Mar 31, just haven't
gotten around to respin it with a proper commit message:

https://lore.kernel.org/linux-pci/20200331130139.46oxbade6rcbaicb@wunner.de/

I've solved it by moving the reset lock into struct pci_slot.
I think that's simpler than adding two callbacks.

Do you think the AER deadlock could be fixed based on my approach?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 18:56   ` Lukas Wunner
@ 2020-06-16 17:13     ` Ian May
  2020-07-17  5:20       ` Lukas Wunner
  2020-07-14 23:58     ` Bjorn Helgaas
  1 sibling, 1 reply; 8+ messages in thread
From: Ian May @ 2020-06-16 17:13 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci

Hi Lukas,

Thanks for the quick reply!  I like your solution and have confirmed it
solves the first deadlock we see between the Hotplug interrupt and AER
recovery.

Thanks,
Ian

On 6/15/20 1:56 PM, Lukas Wunner wrote:
> On Mon, Jun 15, 2020 at 09:32:49AM -0500, Ian May wrote:
>> Currently when a hotplug interrupt and AER recovery triggers simultaneously
>> the following deadlock can occur.
>>
>>         Hotplug				       AER
>> ----------------------------       ---------------------------
>> device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
>> down_read(&ctrl->reset_lock)       device_lock(&dev->dev)
>>
>> This patch adds a reset_lock and reset_unlock hotplug_slot_op.
>> This would allow the controller reset_lock/reset_unlock to be moved
>> from pciehp_reset_slot to pci_slot_reset function allowing the controller
>> reset_lock to be acquired before the device lock allowing for both hotplug
>> and AER to grab the reset_lock and device lock in proper order.
> I've posted a patch to address such issues on Mar 31, just haven't
> gotten around to respin it with a proper commit message:
>
> https://lore.kernel.org/linux-pci/20200331130139.46oxbade6rcbaicb@wunner.de/
>
> I've solved it by moving the reset lock into struct pci_slot.
> I think that's simpler than adding two callbacks.
>
> Do you think the AER deadlock could be fixed based on my approach?
>
> Thanks,
>
> Lukas


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-15 18:56   ` Lukas Wunner
  2020-06-16 17:13     ` Ian May
@ 2020-07-14 23:58     ` Bjorn Helgaas
  1 sibling, 0 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2020-07-14 23:58 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Ian May, linux-pci

On Mon, Jun 15, 2020 at 08:56:50PM +0200, Lukas Wunner wrote:
> On Mon, Jun 15, 2020 at 09:32:49AM -0500, Ian May wrote:
> > Currently when a hotplug interrupt and AER recovery triggers simultaneously
> > the following deadlock can occur.
> > 
> >         Hotplug				       AER
> > ----------------------------       ---------------------------
> > device_lock(&dev->dev)		   down_write(&ctrl->reset_lock)
> > down_read(&ctrl->reset_lock)       device_lock(&dev->dev)
> > 
> > This patch adds a reset_lock and reset_unlock hotplug_slot_op.
> > This would allow the controller reset_lock/reset_unlock to be moved
> > from pciehp_reset_slot to pci_slot_reset function allowing the controller
> > reset_lock to be acquired before the device lock allowing for both hotplug
> > and AER to grab the reset_lock and device lock in proper order.
> 
> I've posted a patch to address such issues on Mar 31, just haven't
> gotten around to respin it with a proper commit message:
> 
> https://lore.kernel.org/linux-pci/20200331130139.46oxbade6rcbaicb@wunner.de/
> 
> I've solved it by moving the reset lock into struct pci_slot.
> I think that's simpler than adding two callbacks.
> 
> Do you think the AER deadlock could be fixed based on my approach?

How should we proceed on this?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
  2020-06-16 17:13     ` Ian May
@ 2020-07-17  5:20       ` Lukas Wunner
       [not found]         ` <CAE1ug=a8PJpKh0Jx2ZZxo5kwQvvK883xs+mzyMWbW8T1oqyKDg@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Lukas Wunner @ 2020-07-17  5:20 UTC (permalink / raw)
  To: Ian May; +Cc: linux-pci

On Tue, Jun 16, 2020 at 12:13:23PM -0500, Ian May wrote:
> Thanks for the quick reply! I like your solution and have confirmed it
> solves the first deadlock we see between the Hotplug interrupt and AER
> recovery.

Thank you for the confirmation (and sorry for the delay).  I'm cooking
up a proper patch right now.

One question regarding your patch [2/2]:  If, instead of this patch,
you change pci_bus_error_reset() to call "device_lock(bridge)" rather
than "mutex_lock(&pci_slot_mutex)", do you still see deadlocks?

Taking the pci_slot_mutex in pci_bus_error_reset() was actually the
right thing to do because it holds the driver of the hotplug port
in place.  (The hotplug port above the bus being reset.)  Without
that, dereferencing slot->hotplug in pci_slot_reset() wouldn't be
safe.  My fear is that acquiring the device_lock() of the bridge
leading to the bus being reset may cause other deadlocks, in particular
in cascaded topologies such as Thunderbolt, which I suspect may be
what you're dealing with.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock
       [not found]         ` <CAE1ug=a8PJpKh0Jx2ZZxo5kwQvvK883xs+mzyMWbW8T1oqyKDg@mail.gmail.com>
@ 2020-08-01 16:14           ` Lukas Wunner
  0 siblings, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2020-08-01 16:14 UTC (permalink / raw)
  To: Ian May; +Cc: linux-pci, Keith Busch

[cc += Keith]

On Fri, Jul 17, 2020 at 09:02:22AM -0500, Ian May wrote:
> I do now have a "better" patch that I was going to submit to the list
> where I converted the pci_slot_mutex to a rw_semaphore.  Do you see
> any potential problems with changing the lock type?  I attached the
> patch if you are interested in checking it over.

The question is, if pci_slot_mutex is an rw_semaphore, can it happen
that pciehp acquires it for writing, provoking a deadlock like this:

        Hotplug                                AER
	----------------------------       ---------------------------
      1 down_read(&ctrl->reset_lock)
	                                 2 down_read(&pci_slot_mutex)
      3 down_write(&pci_slot_mutex)
                                         4 down_write(&ctrl->reset_lock)
	** DEADLOCK **

I think this can happen if the device inserted into the hotplug slot
contains a PCIe switch which itself has hotplug ports.  That's the
case with Thunderbolt:  Every Thunderbolt device contains a PCIe
switch with hotplug ports to extend the Thunderbolt chain.  E.g.
the PCIe hierarchy looks like this for a Thunderbolt host controller
with a chain of two devices:

Root - Upstream - Downstream - Upstream - Downstream - Upstream - Downstream

(host ...)                     (1st device ...)        (2nd device ...)

When a Thunderbolt device is attached, pci_slot_mutex would be taken
for writing in pci_create_slot():

pciehp_configure_device()
  pci_scan_slot()
    pci_scan_single_device()
      pci_scan_device()
            pci_setup_device()
                pci_dev_assign_slot() # acquire pci_slot_mutex for reading
        pci_device_add() # match_driver = false; device_add()
    pci_bus_add_devices()
      pci_bus_add_device() # match_driver = true;  device_attach()
        device_attach()
          __device_attach()
            __device_attach_driver()
              driver_probe_device()
                pcie_portdrv_probe()
                  pcie_port_device_register()
                    pcie_device_init()
                      device_register()
                        device_add()
                          bus_probe_device()
                            device_initial_probe()
                              __device_attach()
                                __device_attach_driver()
                                  driver_probe_device()
                                    pciehp_probe()
                                      init_slot()
				        pci_hp_initialize()
					  pci_create_slot()
					    down_write(pci_slot_mutex)

(You may want to double-check that I got this right.)

In principle, Keith did the right thing to acquire pci_slot_mutex in
pci_bus_error_reset() for accessing the bus->slots list.

I need to think some more to come up with a solution for this particular
deadlock.  Maybe using a klist and traversing it with klist_iter_init()
(holds a ref on each slot, allowing concurrent list access) or something
along those lines...

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-08-01 16:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 14:32 [PATCH 0/2] Hotplug interrupt and AER recovery deadlocks Ian May
2020-06-15 14:32 ` [PATCH 1/2] PCIe hotplug interrupt and AER deadlock with reset_lock and device_lock Ian May
2020-06-15 18:56   ` Lukas Wunner
2020-06-16 17:13     ` Ian May
2020-07-17  5:20       ` Lukas Wunner
     [not found]         ` <CAE1ug=a8PJpKh0Jx2ZZxo5kwQvvK883xs+mzyMWbW8T1oqyKDg@mail.gmail.com>
2020-08-01 16:14           ` Lukas Wunner
2020-07-14 23:58     ` Bjorn Helgaas
2020-06-15 14:32 ` [PATCH 2/2] PCIe hotplug interrupt and AER pci_slot_mutex deadlock Ian May

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.