linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
@ 2023-01-05 18:43 Anatoli Antonovitch
  2023-01-05 19:18 ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Anatoli Antonovitch @ 2023-01-05 18:43 UTC (permalink / raw)
  To: linux-pci
  Cc: bhelgaas, lukas, Alexander.Deucher, christian.koenig,
	Anatoli Antonovitch

From: Anatoli Antonovitch <anatoli.antonovitch@amd.com>

It is to avoid any potential issues when S3 resume but at the same time we want to hot-unplug.
Need to do some more testing to see if it is still necessary.

To fix the race between pciehp and AER reported in https://bugzilla.kernel.org/show_bug.cgi?id=215590

INFO: task irq/26-aerdrv:104 blocked for more than 120 seconds.
Tainted: G        W          6.1.0-rc5-custom-master-nov14+ #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:irq/26-aerdrv   state:D stack:0     pid:104   ppid:2      flags:0x00004000
Call Trace:
<TASK>
__schedule+0x39c/0xe90
? rcu_read_lock_sched_held+0x25/0x80
schedule+0x6b/0xf0
rwsem_down_write_slowpath+0x3b2/0x9c0
down_write_nested+0x16b/0x210
pciehp_reset_slot+0x63/0x160
pci_reset_hotplug_slot+0x44/0x70
pci_slot_reset+0x10d/0x190
pci_bus_error_reset+0xb2/0xe0
aer_root_reset+0x144/0x190
pcie_do_recovery+0x15a/0x270
? aer_dev_correctable_show+0xc0/0xc0
aer_process_err_devices+0xcf/0xea
aer_isr.cold+0x52/0xa1
? __kmem_cache_free+0x36a/0x3b0
? irq_thread+0xb0/0x1e0
? irq_thread+0xb0/0x1e0
irq_thread_fn+0x28/0x70
irq_thread+0x106/0x1e0
? irq_forced_thread_fn+0xb0/0xb0
? wake_threads_waitq+0x40/0x40
? irq_thread_check_affinity+0xf0/0xf0
kthread+0x10a/0x130
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>

INFO: task irq/26-pciehp:105 blocked for more than 120 seconds.
Tainted: G        W          6.1.0-rc5-custom-master-nov14+ #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:irq/26-pciehp   state:D stack:0     pid:105   ppid:2      flags:0x00004000
Call Trace:
<TASK>
__schedule+0x39c/0xe90
schedule+0x6b/0xf0
schedule_preempt_disabled+0x18/0x30
__mutex_lock+0x685/0xf60
? rcu_read_lock_sched_held+0x25/0x80
? lock_release+0x24d/0x410
? __device_driver_lock+0x2d/0x50
mutex_lock_nested+0x1b/0x30
? mutex_lock_nested+0x1b/0x30
__device_driver_lock+0x2d/0x50
device_release_driver_internal+0x1f/0x170
device_release_driver+0x12/0x20
pci_stop_bus_device+0x74/0xa0
pci_stop_bus_device+0x30/0xa0
pci_stop_and_remove_bus_device+0x13/0x30
pciehp_unconfigure_device+0x80/0x140
pciehp_disable_slot+0x6e/0x110
pciehp_handle_presence_or_link_change+0xf1/0x310
pciehp_ist+0x1a0/0x1b0
? irq_thread+0xb0/0x1e0
irq_thread_fn+0x28/0x70
irq_thread+0x106/0x1e0
? irq_forced_thread_fn+0xb0/0xb0
? wake_threads_waitq+0x40/0x40
? irq_thread_check_affinity+0xf0/0xf0
kthread+0x10a/0x130
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>

Signed-off-by: Anatoli Antonovitch <anatoli.antonovitch@amd.com>
---
 drivers/pci/hotplug/pciehp_hpc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 10e9670eea0b..b1084e67f798 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -27,6 +27,8 @@
 #include "../pci.h"
 #include "pciehp.h"
 
+static DECLARE_RWSEM(hotplug_slot_rwsem);
+
 static const struct dmi_system_id inband_presence_disabled_dmi_table[] = {
 	/*
 	 * Match all Dell systems, as some Dell systems have inband
@@ -911,7 +913,10 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe)
 	if (probe)
 		return 0;
 
-	down_write_nested(&ctrl->reset_lock, ctrl->depth);
+	if (ctrl->depth > 0)
+		down_write_nested(&ctrl->reset_lock, ctrl->depth);
+	else
+		down_write(&hotplug_slot_rwsem);
 
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
@@ -931,7 +936,11 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
 
-	up_write(&ctrl->reset_lock);
+	if (ctrl->depth > 0)
+		up_write(&ctrl->reset_lock);
+	else
+		up_write(&hotplug_slot_rwsem);
+
 	return rc;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-01-05 18:43 [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock Anatoli Antonovitch
@ 2023-01-05 19:18 ` Bjorn Helgaas
       [not found]   ` <5b9a2dc7-54ab-82c5-81e7-1770a4ec891c@amd.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2023-01-05 19:18 UTC (permalink / raw)
  To: Anatoli Antonovitch
  Cc: linux-pci, bhelgaas, lukas, Alexander.Deucher, christian.koenig,
	Anatoli Antonovitch

On Thu, Jan 05, 2023 at 01:43:55PM -0500, Anatoli Antonovitch wrote:
> From: Anatoli Antonovitch <anatoli.antonovitch@amd.com>
> 
> It is to avoid any potential issues when S3 resume but at the same time we want to hot-unplug.
> Need to do some more testing to see if it is still necessary.

I'm not sure how to interpret this.

I guess I'll wait for you to do the testing and repost this if it does
turn out to be necessary?

I don't want to merge a patch with a commit log that says "need to do
more testing."

Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
       [not found]   ` <5b9a2dc7-54ab-82c5-81e7-1770a4ec891c@amd.com>
@ 2023-01-16 18:37     ` Bjorn Helgaas
  0 siblings, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2023-01-16 18:37 UTC (permalink / raw)
  To: Anatoli Antonovitch
  Cc: Bjorn Helgaas, Anatoli Antonovitch, linux-pci, bhelgaas, lukas,
	Alexander.Deucher, christian.koenig

On Mon, Jan 16, 2023 at 9:24 AM Anatoli Antonovitch
<anatoli.antonovitch@amd.com> wrote:
>
> Thanks Bjorn,
>
> The extensive testing has been done for hot-plug, hot-unplug the device and ACPI S3 suspend/resume cycles on system with AER enabled.
>
> The patch has bee resent.

Thanks.  Just FYI, the mailing lists generally reject HTML email, so
your response doesn't appear in the archives:
https://lore.kernel.org/linux-pci/20230105184355.9829-1-a.antonovitch@gmail.com/

See http://vger.kernel.org/majordomo-info.html,
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style,
https://people.kernel.org/tglx/notes-about-netiquette .

Not a big deal for this response, but it is important for more
substantive conversations.

> On 2023-01-05 14:18, Bjorn Helgaas wrote:
>
> On Thu, Jan 05, 2023 at 01:43:55PM -0500, Anatoli Antonovitch wrote:
>
> From: Anatoli Antonovitch <anatoli.antonovitch@amd.com>
>
> It is to avoid any potential issues when S3 resume but at the same time we want to hot-unplug.
> Need to do some more testing to see if it is still necessary.
>
> I'm not sure how to interpret this.
>
> I guess I'll wait for you to do the testing and repost this if it does
> turn out to be necessary?
>
> I don't want to merge a patch with a commit log that says "need to do
> more testing."
>
> Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-02-19 20:21               ` Lukas Wunner
@ 2023-04-10 20:36                 ` Anatoli Antonovitch
  0 siblings, 0 replies; 13+ messages in thread
From: Anatoli Antonovitch @ 2023-04-10 20:36 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Anatoli Antonovitch, linux-pci, Koenig, Christian,
	Deucher, Alexander

Thanks Lukas.

The patch has been tested with current kernel 6.3.0-rc5 on the same setup.
The deadlock between reset_lock and device_lock has been fixed.

See details in the dmesg log: dmesg_6.3.0-rc5_fix.txt in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215590

Thanks,
Anatoli

On 2023-02-19 15:21, Lukas Wunner wrote:
> On Fri, Feb 17, 2023 at 06:37:54PM +0000, Deucher, Alexander wrote:
>>> From: Bjorn Helgaas <helgaas@kernel.org>
>>> On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote:
>>>> On 2023-01-23 14:30, Anatoli Antonovitch wrote:
>>>>> I do not see a deadlock, when applying the following old patch:
>>>>> https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/
>>>> Can we revisit the patches again to get a fix?
>>>> The issue still reproduce and visible in the kernel 6.2.0-rc8.
>>> This old patch would need to be updated and reposted.  There was a 0-day
>>> bot issue and a question to be resolved.  Maybe this is all already resolved,
>>> but it needs to be posted and tested with a current kernel.
>> Lukas, can you resend that patch?  We can test it.
> I'm working on a patch which aims to solve these deadlocks differently,
> by reducing the critical sections for which the reset_lock is held.
> Please stand by.
>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-02-17 18:37             ` Deucher, Alexander
@ 2023-02-19 20:21               ` Lukas Wunner
  2023-04-10 20:36                 ` Anatoli Antonovitch
  0 siblings, 1 reply; 13+ messages in thread
From: Lukas Wunner @ 2023-02-19 20:21 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: Bjorn Helgaas, Anatoli Antonovitch, linux-pci, Koenig, Christian

On Fri, Feb 17, 2023 at 06:37:54PM +0000, Deucher, Alexander wrote:
> > From: Bjorn Helgaas <helgaas@kernel.org>
> > On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote:
> > > On 2023-01-23 14:30, Anatoli Antonovitch wrote:
> > > > I do not see a deadlock, when applying the following old patch:
> > > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/
> > > 
> > > Can we revisit the patches again to get a fix?
> > > The issue still reproduce and visible in the kernel 6.2.0-rc8.
> > 
> > This old patch would need to be updated and reposted.  There was a 0-day
> > bot issue and a question to be resolved.  Maybe this is all already resolved,
> > but it needs to be posted and tested with a current kernel.
> 
> Lukas, can you resend that patch?  We can test it.

I'm working on a patch which aims to solve these deadlocks differently,
by reducing the critical sections for which the reset_lock is held.
Please stand by.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-02-17 16:03           ` Bjorn Helgaas
@ 2023-02-17 18:37             ` Deucher, Alexander
  2023-02-19 20:21               ` Lukas Wunner
  0 siblings, 1 reply; 13+ messages in thread
From: Deucher, Alexander @ 2023-02-17 18:37 UTC (permalink / raw)
  To: Bjorn Helgaas, Antonovitch, Anatoli
  Cc: Lukas Wunner, Anatoli Antonovitch, linux-pci, bhelgaas, Koenig,
	Christian

[Public]

> -----Original Message-----
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Friday, February 17, 2023 11:04 AM
> To: Antonovitch, Anatoli <Anatoli.Antonovitch@amd.com>
> Cc: Lukas Wunner <lukas@wunner.de>; Anatoli Antonovitch
> <a.antonovitch@gmail.com>; linux-pci@vger.kernel.org;
> bhelgaas@google.com; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH] PCI/hotplug: Replaced down_write_nested with
> hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
> 
> On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote:
> > Hi Lukas,
> >
> > Can we revisit the patches again to get a fix?
> > The issue still reproduce and visible in the kernel 6.2.0-rc8.
> 
> > On 2023-01-23 14:30, Anatoli Antonovitch wrote:
> > > I do not see a deadlock, when applying the following old patch:
> > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73
> > > dab4b6.1595329748.git.lukas@wunner.de/
> 
> This old patch would need to be updated and reposted.  There was a 0-day
> bot issue and a question to be resolved.  Maybe this is all already resolved,
> but it needs to be posted and tested with a current kernel.

Lukas, can you resend that patch?  We can test it.

Alex

> 
> > > after merge for the kernel 6.2.0-rc5, and applied the alternative patch:
> > > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e3
> > > 7d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/
> 
> This one is on track to appear in v6.3-rc1:
> https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git/commit/?id=74ff8864c
> c84
> 
> > > I have uploaded the merged patch and the system log for the upstream
> > > kernel.
> > >
> > > Anatoli
> > >
> > >
> > > On 2023-01-21 02:21, Lukas Wunner wrote:
> > > > You're now getting a different deadlock. That one is addressed by
> > > > this old patch (it's already linked from the bugzilla):
> > > >
> > > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d
> > > > 73dab4b6.1595329748.git.lukas@wunner.de/
> > > >
> > > >
> > > > If you apply that patch plus the new one, do you still see a deadlock?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-02-13 14:59         ` Anatoli Antonovitch
@ 2023-02-17 16:03           ` Bjorn Helgaas
  2023-02-17 18:37             ` Deucher, Alexander
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2023-02-17 16:03 UTC (permalink / raw)
  To: Anatoli Antonovitch
  Cc: Lukas Wunner, Anatoli Antonovitch, linux-pci, bhelgaas,
	Alexander.Deucher, Christian.Koenig

On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote:
> Hi Lukas,
> 
> Can we revisit the patches again to get a fix?
> The issue still reproduce and visible in the kernel 6.2.0-rc8.

> On 2023-01-23 14:30, Anatoli Antonovitch wrote:
> > I do not see a deadlock, when applying the following old patch:
> > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/

This old patch would need to be updated and reposted.  There was a
0-day bot issue and a question to be resolved.  Maybe this is all
already resolved, but it needs to be posted and tested with a current
kernel.

> > after merge for the kernel 6.2.0-rc5, and applied the alternative patch:
> > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/

This one is on track to appear in v6.3-rc1:
https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git/commit/?id=74ff8864cc84

> > I have uploaded the merged patch and the system log for the upstream
> > kernel.
> > 
> > Anatoli
> > 
> > 
> > On 2023-01-21 02:21, Lukas Wunner wrote:
> > > You're now getting a different deadlock. That one is addressed by this
> > > old patch (it's already linked from the bugzilla):
> > > 
> > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/
> > > 
> > > 
> > > If you apply that patch plus the new one, do you still see a deadlock?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-01-23 19:30       ` Anatoli Antonovitch
@ 2023-02-13 14:59         ` Anatoli Antonovitch
  2023-02-17 16:03           ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Anatoli Antonovitch @ 2023-02-13 14:59 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Anatoli Antonovitch, linux-pci, bhelgaas, Alexander.Deucher,
	Christian.Koenig

Hi Lukas,

Can we revisit the patches again to get a fix?
The issue still reproduce and visible in the kernel 6.2.0-rc8.

Thanks,
Anatoli

On 2023-01-23 14:30, Anatoli Antonovitch wrote:
> I do not see a deadlock, when applying the following old patch:
> https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ 
>
> after merge for the kernel 6.2.0-rc5, and applied the alternative patch:
> https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ 
>
>
> I have uploaded the merged patch and the system log for the upstream 
> kernel.
>
> Anatoli
>
>
> On 2023-01-21 02:21, Lukas Wunner wrote:
>> You're now getting a different deadlock. That one is addressed by this
>> old patch (it's already linked from the bugzilla):
>>
>> https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ 
>>
>>
>> If you apply that patch plus the new one, do you still see a deadlock?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-01-21  7:21     ` Lukas Wunner
@ 2023-01-23 19:30       ` Anatoli Antonovitch
  2023-02-13 14:59         ` Anatoli Antonovitch
  0 siblings, 1 reply; 13+ messages in thread
From: Anatoli Antonovitch @ 2023-01-23 19:30 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Anatoli Antonovitch, linux-pci, bhelgaas, Alexander.Deucher,
	christian.koenig

I do not see a deadlock, when applying the following old patch:
https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/
after merge for the kernel 6.2.0-rc5, and applied the alternative patch:
https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/

I have uploaded the merged patch and the system log for the upstream kernel.

Anatoli


On 2023-01-21 02:21, Lukas Wunner wrote:
> You're now getting a different deadlock.  That one is addressed by this
> old patch (it's already linked from the bugzilla):
>
> https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/
>
> If you apply that patch plus the new one, do you still see a deadlock?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-01-20 21:35   ` Anatoli Antonovitch
@ 2023-01-21  7:21     ` Lukas Wunner
  2023-01-23 19:30       ` Anatoli Antonovitch
  0 siblings, 1 reply; 13+ messages in thread
From: Lukas Wunner @ 2023-01-21  7:21 UTC (permalink / raw)
  To: Anatoli Antonovitch
  Cc: Anatoli Antonovitch, linux-pci, bhelgaas, Alexander.Deucher,
	christian.koenig

On Fri, Jan 20, 2023 at 04:35:01PM -0500, Anatoli Antonovitch wrote:
> On 2023-01-20 04:28, Lukas Wunner wrote:
> > I've just submitted an alternative patch to fix this, could you give
> > it a spin and see if the issue goes away?
> > 
> > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/
> 
> The patch has been tested on the same setup.
> Unfortunately, this alternative approach with optimization and simplified
> locking is not resolve the issue in
> https://bugzilla.kernel.org/show_bug.cgi?id=215590
> 
> I have uploaded the log dmesg_6_2_rc4_hotadd_aer_fix_a6bd101b8f84.txt into
> the bugzilla for your patch.

You're now getting a different deadlock.  That one is addressed by this
old patch (it's already linked from the bugzilla):

https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/

If you apply that patch plus the new one, do you still see a deadlock?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-01-20  9:28 ` Lukas Wunner
@ 2023-01-20 21:35   ` Anatoli Antonovitch
  2023-01-21  7:21     ` Lukas Wunner
  0 siblings, 1 reply; 13+ messages in thread
From: Anatoli Antonovitch @ 2023-01-20 21:35 UTC (permalink / raw)
  To: Lukas Wunner, Anatoli Antonovitch
  Cc: linux-pci, bhelgaas, Alexander.Deucher, christian.koenig

Hi Lukas,

The patch has been tested on the same setup.
Unfortunately, this alternative approach with optimization and simplified
locking is not resolve the issue in
https://bugzilla.kernel.org/show_bug.cgi?id=215590

I have uploaded the log dmesg_6_2_rc4_hotadd_aer_fix_a6bd101b8f84.txt into
the bugzilla for your patch.

Thanks,

Anatoli


On 2023-01-20 04:28, Lukas Wunner wrote:
> I've just submitted an alternative patch to fix this, could you give
> it a spin and see if the issue goes away?
>
> https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
  2023-01-13 17:01 Anatoli Antonovitch
@ 2023-01-20  9:28 ` Lukas Wunner
  2023-01-20 21:35   ` Anatoli Antonovitch
  0 siblings, 1 reply; 13+ messages in thread
From: Lukas Wunner @ 2023-01-20  9:28 UTC (permalink / raw)
  To: Anatoli Antonovitch
  Cc: linux-pci, bhelgaas, Alexander.Deucher, christian.koenig,
	Anatoli Antonovitch

Hi Anatoli,

On Fri, Jan 13, 2023 at 12:01:31PM -0500, Anatoli Antonovitch wrote:
> It is to avoid any potential issues when S3 resume but at the same
> time we want to hot-unplug.
> 
> To fix the race between pciehp and AER reported in
> https://bugzilla.kernel.org/show_bug.cgi?id=215590

I've just submitted an alternative patch to fix this, could you give
it a spin and see if the issue goes away?

https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/

That alternative approach is preferable IMO because it also solves the
problem that marking devices as permanently offline isn't possible
concurrently to driver bind/unbind at the moment.  Additionally,
the alternative patch simplifies locking and reduces code size.

Thanks and sorry for my belated response.

Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock.
@ 2023-01-13 17:01 Anatoli Antonovitch
  2023-01-20  9:28 ` Lukas Wunner
  0 siblings, 1 reply; 13+ messages in thread
From: Anatoli Antonovitch @ 2023-01-13 17:01 UTC (permalink / raw)
  To: linux-pci
  Cc: bhelgaas, lukas, Alexander.Deucher, christian.koenig,
	Anatoli Antonovitch

From: Anatoli Antonovitch <anatoli.antonovitch@amd.com>

It is to avoid any potential issues when S3 resume but at the same time we want to hot-unplug.

To fix the race between pciehp and AER reported in https://bugzilla.kernel.org/show_bug.cgi?id=215590

INFO: task irq/26-aerdrv:95 blocked for more than 120 seconds.
Tainted: G        W          6.2.0-rc3-custom-norework-jan11+ #29
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:irq/26-aerdrv   state:D stack:0     pid:95    ppid:2      flags:0x00004000
Call Trace:
<TASK>
__schedule+0x3df/0xec0
? rcu_read_lock_held_common+0x12/0x60
schedule+0x6b/0x100
rwsem_down_write_slowpath+0x3b2/0x9c0
down_write_nested+0x16b/0x220
pciehp_reset_slot+0x63/0x160
pci_reset_hotplug_slot+0x44/0x80
pci_slot_reset+0x10d/0x1a0
pci_bus_error_reset+0xb2/0xe0
aer_root_reset+0x144/0x1a0
pcie_do_recovery+0x15a/0x280
? __pfx_aer_root_reset+0x20/0x20
aer_process_err_devices+0xfa/0x115
aer_isr.cold+0x52/0xa1
? __kmem_cache_free+0x36a/0x3c0
? irq_thread+0xb0/0x1e0
? irq_thread+0xb0/0x1e0
irq_thread_fn+0x28/0x80
irq_thread+0x106/0x1e0
? __pfx_irq_thread_fn+0x20/0x20
? __pfx_irq_thread_dtor+0x20/0x20
? __pfx_irq_thread+0x20/0x20
kthread+0x10a/0x140
? __pfx_kthread+0x20/0x20
ret_from_fork+0x35/0x60
</TASK>

INFO: task irq/26-pciehp:96 blocked for more than 120 seconds.
Tainted: G        W          6.2.0-rc3-custom-norework-jan11+ #29
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:irq/26-pciehp   state:D stack:0     pid:96    ppid:2      flags:0x00004000
Call Trace:
<TASK>
__schedule+0x3df/0xec0
? rcu_read_lock_sched_held+0x25/0x80
schedule+0x6b/0x100
schedule_preempt_disabled+0x18/0x40
__mutex_lock+0x685/0xf60
? rcu_read_lock_sched_held+0x25/0x80
? rcu_read_lock_held_common+0x12/0x60
? pci_dev_set_disconnected+0x1b/0x80
mutex_lock_nested+0x1b/0x40
? mutex_lock_nested+0x1b/0x40
pci_dev_set_disconnected+0x1b/0x80
? __pfx_pci_dev_set_disconnected+0x20/0x20
pci_walk_bus+0x48/0xa0
pciehp_unconfigure_device+0x129/0x140
pciehp_disable_slot+0x6e/0x120
pciehp_handle_presence_or_link_change+0xf1/0x320
pciehp_ist+0x1a0/0x1c0
? irq_thread+0xb0/0x1e0
irq_thread_fn+0x28/0x80
irq_thread+0x106/0x1e0
? __pfx_irq_thread_fn+0x20/0x20
? __pfx_irq_thread_dtor+0x20/0x20
? __pfx_irq_thread+0x20/0x20
kthread+0x10a/0x140
? __pfx_kthread+0x20/0x20
ret_from_fork+0x35/0x60
</TASK>

Signed-off-by: Anatoli Antonovitch <anatoli.antonovitch@amd.com>
---
 drivers/pci/hotplug/pciehp_hpc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 10e9670eea0b..b1084e67f798 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -27,6 +27,8 @@
 #include "../pci.h"
 #include "pciehp.h"
 
+static DECLARE_RWSEM(hotplug_slot_rwsem);
+
 static const struct dmi_system_id inband_presence_disabled_dmi_table[] = {
 	/*
 	 * Match all Dell systems, as some Dell systems have inband
@@ -911,7 +913,10 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe)
 	if (probe)
 		return 0;
 
-	down_write_nested(&ctrl->reset_lock, ctrl->depth);
+	if (ctrl->depth > 0)
+		down_write_nested(&ctrl->reset_lock, ctrl->depth);
+	else
+		down_write(&hotplug_slot_rwsem);
 
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
@@ -931,7 +936,11 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
 
-	up_write(&ctrl->reset_lock);
+	if (ctrl->depth > 0)
+		up_write(&ctrl->reset_lock);
+	else
+		up_write(&hotplug_slot_rwsem);
+
 	return rc;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-04-10 20:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-05 18:43 [PATCH] PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock Anatoli Antonovitch
2023-01-05 19:18 ` Bjorn Helgaas
     [not found]   ` <5b9a2dc7-54ab-82c5-81e7-1770a4ec891c@amd.com>
2023-01-16 18:37     ` Bjorn Helgaas
2023-01-13 17:01 Anatoli Antonovitch
2023-01-20  9:28 ` Lukas Wunner
2023-01-20 21:35   ` Anatoli Antonovitch
2023-01-21  7:21     ` Lukas Wunner
2023-01-23 19:30       ` Anatoli Antonovitch
2023-02-13 14:59         ` Anatoli Antonovitch
2023-02-17 16:03           ` Bjorn Helgaas
2023-02-17 18:37             ` Deucher, Alexander
2023-02-19 20:21               ` Lukas Wunner
2023-04-10 20:36                 ` Anatoli Antonovitch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).