linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
       [not found] <0fc31d9a-f414-a412-3765-5519cbb9b7ff@amd.com>
@ 2022-02-09 21:28 ` Andrey Grodzovsky
  2022-02-10  6:23 ` Lukas Wunner
  1 sibling, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-02-09 21:28 UTC (permalink / raw)
  To: linux-pci, helgaas, Lukas Wunner
  Cc: anatoli.antonovitch, Kumar1, Rahul, Alexander.Deucher

I got a bounce back 'Message too long (>100000 chars)' reply so
reseeding with minimal essential log inline here

[   56.138636] ACPI: Waking up from system sleep state S3
[   56.140541] pcieport 0000:01:00.0: Refused to change power state, 
currently in D3
[   56.143542] pcieport 0000:02:00.0: Refused to change power state, 
currently in D3
[   56.146517] amdgpu 0000:03:00.0: Refused to change power state, 
currently in D3
[   56.209416] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) 
error received: 0000:00:01.0
[   56.209438] pcieport 0000:00:01.1: AER: PCIe Bus Error: 
severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[   56.209440] pcieport 0000:00:01.1: AER:   device [1022:15d3] error 
status/mask=00004000/04400000
[   56.209441] pcieport 0000:00:01.1: AER:    [14] CmpltTO 
   (First)
[   56.209817] sd 0:0:0:0: [sda] Starting disk
[   56.211483] [drm] PCIE GART of 1024M enabled.
[   56.211484] [drm] PTB located at 0x000000F400E10000
[   56.211508] [drm] PSP is resuming...
[   56.231386] [drm] reserve 0x400000 from 0xf41fc00000 for PSP TMR
[   56.312520] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode 
is not available
[   56.320623] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode 
is not available
[   56.326446] [drm] kiq ring mec 2 pipe 1 q 0
[   56.326919] amdgpu: restore the fine grain parameters
[   56.539633] [drm] VCN decode and encode initialized 
successfully(under SPG Mode).
[   56.539655] amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0 
on hub 0
[   56.539656] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv 
eng 1 on hub 0
[   56.539657] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv 
eng 4 on hub 0
[   56.539658] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv 
eng 5 on hub 0
[   56.539660] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv 
eng 6 on hub 0
[   56.539661] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv 
eng 7 on hub 0
[   56.539662] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv 
eng 8 on hub 0
[   56.539663] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv 
eng 9 on hub 0
[   56.539664] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv 
eng 10 on hub 0
[   56.539665] amdgpu 0000:05:00.0: amdgpu: ring kiq_2.1.0 uses VM inv 
eng 11 on hub 0
[   56.539666] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0 
on hub 1
[   56.539667] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 
1 on hub 1
[   56.539668] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv 
eng 4 on hub 1
[   56.539669] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv 
eng 5 on hub 1
[   56.539670] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv 
eng 6 on hub 1
[   56.685926] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   56.686175] ata1.00: supports DRM functions and may not be fully 
accessible
[   56.686848] ata1.00: disabling queued TRIM support
[   56.688408] ata1.00: supports DRM functions and may not be fully 
accessible
[   56.688925] ata1.00: disabling queued TRIM support
[   56.690217] ata1.00: configured for UDMA/133
[   57.246588] pcieport 0000:00:01.1: AER: Root Port link has been reset
[   57.246635] pcieport 0000:00:01.1: AER: Device recovery failed
[   57.246668] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[   57.247019] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[   57.247198] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) 
error received: 0000:00:01.0
[   57.247212] pcieport 0000:00:01.1: AER: PCIe Bus Error: 
severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[   57.247214] pcieport 0000:00:01.1: AER:   device [1022:15d3] error 
status/mask=00004000/04400000
[   57.247217] pcieport 0000:00:01.1: AER:    [14] CmpltTO 
   (First)
[   59.038917] pci 0000:03:00.0: Removing from iommu group 21
[   59.039314] pci_bus 0000:03: busn_res: [bus 03] is released
[   59.039790] acpi LNXPOWER:08: Turning OFF
[   59.040014] acpi LNXPOWER:07: Turning OFF
[   59.040296] acpi LNXPOWER:04: Turning OFF
[   59.040500] acpi LNXPOWER:03: Turning OFF
[   59.040682] OOM killer enabled.
[   59.040682] Restarting tasks ...
[   59.041112] systemd-journald[342]: /dev/kmsg buffer overrun, some 
messages lost.
[   59.047174] done.
[   59.047182] PM: suspend exit
[   61.382560] show_signal_msg: 29 callbacks suppressed
[   61.382563] glmark2[1891]: segfault at 0 ip 00007fdebc1cbd85 sp 
00007ffd56800870 error 4 in radeonsi_dri.so[7fdebb972000+a94000]
[   61.382574] Code: 00 4c 39 ed 74 6f 49 89 fc eb 1f 66 2e 0f 1f 84 00 
00 00 00 00 48 89 ef e8 08 a2 7a ff 49 8b ac 24 e0 77 00 00 4c 39 ed 74 
4b <48> 8b 55 00 48 8b 45 08 48 8b 5d 10 48 89 42 08 48 89 10 48 c7 45
[  243.354138] INFO: task irq/26-aerdrv:170 blocked for more than 120 
seconds.
[  243.354145]       Not tainted 5.4.2-10-feb+ #51
[  243.354147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  243.354150] irq/26-aerdrv   D    0   170      2 0x80004000
[  243.354156] Call Trace:
[  243.354170]  ? __schedule+0x2e3/0x740
[  243.354173]  schedule+0x39/0xa0
[  243.354179]  rwsem_down_write_slowpath+0x244/0x4d0
[  243.354183]  ? schedule+0x39/0xa0
[  243.354186]  ? schedule_preempt_disabled+0xa/0x10
[  243.354192]  pciehp_reset_slot+0x51/0x150
[  243.354198]  pci_reset_hotplug_slot+0x3c/0x60
[  243.354202]  pci_slot_reset+0x107/0x130
[  243.354205]  pci_bus_error_reset+0xf3/0x120
[  243.354210]  aer_root_reset+0x5c/0xf0
[  243.354214]  pcie_do_recovery+0x13e/0x275
[  243.354217]  aer_process_err_devices+0xb2/0xc7
[  243.354220]  aer_isr.cold+0x50/0x9f
[  243.354223]  ? __schedule+0x2eb/0x740
[  243.354228]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  243.354230]  irq_thread_fn+0x20/0x60
[  243.354234]  irq_thread+0xdc/0x170
[  243.354237]  ? irq_forced_thread_fn+0x80/0x80
[  243.354241]  kthread+0xf9/0x130
[  243.354245]  ? irq_thread_check_affinity+0xf0/0xf0
[  243.354247]  ? kthread_park+0x90/0x90
[  243.354250]  ret_from_fork+0x22/0x40
[  243.354255] INFO: task irq/26-pciehp:171 blocked for more than 120 
seconds.
[  243.354257]       Not tainted 5.4.2-10-feb+ #51
[  243.354259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  243.354261] irq/26-pciehp   D    0   171      2 0x80004000
[  243.354263] Call Trace:
[  243.354266]  ? __schedule+0x2e3/0x740
[  243.354269]  schedule+0x39/0xa0
[  243.354271]  schedule_preempt_disabled+0xa/0x10
[  243.354274]  __mutex_lock.isra.0+0x182/0x4f0
[  243.354279]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  243.354284]  device_del+0x35/0x370
[  243.354288]  pci_remove_bus_device+0x77/0x100
[  243.354292]  pci_remove_bus_device+0x2e/0x100
[  243.354296]  pciehp_unconfigure_device+0x7c/0x12f
[  243.354299]  pciehp_disable_slot+0x6b/0x100
[  243.354303]  pciehp_handle_presence_or_link_change+0xdc/0x140
[  243.354306]  pciehp_ist+0x10f/0x120
[  243.354309]  irq_thread_fn+0x20/0x60
[  243.354312]  irq_thread+0xdc/0x170
[  243.354316]  ? irq_forced_thread_fn+0x80/0x80
[  243.354318]  kthread+0xf9/0x130
[  243.354321]  ? irq_thread_check_affinity+0xf0/0xf0
[  243.354323]  ? kthread_park+0x90/0x90
[  243.354326]  ret_from_fork+0x22/0x40

Andrey



On 2022-02-09 14:54, Andrey Grodzovsky wrote:
> Hi, on kernel based on 5.4.2 we are observing a deadlock between
> reset_lock semaphore and device_lock (dev->mutex). The scenario
> we do is putting the system to sleep, disconnecting the eGPU
> from the PCIe bus (through a special SBIOS setting) or by simply
> removing power to external PCIe cage and waking the
> system up.
> 
> I attached the log. Please advise if you have any idea how
> to work around it ? Since the kernel is old, does anyone
> have an idea if this issue is known and already solved in later kernels ?
> We cannot try with latest since our kernel is custom for that platform.
> 
> Thanks,
> Andrey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
       [not found] <0fc31d9a-f414-a412-3765-5519cbb9b7ff@amd.com>
  2022-02-09 21:28 ` Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device Andrey Grodzovsky
@ 2022-02-10  6:23 ` Lukas Wunner
  2022-02-10 14:39   ` Andrey Grodzovsky
  2022-02-10 20:47   ` Andrey Grodzovsky
  1 sibling, 2 replies; 15+ messages in thread
From: Lukas Wunner @ 2022-02-10  6:23 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher

On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
> Hi, on kernel based on 5.4.2 we are observing a deadlock between
> reset_lock semaphore and device_lock (dev->mutex). The scenario
> we do is putting the system to sleep, disconnecting the eGPU
> from the PCIe bus (through a special SBIOS setting) or by simply
> removing power to external PCIe cage and waking the
> system up.
> 
> I attached the log. Please advise if you have any idea how
> to work around it ? Since the kernel is old, does anyone
> have an idea if this issue is known and already solved in later kernels ?
> We cannot try with latest since our kernel is custom for that platform.

It is a known issue.  Here's a fix I submitted during the v5.9 cycle:

https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/

The fix hasn't been applied yet.  I think I need to rework the patch,
just haven't found the time.

Since the trigger in your case are AER-handled errors during a
system sleep transition, you may also want to consider the
following 2-patch series by Kai-Heng Feng which is currently
under discussion:

https://lore.kernel.org/linux-pci/20220127025418.1989642-1-kai.heng.feng@canonical.com/

That series disables AER during a system sleep transition and
should thus prevent the flood of AER-handled errors you're seeing.
Once AER is disabled, the reset-induced deadlocks should go away as well.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-10  6:23 ` Lukas Wunner
@ 2022-02-10 14:39   ` Andrey Grodzovsky
  2022-06-10 21:25     ` Andrey Grodzovsky
  2022-02-10 20:47   ` Andrey Grodzovsky
  1 sibling, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-02-10 14:39 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher

Thanks a lot for quick response, we will give this a try.

Andrey

On 2022-02-10 01:23, Lukas Wunner wrote:
> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>> we do is putting the system to sleep, disconnecting the eGPU
>> from the PCIe bus (through a special SBIOS setting) or by simply
>> removing power to external PCIe cage and waking the
>> system up.
>>
>> I attached the log. Please advise if you have any idea how
>> to work around it ? Since the kernel is old, does anyone
>> have an idea if this issue is known and already solved in later kernels ?
>> We cannot try with latest since our kernel is custom for that platform.
> 
> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&amp;reserved=0
> 
> The fix hasn't been applied yet.  I think I need to rework the patch,
> just haven't found the time.
> 
> Since the trigger in your case are AER-handled errors during a
> system sleep transition, you may also want to consider the
> following 2-patch series by Kai-Heng Feng which is currently
> under discussion:
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&amp;reserved=0
> 
> That series disables AER during a system sleep transition and
> should thus prevent the flood of AER-handled errors you're seeing.
> Once AER is disabled, the reset-induced deadlocks should go away as well.
> 
> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-10  6:23 ` Lukas Wunner
  2022-02-10 14:39   ` Andrey Grodzovsky
@ 2022-02-10 20:47   ` Andrey Grodzovsky
  2022-02-10 21:37     ` Lukas Wunner
  1 sibling, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-02-10 20:47 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher

So the patches indeed helped resolving the deadlock but when we try
again to hotplug back there is a link status failure

pcieport 0000:00:01.1: pciehp: Slot(0): Card present
pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec
pcieport 0000:00:01.1: pciehp: Failed to check link status

and more detailed  bellow,
we are trying to debug but again, you might have a quick insight

Feb 10 23:37:52 amd-BILBY kernel: [   67.885459] amdgpu 0000:05:00.0: 
amdgpu: RAS: optional ras ta ucode is not available
Feb 10 23:37:52 amd-BILBY kernel: [   67.901477] amdgpu 0000:05:00.0: 
amdgpu: RAP: optional rap ta ucode is not available
Feb 10 23:37:52 amd-BILBY kernel: [   67.915376] [drm] kiq ring mec 2 
pipe 1 q 0
Feb 10 23:37:52 amd-BILBY kernel: [   67.920041] amdgpu: restore the 
fine grain parameters
Feb 10 23:37:52 amd-BILBY kernel: [   68.156714] [drm] VCN decode and 
encode initialized successfully(under SPG Mode).
Feb 10 23:37:52 amd-BILBY kernel: [   68.164222] amdgpu 0000:05:00.0: 
amdgpu: ring gfx uses VM inv eng 0 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.171275] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.178932] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.186589] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.194247] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.201906] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.209562] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.217216] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.224872] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.232616] amdgpu 0000:05:00.0: 
amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.240272] amdgpu 0000:05:00.0: 
amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.247497] amdgpu 0000:05:00.0: 
amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.249433] ata1: SATA link up 6.0 
Gbps (SStatus 133 SControl 300)
Feb 10 23:37:52 amd-BILBY kernel: [   68.254894] amdgpu 0000:05:00.0: 
amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.261315] ata1.00: supports DRM 
functions and may not be fully accessible
Feb 10 23:37:52 amd-BILBY kernel: [   68.268558] amdgpu 0000:05:00.0: 
amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.276173] ata1.00: disabling 
queued TRIM support
Feb 10 23:37:52 amd-BILBY kernel: [   68.283010] amdgpu 0000:05:00.0: 
amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.289443] ata1.00: supports DRM 
functions and may not be fully accessible
Feb 10 23:37:52 amd-BILBY kernel: [   68.302782] ata1.00: disabling 
queued TRIM support
Feb 10 23:37:52 amd-BILBY kernel: [   68.308863] ata1.00: configured for 
UDMA/133
Feb 10 23:37:52 amd-BILBY kernel: [   68.597833] pci 0000:03:00.0: 
Removing from iommu group 21
Feb 10 23:37:52 amd-BILBY kernel: [   68.597991] acpi LNXPOWER:08: 
Turning OFF
Feb 10 23:37:52 amd-BILBY kernel: [   68.605244] pci_bus 0000:03: 
busn_res: [bus 03] is released
Feb 10 23:37:52 amd-BILBY kernel: [   68.611552] acpi LNXPOWER:07: 
Turning OFF
Feb 10 23:37:52 amd-BILBY kernel: [   68.619469] pci 0000:02:00.0: 
Removing from iommu group 20
Feb 10 23:37:52 amd-BILBY kernel: [   68.626121] acpi LNXPOWER:04: 
Turning OFF
Feb 10 23:37:52 amd-BILBY kernel: [   68.632720] pci_bus 0000:02: 
busn_res: [bus 02-03] is released
Feb 10 23:37:52 amd-BILBY kernel: [   68.638105] OOM killer enabled.
Feb 10 23:37:52 amd-BILBY kernel: [   68.645106] pci 0000:01:00.0: 
Removing from iommu group 19
Feb 10 23:37:52 amd-BILBY kernel: [   68.649418] Restarting tasks ... done.
Feb 10 23:37:52 amd-BILBY kernel: [   68.662516] PM: suspend exit
Feb 10 23:37:52 amd-BILBY kernel: [   68.669613] rfkill: input handler 
disabled
Feb 10 23:37:52 amd-BILBY kernel: [   68.695045] show_signal_msg: 28 
callbacks suppressed
Feb 10 23:37:52 amd-BILBY kernel: [   68.695048] glmark2[1894]: segfault 
at 0 ip 00007f799dae6d85 sp 00007ffd34320bc0 error 4 in 
radeonsi_dri.so[7f799d28d000+a94000]
Feb 10 23:37:52 amd-BILBY kernel: [   68.711653] Code: 00 4c 39 ed 74 6f 
49 89 fc eb 1f 66 2e 0f 1f 84 00 00 00 00 00 48 89 ef e8 08 a2 7a ff 49 
8b ac 24 e0 77 00 00 4c 39 ed 74 4b <48> 8b 55 00 48 8b 45 08 48 8b 5d 
10 48 89 42 08 48 89 10 48 c7 45
Feb 10 23:37:53 amd-BILBY kernel: [   69.684921] pcieport 0000:00:01.1: 
AER: Root Port link has been reset
Feb 10 23:37:53 amd-BILBY kernel: [   69.691438] pcieport 0000:00:01.1: 
AER: Device recovery failed
Feb 10 23:37:53 amd-BILBY kernel: [   69.697327] pcieport 0000:00:01.1: 
AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
Feb 10 23:37:53 amd-BILBY kernel: [   69.706231] pcieport 0000:00:01.1: 
AER: can't find device of ID0008
Feb 10 23:40:33 amd-BILBY kernel: [  228.769973] sysrq: HELP : 
loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) 
memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) 
show-backtrace-all-active-cpus(l) show-memory-usage(m) 
nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) 
unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) 
show-blocked-tasks(w) dump-ftrace-buffer(z)
Feb 10 23:41:47 amd-BILBY kernel: [  302.759503] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:41:49 amd-BILBY kernel: [  304.795473] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:41:49 amd-BILBY kernel: [  304.803146] pcieport 0000:00:01.1: 
pciehp: Failed to check link status
Feb 10 23:42:30 amd-BILBY kernel: [  345.767046] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:42:32 amd-BILBY kernel: [  347.811119] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:42:32 amd-BILBY kernel: [  347.818793] pcieport 0000:00:01.1: 
pciehp: Failed to check link status
Feb 10 23:45:13 amd-BILBY kernel: [  508.465497] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:45:15 amd-BILBY kernel: [  510.505681] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:45:15 amd-BILBY kernel: [  510.513355] pcieport 0000:00:01.1: 
pciehp: Failed to check link status

Andrey

On 2022-02-10 01:23, Lukas Wunner wrote:
> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>> we do is putting the system to sleep, disconnecting the eGPU
>> from the PCIe bus (through a special SBIOS setting) or by simply
>> removing power to external PCIe cage and waking the
>> system up.
>>
>> I attached the log. Please advise if you have any idea how
>> to work around it ? Since the kernel is old, does anyone
>> have an idea if this issue is known and already solved in later kernels ?
>> We cannot try with latest since our kernel is custom for that platform.
> 
> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&amp;reserved=0
> 
> The fix hasn't been applied yet.  I think I need to rework the patch,
> just haven't found the time.
> 
> Since the trigger in your case are AER-handled errors during a
> system sleep transition, you may also want to consider the
> following 2-patch series by Kai-Heng Feng which is currently
> under discussion:
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&amp;reserved=0
> 
> That series disables AER during a system sleep transition and
> should thus prevent the flood of AER-handled errors you're seeing.
> Once AER is disabled, the reset-induced deadlocks should go away as well.
> 
> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-10 20:47   ` Andrey Grodzovsky
@ 2022-02-10 21:37     ` Lukas Wunner
  2022-02-10 23:12       ` Andrey Grodzovsky
  2022-02-11 14:42       ` Kumar1, Rahul
  0 siblings, 2 replies; 15+ messages in thread
From: Lukas Wunner @ 2022-02-10 21:37 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher

On Thu, Feb 10, 2022 at 03:47:10PM -0500, Andrey Grodzovsky wrote:
> So the patches indeed helped resolving the deadlock but when we try
> again to hotplug back there is a link status failure
> 
> pcieport 0000:00:01.1: pciehp: Slot(0): Card present
> pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec
> pcieport 0000:00:01.1: pciehp: Failed to check link status
> 
> and more detailed  bellow,
> we are trying to debug but again, you might have a quick insight

Well, the link doesn't come up.  Is the Link Disable bit in the
Link Control Register set for some reason?  Perhaps some ACPI method
fiddled with it?

Compare the output of lspci -vv before and after the system sleep
transition, do you see anything suspicious?

If you reset the slot via sysfs, does the link come back up?

You may want to open a bug over at bugzilla.kernel.org and attach
the full dmesg output which didn't reach the list, as well as lspci
output.

Did you apply only my deadlock fix or also Kai-Heng Feng's AER disablement
patch?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-10 21:37     ` Lukas Wunner
@ 2022-02-10 23:12       ` Andrey Grodzovsky
  2022-02-11 14:42       ` Kumar1, Rahul
  1 sibling, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-02-10 23:12 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher



On 2022-02-10 16:37, Lukas Wunner wrote:
> On Thu, Feb 10, 2022 at 03:47:10PM -0500, Andrey Grodzovsky wrote:
>> So the patches indeed helped resolving the deadlock but when we try
>> again to hotplug back there is a link status failure
>>
>> pcieport 0000:00:01.1: pciehp: Slot(0): Card present
>> pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec
>> pcieport 0000:00:01.1: pciehp: Failed to check link status
>>
>> and more detailed  bellow,
>> we are trying to debug but again, you might have a quick insight
> 
> Well, the link doesn't come up.  Is the Link Disable bit in the
> Link Control Register set for some reason?  Perhaps some ACPI method
> fiddled with it?
> 
> Compare the output of lspci -vv before and after the system sleep
> transition, do you see anything suspicious?
> 
> If you reset the slot via sysfs, does the link come back up?
> 
> You may want to open a bug over at bugzilla.kernel.org and attach
> the full dmesg output which didn't reach the list, as well as lspci
> output.


We will follow on all your advises and update you

> 
> Did you apply only my deadlock fix or also Kai-Heng Feng's AER disablement
> patch?

Yes we did.

Andrey

> 
> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-10 21:37     ` Lukas Wunner
  2022-02-10 23:12       ` Andrey Grodzovsky
@ 2022-02-11 14:42       ` Kumar1, Rahul
  2022-02-15  7:02         ` Lukas Wunner
  1 sibling, 1 reply; 15+ messages in thread
From: Kumar1, Rahul @ 2022-02-11 14:42 UTC (permalink / raw)
  To: Lukas Wunner, Grodzovsky, Andrey
  Cc: linux-pci, helgaas, Antonovitch, Anatoli, Deucher, Alexander

[AMD Official Use Only]


Hi Lucas,

We can some changes we can see in lspci from working to non-working case. Below are changes
Link Speed =  8GT/s  -> 2.5GT/s.
DLActive+   ->     DLActive-
BWMgmt+   -> BWMgmt+
PresDet+ -> PresDet+
EqualizationComplete+ -> EqualizationComplete+


Also when we do reset via sysfs, we don't see this issue.

I have created bug here https://bugzilla.kernel.org/show_bug.cgi?id=215590


Thanks
Rahul
-----Original Message-----
From: Lukas Wunner <lukas@wunner.de> 
Sent: Friday, February 11, 2022 3:08 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: linux-pci@vger.kernel.org; helgaas@kernel.org; Antonovitch, Anatoli <Anatoli.Antonovitch@amd.com>; Kumar1, Rahul <Rahul.Kumar1@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device

On Thu, Feb 10, 2022 at 03:47:10PM -0500, Andrey Grodzovsky wrote:
> So the patches indeed helped resolving the deadlock but when we try 
> again to hotplug back there is a link status failure
> 
> pcieport 0000:00:01.1: pciehp: Slot(0): Card present pcieport 
> 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec 
> pcieport 0000:00:01.1: pciehp: Failed to check link status
> 
> and more detailed  bellow,
> we are trying to debug but again, you might have a quick insight

Well, the link doesn't come up.  Is the Link Disable bit in the Link Control Register set for some reason?  Perhaps some ACPI method fiddled with it?

Compare the output of lspci -vv before and after the system sleep transition, do you see anything suspicious?

If you reset the slot via sysfs, does the link come back up?

You may want to open a bug over at bugzilla.kernel.org and attach the full dmesg output which didn't reach the list, as well as lspci output.

Did you apply only my deadlock fix or also Kai-Heng Feng's AER disablement patch?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-11 14:42       ` Kumar1, Rahul
@ 2022-02-15  7:02         ` Lukas Wunner
  2022-02-15  8:18           ` Kumar1, Rahul
  0 siblings, 1 reply; 15+ messages in thread
From: Lukas Wunner @ 2022-02-15  7:02 UTC (permalink / raw)
  To: Kumar1, Rahul
  Cc: Grodzovsky, Andrey, linux-pci, helgaas, Antonovitch, Anatoli,
	Deucher, Alexander

On Fri, Feb 11, 2022 at 02:42:21PM +0000, Kumar1, Rahul wrote:
> We can some changes we can see in lspci from working to non-working case. Below are changes
> Link Speed =  8GT/s  -> 2.5GT/s.
> DLActive+   ->     DLActive-
> BWMgmt+   -> BWMgmt+
> PresDet+ -> PresDet+
> EqualizationComplete+ -> EqualizationComplete+
> 
> Also when we do reset via sysfs, we don't see this issue.
> 
> I have created bug here https://bugzilla.kernel.org/show_bug.cgi?id=215590

So with the patches applied, the link doesn't come up after resume,
but if you then reset via sysfs, it does come up, is that what you're
saying?

The dmesg excerpt Andrey posted shows an AER splat after resume (even
with the patches applied):

[   69.684921] pcieport 0000:00:01.1: AER: Root Port link has been reset
[   69.691438] pcieport 0000:00:01.1: AER: Device recovery failed
[   69.697327] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
[   69.706231] pcieport 0000:00:01.1: AER: can't find device of ID0008

I suspect the Root Port refuses to train the link due to that fatal
error.  Perhaps Kai-Heng Feng's patch is incomplete and it needs to
clear stale AER errors?  Or maybe it re-enables AER too early?

Could you attach lspci -vv output before/after suspend to the bugzilla?
And also attach full dmesg output with the patches applied?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-15  7:02         ` Lukas Wunner
@ 2022-02-15  8:18           ` Kumar1, Rahul
  0 siblings, 0 replies; 15+ messages in thread
From: Kumar1, Rahul @ 2022-02-15  8:18 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Grodzovsky, Andrey, linux-pci, helgaas, Antonovitch, Anatoli,
	Deucher, Alexander

[AMD Official Use Only]

>>>So with the patches applied, the link doesn't come up after resume, but if you then reset via sysfs, it >>>does come up, is that what you're saying?

Yes correct, if we reset via sysfs we are not seeing this, issue. I  attached lspci and dmesg logs with taking all three patches to Bugzilla.

We could confirm PCI_BRIDGE_CTL_BUS_RESET bit is set after resume, and once is PCI_BRIDGE_CTL_BUS_RESET set to 0 we are able to access the link.

Looks reset command doesn't complete properly due to some timing issues in pci_reset_secondary_bus , will comeback after analyzing more on this.

Best Regards,
Rahul


-----Original Message----- 
From: Lukas Wunner <lukas@wunner.de> 
Sent: Tuesday, February 15, 2022 12:32 PM
To: Kumar1, Rahul <Rahul.Kumar1@amd.com>
Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; linux-pci@vger.kernel.org; helgaas@kernel.org; 
Antonovitch, Anatoli <Anatoli.Antonovitch@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device

On Fri, Feb 11, 2022 at 02:42:21PM +0000, Kumar1, Rahul wrote:
> We can some changes we can see in lspci from working to non-working 
> case. Below are changes Link Speed =  8GT/s  -> 2.5GT/s.
> DLActive+   ->     DLActive-
> BWMgmt+   -> BWMgmt+
> PresDet+ -> PresDet+
> EqualizationComplete+ -> EqualizationComplete+
> 
> Also when we do reset via sysfs, we don't see this issue.
> 
> I have created bug here 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215590&amp;data=04%7C01%7CRahul.
> Kumar1%40amd.com%7C6064d47163b545798e3508d9f051227c%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637805054005384810%7CUnknown%7CTWFpbGZsb3d8
> eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> 000&amp;sdata=w4WTYpduf4brVLx14ADw7yh511Vjf5v5rVtXWjxU7AI%3D&amp;reser
> ved=0

So with the patches applied, the link doesn't come up after resume, but if you then reset via sysfs, it does come up, is that what you're saying?

The dmesg excerpt Andrey posted shows an AER splat after resume (even with the patches applied):

[   69.684921] pcieport 0000:00:01.1: AER: Root Port link has been reset
[   69.691438] pcieport 0000:00:01.1: AER: Device recovery failed
[   69.697327] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
[   69.706231] pcieport 0000:00:01.1: AER: can't find device of ID0008

I suspect the Root Port refuses to train the link due to that fatal error.  Perhaps Kai-Heng Feng's patch is incomplete and it needs to clear stale AER errors?  Or maybe it re-enables AER too early?

Could you attach lspci -vv output before/after suspend to the bugzilla?
And also attach full dmesg output with the patches applied?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-02-10 14:39   ` Andrey Grodzovsky
@ 2022-06-10 21:25     ` Andrey Grodzovsky
  2022-06-14 18:07       ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-06-10 21:25 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher



On 2022-02-10 09:39, Andrey Grodzovsky wrote:
> Thanks a lot for quick response, we will give this a try.
> 
> Andrey
> 
> On 2022-02-10 01:23, Lukas Wunner wrote:
>> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>>> we do is putting the system to sleep, disconnecting the eGPU
>>> from the PCIe bus (through a special SBIOS setting) or by simply
>>> removing power to external PCIe cage and waking the
>>> system up.
>>>
>>> I attached the log. Please advise if you have any idea how
>>> to work around it ? Since the kernel is old, does anyone
>>> have an idea if this issue is known and already solved in later 
>>> kernels ?
>>> We cannot try with latest since our kernel is custom for that platform.
>>
>> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&amp;reserved=0 
>>
>>
>> The fix hasn't been applied yet.  I think I need to rework the patch,
>> just haven't found the time.

Hey Lucas - just checking again if you had a chance to push this change
through ? It's essential to us in one of our costumer projects so we
wonder if have any estimate when will it be up-streamed and if we can
help with this. We would also need backporting this back to 5.11 and 5.4
kernels after it's upstreamed.

Another point I want to mention is that this patch has a negative
side effect on plug back times - it causes a regression point for the 
delay to light-up display at resume time related to back-ported AER

Anatoli is working on resolving this and so maybe he can add his
comment here and maybe you can help him with proper resolution for this.

Andrey

>>
>> Since the trigger in your case are AER-handled errors during a
>> system sleep transition, you may also want to consider the
>> following 2-patch series by Kai-Heng Feng which is currently
>> under discussion:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&amp;reserved=0 
>>
>>
>> That series disables AER during a system sleep transition and
>> should thus prevent the flood of AER-handled errors you're seeing.
>> Once AER is disabled, the reset-induced deadlocks should go away as well.
>>
>> Thanks,
>>
>> Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-06-10 21:25     ` Andrey Grodzovsky
@ 2022-06-14 18:07       ` Andrey Grodzovsky
  2022-06-14 18:22         ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-06-14 18:07 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher

Just a gentle ping, also - I updated the ticket 
https://bugzilla.kernel.org/show_bug.cgi?id=215590

with the workaround we did if this could help you to advise us
what would be a generic solution for this ?

Andrey

On 2022-06-10 17:25, Andrey Grodzovsky wrote:
> 
> 
> On 2022-02-10 09:39, Andrey Grodzovsky wrote:
>> Thanks a lot for quick response, we will give this a try.
>>
>> Andrey
>>
>> On 2022-02-10 01:23, Lukas Wunner wrote:
>>> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>>>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>>>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>>>> we do is putting the system to sleep, disconnecting the eGPU
>>>> from the PCIe bus (through a special SBIOS setting) or by simply
>>>> removing power to external PCIe cage and waking the
>>>> system up.
>>>>
>>>> I attached the log. Please advise if you have any idea how
>>>> to work around it ? Since the kernel is old, does anyone
>>>> have an idea if this issue is known and already solved in later 
>>>> kernels ?
>>>> We cannot try with latest since our kernel is custom for that platform.
>>>
>>> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&amp;reserved=0 
>>>
>>>
>>> The fix hasn't been applied yet.  I think I need to rework the patch,
>>> just haven't found the time.
> 
> Hey Lucas - just checking again if you had a chance to push this change
> through ? It's essential to us in one of our costumer projects so we
> wonder if have any estimate when will it be up-streamed and if we can
> help with this. We would also need backporting this back to 5.11 and 5.4
> kernels after it's upstreamed.
> 
> Another point I want to mention is that this patch has a negative
> side effect on plug back times - it causes a regression point for the 
> delay to light-up display at resume time related to back-ported AER
> 
> Anatoli is working on resolving this and so maybe he can add his
> comment here and maybe you can help him with proper resolution for this.
> 
> Andrey
> 
>>>
>>> Since the trigger in your case are AER-handled errors during a
>>> system sleep transition, you may also want to consider the
>>> following 2-patch series by Kai-Heng Feng which is currently
>>> under discussion:
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&amp;reserved=0 
>>>
>>>
>>> That series disables AER during a system sleep transition and
>>> should thus prevent the flood of AER-handled errors you're seeing.
>>> Once AER is disabled, the reset-induced deadlocks should go away as 
>>> well.
>>>
>>> Thanks,
>>>
>>> Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-06-14 18:07       ` Andrey Grodzovsky
@ 2022-06-14 18:22         ` Sathyanarayanan Kuppuswamy
  2022-06-14 20:35           ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-14 18:22 UTC (permalink / raw)
  To: Andrey Grodzovsky, Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher

Hi,

On 6/14/22 11:07 AM, Andrey Grodzovsky wrote:
> Just a gentle ping, also - I updated the ticket https://bugzilla.kernel.org/show_bug.cgi?id=215590
> 
> with the workaround we did if this could help you to advise us
> what would be a generic solution for this ?
> 
> Andrey
Can you explain your WA? It seems to be unrelated to deadlock issue
discussed in this thread. Are they related?

> 
> On 2022-06-10 17:25, Andrey Grodzovsky wrote:
>>
>>
>> On 2022-02-10 09:39, Andrey Grodzovsky wrote:
>>> Thanks a lot for quick response, we will give this a try.
>>>
>>> Andrey
>>>
>>> On 2022-02-10 01:23, Lukas Wunner wrote:
>>>> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>>>>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>>>>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>>>>> we do is putting the system to sleep, disconnecting the eGPU
>>>>> from the PCIe bus (through a special SBIOS setting) or by simply
>>>>> removing power to external PCIe cage and waking the
>>>>> system up.
>>>>>
>>>>> I attached the log. Please advise if you have any idea how
>>>>> to work around it ? Since the kernel is old, does anyone
>>>>> have an idea if this issue is known and already solved in later kernels ?
>>>>> We cannot try with latest since our kernel is custom for that platform.
>>>>
>>>> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&amp;reserved=0
>>>>
>>>> The fix hasn't been applied yet.  I think I need to rework the patch,
>>>> just haven't found the time.
>>
>> Hey Lucas - just checking again if you had a chance to push this change
>> through ? It's essential to us in one of our costumer projects so we
>> wonder if have any estimate when will it be up-streamed and if we can
>> help with this. We would also need backporting this back to 5.11 and 5.4
>> kernels after it's upstreamed.
>>
>> Another point I want to mention is that this patch has a negative
>> side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER
>>
>> Anatoli is working on resolving this and so maybe he can add his
>> comment here and maybe you can help him with proper resolution for this.
>>
>> Andrey
>>
>>>>
>>>> Since the trigger in your case are AER-handled errors during a
>>>> system sleep transition, you may also want to consider the
>>>> following 2-patch series by Kai-Heng Feng which is currently
>>>> under discussion:
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&amp;reserved=0
>>>>
>>>> That series disables AER during a system sleep transition and
>>>> should thus prevent the flood of AER-handled errors you're seeing.
>>>> Once AER is disabled, the reset-induced deadlocks should go away as well.
>>>>
>>>> Thanks,
>>>>
>>>> Lukas

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-06-14 18:22         ` Sathyanarayanan Kuppuswamy
@ 2022-06-14 20:35           ` Andrey Grodzovsky
  2022-06-15 15:14             ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-06-14 20:35 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher



On 2022-06-14 14:22, Sathyanarayanan Kuppuswamy wrote:
> Hi,
> 
> On 6/14/22 11:07 AM, Andrey Grodzovsky wrote:
>> Just a gentle ping, also - I updated the ticket https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D215590&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=wEEU3f5%2BrCSZZEnn0e0FTiWRbILd1ZlyYccg3k2CfQQ%3D&amp;reserved=0
>>
>> with the workaround we did if this could help you to advise us
>> what would be a generic solution for this ?
>>
>> Andrey
> Can you explain your WA? It seems to be unrelated to deadlock issue
> discussed in this thread. Are they related?

So from start - originally we have an extension PCI board which is hot 
plug-able into our system board. On top of this extension board we have
AMD dGPU card. Originally we observed hang on resume from sleep (S3) in
AER enabled system because of race between AER and pciehp on S3 resume 
and so this
was resolved by the patch 
https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/T/

Now after this we are facing a second issue where after resume and after
AER driver recovery completed for pcieport the system won't detect a new
hotplug of the extention board into the system board. Anatoli looked
into it and found the workaround that I attached that made it work by
resetting secondary bus and updating link speed on the upstream bridge
after AER recovery complete (post S3 resume).  But this is just a
workaround and not a generic solution so we would like to get an advise 
for a generic fix for this problem.

To reiterate the full scenario is like this

1) Boot system

2) Extension board is first time hotplugged and dGPU is added to PCI 
topology

3) System suspend S3

4)  WE have costum BIOS which 'shuts off' the extension board during 
sleep so on resume the system discovers that the extension board (and 
dGPU) are gone and hot removes it from PCI topology. Together with this 
hot remove AER errors are generated and handled.

5)We again try to hot plug though a script we have but the system won't
detect the new hot plug of the extension board.

5*) The given workaround patch fixes issue in bullet 5) and hot plug
is detected and system recognizes the extension board and add it and 
dGPU to PCI topology.

Andrey

> 
>>
>> On 2022-06-10 17:25, Andrey Grodzovsky wrote:
>>>
>>>
>>> On 2022-02-10 09:39, Andrey Grodzovsky wrote:
>>>> Thanks a lot for quick response, we will give this a try.
>>>>
>>>> Andrey
>>>>
>>>> On 2022-02-10 01:23, Lukas Wunner wrote:
>>>>> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>>>>>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>>>>>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>>>>>> we do is putting the system to sleep, disconnecting the eGPU
>>>>>> from the PCIe bus (through a special SBIOS setting) or by simply
>>>>>> removing power to external PCIe cage and waking the
>>>>>> system up.
>>>>>>
>>>>>> I attached the log. Please advise if you have any idea how
>>>>>> to work around it ? Since the kernel is old, does anyone
>>>>>> have an idea if this issue is known and already solved in later kernels ?
>>>>>> We cannot try with latest since our kernel is custom for that platform.
>>>>>
>>>>> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0mLcR5MtJ52ZPoGPZ63WqK%2BFPNCQ8tOpizKU%2BUmkuFY%3D&amp;reserved=0
>>>>>
>>>>> The fix hasn't been applied yet.  I think I need to rework the patch,
>>>>> just haven't found the time.
>>>
>>> Hey Lucas - just checking again if you had a chance to push this change
>>> through ? It's essential to us in one of our costumer projects so we
>>> wonder if have any estimate when will it be up-streamed and if we can
>>> help with this. We would also need backporting this back to 5.11 and 5.4
>>> kernels after it's upstreamed.
>>>
>>> Another point I want to mention is that this patch has a negative
>>> side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER
>>>
>>> Anatoli is working on resolving this and so maybe he can add his
>>> comment here and maybe you can help him with proper resolution for this.
>>>
>>> Andrey
>>>
>>>>>
>>>>> Since the trigger in your case are AER-handled errors during a
>>>>> system sleep transition, you may also want to consider the
>>>>> following 2-patch series by Kai-Heng Feng which is currently
>>>>> under discussion:
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=%2F94hA3KKA9VUqisUhSaPCPIbi9IS43%2FOGManjoOh1AQ%3D&amp;reserved=0
>>>>>
>>>>> That series disables AER during a system sleep transition and
>>>>> should thus prevent the flood of AER-handled errors you're seeing.
>>>>> Once AER is disabled, the reset-induced deadlocks should go away as well.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Lukas
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-06-14 20:35           ` Andrey Grodzovsky
@ 2022-06-15 15:14             ` Sathyanarayanan Kuppuswamy
  2022-06-15 15:49               ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-15 15:14 UTC (permalink / raw)
  To: Andrey Grodzovsky, Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher



On 6/14/22 1:35 PM, Andrey Grodzovsky wrote:
> 
> 
> On 2022-06-14 14:22, Sathyanarayanan Kuppuswamy wrote:
>> Hi,
>>
>> On 6/14/22 11:07 AM, Andrey Grodzovsky wrote:
>>> Just a gentle ping, also - I updated the ticket https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D215590&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=wEEU3f5%2BrCSZZEnn0e0FTiWRbILd1ZlyYccg3k2CfQQ%3D&amp;reserved=0
>>>
>>> with the workaround we did if this could help you to advise us
>>> what would be a generic solution for this ?
>>>
>>> Andrey
>> Can you explain your WA? It seems to be unrelated to deadlock issue
>> discussed in this thread. Are they related?
> 
> So from start - originally we have an extension PCI board which is hot plug-able into our system board. On top of this extension board we have
> AMD dGPU card. Originally we observed hang on resume from sleep (S3) in
> AER enabled system because of race between AER and pciehp on S3 resume and so this
> was resolved by the patch https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/T/
> 

There is patch to disable AER in suspend/resume path (from Kai-Heng Feng). Did
you check with this patch?

> Now after this we are facing a second issue where after resume and after
> AER driver recovery completed for pcieport the system won't detect a new
> hotplug of the extention board into the system board. Anatoli looked

What about the hotplug events during this sequence? Did you get the
LINK DOWN/UP or Presence change events?

> into it and found the workaround that I attached that made it work by
> resetting secondary bus and updating link speed on the upstream bridge
> after AER recovery complete (post S3 resume).  But this is just a


> workaround and not a generic solution so we would like to get an advise for a generic fix for this problem.
> 
> To reiterate the full scenario is like this
> 
> 1) Boot system
> 
> 2) Extension board is first time hotplugged and dGPU is added to PCI topology
> 
> 3) System suspend S3
> 
> 4)  WE have costum BIOS which 'shuts off' the extension board during sleep so on resume the system discovers that the extension board (and dGPU) are gone and hot removes it from PCI topology. Together with this hot remove AER errors are generated and handled.
> 
> 5)We again try to hot plug though a script we have but the system won't
> detect the new hot plug of the extension board.
> 
> 5*) The given workaround patch fixes issue in bullet 5) and hot plug
> is detected and system recognizes the extension board and add it and dGPU to PCI topology.
> 
> Andrey
> 
>>
>>>
>>> On 2022-06-10 17:25, Andrey Grodzovsky wrote:
>>>>
>>>>
>>>> On 2022-02-10 09:39, Andrey Grodzovsky wrote:
>>>>> Thanks a lot for quick response, we will give this a try.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2022-02-10 01:23, Lukas Wunner wrote:
>>>>>> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>>>>>>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>>>>>>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>>>>>>> we do is putting the system to sleep, disconnecting the eGPU
>>>>>>> from the PCIe bus (through a special SBIOS setting) or by simply
>>>>>>> removing power to external PCIe cage and waking the
>>>>>>> system up.
>>>>>>>
>>>>>>> I attached the log. Please advise if you have any idea how
>>>>>>> to work around it ? Since the kernel is old, does anyone
>>>>>>> have an idea if this issue is known and already solved in later kernels ?
>>>>>>> We cannot try with latest since our kernel is custom for that platform.
>>>>>>
>>>>>> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0mLcR5MtJ52ZPoGPZ63WqK%2BFPNCQ8tOpizKU%2BUmkuFY%3D&amp;reserved=0
>>>>>>
>>>>>> The fix hasn't been applied yet.  I think I need to rework the patch,
>>>>>> just haven't found the time.
>>>>
>>>> Hey Lucas - just checking again if you had a chance to push this change
>>>> through ? It's essential to us in one of our costumer projects so we
>>>> wonder if have any estimate when will it be up-streamed and if we can
>>>> help with this. We would also need backporting this back to 5.11 and 5.4
>>>> kernels after it's upstreamed.
>>>>
>>>> Another point I want to mention is that this patch has a negative
>>>> side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER
>>>>
>>>> Anatoli is working on resolving this and so maybe he can add his
>>>> comment here and maybe you can help him with proper resolution for this.
>>>>
>>>> Andrey
>>>>
>>>>>>
>>>>>> Since the trigger in your case are AER-handled errors during a
>>>>>> system sleep transition, you may also want to consider the
>>>>>> following 2-patch series by Kai-Heng Feng which is currently
>>>>>> under discussion:
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=%2F94hA3KKA9VUqisUhSaPCPIbi9IS43%2FOGManjoOh1AQ%3D&amp;reserved=0
>>>>>>
>>>>>> That series disables AER during a system sleep transition and
>>>>>> should thus prevent the flood of AER-handled errors you're seeing.
>>>>>> Once AER is disabled, the reset-induced deadlocks should go away as well.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Lukas
>>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
  2022-06-15 15:14             ` Sathyanarayanan Kuppuswamy
@ 2022-06-15 15:49               ` Andrey Grodzovsky
  0 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-06-15 15:49 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Lukas Wunner
  Cc: linux-pci, helgaas, anatoli.antonovitch, Kumar1, Rahul,
	Alexander.Deucher



On 2022-06-15 11:14, Sathyanarayanan Kuppuswamy wrote:
> 
> 
> On 6/14/22 1:35 PM, Andrey Grodzovsky wrote:
>>
>>
>> On 2022-06-14 14:22, Sathyanarayanan Kuppuswamy wrote:
>>> Hi,
>>>
>>> On 6/14/22 11:07 AM, Andrey Grodzovsky wrote:
>>>> Just a gentle ping, also - I updated the ticket https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D215590&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=yJ3FgPSbH52kEMmoMjcgmU0apo9LZYtWwe%2B%2Bn%2F4J30U%3D&amp;reserved=0
>>>>
>>>> with the workaround we did if this could help you to advise us
>>>> what would be a generic solution for this ?
>>>>
>>>> Andrey
>>> Can you explain your WA? It seems to be unrelated to deadlock issue
>>> discussed in this thread. Are they related?
>>
>> So from start - originally we have an extension PCI board which is hot plug-able into our system board. On top of this extension board we have
>> AMD dGPU card. Originally we observed hang on resume from sleep (S3) in
>> AER enabled system because of race between AER and pciehp on S3 resume and so this
>> was resolved by the patch https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2FT%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=JCzeDmJmByiqDeAZGYJUjgOW2VIAMybHZgg%2B0YzYd%2Fg%3D&amp;reserved=0
>>
> 
> There is patch to disable AER in suspend/resume path (from Kai-Heng Feng). Did
> you check with this patch?

Yes, this patcheset[1] had no impact on the problem and we only included
the AB-BA Deadlock patch in our code by Lukas since it resolved the SW
deadlock for us.

[1] - 
https://patchwork.kernel.org/project/linux-pci/patch/20220126071853.1940111-1-kai.heng.feng@canonical.com/


> 
>> Now after this we are facing a second issue where after resume and after
>> AER driver recovery completed for pcieport the system won't detect a new
>> hotplug of the extention board into the system board. Anatoli looked
> 
> What about the hotplug events during this sequence? Did you get the
> LINK DOWN/UP or Presence change events?

I think we do get them - both in first time hot plug (step 2)
bellow and in post S3 resume hot plug (step 5 bellow). It's just
that it seems we get timeout for pcie_wait_for_link in step 5)

Step 2) logs
Feb 10 23:36:59 amd-BILBY kernel: [   28.729523] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:36:59 amd-BILBY kernel: [   28.735552] pcieport 0000:00:01.1: 
pciehp: Slot(0): Link Up

Step 5) logs
Feb 10 23:41:47 amd-BILBY kernel: [  302.759503] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:41:49 amd-BILBY kernel: [  304.795473] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:41:49 amd-BILBY kernel: [  304.803146] pcieport 0000:00:01.1: 
pciehp: Failed to check link status





But maybe you meant something else and if so maybe you can
tell me what exactly you want me to look at ?


Andrey

> 
>> into it and found the workaround that I attached that made it work by
>> resetting secondary bus and updating link speed on the upstream bridge
>> after AER recovery complete (post S3 resume).  But this is just a
> 
> 
>> workaround and not a generic solution so we would like to get an advise for a generic fix for this problem.
>>
>> To reiterate the full scenario is like this
>>
>> 1) Boot system
>>
>> 2) Extension board is first time hotplugged and dGPU is added to PCI topology
>>
>> 3) System suspend S3
>>
>> 4)  WE have costum BIOS which 'shuts off' the extension board during sleep so on resume the system discovers that the extension board (and dGPU) are gone and hot removes it from PCI topology. Together with this hot remove AER errors are generated and handled.
>>
>> 5)We again try to hot plug though a script we have but the system won't
>> detect the new hot plug of the extension board.
>>
>> 5*) The given workaround patch fixes issue in bullet 5) and hot plug
>> is detected and system recognizes the extension board and add it and dGPU to PCI topology.
>>
>> Andrey
>>
>>>
>>>>
>>>> On 2022-06-10 17:25, Andrey Grodzovsky wrote:
>>>>>
>>>>>
>>>>> On 2022-02-10 09:39, Andrey Grodzovsky wrote:
>>>>>> Thanks a lot for quick response, we will give this a try.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2022-02-10 01:23, Lukas Wunner wrote:
>>>>>>> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
>>>>>>>> Hi, on kernel based on 5.4.2 we are observing a deadlock between
>>>>>>>> reset_lock semaphore and device_lock (dev->mutex). The scenario
>>>>>>>> we do is putting the system to sleep, disconnecting the eGPU
>>>>>>>> from the PCIe bus (through a special SBIOS setting) or by simply
>>>>>>>> removing power to external PCIe cage and waking the
>>>>>>>> system up.
>>>>>>>>
>>>>>>>> I attached the log. Please advise if you have any idea how
>>>>>>>> to work around it ? Since the kernel is old, does anyone
>>>>>>>> have an idea if this issue is known and already solved in later kernels ?
>>>>>>>> We cannot try with latest since our kernel is custom for that platform.
>>>>>>>
>>>>>>> It is a known issue.  Here's a fix I submitted during the v5.9 cycle:
>>>>>>>
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=LchNztBhnuGsXC7Shn9AFc%2BRBk%2Bp%2B6O6Vq%2Fj9AzXBxI%3D&amp;reserved=0
>>>>>>>
>>>>>>> The fix hasn't been applied yet.  I think I need to rework the patch,
>>>>>>> just haven't found the time.
>>>>>
>>>>> Hey Lucas - just checking again if you had a chance to push this change
>>>>> through ? It's essential to us in one of our costumer projects so we
>>>>> wonder if have any estimate when will it be up-streamed and if we can
>>>>> help with this. We would also need backporting this back to 5.11 and 5.4
>>>>> kernels after it's upstreamed.
>>>>>
>>>>> Another point I want to mention is that this patch has a negative
>>>>> side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER
>>>>>
>>>>> Anatoli is working on resolving this and so maybe he can add his
>>>>> comment here and maybe you can help him with proper resolution for this.
>>>>>
>>>>> Andrey
>>>>>
>>>>>>>
>>>>>>> Since the trigger in your case are AER-handled errors during a
>>>>>>> system sleep transition, you may also want to consider the
>>>>>>> following 2-patch series by Kai-Heng Feng which is currently
>>>>>>> under discussion:
>>>>>>>
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=I%2FkE9XrIbeeWE%2F8IHXnD%2B3%2BhOnQ2TqgZqlpr9ViKiaI%3D&amp;reserved=0
>>>>>>>
>>>>>>> That series disables AER during a system sleep transition and
>>>>>>> should thus prevent the flood of AER-handled errors you're seeing.
>>>>>>> Once AER is disabled, the reset-induced deadlocks should go away as well.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Lukas
>>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-06-15 15:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0fc31d9a-f414-a412-3765-5519cbb9b7ff@amd.com>
2022-02-09 21:28 ` Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device Andrey Grodzovsky
2022-02-10  6:23 ` Lukas Wunner
2022-02-10 14:39   ` Andrey Grodzovsky
2022-06-10 21:25     ` Andrey Grodzovsky
2022-06-14 18:07       ` Andrey Grodzovsky
2022-06-14 18:22         ` Sathyanarayanan Kuppuswamy
2022-06-14 20:35           ` Andrey Grodzovsky
2022-06-15 15:14             ` Sathyanarayanan Kuppuswamy
2022-06-15 15:49               ` Andrey Grodzovsky
2022-02-10 20:47   ` Andrey Grodzovsky
2022-02-10 21:37     ` Lukas Wunner
2022-02-10 23:12       ` Andrey Grodzovsky
2022-02-11 14:42       ` Kumar1, Rahul
2022-02-15  7:02         ` Lukas Wunner
2022-02-15  8:18           ` Kumar1, Rahul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).