Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device

* Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device
       [not found] <0fc31d9a-f414-a412-3765-5519cbb9b7ff@amd.com>
@ 2022-02-09 21:28 ` Andrey Grodzovsky
  2022-02-10  6:23 ` Lukas Wunner
  1 sibling, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-02-09 21:28 UTC (permalink / raw)
  To: linux-pci, helgaas, Lukas Wunner
  Cc: anatoli.antonovitch, Kumar1, Rahul, Alexander.Deucher

I got a bounce back 'Message too long (>100000 chars)' reply so
reseeding with minimal essential log inline here

[   56.138636] ACPI: Waking up from system sleep state S3
[   56.140541] pcieport 0000:01:00.0: Refused to change power state, 
currently in D3
[   56.143542] pcieport 0000:02:00.0: Refused to change power state, 
currently in D3
[   56.146517] amdgpu 0000:03:00.0: Refused to change power state, 
currently in D3
[   56.209416] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) 
error received: 0000:00:01.0
[   56.209438] pcieport 0000:00:01.1: AER: PCIe Bus Error: 
severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[   56.209440] pcieport 0000:00:01.1: AER:   device [1022:15d3] error 
status/mask=00004000/04400000
[   56.209441] pcieport 0000:00:01.1: AER:    [14] CmpltTO 
   (First)
[   56.209817] sd 0:0:0:0: [sda] Starting disk
[   56.211483] [drm] PCIE GART of 1024M enabled.
[   56.211484] [drm] PTB located at 0x000000F400E10000
[   56.211508] [drm] PSP is resuming...
[   56.231386] [drm] reserve 0x400000 from 0xf41fc00000 for PSP TMR
[   56.312520] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode 
is not available
[   56.320623] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode 
is not available
[   56.326446] [drm] kiq ring mec 2 pipe 1 q 0
[   56.326919] amdgpu: restore the fine grain parameters
[   56.539633] [drm] VCN decode and encode initialized 
successfully(under SPG Mode).
[   56.539655] amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0 
on hub 0
[   56.539656] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv 
eng 1 on hub 0
[   56.539657] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv 
eng 4 on hub 0
[   56.539658] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv 
eng 5 on hub 0
[   56.539660] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv 
eng 6 on hub 0
[   56.539661] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv 
eng 7 on hub 0
[   56.539662] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv 
eng 8 on hub 0
[   56.539663] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv 
eng 9 on hub 0
[   56.539664] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv 
eng 10 on hub 0
[   56.539665] amdgpu 0000:05:00.0: amdgpu: ring kiq_2.1.0 uses VM inv 
eng 11 on hub 0
[   56.539666] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0 
on hub 1
[   56.539667] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 
1 on hub 1
[   56.539668] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv 
eng 4 on hub 1
[   56.539669] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv 
eng 5 on hub 1
[   56.539670] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv 
eng 6 on hub 1
[   56.685926] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   56.686175] ata1.00: supports DRM functions and may not be fully 
accessible
[   56.686848] ata1.00: disabling queued TRIM support
[   56.688408] ata1.00: supports DRM functions and may not be fully 
accessible
[   56.688925] ata1.00: disabling queued TRIM support
[   56.690217] ata1.00: configured for UDMA/133
[   57.246588] pcieport 0000:00:01.1: AER: Root Port link has been reset
[   57.246635] pcieport 0000:00:01.1: AER: Device recovery failed
[   57.246668] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[   57.247019] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[   57.247198] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) 
error received: 0000:00:01.0
[   57.247212] pcieport 0000:00:01.1: AER: PCIe Bus Error: 
severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[   57.247214] pcieport 0000:00:01.1: AER:   device [1022:15d3] error 
status/mask=00004000/04400000
[   57.247217] pcieport 0000:00:01.1: AER:    [14] CmpltTO 
   (First)
[   59.038917] pci 0000:03:00.0: Removing from iommu group 21
[   59.039314] pci_bus 0000:03: busn_res: [bus 03] is released
[   59.039790] acpi LNXPOWER:08: Turning OFF
[   59.040014] acpi LNXPOWER:07: Turning OFF
[   59.040296] acpi LNXPOWER:04: Turning OFF
[   59.040500] acpi LNXPOWER:03: Turning OFF
[   59.040682] OOM killer enabled.
[   59.040682] Restarting tasks ...
[   59.041112] systemd-journald[342]: /dev/kmsg buffer overrun, some 
messages lost.
[   59.047174] done.
[   59.047182] PM: suspend exit
[   61.382560] show_signal_msg: 29 callbacks suppressed
[   61.382563] glmark2[1891]: segfault at 0 ip 00007fdebc1cbd85 sp 
00007ffd56800870 error 4 in radeonsi_dri.so[7fdebb972000+a94000]
[   61.382574] Code: 00 4c 39 ed 74 6f 49 89 fc eb 1f 66 2e 0f 1f 84 00 
00 00 00 00 48 89 ef e8 08 a2 7a ff 49 8b ac 24 e0 77 00 00 4c 39 ed 74 
4b <48> 8b 55 00 48 8b 45 08 48 8b 5d 10 48 89 42 08 48 89 10 48 c7 45
[  243.354138] INFO: task irq/26-aerdrv:170 blocked for more than 120 
seconds.
[  243.354145]       Not tainted 5.4.2-10-feb+ #51
[  243.354147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  243.354150] irq/26-aerdrv   D    0   170      2 0x80004000
[  243.354156] Call Trace:
[  243.354170]  ? __schedule+0x2e3/0x740
[  243.354173]  schedule+0x39/0xa0
[  243.354179]  rwsem_down_write_slowpath+0x244/0x4d0
[  243.354183]  ? schedule+0x39/0xa0
[  243.354186]  ? schedule_preempt_disabled+0xa/0x10
[  243.354192]  pciehp_reset_slot+0x51/0x150
[  243.354198]  pci_reset_hotplug_slot+0x3c/0x60
[  243.354202]  pci_slot_reset+0x107/0x130
[  243.354205]  pci_bus_error_reset+0xf3/0x120
[  243.354210]  aer_root_reset+0x5c/0xf0
[  243.354214]  pcie_do_recovery+0x13e/0x275
[  243.354217]  aer_process_err_devices+0xb2/0xc7
[  243.354220]  aer_isr.cold+0x50/0x9f
[  243.354223]  ? __schedule+0x2eb/0x740
[  243.354228]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  243.354230]  irq_thread_fn+0x20/0x60
[  243.354234]  irq_thread+0xdc/0x170
[  243.354237]  ? irq_forced_thread_fn+0x80/0x80
[  243.354241]  kthread+0xf9/0x130
[  243.354245]  ? irq_thread_check_affinity+0xf0/0xf0
[  243.354247]  ? kthread_park+0x90/0x90
[  243.354250]  ret_from_fork+0x22/0x40
[  243.354255] INFO: task irq/26-pciehp:171 blocked for more than 120 
seconds.
[  243.354257]       Not tainted 5.4.2-10-feb+ #51
[  243.354259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  243.354261] irq/26-pciehp   D    0   171      2 0x80004000
[  243.354263] Call Trace:
[  243.354266]  ? __schedule+0x2e3/0x740
[  243.354269]  schedule+0x39/0xa0
[  243.354271]  schedule_preempt_disabled+0xa/0x10
[  243.354274]  __mutex_lock.isra.0+0x182/0x4f0
[  243.354279]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  243.354284]  device_del+0x35/0x370
[  243.354288]  pci_remove_bus_device+0x77/0x100
[  243.354292]  pci_remove_bus_device+0x2e/0x100
[  243.354296]  pciehp_unconfigure_device+0x7c/0x12f
[  243.354299]  pciehp_disable_slot+0x6b/0x100
[  243.354303]  pciehp_handle_presence_or_link_change+0xdc/0x140
[  243.354306]  pciehp_ist+0x10f/0x120
[  243.354309]  irq_thread_fn+0x20/0x60
[  243.354312]  irq_thread+0xdc/0x170
[  243.354316]  ? irq_forced_thread_fn+0x80/0x80
[  243.354318]  kthread+0xf9/0x130
[  243.354321]  ? irq_thread_check_affinity+0xf0/0xf0
[  243.354323]  ? kthread_park+0x90/0x90
[  243.354326]  ret_from_fork+0x22/0x40

Andrey


On 2022-02-09 14:54, Andrey Grodzovsky wrote:
> Hi, on kernel based on 5.4.2 we are observing a deadlock between
> reset_lock semaphore and device_lock (dev->mutex). The scenario
> we do is putting the system to sleep, disconnecting the eGPU
> from the PCIe bus (through a special SBIOS setting) or by simply
> removing power to external PCIe cage and waking the
> system up.
> 
> I attached the log. Please advise if you have any idea how
> to work around it ? Since the kernel is old, does anyone
> have an idea if this issue is known and already solved in later kernels ?
> We cannot try with latest since our kernel is custom for that platform.
> 
> Thanks,
> Andrey

^ permalink raw reply	[flat|nested] 15+ messages in thread