linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 0/5] aer handling fixups
@ 2021-01-04 23:02 Keith Busch
  2021-01-04 23:02 ` [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device Keith Busch
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Keith Busch @ 2021-01-04 23:02 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Keith Busch

Changes from v1:

  Added received Acks

  Split the kernel print identifying the port type being reset.

  Added a patch for the portdrv to ensure the slot_reset happens without
  relying on a downstream device driver..

Keith Busch (5):
  PCI/ERR: Clear status of the reporting device
  PCI/AER: Actually get the root port
  PCI/ERR: Retain status from error notification
  PCI/AER: Specify the type of port that was reset
  PCI/portdrv: Report reset for frozen channel

 drivers/pci/pcie/aer.c         |  5 +++--
 drivers/pci/pcie/err.c         | 16 +++++++---------
 drivers/pci/pcie/portdrv_pci.c |  3 ++-
 3 files changed, 12 insertions(+), 12 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
@ 2021-01-04 23:02 ` Keith Busch
  2021-01-04 23:02 ` [PATCHv2 2/5] PCI/AER: Actually get the root port Keith Busch
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Keith Busch @ 2021-01-04 23:02 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Keith Busch, Sean V Kelley

Error handling operates on the first downstream port above the detected
error, but the error may have been reported by a downstream device.
Clear the AER status of the device that reported the error rather than
the first downstream port.

Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/err.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 510f31f0ef6d..a84f0bf4c1e2 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -231,15 +231,14 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_walk_bridge(bridge, report_resume, &status);
 
 	/*
-	 * If we have native control of AER, clear error status in the Root
-	 * Port or Downstream Port that signaled the error.  If the
-	 * platform retained control of AER, it is responsible for clearing
-	 * this status.  In that case, the signaling device may not even be
-	 * visible to the OS.
+	 * If we have native control of AER, clear error status in the device
+	 * that detected the error.  If the platform retained control of AER,
+	 * it is responsible for clearing this status.  In that case, the
+	 * signaling device may not even be visible to the OS.
 	 */
 	if (host->native_aer || pcie_ports_native) {
-		pcie_clear_device_status(bridge);
-		pci_aer_clear_nonfatal_status(bridge);
+		pcie_clear_device_status(dev);
+		pci_aer_clear_nonfatal_status(dev);
 	}
 	pci_info(bridge, "device recovery successful\n");
 	return status;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCHv2 2/5] PCI/AER: Actually get the root port
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
  2021-01-04 23:02 ` [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device Keith Busch
@ 2021-01-04 23:02 ` Keith Busch
  2021-01-04 23:02 ` [PATCHv2 3/5] PCI/ERR: Retain status from error notification Keith Busch
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Keith Busch @ 2021-01-04 23:02 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Keith Busch, Sean V Kelley

The pci_dev parameter given to aer_root_reset() may be a downstream port
rather than the root port. Get the root port from the provided device in
order to clear the root's aer status.

Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/aer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 77b0f2c45bc0..3fd4aaaa627e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1388,7 +1388,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	if (type == PCI_EXP_TYPE_RC_END)
 		root = dev->rcec;
 	else
-		root = dev;
+		root = pcie_find_root_port(dev);
 
 	/*
 	 * If the platform retained control of AER, an RCiEP may not have
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
  2021-01-04 23:02 ` [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device Keith Busch
  2021-01-04 23:02 ` [PATCHv2 2/5] PCI/AER: Actually get the root port Keith Busch
@ 2021-01-04 23:02 ` Keith Busch
  2021-03-03  5:34   ` Williams, Dan J
  2021-01-04 23:02 ` [PATCHv2 4/5] PCI/AER: Specify the type of port that was reset Keith Busch
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-01-04 23:02 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Keith Busch, Hinko Kocevar, Sean V Kelley

Overwriting the frozen detected status with the result of the link reset
loses the NEED_RESET result that drivers are depending on for error
handling to report the .slot_reset() callback. Retain this status so
that subsequent error handling has the correct flow.

Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/err.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index a84f0bf4c1e2..b576aa890c76 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -198,8 +198,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_dbg(bridge, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bridge(bridge, report_frozen_detected, &status);
-		status = reset_subordinates(bridge);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
 			pci_warn(bridge, "subordinate device reset failed\n");
 			goto failed;
 		}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCHv2 4/5] PCI/AER: Specify the type of port that was reset
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
                   ` (2 preceding siblings ...)
  2021-01-04 23:02 ` [PATCHv2 3/5] PCI/ERR: Retain status from error notification Keith Busch
@ 2021-01-04 23:02 ` Keith Busch
  2021-01-04 23:03 ` [PATCHv2 5/5] PCI/portdrv: Report reset for frozen channel Keith Busch
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Keith Busch @ 2021-01-04 23:02 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Keith Busch

The AER driver may be called upon to reset either a downstream or a root
port. Check which type it is to properly identify it when logging that
the reset occured.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/aer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3fd4aaaa627e..ba22388342d1 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1414,7 +1414,8 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 		}
 	} else {
 		rc = pci_bus_error_reset(dev);
-		pci_info(dev, "Root Port link has been reset (%d)\n", rc);
+		pci_info(dev, "%s Port link has been reset (%d)\n",
+			pci_is_root_bus(dev->bus) ? "Root" : "Downstream", rc);
 	}
 
 	if ((host->native_aer || pcie_ports_native) && aer) {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCHv2 5/5] PCI/portdrv: Report reset for frozen channel
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
                   ` (3 preceding siblings ...)
  2021-01-04 23:02 ` [PATCHv2 4/5] PCI/AER: Specify the type of port that was reset Keith Busch
@ 2021-01-04 23:03 ` Keith Busch
  2021-01-05 14:21 ` [PATCHv2 0/5] aer handling fixups Hinko Kocevar
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Keith Busch @ 2021-01-04 23:03 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Keith Busch

The PCI error recovery always resets the link for a frozen state, so the
port driver should return that a reset is required for its result. This
will get the .slot_reset() callback invoked, which is necessary to
restore the port's config space. Without this, the driver had been
relying on downstream drivers to return this status.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/portdrv_pci.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 0b250bc5f405..de141bfb0bc2 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -153,7 +153,8 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
 static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
 					pci_channel_state_t error)
 {
-	/* Root Port has no impact. Always recovers. */
+	if (error == pci_channel_io_frozen)
+		return PCI_ERS_RESULT_NEED_RESET;
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
                   ` (4 preceding siblings ...)
  2021-01-04 23:03 ` [PATCHv2 5/5] PCI/portdrv: Report reset for frozen channel Keith Busch
@ 2021-01-05 14:21 ` Hinko Kocevar
  2021-01-05 15:06   ` Hinko Kocevar
  2021-02-03  0:03 ` Keith Busch
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Hinko Kocevar @ 2021-01-05 14:21 UTC (permalink / raw)
  To: Keith Busch, linux-pci, Bjorn Helgaas



On 1/5/21 12:02 AM, Keith Busch wrote:
> Changes from v1:
> 
>    Added received Acks
> 
>    Split the kernel print identifying the port type being reset.
> 
>    Added a patch for the portdrv to ensure the slot_reset happens without
>    relying on a downstream device driver..
> 
> Keith Busch (5):
>    PCI/ERR: Clear status of the reporting device
>    PCI/AER: Actually get the root port
>    PCI/ERR: Retain status from error notification
>    PCI/AER: Specify the type of port that was reset
>    PCI/portdrv: Report reset for frozen channel
> 
>   drivers/pci/pcie/aer.c         |  5 +++--
>   drivers/pci/pcie/err.c         | 16 +++++++---------
>   drivers/pci/pcie/portdrv_pci.c |  3 ++-
>   3 files changed, 12 insertions(+), 12 deletions(-)
> 

Dear Keith,

I've tested this v2 series patches on the same setup I have tested the 
v1 series patches a bit earlier (the same linux-pci GIT tree was used to 
apply these patches to).

I'm seeing these messages in the dmesg when after an error is injected 
and the recovery process starts:

[  397.434293] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
[irq/122-aerdrv:128]
[  397.441872] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
libcrc32c sd_mod t10_pi i915 cec rc_core
[  397.441925]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
i2c_algo_bit libata drm video
[  397.542794] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S 
  O      5.10.0-1.keithv2.el7.x86_64 #1
[  397.552530] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
02/19/2019
[  397.559663] RIP: 0010:pci_bus_write_config_dword+0x0/0x30
[  397.565069] Code: c2 01 75 1f 55 48 8b 87 c0 00 00 00 44 0f b7 c1 b9 
02 00 00 00 48 8b 40 20 48 89 e5 e8 a9 de 70 00 5d c3 b8 87 00 00 00 c3 
90 <0f> 1f 44 00 00 f6 c2 03 75 1e 55 48 8b 87 c0 00 00 00 41 89 c8 b9
[  397.583819] RSP: 0018:ffffc90000303b88 EFLAGS: 00000297
[  397.589044] RAX: 0000000000000000 RBX: ffff8881011671cc RCX: 
00000000071f1f1f
[  397.596182] RDX: 00000000000001a4 RSI: 0000000000000008 RDI: 
ffff8881013d9000
[  397.603319] RBP: ffffc90000303b90 R08: 0000000000000004 R09: 
00000000071f1f00
[  397.610454] R10: 0000000000000000 R11: ffffc90000303950 R12: 
ffff888101400000
[  397.617588] R13: 0000000000000000 R14: ffff8881011671dc R15: 
00000000fee98fdc
[  397.624722] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
knlGS:0000000000000000
[  397.632814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  397.638565] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
00000000003706e0
[  397.645702] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  397.652842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  397.659978] Call Trace:
[  397.662433]  ? pci_write_config_dword+0x22/0x30
[  397.666974]  pci_vc_save_restore_dwords+0x59/0x70
[  397.671682]  pci_vc_do_save_buffer+0x3da/0x5b0
[  397.676131]  ? pci_read_config_dword+0x23/0x40
[  397.680583]  pci_restore_vc_state+0x4d/0x70
[  397.684776]  pci_restore_state.part.44+0x177/0x420
[  397.689573]  pci_restore_state+0x18/0x20
[  397.693507]  pcie_portdrv_slot_reset+0x12/0x30
[  397.697957]  report_slot_reset+0x46/0x80
[  397.701889]  ? merge_result.part.4+0x40/0x40
[  397.706163]  pci_walk_bus+0x48/0x90
[  397.709660]  pci_walk_bridge+0x1d/0x30
[  397.713415]  pcie_do_recovery+0x1bd/0x270
[  397.717434]  ? aer_dev_correctable_show+0xd0/0xd0
[  397.722147]  ? aer_dev_correctable_show+0xd0/0xd0
[  397.726854]  aer_process_err_devices+0xd0/0xe9
[  397.731308]  aer_isr.cold.25+0x54/0xa3
[  397.735067]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  397.740122]  irq_thread_fn+0x24/0x60
[  397.743713]  irq_thread+0xea/0x170
[  397.747121]  ? irq_forced_thread_fn+0x80/0x80
[  397.751481]  ? irq_thread_check_affinity+0xf0/0xf0
[  397.756280]  kthread+0x126/0x140
[  397.759521]  ? kthread_park+0x90/0x90
[  397.763191]  ret_from_fork+0x22/0x30
[  425.434159] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
[irq/122-aerdrv:128]
[  425.441727] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
libcrc32c sd_mod t10_pi i915 cec rc_core
[  425.441781]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
i2c_algo_bit libata drm video
[  425.542649] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S 
  O L    5.10.0-1.keithv2.el7.x86_64 #1
[  425.552383] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
02/19/2019
[  425.559517] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
[  425.564581] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
[  425.583332] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
[  425.588563] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
0000000000000046
[  425.595702] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
0000000000000046
[  425.602839] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
000000000000001b
[  425.609978] R10: 0000000000000001 R11: ffff888102db4000 R12: 
0000000000000000
[  425.617120] R13: 000001343b891adb R14: 0000000000000000 R15: 
000000005b0c2b18
[  425.624261] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
knlGS:0000000000000000
[  425.632353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  425.638100] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
00000000003706e0
[  425.645232] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  425.652365] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  425.659501] Call Trace:
[  425.661957]  <IRQ>
[  425.663981]  ? find_next_bit+0x14/0x20
[  425.667738]  ? __clocksource_unstable+0x70/0x70
[  425.672273]  call_timer_fn+0x2e/0x100
[  425.675945]  run_timer_softirq+0x1d4/0x400
[  425.680054]  ? execlists_submission_tasklet+0x93/0x180 [i915]
[  425.685856]  ? tasklet_action_common.isra.23+0x69/0x100
[  425.691085]  __do_softirq+0xdb/0x29e
[  425.694669]  asm_call_irq_on_stack+0x12/0x20
[  425.698950]  </IRQ>
[  425.701064]  do_softirq_own_stack+0x3d/0x50
[  425.705262]  irq_exit_rcu+0xd2/0xe0
[  425.708762]  common_interrupt+0x81/0x150
[  425.712701]  ? irq_exit_rcu+0x45/0xe0
[  425.716372]  asm_common_interrupt+0x1e/0x40
[  425.720563] RIP: 0010:pci_mmcfg_read+0x11/0xe0
[  425.725023] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
[  425.743776] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
[  425.749006] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
0000000000000162
[  425.756141] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
0000000000000000
[  425.763275] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
ffffc90000303b5c
[  425.770411] R10: 0000000000000008 R11: ffffc90000303950 R12: 
0000000000000001
[  425.777542] R13: 0000000000000162 R14: ffff888101400000 R15: 
0000000000000064
[  425.784678]  ? pcibios_retrieve_fw_addr+0x90/0x90
[  425.789386]  raw_pci_read+0x35/0x40
[  425.792885]  pci_read+0x2c/0x30
[  425.796036]  pci_bus_read_config_word+0x4a/0x70
[  425.800573]  pci_read_config_word+0x23/0x40
[  425.804760]  pci_wait_for_pending+0x49/0xb0
[  425.808946]  ? raw_pci_write+0x36/0x40
[  425.812700]  pci_vc_do_save_buffer+0x453/0x5b0
[  425.817152]  ? pci_read_config_dword+0x23/0x40
[  425.821605]  pci_restore_vc_state+0x4d/0x70
[  425.825799]  pci_restore_state.part.44+0x177/0x420
[  425.830601]  pci_restore_state+0x18/0x20
[  425.834532]  pcie_portdrv_slot_reset+0x12/0x30
[  425.838991]  report_slot_reset+0x46/0x80
[  425.842919]  ? merge_result.part.4+0x40/0x40
[  425.847197]  pci_walk_bus+0x48/0x90
[  425.850696]  pci_walk_bridge+0x1d/0x30
[  425.854455]  pcie_do_recovery+0x1bd/0x270
[  425.858472]  ? aer_dev_correctable_show+0xd0/0xd0
[  425.863181]  ? aer_dev_correctable_show+0xd0/0xd0
[  425.867889]  aer_process_err_devices+0xd0/0xe9
[  425.872334]  aer_isr.cold.25+0x54/0xa3
[  425.876090]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  425.881144]  irq_thread_fn+0x24/0x60
[  425.884722]  irq_thread+0xea/0x170
[  425.888132]  ? irq_forced_thread_fn+0x80/0x80
[  425.892498]  ? irq_thread_check_affinity+0xf0/0xf0
[  425.897296]  kthread+0x126/0x140
[  425.900532]  ? kthread_park+0x90/0x90
[  425.904204]  ret_from_fork+0x22/0x30
[  430.737050] rcu: INFO: rcu_sched self-detected stall on CPU
[  430.742631] rcu: 	2-....: (10176 ticks this GP) 
idle=01a/1/0x4000000000000000 softirq=8276/8276 fqs=13750
[  430.752283] 	(t=60000 jiffies g=12873 q=14518)
[  430.756732] NMI backtrace for cpu 2
[  430.760223] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S 
  O L    5.10.0-1.keithv2.el7.x86_64 #1
[  430.769954] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
02/19/2019
[  430.777089] Call Trace:
[  430.779542]  <IRQ>
[  430.781562]  dump_stack+0x74/0x92
[  430.784896]  nmi_cpu_backtrace.cold.8+0x32/0x68
[  430.789435]  ? lapic_can_unplug_cpu+0x80/0x80
[  430.793798]  nmi_trigger_cpumask_backtrace+0xe6/0xf0
[  430.798770]  arch_trigger_cpumask_backtrace+0x19/0x20
[  430.803830]  rcu_dump_cpu_stacks+0xb6/0xe6
[  430.807937]  rcu_sched_clock_irq.cold.92+0x124/0x3a9
[  430.812908]  ? account_system_index_time+0x94/0xa0
[  430.817708]  ? tick_sched_do_timer+0x60/0x60
[  430.821982]  update_process_times+0x93/0xd0
[  430.826173]  tick_sched_handle.isra.23+0x25/0x60
[  430.830798]  tick_sched_timer+0x6b/0x80
[  430.834638]  __hrtimer_run_queues+0x105/0x240
[  430.839002]  hrtimer_interrupt+0x101/0x220
[  430.843112]  __sysvec_apic_timer_interrupt+0x62/0xe0
[  430.848089]  sysvec_apic_timer_interrupt+0x35/0x90
[  430.852885]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  430.858026] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
[  430.863086] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
[  430.881838] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
[  430.887077] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
0000000000000046
[  430.894216] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
0000000000000046
[  430.901355] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
000000000000001b
[  430.908491] R10: 0000000000000001 R11: ffff888102db4000 R12: 
0000000000000000
[  430.915631] R13: 000001343b891adb R14: 0000000000000000 R15: 
000000005b0c2b18
[  430.922770]  ? clocksource_watchdog+0xad/0x2b0
[  430.927221]  ? find_next_bit+0x14/0x20
[  430.930984]  ? __clocksource_unstable+0x70/0x70
[  430.935526]  call_timer_fn+0x2e/0x100
[  430.939198]  run_timer_softirq+0x1d4/0x400
[  430.943305]  ? execlists_submission_tasklet+0x93/0x180 [i915]
[  430.949104]  ? tasklet_action_common.isra.23+0x69/0x100
[  430.954337]  __do_softirq+0xdb/0x29e
[  430.957921]  asm_call_irq_on_stack+0x12/0x20
[  430.962196]  </IRQ>
[  430.964307]  do_softirq_own_stack+0x3d/0x50
[  430.968500]  irq_exit_rcu+0xd2/0xe0
[  430.971998]  common_interrupt+0x81/0x150
[  430.975936]  ? irq_exit_rcu+0x45/0xe0
[  430.979606]  asm_common_interrupt+0x1e/0x40
[  430.983795] RIP: 0010:pci_mmcfg_read+0x11/0xe0
[  430.988257] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
[  431.007005] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
[  431.012232] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
0000000000000162
[  431.019366] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
0000000000000000
[  431.026499] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
ffffc90000303b5c
[  431.033630] R10: 0000000000000008 R11: ffffc90000303950 R12: 
0000000000000001
[  431.040764] R13: 0000000000000162 R14: ffff888101400000 R15: 
0000000000000064
[  431.047897]  ? pcibios_retrieve_fw_addr+0x90/0x90
[  431.052612]  raw_pci_read+0x35/0x40
[  431.056111]  pci_read+0x2c/0x30
[  431.059261]  pci_bus_read_config_word+0x4a/0x70
[  431.063800]  pci_read_config_word+0x23/0x40
[  431.067991]  pci_wait_for_pending+0x49/0xb0
[  431.072180]  ? raw_pci_write+0x36/0x40
[  431.075939]  pci_vc_do_save_buffer+0x453/0x5b0
[  431.080386]  ? pci_read_config_dword+0x23/0x40
[  431.084832]  pci_restore_vc_state+0x4d/0x70
[  431.089020]  pci_restore_state.part.44+0x177/0x420
[  431.093821]  pci_restore_state+0x18/0x20
[  431.097751]  pcie_portdrv_slot_reset+0x12/0x30
[  431.102205]  report_slot_reset+0x46/0x80
[  431.106138]  ? merge_result.part.4+0x40/0x40
[  431.110413]  pci_walk_bus+0x48/0x90
[  431.113912]  pci_walk_bridge+0x1d/0x30
[  431.117668]  pcie_do_recovery+0x1bd/0x270
[  431.121685]  ? aer_dev_correctable_show+0xd0/0xd0
[  431.126404]  ? aer_dev_correctable_show+0xd0/0xd0
[  431.131118]  aer_process_err_devices+0xd0/0xe9
[  431.135568]  aer_isr.cold.25+0x54/0xa3
[  431.139326]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  431.144388]  irq_thread_fn+0x24/0x60
[  431.147972]  irq_thread+0xea/0x170
[  431.151380]  ? irq_forced_thread_fn+0x80/0x80
[  431.155748]  ? irq_thread_check_affinity+0xf0/0xf0
[  431.160548]  kthread+0x126/0x140
[  431.163788]  ? kthread_park+0x90/0x90
[  431.167460]  ret_from_fork+0x22/0x30
[  444.669140] pcieport 0000:02:01.0: restoring config space at offset 
0x3c (was 0x100, writing 0x12010a)
[  462.143968] pcieport 0000:02:01.0: restoring config space at offset 
0x2c (was 0x0, writing 0x0)
[  462.152864] pcieport 0000:02:01.0: restoring config space at offset 
0x28 (was 0x0, writing 0x0)
[  462.161752] pcieport 0000:02:01.0: restoring config space at offset 
0x24 (was 0x1fff1, writing 0x90e19001)


This was not observed with the v1 series of patches.

 From what I can tell, this v2 series introduced a change in the 5/5 
patch, that the v1 did not.

BTW, just to be safe and aligned with you, which kernel sources should I 
be applying this v2 series patches to?

Let me know if you need more input.


Thanks,
Hinko

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-05 14:21 ` [PATCHv2 0/5] aer handling fixups Hinko Kocevar
@ 2021-01-05 15:06   ` Hinko Kocevar
  2021-01-05 18:33     ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Hinko Kocevar @ 2021-01-05 15:06 UTC (permalink / raw)
  To: Keith Busch, linux-pci, Bjorn Helgaas



On 1/5/21 3:21 PM, Hinko Kocevar wrote:
> 
> 
> On 1/5/21 12:02 AM, Keith Busch wrote:
>> Changes from v1:
>>
>>    Added received Acks
>>
>>    Split the kernel print identifying the port type being reset.
>>
>>    Added a patch for the portdrv to ensure the slot_reset happens without
>>    relying on a downstream device driver..
>>
>> Keith Busch (5):
>>    PCI/ERR: Clear status of the reporting device
>>    PCI/AER: Actually get the root port
>>    PCI/ERR: Retain status from error notification
>>    PCI/AER: Specify the type of port that was reset
>>    PCI/portdrv: Report reset for frozen channel

I removed the patch 5/5 from this patch series, and after testing again, 
it makes my setup recover from the injected error; same as observed with 
v1 series.

Hope this helps.
//hinko

>>
>>   drivers/pci/pcie/aer.c         |  5 +++--
>>   drivers/pci/pcie/err.c         | 16 +++++++---------
>>   drivers/pci/pcie/portdrv_pci.c |  3 ++-
>>   3 files changed, 12 insertions(+), 12 deletions(-)
>>
> 
> Dear Keith,
> 
> I've tested this v2 series patches on the same setup I have tested the 
> v1 series patches a bit earlier (the same linux-pci GIT tree was used to 
> apply these patches to).
> 
> I'm seeing these messages in the dmesg when after an error is injected 
> and the recovery process starts:
> 
> [  397.434293] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
> [irq/122-aerdrv:128]
> [  397.441872] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
> sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
> intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
> snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
> snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
> intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
> snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
> soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
> virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
> pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
> libcrc32c sd_mod t10_pi i915 cec rc_core
> [  397.441925]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
> serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
> i2c_algo_bit libata drm video
> [  397.542794] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S  O      
> 5.10.0-1.keithv2.el7.x86_64 #1
> [  397.552530] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
> 02/19/2019
> [  397.559663] RIP: 0010:pci_bus_write_config_dword+0x0/0x30
> [  397.565069] Code: c2 01 75 1f 55 48 8b 87 c0 00 00 00 44 0f b7 c1 b9 
> 02 00 00 00 48 8b 40 20 48 89 e5 e8 a9 de 70 00 5d c3 b8 87 00 00 00 c3 
> 90 <0f> 1f 44 00 00 f6 c2 03 75 1e 55 48 8b 87 c0 00 00 00 41 89 c8 b9
> [  397.583819] RSP: 0018:ffffc90000303b88 EFLAGS: 00000297
> [  397.589044] RAX: 0000000000000000 RBX: ffff8881011671cc RCX: 
> 00000000071f1f1f
> [  397.596182] RDX: 00000000000001a4 RSI: 0000000000000008 RDI: 
> ffff8881013d9000
> [  397.603319] RBP: ffffc90000303b90 R08: 0000000000000004 R09: 
> 00000000071f1f00
> [  397.610454] R10: 0000000000000000 R11: ffffc90000303950 R12: 
> ffff888101400000
> [  397.617588] R13: 0000000000000000 R14: ffff8881011671dc R15: 
> 00000000fee98fdc
> [  397.624722] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
> knlGS:0000000000000000
> [  397.632814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  397.638565] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
> 00000000003706e0
> [  397.645702] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  397.652842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  397.659978] Call Trace:
> [  397.662433]  ? pci_write_config_dword+0x22/0x30
> [  397.666974]  pci_vc_save_restore_dwords+0x59/0x70
> [  397.671682]  pci_vc_do_save_buffer+0x3da/0x5b0
> [  397.676131]  ? pci_read_config_dword+0x23/0x40
> [  397.680583]  pci_restore_vc_state+0x4d/0x70
> [  397.684776]  pci_restore_state.part.44+0x177/0x420
> [  397.689573]  pci_restore_state+0x18/0x20
> [  397.693507]  pcie_portdrv_slot_reset+0x12/0x30
> [  397.697957]  report_slot_reset+0x46/0x80
> [  397.701889]  ? merge_result.part.4+0x40/0x40
> [  397.706163]  pci_walk_bus+0x48/0x90
> [  397.709660]  pci_walk_bridge+0x1d/0x30
> [  397.713415]  pcie_do_recovery+0x1bd/0x270
> [  397.717434]  ? aer_dev_correctable_show+0xd0/0xd0
> [  397.722147]  ? aer_dev_correctable_show+0xd0/0xd0
> [  397.726854]  aer_process_err_devices+0xd0/0xe9
> [  397.731308]  aer_isr.cold.25+0x54/0xa3
> [  397.735067]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
> [  397.740122]  irq_thread_fn+0x24/0x60
> [  397.743713]  irq_thread+0xea/0x170
> [  397.747121]  ? irq_forced_thread_fn+0x80/0x80
> [  397.751481]  ? irq_thread_check_affinity+0xf0/0xf0
> [  397.756280]  kthread+0x126/0x140
> [  397.759521]  ? kthread_park+0x90/0x90
> [  397.763191]  ret_from_fork+0x22/0x30
> [  425.434159] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
> [irq/122-aerdrv:128]
> [  425.441727] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
> sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
> intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
> snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
> snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
> intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
> snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
> soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
> virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
> pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
> libcrc32c sd_mod t10_pi i915 cec rc_core
> [  425.441781]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
> serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
> i2c_algo_bit libata drm video
> [  425.542649] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S  O L    
> 5.10.0-1.keithv2.el7.x86_64 #1
> [  425.552383] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
> 02/19/2019
> [  425.559517] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
> [  425.564581] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
> e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
> 00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
> [  425.583332] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
> [  425.588563] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
> 0000000000000046
> [  425.595702] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
> 0000000000000046
> [  425.602839] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
> 000000000000001b
> [  425.609978] R10: 0000000000000001 R11: ffff888102db4000 R12: 
> 0000000000000000
> [  425.617120] R13: 000001343b891adb R14: 0000000000000000 R15: 
> 000000005b0c2b18
> [  425.624261] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
> knlGS:0000000000000000
> [  425.632353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  425.638100] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
> 00000000003706e0
> [  425.645232] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  425.652365] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  425.659501] Call Trace:
> [  425.661957]  <IRQ>
> [  425.663981]  ? find_next_bit+0x14/0x20
> [  425.667738]  ? __clocksource_unstable+0x70/0x70
> [  425.672273]  call_timer_fn+0x2e/0x100
> [  425.675945]  run_timer_softirq+0x1d4/0x400
> [  425.680054]  ? execlists_submission_tasklet+0x93/0x180 [i915]
> [  425.685856]  ? tasklet_action_common.isra.23+0x69/0x100
> [  425.691085]  __do_softirq+0xdb/0x29e
> [  425.694669]  asm_call_irq_on_stack+0x12/0x20
> [  425.698950]  </IRQ>
> [  425.701064]  do_softirq_own_stack+0x3d/0x50
> [  425.705262]  irq_exit_rcu+0xd2/0xe0
> [  425.708762]  common_interrupt+0x81/0x150
> [  425.712701]  ? irq_exit_rcu+0x45/0xe0
> [  425.716372]  asm_common_interrupt+0x1e/0x40
> [  425.720563] RIP: 0010:pci_mmcfg_read+0x11/0xe0
> [  425.725023] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
> cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
> 57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
> [  425.743776] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
> [  425.749006] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
> 0000000000000162
> [  425.756141] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
> 0000000000000000
> [  425.763275] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
> ffffc90000303b5c
> [  425.770411] R10: 0000000000000008 R11: ffffc90000303950 R12: 
> 0000000000000001
> [  425.777542] R13: 0000000000000162 R14: ffff888101400000 R15: 
> 0000000000000064
> [  425.784678]  ? pcibios_retrieve_fw_addr+0x90/0x90
> [  425.789386]  raw_pci_read+0x35/0x40
> [  425.792885]  pci_read+0x2c/0x30
> [  425.796036]  pci_bus_read_config_word+0x4a/0x70
> [  425.800573]  pci_read_config_word+0x23/0x40
> [  425.804760]  pci_wait_for_pending+0x49/0xb0
> [  425.808946]  ? raw_pci_write+0x36/0x40
> [  425.812700]  pci_vc_do_save_buffer+0x453/0x5b0
> [  425.817152]  ? pci_read_config_dword+0x23/0x40
> [  425.821605]  pci_restore_vc_state+0x4d/0x70
> [  425.825799]  pci_restore_state.part.44+0x177/0x420
> [  425.830601]  pci_restore_state+0x18/0x20
> [  425.834532]  pcie_portdrv_slot_reset+0x12/0x30
> [  425.838991]  report_slot_reset+0x46/0x80
> [  425.842919]  ? merge_result.part.4+0x40/0x40
> [  425.847197]  pci_walk_bus+0x48/0x90
> [  425.850696]  pci_walk_bridge+0x1d/0x30
> [  425.854455]  pcie_do_recovery+0x1bd/0x270
> [  425.858472]  ? aer_dev_correctable_show+0xd0/0xd0
> [  425.863181]  ? aer_dev_correctable_show+0xd0/0xd0
> [  425.867889]  aer_process_err_devices+0xd0/0xe9
> [  425.872334]  aer_isr.cold.25+0x54/0xa3
> [  425.876090]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
> [  425.881144]  irq_thread_fn+0x24/0x60
> [  425.884722]  irq_thread+0xea/0x170
> [  425.888132]  ? irq_forced_thread_fn+0x80/0x80
> [  425.892498]  ? irq_thread_check_affinity+0xf0/0xf0
> [  425.897296]  kthread+0x126/0x140
> [  425.900532]  ? kthread_park+0x90/0x90
> [  425.904204]  ret_from_fork+0x22/0x30
> [  430.737050] rcu: INFO: rcu_sched self-detected stall on CPU
> [  430.742631] rcu:     2-....: (10176 ticks this GP) 
> idle=01a/1/0x4000000000000000 softirq=8276/8276 fqs=13750
> [  430.752283]     (t=60000 jiffies g=12873 q=14518)
> [  430.756732] NMI backtrace for cpu 2
> [  430.760223] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S  O L    
> 5.10.0-1.keithv2.el7.x86_64 #1
> [  430.769954] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
> 02/19/2019
> [  430.777089] Call Trace:
> [  430.779542]  <IRQ>
> [  430.781562]  dump_stack+0x74/0x92
> [  430.784896]  nmi_cpu_backtrace.cold.8+0x32/0x68
> [  430.789435]  ? lapic_can_unplug_cpu+0x80/0x80
> [  430.793798]  nmi_trigger_cpumask_backtrace+0xe6/0xf0
> [  430.798770]  arch_trigger_cpumask_backtrace+0x19/0x20
> [  430.803830]  rcu_dump_cpu_stacks+0xb6/0xe6
> [  430.807937]  rcu_sched_clock_irq.cold.92+0x124/0x3a9
> [  430.812908]  ? account_system_index_time+0x94/0xa0
> [  430.817708]  ? tick_sched_do_timer+0x60/0x60
> [  430.821982]  update_process_times+0x93/0xd0
> [  430.826173]  tick_sched_handle.isra.23+0x25/0x60
> [  430.830798]  tick_sched_timer+0x6b/0x80
> [  430.834638]  __hrtimer_run_queues+0x105/0x240
> [  430.839002]  hrtimer_interrupt+0x101/0x220
> [  430.843112]  __sysvec_apic_timer_interrupt+0x62/0xe0
> [  430.848089]  sysvec_apic_timer_interrupt+0x35/0x90
> [  430.852885]  asm_sysvec_apic_timer_interrupt+0x12/0x20
> [  430.858026] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
> [  430.863086] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
> e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
> 00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
> [  430.881838] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
> [  430.887077] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
> 0000000000000046
> [  430.894216] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
> 0000000000000046
> [  430.901355] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
> 000000000000001b
> [  430.908491] R10: 0000000000000001 R11: ffff888102db4000 R12: 
> 0000000000000000
> [  430.915631] R13: 000001343b891adb R14: 0000000000000000 R15: 
> 000000005b0c2b18
> [  430.922770]  ? clocksource_watchdog+0xad/0x2b0
> [  430.927221]  ? find_next_bit+0x14/0x20
> [  430.930984]  ? __clocksource_unstable+0x70/0x70
> [  430.935526]  call_timer_fn+0x2e/0x100
> [  430.939198]  run_timer_softirq+0x1d4/0x400
> [  430.943305]  ? execlists_submission_tasklet+0x93/0x180 [i915]
> [  430.949104]  ? tasklet_action_common.isra.23+0x69/0x100
> [  430.954337]  __do_softirq+0xdb/0x29e
> [  430.957921]  asm_call_irq_on_stack+0x12/0x20
> [  430.962196]  </IRQ>
> [  430.964307]  do_softirq_own_stack+0x3d/0x50
> [  430.968500]  irq_exit_rcu+0xd2/0xe0
> [  430.971998]  common_interrupt+0x81/0x150
> [  430.975936]  ? irq_exit_rcu+0x45/0xe0
> [  430.979606]  asm_common_interrupt+0x1e/0x40
> [  430.983795] RIP: 0010:pci_mmcfg_read+0x11/0xe0
> [  430.988257] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
> cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
> 57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
> [  431.007005] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
> [  431.012232] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
> 0000000000000162
> [  431.019366] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
> 0000000000000000
> [  431.026499] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
> ffffc90000303b5c
> [  431.033630] R10: 0000000000000008 R11: ffffc90000303950 R12: 
> 0000000000000001
> [  431.040764] R13: 0000000000000162 R14: ffff888101400000 R15: 
> 0000000000000064
> [  431.047897]  ? pcibios_retrieve_fw_addr+0x90/0x90
> [  431.052612]  raw_pci_read+0x35/0x40
> [  431.056111]  pci_read+0x2c/0x30
> [  431.059261]  pci_bus_read_config_word+0x4a/0x70
> [  431.063800]  pci_read_config_word+0x23/0x40
> [  431.067991]  pci_wait_for_pending+0x49/0xb0
> [  431.072180]  ? raw_pci_write+0x36/0x40
> [  431.075939]  pci_vc_do_save_buffer+0x453/0x5b0
> [  431.080386]  ? pci_read_config_dword+0x23/0x40
> [  431.084832]  pci_restore_vc_state+0x4d/0x70
> [  431.089020]  pci_restore_state.part.44+0x177/0x420
> [  431.093821]  pci_restore_state+0x18/0x20
> [  431.097751]  pcie_portdrv_slot_reset+0x12/0x30
> [  431.102205]  report_slot_reset+0x46/0x80
> [  431.106138]  ? merge_result.part.4+0x40/0x40
> [  431.110413]  pci_walk_bus+0x48/0x90
> [  431.113912]  pci_walk_bridge+0x1d/0x30
> [  431.117668]  pcie_do_recovery+0x1bd/0x270
> [  431.121685]  ? aer_dev_correctable_show+0xd0/0xd0
> [  431.126404]  ? aer_dev_correctable_show+0xd0/0xd0
> [  431.131118]  aer_process_err_devices+0xd0/0xe9
> [  431.135568]  aer_isr.cold.25+0x54/0xa3
> [  431.139326]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
> [  431.144388]  irq_thread_fn+0x24/0x60
> [  431.147972]  irq_thread+0xea/0x170
> [  431.151380]  ? irq_forced_thread_fn+0x80/0x80
> [  431.155748]  ? irq_thread_check_affinity+0xf0/0xf0
> [  431.160548]  kthread+0x126/0x140
> [  431.163788]  ? kthread_park+0x90/0x90
> [  431.167460]  ret_from_fork+0x22/0x30
> [  444.669140] pcieport 0000:02:01.0: restoring config space at offset 
> 0x3c (was 0x100, writing 0x12010a)
> [  462.143968] pcieport 0000:02:01.0: restoring config space at offset 
> 0x2c (was 0x0, writing 0x0)
> [  462.152864] pcieport 0000:02:01.0: restoring config space at offset 
> 0x28 (was 0x0, writing 0x0)
> [  462.161752] pcieport 0000:02:01.0: restoring config space at offset 
> 0x24 (was 0x1fff1, writing 0x90e19001)
> 
> 
> This was not observed with the v1 series of patches.
> 
>  From what I can tell, this v2 series introduced a change in the 5/5 
> patch, that the v1 did not.
> 
> BTW, just to be safe and aligned with you, which kernel sources should I 
> be applying this v2 series patches to?
> 
> Let me know if you need more input.
> 
> 
> Thanks,
> Hinko

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-05 15:06   ` Hinko Kocevar
@ 2021-01-05 18:33     ` Keith Busch
  2021-01-05 23:07       ` Kelley, Sean V
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-01-05 18:33 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: linux-pci, Bjorn Helgaas

On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
> > On 1/5/21 12:02 AM, Keith Busch wrote:
> > > Changes from v1:
> > > 
> > >    Added received Acks
> > > 
> > >    Split the kernel print identifying the port type being reset.
> > > 
> > >    Added a patch for the portdrv to ensure the slot_reset happens without
> > >    relying on a downstream device driver..
> > > 
> > > Keith Busch (5):
> > >    PCI/ERR: Clear status of the reporting device
> > >    PCI/AER: Actually get the root port
> > >    PCI/ERR: Retain status from error notification
> > >    PCI/AER: Specify the type of port that was reset
> > >    PCI/portdrv: Report reset for frozen channel
> 
> I removed the patch 5/5 from this patch series, and after testing again, it
> makes my setup recover from the injected error; same as observed with v1
> series.

Thanks for the notice. Unfortunately that seems even more confusing to
me right now. That patch shouldn't do anything to the devices or the
driver's state; it just ensures a recovery path that was supposed to
happen anyway. The stack trace says restoring the config space completed
partially before getting stuck at the virtual channel capability, at
which point it appears to be in an infinite loop. I'll try to look into
it. The emulated devices I test with don't have the VC cap but I might
have real devices that do.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-05 18:33     ` Keith Busch
@ 2021-01-05 23:07       ` Kelley, Sean V
  2021-01-07 21:42         ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Kelley, Sean V @ 2021-01-05 23:07 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hinko Kocevar, Linux PCI, Bjorn Helgaas



> On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
>> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
>>> On 1/5/21 12:02 AM, Keith Busch wrote:
>>>> Changes from v1:
>>>> 
>>>>    Added received Acks
>>>> 
>>>>    Split the kernel print identifying the port type being reset.
>>>> 
>>>>    Added a patch for the portdrv to ensure the slot_reset happens without
>>>>    relying on a downstream device driver..
>>>> 
>>>> Keith Busch (5):
>>>>    PCI/ERR: Clear status of the reporting device
>>>>    PCI/AER: Actually get the root port
>>>>    PCI/ERR: Retain status from error notification
>>>>    PCI/AER: Specify the type of port that was reset
>>>>    PCI/portdrv: Report reset for frozen channel
>> 
>> I removed the patch 5/5 from this patch series, and after testing again, it
>> makes my setup recover from the injected error; same as observed with v1
>> series.
> 
> Thanks for the notice. Unfortunately that seems even more confusing to
> me right now. That patch shouldn't do anything to the devices or the
> driver's state; it just ensures a recovery path that was supposed to
> happen anyway. The stack trace says restoring the config space completed
> partially before getting stuck at the virtual channel capability, at
> which point it appears to be in an infinite loop. I'll try to look into
> it. The emulated devices I test with don't have the VC cap but I might
> have real devices that do.

I’m not seeing the error either with V2 when testing with are-inject using RCECs and an associated RCiEP.

Sean

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-05 23:07       ` Kelley, Sean V
@ 2021-01-07 21:42         ` Keith Busch
  2021-01-08  9:38           ` Hinko Kocevar
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-01-07 21:42 UTC (permalink / raw)
  To: Kelley, Sean V; +Cc: Hinko Kocevar, Linux PCI, Bjorn Helgaas

On Tue, Jan 05, 2021 at 11:07:23PM +0000, Kelley, Sean V wrote:
> > On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
> > On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
> >> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
> >>> On 1/5/21 12:02 AM, Keith Busch wrote:
> >>>> Changes from v1:
> >>>> 
> >>>>    Added received Acks
> >>>> 
> >>>>    Split the kernel print identifying the port type being reset.
> >>>> 
> >>>>    Added a patch for the portdrv to ensure the slot_reset happens without
> >>>>    relying on a downstream device driver..
> >>>> 
> >>>> Keith Busch (5):
> >>>>    PCI/ERR: Clear status of the reporting device
> >>>>    PCI/AER: Actually get the root port
> >>>>    PCI/ERR: Retain status from error notification
> >>>>    PCI/AER: Specify the type of port that was reset
> >>>>    PCI/portdrv: Report reset for frozen channel
> >> 
> >> I removed the patch 5/5 from this patch series, and after testing again, it
> >> makes my setup recover from the injected error; same as observed with v1
> >> series.
> > 
> > Thanks for the notice. Unfortunately that seems even more confusing to
> > me right now. That patch shouldn't do anything to the devices or the
> > driver's state; it just ensures a recovery path that was supposed to
> > happen anyway. The stack trace says restoring the config space completed
> > partially before getting stuck at the virtual channel capability, at
> > which point it appears to be in an infinite loop. I'll try to look into
> > it. The emulated devices I test with don't have the VC cap but I might
> > have real devices that do.
> 
> I’m not seeing the error either with V2 when testing with are-inject using RCECs and an associated RCiEP.

Thank you, yes, I'm also not seeing a problem either on my end. The
sighting is still concerning though, so I'll keep looking. I may have to
request Hinko to try a debug patch to help narrow down where things have
gone wrong if that's okay.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-07 21:42         ` Keith Busch
@ 2021-01-08  9:38           ` Hinko Kocevar
  2021-01-11 13:39             ` Hinko Kocevar
  0 siblings, 1 reply; 36+ messages in thread
From: Hinko Kocevar @ 2021-01-08  9:38 UTC (permalink / raw)
  To: Keith Busch, Kelley, Sean V; +Cc: Linux PCI, Bjorn Helgaas



On 1/7/21 10:42 PM, Keith Busch wrote:
> On Tue, Jan 05, 2021 at 11:07:23PM +0000, Kelley, Sean V wrote:
>>> On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
>>> On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
>>>> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
>>>>> On 1/5/21 12:02 AM, Keith Busch wrote:
>>>>>> Changes from v1:
>>>>>>
>>>>>>     Added received Acks
>>>>>>
>>>>>>     Split the kernel print identifying the port type being reset.
>>>>>>
>>>>>>     Added a patch for the portdrv to ensure the slot_reset happens without
>>>>>>     relying on a downstream device driver..
>>>>>>
>>>>>> Keith Busch (5):
>>>>>>     PCI/ERR: Clear status of the reporting device
>>>>>>     PCI/AER: Actually get the root port
>>>>>>     PCI/ERR: Retain status from error notification
>>>>>>     PCI/AER: Specify the type of port that was reset
>>>>>>     PCI/portdrv: Report reset for frozen channel
>>>>
>>>> I removed the patch 5/5 from this patch series, and after testing again, it
>>>> makes my setup recover from the injected error; same as observed with v1
>>>> series.
>>>
>>> Thanks for the notice. Unfortunately that seems even more confusing to
>>> me right now. That patch shouldn't do anything to the devices or the
>>> driver's state; it just ensures a recovery path that was supposed to
>>> happen anyway. The stack trace says restoring the config space completed
>>> partially before getting stuck at the virtual channel capability, at
>>> which point it appears to be in an infinite loop. I'll try to look into
>>> it. The emulated devices I test with don't have the VC cap but I might
>>> have real devices that do.
>>
>> I’m not seeing the error either with V2 when testing with are-inject using RCECs and an associated RCiEP.
> 
> Thank you, yes, I'm also not seeing a problem either on my end. The
> sighting is still concerning though, so I'll keep looking. I may have to
> request Hinko to try a debug patch to help narrow down where things have
> gone wrong if that's okay.
> 

Sure. I'm willing to help out and debug this on my side as well. Let me 
know what you need me to do!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-08  9:38           ` Hinko Kocevar
@ 2021-01-11 13:39             ` Hinko Kocevar
  2021-01-11 16:37               ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Hinko Kocevar @ 2021-01-11 13:39 UTC (permalink / raw)
  To: Keith Busch, Kelley, Sean V; +Cc: Linux PCI, Bjorn Helgaas



On 1/8/21 10:38 AM, Hinko Kocevar wrote:
> 
> 
> On 1/7/21 10:42 PM, Keith Busch wrote:
>> On Tue, Jan 05, 2021 at 11:07:23PM +0000, Kelley, Sean V wrote:
>>>> On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
>>>> On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
>>>>> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
>>>>>> On 1/5/21 12:02 AM, Keith Busch wrote:
>>>>>>> Changes from v1:
>>>>>>>
>>>>>>>     Added received Acks
>>>>>>>
>>>>>>>     Split the kernel print identifying the port type being reset.
>>>>>>>
>>>>>>>     Added a patch for the portdrv to ensure the slot_reset 
>>>>>>> happens without
>>>>>>>     relying on a downstream device driver..
>>>>>>>
>>>>>>> Keith Busch (5):
>>>>>>>     PCI/ERR: Clear status of the reporting device
>>>>>>>     PCI/AER: Actually get the root port
>>>>>>>     PCI/ERR: Retain status from error notification
>>>>>>>     PCI/AER: Specify the type of port that was reset
>>>>>>>     PCI/portdrv: Report reset for frozen channel
>>>>>
>>>>> I removed the patch 5/5 from this patch series, and after testing 
>>>>> again, it
>>>>> makes my setup recover from the injected error; same as observed 
>>>>> with v1
>>>>> series.
>>>>
>>>> Thanks for the notice. Unfortunately that seems even more confusing to
>>>> me right now. That patch shouldn't do anything to the devices or the
>>>> driver's state; it just ensures a recovery path that was supposed to
>>>> happen anyway. The stack trace says restoring the config space 
>>>> completed
>>>> partially before getting stuck at the virtual channel capability, at
>>>> which point it appears to be in an infinite loop. I'll try to look into
>>>> it. The emulated devices I test with don't have the VC cap but I might
>>>> have real devices that do.
>>>
>>> I’m not seeing the error either with V2 when testing with are-inject 
>>> using RCECs and an associated RCiEP.
>>
>> Thank you, yes, I'm also not seeing a problem either on my end. The
>> sighting is still concerning though, so I'll keep looking. I may have to
>> request Hinko to try a debug patch to help narrow down where things have
>> gone wrong if that's okay.
>>
> 
> Sure. I'm willing to help out and debug this on my side as well. Let me 
> know what you need me to do!

Testing this patch a bit more (without the 5/5) resulted in the same CPU 
lockup:

watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:128]

as I initially reported with the 5/5 of this patch included.

It seems more infrequent, though. For example, after reboot this is not 
observed and the recovery process is successful, whereas when 5/5 is 
also used every recovery resulted in CPU lockup.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-11 13:39             ` Hinko Kocevar
@ 2021-01-11 16:37               ` Keith Busch
  2021-01-11 20:02                 ` Hinko Kocevar
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-01-11 16:37 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas

On Mon, Jan 11, 2021 at 02:39:20PM +0100, Hinko Kocevar wrote:
> Testing this patch a bit more (without the 5/5) resulted in the same CPU
> lockup:
> 
> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:128]
> 
> as I initially reported with the 5/5 of this patch included.
> 
> It seems more infrequent, though. For example, after reboot this is not
> observed and the recovery process is successful, whereas when 5/5 is also
> used every recovery resulted in CPU lockup.

I am assuming this soft lockup is still when restoring the downstream
port's virtual channel capability. Your initial sighting indicates that
it doesn't appear to be a deadlock, but the stack trace never existed
pci_restore_vc_state() either. I did not find any obvious issues here
just from code inspection, so if you could try applying the following
patch and send the kernel messages output, that would help.

---
diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index 5fc59ac31145..4834af7eb582 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -28,6 +28,7 @@ static void pci_vc_save_restore_dwords(struct pci_dev *dev, int pos,
 {
 	int i;
 
+	pci_warn(dev, "%s: pos:%d dwords:%d\n", __func__, pos, dwords);
 	for (i = 0; i < dwords; i++, buf++) {
 		if (save)
 			pci_read_config_dword(dev, pos + (i * 4), buf);
@@ -110,6 +111,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
 	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev))
 		return;
 
+	pci_warn(dev, "%s: pos:%d res:%d\n", __func__, pos, res);
+
 	ctrl_pos = pos + PCI_VC_RES_CTRL + (res * PCI_CAP_VC_PER_VC_SIZEOF);
 	status_pos = pos + PCI_VC_RES_STATUS + (res * PCI_CAP_VC_PER_VC_SIZEOF);
 
@@ -165,6 +168,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
 	if (link && !pci_wait_for_pending(link, status_pos2,
 					  PCI_VC_RES_STATUS_NEGO))
 		pci_err(link, "VC%d negotiation stuck pending\n", id);
+
+	pci_warn(dev, "%s: pos:%d res:%d return\n", __func__, pos, res);
 }
 
 /**
@@ -190,6 +195,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
 	int i, len = 0;
 	u8 *buf = save_state ? (u8 *)save_state->cap.data : NULL;
 
+	pci_warn(dev, "%s: buf:%d pos:%d\n", __func__, buf != NULL, pos);
 	/* Sanity check buffer size for save/restore */
 	if (buf && save_state->cap.size !=
 	    pci_vc_do_save_buffer(dev, pos, NULL, save)) {
@@ -278,6 +284,8 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
 		pci_read_config_dword(dev, pos + PCI_VC_RES_CAP +
 				      (i * PCI_CAP_VC_PER_VC_SIZEOF), &cap);
 		parb_offset = ((cap & PCI_VC_RES_CAP_ARB_OFF) >> 24) * 16;
+		pci_warn(dev, "%s: i:%d evcc:%d parb_offset:%d\n", __func__, i,
+			 evcc, parb_offset);
 		if (parb_offset) {
 			int size, parb_phases = 0;
 
@@ -332,6 +340,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
 		len += 4;
 	}
 
+	pci_warn(dev, "%s: len:%d\n", __func__, len);
 	return buf ? 0 : len;
 }
 
@@ -399,6 +408,7 @@ void pci_restore_vc_state(struct pci_dev *dev)
 		if (!save_state || !pos)
 			continue;
 
+		pci_warn(dev, "%s: i:%d pos:%d\n", __func__, i, pos);
 		pci_vc_do_save_buffer(dev, pos, save_state, false);
 	}
 }
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-11 16:37               ` Keith Busch
@ 2021-01-11 20:02                 ` Hinko Kocevar
  2021-01-11 22:09                   ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Hinko Kocevar @ 2021-01-11 20:02 UTC (permalink / raw)
  To: Keith Busch; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 3443 bytes --]



On 1/11/21 5:37 PM, Keith Busch wrote:
> On Mon, Jan 11, 2021 at 02:39:20PM +0100, Hinko Kocevar wrote:
>> Testing this patch a bit more (without the 5/5) resulted in the same CPU
>> lockup:
>>
>> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:128]
>>
>> as I initially reported with the 5/5 of this patch included.
>>
>> It seems more infrequent, though. For example, after reboot this is not
>> observed and the recovery process is successful, whereas when 5/5 is also
>> used every recovery resulted in CPU lockup.
> 
> I am assuming this soft lockup is still when restoring the downstream
> port's virtual channel capability. Your initial sighting indicates that
> it doesn't appear to be a deadlock, but the stack trace never existed
> pci_restore_vc_state() either. I did not find any obvious issues here
> just from code inspection, so if you could try applying the following
> patch and send the kernel messages output, that would help.
> 
> ---
> diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
> index 5fc59ac31145..4834af7eb582 100644
> --- a/drivers/pci/vc.c
> +++ b/drivers/pci/vc.c
> @@ -28,6 +28,7 @@ static void pci_vc_save_restore_dwords(struct pci_dev *dev, int pos,
>   {
>   	int i;
>   
> +	pci_warn(dev, "%s: pos:%d dwords:%d\n", __func__, pos, dwords);
>   	for (i = 0; i < dwords; i++, buf++) {
>   		if (save)
>   			pci_read_config_dword(dev, pos + (i * 4), buf);
> @@ -110,6 +111,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
>   	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev))
>   		return;
>   
> +	pci_warn(dev, "%s: pos:%d res:%d\n", __func__, pos, res);
> +
>   	ctrl_pos = pos + PCI_VC_RES_CTRL + (res * PCI_CAP_VC_PER_VC_SIZEOF);
>   	status_pos = pos + PCI_VC_RES_STATUS + (res * PCI_CAP_VC_PER_VC_SIZEOF);
>   
> @@ -165,6 +168,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
>   	if (link && !pci_wait_for_pending(link, status_pos2,
>   					  PCI_VC_RES_STATUS_NEGO))
>   		pci_err(link, "VC%d negotiation stuck pending\n", id);
> +
> +	pci_warn(dev, "%s: pos:%d res:%d return\n", __func__, pos, res);
>   }
>   
>   /**
> @@ -190,6 +195,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
>   	int i, len = 0;
>   	u8 *buf = save_state ? (u8 *)save_state->cap.data : NULL;
>   
> +	pci_warn(dev, "%s: buf:%d pos:%d\n", __func__, buf != NULL, pos);
>   	/* Sanity check buffer size for save/restore */
>   	if (buf && save_state->cap.size !=
>   	    pci_vc_do_save_buffer(dev, pos, NULL, save)) {
> @@ -278,6 +284,8 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
>   		pci_read_config_dword(dev, pos + PCI_VC_RES_CAP +
>   				      (i * PCI_CAP_VC_PER_VC_SIZEOF), &cap);
>   		parb_offset = ((cap & PCI_VC_RES_CAP_ARB_OFF) >> 24) * 16;
> +		pci_warn(dev, "%s: i:%d evcc:%d parb_offset:%d\n", __func__, i,
> +			 evcc, parb_offset);
>   		if (parb_offset) {
>   			int size, parb_phases = 0;
>   
> @@ -332,6 +340,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
>   		len += 4;
>   	}
>   
> +	pci_warn(dev, "%s: len:%d\n", __func__, len);
>   	return buf ? 0 : len;
>   }
>   
> @@ -399,6 +408,7 @@ void pci_restore_vc_state(struct pci_dev *dev)
>   		if (!save_state || !pos)
>   			continue;
>   
> +		pci_warn(dev, "%s: i:%d pos:%d\n", __func__, i, pos);
>   		pci_vc_do_save_buffer(dev, pos, save_state, false);
>   	}
>   }
> --
> 

Attached are the messages.

Thanks!

[-- Attachment #2: cpu-lockups2 --]
[-- Type: text/plain, Size: 35612 bytes --]

...
[   64.649399] xdma: loading out-of-tree module taints kernel.
[   64.656255] xdma:xdma_mod_init: Xilinx XDMA Reference Driver xdma v2020.1.8
[   64.663324] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, timeout: h2c 10 c2h 10 sec.
[   64.672189] xdma:xdma_device_open: xdma device 0000:08:00.0, 0x00000000591acbe0.
[   64.680346] xdma:map_single_bar: BAR0 at 0xde000000 mapped at 0x000000000fdb12a5, length=16777216(/16777216)
[   64.690290] xdma:map_single_bar: BAR1 at 0xdf000000 mapped at 0x000000005fe214c9, length=65536(/65536)
[   64.699702] xdma:map_bars: config bar 1, pos 1.
[   64.704339] xdma:identify_bars: 2 BARs: config 1, user 0, bypass -1.
[   64.711245] xdma:pci_keep_intx_enabled: 0000:08:00.0: clear INTX_DISABLE, 0x407 -> 0x7.
[   64.719369] xdma:irq_msix_channel_setup: engine 0-H2C0-MM, irq#169.
[   64.725751] xdma:irq_msix_channel_setup: engine 0-C2H0-MM, irq#170.
[   64.732134] xdma:irq_msix_user_setup: 0-USR-0, IRQ#171 with 0x0000000016487e14
[   64.739479] xdma:irq_msix_user_setup: 0-USR-1, IRQ#172 with 0x0000000017b744b7
[   64.746813] xdma:irq_msix_user_setup: 0-USR-2, IRQ#173 with 0x000000007491af44
[   64.754149] xdma:irq_msix_user_setup: 0-USR-3, IRQ#174 with 0x000000009e92f5ff
[   64.761484] xdma:irq_msix_user_setup: 0-USR-4, IRQ#175 with 0x0000000056e09b58
[   64.768819] xdma:irq_msix_user_setup: 0-USR-5, IRQ#176 with 0x00000000af6d01f5
[   64.776159] xdma:irq_msix_user_setup: 0-USR-6, IRQ#177 with 0x000000002ef15b69
[   64.783493] xdma:irq_msix_user_setup: 0-USR-7, IRQ#178 with 0x0000000099fd5ab6
[   64.790832] xdma:irq_msix_user_setup: 0-USR-8, IRQ#179 with 0x0000000014654474
[   64.798167] xdma:irq_msix_user_setup: 0-USR-9, IRQ#180 with 0x00000000963ca534
[   64.805504] xdma:irq_msix_user_setup: 0-USR-10, IRQ#181 with 0x0000000086853299
[   64.812928] xdma:irq_msix_user_setup: 0-USR-11, IRQ#182 with 0x00000000d0d53381
[   64.820354] xdma:irq_msix_user_setup: 0-USR-12, IRQ#183 with 0x000000008ce02f44
[   64.827776] xdma:irq_msix_user_setup: 0-USR-13, IRQ#184 with 0x00000000bc4b16b8
[   64.835200] xdma:irq_msix_user_setup: 0-USR-14, IRQ#185 with 0x0000000088677e88
[   64.842623] xdma:irq_msix_user_setup: 0-USR-15, IRQ#186 with 0x000000000d9849f6
[   64.850044] xdma:probe_one: 0000:08:00.0 xdma0, pdev 0x00000000591acbe0, xdev 0x00000000c77a5002, 0x00000000947398f0, usr 16, ch 1,1.
[   64.864139] xdma:cdev_xvc_init: xcdev 0x000000004eda4704, bar 0, offset 0x40000.
[   68.523443] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   68.541024] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   68.547489] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   68.554208] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   68.560929] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.568603] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.574632] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.582305] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.588338] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   68.597139] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   68.605946] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   68.615568] pcieport 0000:00:01.1: PME# disabled
[   68.739042] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   68.747225] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   68.755222] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   68.763310] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   68.771310] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   68.778974] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   68.786628] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   68.794629] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   68.802379] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   68.810298] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   68.818315] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   68.825980] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   68.833651] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   68.841312] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   68.849061] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   68.856719] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   68.864817] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   68.871433] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   68.878049] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.885620] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.891555] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.899128] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.905178] pcieport 0000:00:01.1: PME# enabled
[   68.921051] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   68.927905] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   68.946078] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   68.952545] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   68.959265] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   68.965984] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.973658] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.979682] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.987353] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.993381] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.002178] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.010980] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.020624] pcieport 0000:00:01.1: PME# disabled
[   69.084588] pcieport 0000:02:02.0: PME# disabled
[   69.101016] pcieport 0000:02:02.0: pci_restore_vc_state: i:1 pos:328
[   69.107376] pcieport 0000:02:02.0: pci_vc_do_save_buffer: buf:1 pos:328
[   69.113998] pcieport 0000:02:02.0: pci_vc_do_save_buffer: buf:0 pos:328
[   69.120646] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.128225] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.135795] pcieport 0000:02:02.0: pci_vc_do_save_buffer: len:12
[   69.141818] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.149407] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.149972] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   69.156981] pcieport 0000:02:02.0: pci_vc_do_save_buffer: len:12
[   69.157058] pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.157063] pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.157067] pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.157261] pcieport 0000:02:02.0: PME# disabled
[   69.165517] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   69.171639] pci_bus 0000:0a: busn_res: [bus 0a] is released
[   69.180336] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   69.188880] pcieport 0000:02:08.0: PME# disabled
[   69.203089] pcieport 0000:02:08.0: pci_restore_vc_state: i:1 pos:328
[   69.203091] pcieport 0000:02:08.0: pci_vc_do_save_buffer: buf:1 pos:328
[   69.203093] pcieport 0000:02:08.0: pci_vc_do_save_buffer: buf:0 pos:328
[   69.203097] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   69.203100] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.203101] pcieport 0000:02:08.0: pci_vc_do_save_buffer: len:76
[   69.203108] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   69.211361] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   69.216693] pcieport 0000:02:08.0: pci_vc_save_restore_dwords: pos:376 dwords:16
[   69.216758] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.216763] pcieport 0000:02:08.0: pci_vc_do_save_buffer: len:76
[   69.216804] pcieport 0000:02:08.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.225139] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   69.229513] pcieport 0000:02:08.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.229520] pcieport 0000:02:08.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.229614] pcieport 0000:02:08.0: PME# disabled
[   69.229829] pci_bus 0000:0b: busn_res: [bus 0b] is released
[   69.236231] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   69.242730] pcieport 0000:02:09.0: PME# disabled
[   69.249579] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   69.257048] pci_bus 0000:0c: dev 00, dec refcount to 0
[   69.257060] pci_bus 0000:0c: dev 00, released physical slot 9
[   69.257223] pci_bus 0000:0c: busn_res: [bus 0c] is released
[   69.264928] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   69.270998] pcieport 0000:02:0a.0: PME# disabled
[   69.278705] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   69.302025] pcieport 0000:02:0a.0: pci_restore_vc_state: i:1 pos:328
[   69.307986] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   69.316417] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: buf:1 pos:328
[   69.316419] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: buf:0 pos:328
[   69.316428] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.316432] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.316433] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: len:12
[   69.316446] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.316453] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.316460] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: len:12
[   69.316514] pcieport 0000:02:0a.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.316521] pcieport 0000:02:0a.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.316525] pcieport 0000:02:0a.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.316660] pcieport 0000:02:0a.0: PME# disabled
[   69.324439] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   69.333201] pci_bus 0000:0d: busn_res: [bus 0d] is released
[   69.342639] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   69.347130] plx_dma 0000:01:00.1: PME# disabled
[   69.352951] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   69.546955] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   69.554803] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   69.562563] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   69.570767] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   69.577487] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   69.584208] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   69.591884] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   69.597910] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   69.605583] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   69.611806] pcieport 0000:00:01.1: PME# enabled
[   69.623179] plx_dma 0000:01:00.2: PME# disabled
[   69.628424] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   69.640379] plx_dma 0000:01:00.3: PME# disabled
[   69.656336] plx_dma 0000:01:00.4: PME# disabled
[   72.886830] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   72.904973] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   72.911333] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   72.917963] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   72.924583] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   72.932152] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   72.938089] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   72.945665] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   72.951596] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   72.960292] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   72.968989] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   72.978623] pcieport 0000:00:01.1: PME# disabled
[   73.099568] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   73.107774] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   73.115786] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   73.123878] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   73.131892] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   73.139559] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   73.147222] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   73.155220] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   73.162968] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   73.170892] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   73.178913] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   73.186579] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   73.194234] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   73.201890] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   73.209637] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   73.217296] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   73.225411] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   73.232025] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   73.238644] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.246217] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.252153] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.259736] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.265707] pcieport 0000:00:01.1: PME# enabled
[   73.282253] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   73.289128] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   73.306977] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   73.313335] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   73.319962] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   73.326583] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.334159] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.340092] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.347671] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.353617] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   73.362320] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   73.371018] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   73.380445] pcieport 0000:00:01.1: PME# disabled
[   73.430387] pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
[   73.440086] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
[   73.448186] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[   73.458969] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
[   73.467328] pcieport 0000:00:01.0:    [14] CmpltTO
[   73.473434] xdma_error_detected: >>>
[   73.477012] xdma:xdma_error_detected: dev 0x00000000591acbe0,0x00000000c77a5002, frozen state error, reset controller
[   73.487626] xdma:xdma_device_offline: pdev 0x00000000591acbe0, xdev 0x00000000947398f0.
[   73.495738] xdma:xdma_device_offline: xdev 0x00000000947398f0, done.
[   73.509969] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   73.518157] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   73.526165] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   73.534267] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   73.542269] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   73.549931] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   73.557591] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   73.565605] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   73.573359] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   73.581278] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   73.589286] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   73.596945] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   73.604609] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   73.612272] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   73.620024] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   73.627691] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   73.635804] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   73.642419] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   73.649037] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.656602] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.662525] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.670102] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.676118] pcieport 0000:00:01.1: PME# enabled
[   73.692066] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   74.509998] pcieport 0000:00:01.0: AER: Root Port link has been reset (0)
[   74.516820] pcieport 0000:01:00.0: pci_restore_vc_state: i:1 pos:328
[   74.523169] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:1 pos:328
[   74.529783] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:0 pos:328
[   74.536405] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.544059] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.551630] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   74.557663] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.565327] pcieport 0000:01:00.0: pci_vc_save_restore_dwords: pos:376 dwords:16
[   74.572741] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.580318] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   74.586354] pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x12010b)
[   74.595676] pcieport 0000:01:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   74.604377] pcieport 0000:01:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   74.613076] pcieport 0000:01:00.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x91119001)
[   74.622730] pcieport 0000:01:00.0: restoring config space at offset 0x20 (was 0xfff0, writing 0xdf10de00)
[   74.632300] pcieport 0000:01:00.0: restoring config space at offset 0x1c (was 0x1f1, writing 0x8121)
[   74.641427] pcieport 0000:01:00.0: restoring config space at offset 0x18 (was 0x0, writing 0xd0201)
[   74.650479] pcieport 0000:01:00.0: restoring config space at offset 0x10 (was 0x0, writing 0xdf200000)
[   74.659787] pcieport 0000:01:00.0: restoring config space at offset 0xc (was 0x810000, writing 0x810010)
[   74.669275] pcieport 0000:01:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
[   74.678805] pcieport 0000:01:00.0: saving config space at offset 0x0 (reading 0x872510b5)
[   74.686984] pcieport 0000:01:00.0: saving config space at offset 0x4 (reading 0x100407)
[   74.694996] pcieport 0000:01:00.0: saving config space at offset 0x8 (reading 0x60400ca)
[   74.703094] pcieport 0000:01:00.0: saving config space at offset 0xc (reading 0x810010)
[   74.711114] pcieport 0000:01:00.0: saving config space at offset 0x10 (reading 0xdf200000)
[   74.719381] pcieport 0000:01:00.0: saving config space at offset 0x14 (reading 0x0)
[   74.727031] pcieport 0000:01:00.0: saving config space at offset 0x18 (reading 0xd0201)
[   74.735027] pcieport 0000:01:00.0: saving config space at offset 0x1c (reading 0x8121)
[   74.742939] pcieport 0000:01:00.0: saving config space at offset 0x20 (reading 0xdf10de00)
[   74.751201] pcieport 0000:01:00.0: saving config space at offset 0x24 (reading 0x91119001)
[   74.759472] pcieport 0000:01:00.0: saving config space at offset 0x28 (reading 0x0)
[   74.767134] pcieport 0000:01:00.0: saving config space at offset 0x2c (reading 0x0)
[   74.774788] pcieport 0000:01:00.0: saving config space at offset 0x30 (reading 0x0)
[   74.782441] pcieport 0000:01:00.0: saving config space at offset 0x34 (reading 0x40)
[   74.790188] pcieport 0000:01:00.0: saving config space at offset 0x38 (reading 0x0)
[   74.797844] pcieport 0000:01:00.0: saving config space at offset 0x3c (reading 0x12010b)
[   74.806027] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:1 pos:328
[   74.812641] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:0 pos:328
[   74.819255] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.826911] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.834483] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   74.840505] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.848160] pcieport 0000:01:00.0: pci_vc_save_restore_dwords: pos:376 dwords:16
[   74.855580] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.863150] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   76.390998] sched: RT throttling activated
[   78.616988] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   78.636106] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   78.642483] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   78.649139] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   78.655790] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.663376] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   78.669325] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.676919] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   78.682969] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   78.691694] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   78.700420] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   78.710200] pcieport 0000:00:01.1: PME# disabled
[   78.838975] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   78.847182] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   78.855215] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   78.863331] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   78.871363] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   78.879043] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   78.886728] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   78.894762] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   78.902529] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   78.910469] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   78.918498] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   78.926176] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   78.933857] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   78.941537] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   78.949299] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   78.956985] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   78.965256] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   78.971890] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   78.978537] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.986126] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   78.992093] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.999683] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   79.005664] pcieport 0000:00:01.1: PME# enabled
[   79.023036] pcieport 0000:00:01.1: power state changed by ACPI to D3cold


[  101.457576] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:129]
[  101.465153] Modules linked in: xdma(O) aer_inject xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6tai
[  101.465206]  drm_kms_helper syscopyarea sysfillrect igb sysimgblt ptp fb_sys_fops libahci crc32c_intel pps_core serio_raw dca i2c_algo_bit libata drm video
[  101.566072] CPU: 2 PID: 129 Comm: irq/122-aerdrv Tainted: G S         O      5.10.0-1.keithv3.el7.x86_64 #1
[  101.575814] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
[  101.582950] RIP: 0010:pci_mmcfg_read+0xb7/0xe0
[  101.587403] Code: 63 c5 48 01 d8 66 8b 00 0f b7 c0 41 89 04 24 eb dc 49 63 c5 48 01 d8 8a 00 0f b6 c0 41 89 04 24 eb cb 49 63 c5 48 01 d8 8b 00 <41> 89 04 24 eb bd e8 be f4 78 ff 5b 41 c7 04 24 ff ff ff ff b8 ea
[  101.606158] RSP: 0018:ffffc90000303b68 EFLAGS: 00000286
[  101.611391] RAX: 00000000b701000d RBX: ffffc90010208000 RCX: 0000000000000f24
[  101.618531] RDX: 00000000000000ff RSI: 0000000000000002 RDI: 0000000000000000
[  101.625673] RBP: ffffc90000303b90 R08: 0000000000000004 R09: ffffc90000303bc4
[  101.632811] R10: 0000000000000008 R11: ffffc900003038e0 R12: ffffc90000303bc4
[  101.639944] R13: 0000000000000f24 R14: 0000000000008000 R15: 0000000000000004
[  101.647082] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) knlGS:0000000000000000
[  101.655174] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.660920] CR2: 00007f9369637b80 CR3: 000000000240a005 CR4: 00000000003706e0
[  101.668057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  101.675190] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  101.682323] Call Trace:
[  101.684780]  raw_pci_read+0x35/0x40
[  101.688277]  pci_read+0x2c/0x30
[  101.691428]  pci_bus_read_config_dword+0x4a/0x70
[  101.696051]  pci_read_config_dword+0x23/0x40
[  101.700331]  pci_find_next_ext_capability.part.24+0x6b/0xe0
[  101.705912]  pci_find_ext_capability.part.25+0x12/0x20
[  101.711058]  pci_find_ext_capability+0x1a/0x20
[  101.715509]  pci_restore_vc_state+0x32/0xb0
[  101.719695]  ? pcie_capability_write_word+0x32/0x90
[  101.724579]  pci_restore_state.part.44+0x177/0x420
[  101.729377]  pci_restore_state+0x18/0x20
[  101.733310]  pcie_portdrv_slot_reset+0x12/0x30
[  101.737765]  report_slot_reset+0x46/0x80
[  101.741695]  ? merge_result.part.4+0x40/0x40
[  101.745971]  pci_walk_bus+0x48/0x90
[  101.749466]  pci_walk_bridge+0x1d/0x30
[  101.753223]  pcie_do_recovery+0x1bd/0x270
[  101.757241]  ? aer_dev_correctable_show+0xd0/0xd0
[  101.761948]  ? aer_dev_correctable_show+0xd0/0xd0
[  101.766657]  aer_process_err_devices+0xd0/0xe9
[  101.771104]  aer_isr.cold.25+0x54/0xa3
[  101.774862]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  101.779924]  irq_thread_fn+0x24/0x60
[  101.783516]  irq_thread+0xea/0x170
[  101.786926]  ? irq_forced_thread_fn+0x80/0x80
[  101.791292]  ? irq_thread_check_affinity+0xf0/0xf0
[  101.796088]  kthread+0x126/0x140
[  101.799330]  ? kthread_park+0x90/0x90
[  101.803004]  ret_from_fork+0x22/0x30
[  111.539063] pcieport 0000:02:01.0: pci_restore_vc_state: i:1 pos:328
[  111.545503] pcieport 0000:02:01.0: pci_vc_do_save_buffer: buf:1 pos:328
[  129.266656] pcieport 0000:02:01.0: pci_vc_do_save_buffer: buf:0 pos:328
[  134.720858] pcieport 0000:02:01.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[  134.728712] pcieport 0000:02:01.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[  136.397775] rcu: INFO: rcu_sched self-detected stall on CPU
[  136.410927] rcu:     2-....: (1 GPs behind) idle=7b2/1/0x4000000000000000 softirq=5916/5917 fqs=14413
[  136.419972]  (t=60000 jiffies g=12137 q=11828)
[  136.424422] NMI backtrace for cpu 2
[  136.427915] CPU: 2 PID: 129 Comm: irq/122-aerdrv Tainted: G S         O L    5.10.0-1.keithv3.el7.x86_64 #1
[  136.437651] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
[  136.444785] Call Trace:
[  136.447237]  <IRQ>
[  136.449263]  dump_stack+0x74/0x92
[  136.452595]  nmi_cpu_backtrace.cold.8+0x32/0x68
[  136.457134]  ? lapic_can_unplug_cpu+0x80/0x80
[  136.461494]  nmi_trigger_cpumask_backtrace+0xe6/0xf0
[  136.466467]  arch_trigger_cpumask_backtrace+0x19/0x20
[  136.471526]  rcu_dump_cpu_stacks+0xb6/0xe6
[  136.475635]  rcu_sched_clock_irq.cold.92+0x124/0x3a9
[  136.480610]  ? account_system_index_time+0x94/0xa0
[  136.485417]  ? tick_sched_do_timer+0x60/0x60
[  136.489696]  update_process_times+0x93/0xd0
[  136.493892]  tick_sched_handle.isra.23+0x25/0x60
[  136.498517]  tick_sched_timer+0x6b/0x80
[  136.502360]  __hrtimer_run_queues+0x105/0x240
[  136.506726]  hrtimer_interrupt+0x101/0x220
[  136.510836]  __sysvec_apic_timer_interrupt+0x62/0xe0
[  136.515815]  asm_call_irq_on_stack+0x12/0x20
[  136.520092]  </IRQ>
[  136.522202]  sysvec_apic_timer_interrupt+0x7d/0x90
[  136.527003]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  136.532152] RIP: 0010:vprintk_emit+0x10b/0x250
[  136.536601] Code: 1f 44 00 00 e8 e6 0b 00 00 8b 7d d4 48 8b 55 c8 89 de 4d 89 f8 4c 89 f1 e8 62 fa ff ff 89 c3 e8 eb 0b 00 00 48 8b 7d c0 57 9d <0f> 1f 44 00 00 45 84 e4 74 1f 80 3d 95 0a 6f 01 00 74 05 e8 3d d6
[  136.555356] RSP: 0018:ffffc90000303968 EFLAGS: 00000246
[  136.560583] RAX: 0000000000000046 RBX: 0000000000000046 RCX: 0000000000001d91
[  136.567718] RDX: ffffffff825d9628 RSI: 80000000ffffc9db RDI: 0000000000000246
[  136.574852] RBP: ffffc900003039b0 R08: ffffffff8246a940 R09: 0000000000000000
[  136.581992] R10: ffffffff82b21518 R11: ffffffff82b21516 R12: 0000000000000000
[  136.589126] R13: ffffffff82176221 R14: ffffffff8215574f R15: ffffc90000303a60
[  136.596267]  dev_vprintk_emit+0x13b/0x15f
[  136.600286]  dev_printk_emit+0x4e/0x65
[  136.604048]  ? pci_mmcfg_read+0xb7/0xe0
[  136.607890]  __dev_printk+0x58/0x6b
[  136.611383]  _dev_warn+0x6c/0x83
[  136.614622]  ? pci_bus_read_config_dword+0x4a/0x70
[  136.619418]  pci_vc_do_save_buffer+0x251/0x5ab
[  136.623869]  pci_vc_do_save_buffer+0x87/0x5ab
[  136.628238]  ? _dev_warn+0x6c/0x83
[  136.631642]  pci_restore_vc_state+0x7e/0xb0
[  136.635839]  pci_restore_state.part.44+0x177/0x420
[  136.640636]  pci_restore_state+0x18/0x20
[  136.644569]  pcie_portdrv_slot_reset+0x12/0x30
[  136.649017]  report_slot_reset+0x46/0x80
[  136.652945]  ? merge_result.part.4+0x40/0x40
[  136.657221]  pci_walk_bus+0x48/0x90
[  136.660715]  pci_walk_bridge+0x1d/0x30
[  136.664471]  pcie_do_recovery+0x1bd/0x270
[  136.668483]  ? aer_dev_correctable_show+0xd0/0xd0
[  136.673192]  ? aer_dev_correctable_show+0xd0/0xd0
[  136.677901]  aer_process_err_devices+0xd0/0xe9
[  136.682351]  aer_isr.cold.25+0x54/0xa3
[  136.686108]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  136.691167]  irq_thread_fn+0x24/0x60
[  136.694754]  irq_thread+0xea/0x170
[  136.698169]  ? irq_forced_thread_fn+0x80/0x80
[  136.702533]  ? irq_thread_check_affinity+0xf0/0xf0
[  136.707330]  kthread+0x126/0x140
[  136.710572]  ? kthread_park+0x90/0x90
[  136.714247]  ret_from_fork+0x22/0x30
[  136.717927] pcieport 0000:02:01.0: pci_vc_do_save_buffer: len:76
[  161.459414] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:129]
[  161.459417] Modules linked in: xdma(O) aer_inject xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6tai
[  161.459460]  drm_kms_helper syscopyarea sysfillrect igb sysimgblt ptp fb_sys_fops libahci crc32c_intel pps_core serio_raw dca i2c_algo_bit libata drm video
[  161.459468] CPU: 2 PID: 129 Comm: irq/122-aerdrv Tainted: G S         O L    5.10.0-1.keithv3.el7.x86_64 #1
[  161.459470] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
[  161.459471] RIP: 0010:console_unlock+0x374/0x510
[  161.459475] Code: ff e8 10 f9 ff ff 85 c0 0f 85 4f ff ff ff e8 c3 f8 ff ff 85 c0 0f 85 18 fd ff ff e9 3d ff ff ff e8 71 2b 00 00 4c 89 ff 57 9d <0f> 1f 44 00 00 8b 85 44 ff ff ff 85 c0 0f 84 3a fd ff ff e8 74 03
[  161.459476] RSP: 0018:ffffc90000303898 EFLAGS: 00000247
[  161.459478] RAX: 0000000000000000 RBX: ffffffff82af03ac RCX: ffffffff815dba40
[  161.459479] RDX: 0000000000000000 RSI: ffffffff82af2420 RDI: 0000000000000247
[  161.459480] RBP: ffffc90000303958 R08: 0000019b07f91685 R09: 0000000000000002
[  161.459481] R10: 32303a3030303020 R11: 74726f7065696370 R12: 0000000000000000
[  161.459482] R13: 0000000000000000 R14: ffffffff827ecfa0 R15: 0000000000000247
[  161.459483] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) knlGS:0000000000000000
[  161.459485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.459486] CR2: 00007f9369637b80 CR3: 000000000240a005 CR4: 00000000003706e0
[  161.459487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  161.459488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  161.459489] Call Trace:
[  161.459491]  vprintk_emit+0x14b/0x250
[  161.459493]  dev_vprintk_emit+0x13b/0x15f
[  161.459497]  dev_printk_emit+0x4e/0x65
[  161.459499]  ? pci_mmcfg_read+0xb7/0xe0
[  161.459503]  __dev_printk+0x58/0x6b
[  161.459505]  _dev_warn+0x6c/0x83
[  161.459508]  ? pci_bus_read_config_dword+0x4a/0x70
[  161.459510]  pci_vc_do_save_buffer+0x576/0x5ab
...


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-11 20:02                 ` Hinko Kocevar
@ 2021-01-11 22:09                   ` Keith Busch
       [not found]                     ` <ed8256dd-d70d-b8dc-fdc0-a78b9aa3bbd9@ess.eu>
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-01-11 22:09 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas

On Mon, Jan 11, 2021 at 09:02:39PM +0100, Hinko Kocevar wrote:
> Attached are the messages.
> 
> Thanks!

Thank you. It kind of looks like the frequent ext capabilty lookup might
just be really slow for some reason. Could you try the following patch
and let me know if this improves the CPU lockup?

---
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5c59365092fa..8a61a9365c28 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -106,6 +106,7 @@ void pci_pm_init(struct pci_dev *dev);
 void pci_ea_init(struct pci_dev *dev);
 void pci_msi_init(struct pci_dev *dev);
 void pci_msix_init(struct pci_dev *dev);
+void pci_vc_init(struct pci_dev *dev);
 void pci_allocate_cap_save_buffers(struct pci_dev *dev);
 void pci_free_cap_save_buffers(struct pci_dev *dev);
 bool pci_bridge_d3_possible(struct pci_dev *dev);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 953f15abc850..56992a42bac6 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2401,6 +2401,7 @@ static void pci_init_capabilities(struct pci_dev *dev)
 	pci_aer_init(dev);		/* Advanced Error Reporting */
 	pci_dpc_init(dev);		/* Downstream Port Containment */
 	pci_rcec_init(dev);		/* Root Complex Event Collector */
+	pci_vc_init(dev);		/* Virtual Channel */
 
 	pcie_report_downtraining(dev);
 
diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index 5fc59ac31145..76ac118b9b5d 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -357,7 +357,7 @@ int pci_save_vc_state(struct pci_dev *dev)
 		int pos, ret;
 		struct pci_cap_saved_state *save_state;
 
-		pos = pci_find_ext_capability(dev, vc_caps[i].id);
+		pos = dev->vc_caps[i];
 		if (!pos)
 			continue;
 
@@ -394,9 +394,12 @@ void pci_restore_vc_state(struct pci_dev *dev)
 		int pos;
 		struct pci_cap_saved_state *save_state;
 
-		pos = pci_find_ext_capability(dev, vc_caps[i].id);
+		pos = dev->vc_caps[i];
+		if (!pos)
+			continue;
+
 		save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
-		if (!save_state || !pos)
+		if (!save_state)
 			continue;
 
 		pci_vc_do_save_buffer(dev, pos, save_state, false);
@@ -415,7 +418,7 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
-		int len, pos = pci_find_ext_capability(dev, vc_caps[i].id);
+		int len, pos = dev->vc_caps[i];
 
 		if (!pos)
 			continue;
@@ -426,3 +429,11 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
 				vc_caps[i].name);
 	}
 }
+
+void pci_vc_init(struct pci_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vc_caps); i++)
+		dev->vc_caps[i] = pci_find_ext_capability(dev, vc_caps[i].id);
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b32126d26997..7a3aa7e4d6f8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -501,6 +501,7 @@ struct pci_dev {
 	struct pci_p2pdma *p2pdma;
 #endif
 	u16		acs_cap;	/* ACS Capability offset */
+	u16		vc_caps[3];	/* Virtual Channel capability offsets */
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
       [not found]                     ` <ed8256dd-d70d-b8dc-fdc0-a78b9aa3bbd9@ess.eu>
@ 2021-01-12 19:27                       ` Keith Busch
  2021-01-12 22:19                         ` Hinko Kocevar
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-01-12 19:27 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas

On Tue, Jan 12, 2021 at 06:20:55PM +0100, Hinko Kocevar wrote:
> On 1/11/21 11:09 PM, Keith Busch wrote:
>
> Here is the log after applying the patch.
> 
> What sticks out are the numerous "VC buffer not found in pci_save_vc_state"
> messages, AFAICT from vc.c pci_save_vc_state(), which I have not spotted
> before:
> 
> [dev@bd-cpu18 ~]$ dmesg | grep vc_
> [  336.960749] pcieport 0000:00:01.1: VC buffer not found in pci_save_vc_state
> [  338.125683] pcieport 0000:01:00.0: VC buffer not found in pci_save_vc_state
> [  338.342504] pcieport 0000:02:01.0: VC buffer not found in pci_save_vc_state
> [  338.569035] pcieport 0000:03:00.0: VC buffer not found in pci_save_vc_state
> [  338.775696] pcieport 0000:04:01.0: VC buffer not found in pci_save_vc_state
> [  338.982599] pcieport 0000:04:03.0: VC buffer not found in pci_save_vc_state
> [  339.189608] pcieport 0000:04:08.0: VC buffer not found in pci_save_vc_state
> [  339.406232] pcieport 0000:04:0a.0: VC buffer not found in pci_save_vc_state
> [  339.986434] pcieport 0000:04:12.0: VC buffer not found in pci_save_vc_state

Ah, that's happening because I added the cap caching after the cap
buffer allocation. The patch below on top of the previous should fix
those warnings.
 
> I do not see the lockup anymore, and the recovery seems to have successfully
> been performed.

Okay, that kind of indicates the frequent capability lookups are taking
a while. We cache other capability offsets for similar reasons in the
past, but I don't recall them ever taking so long that it triggers the
CPU lockup watchdog.

---
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 56992a42bac6..a12efa87c7e0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2385,6 +2385,7 @@ static void pci_init_capabilities(struct pci_dev *dev)
 	pci_ea_init(dev);		/* Enhanced Allocation */
 	pci_msi_init(dev);		/* Disable MSI */
 	pci_msix_init(dev);		/* Disable MSI-X */
+	pci_vc_init(dev);		/* Virtual Channel */
 
 	/* Buffers for saving PCIe and PCI-X capabilities */
 	pci_allocate_cap_save_buffers(dev);
@@ -2401,7 +2402,6 @@ static void pci_init_capabilities(struct pci_dev *dev)
 	pci_aer_init(dev);		/* Advanced Error Reporting */
 	pci_dpc_init(dev);		/* Downstream Port Containment */
 	pci_rcec_init(dev);		/* Root Complex Event Collector */
-	pci_vc_init(dev);		/* Virtual Channel */
 
 	pcie_report_downtraining(dev);
 
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-12 19:27                       ` Keith Busch
@ 2021-01-12 22:19                         ` Hinko Kocevar
  2021-01-12 23:17                           ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Hinko Kocevar @ 2021-01-12 22:19 UTC (permalink / raw)
  To: Keith Busch; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas



On 1/12/21 8:27 PM, Keith Busch wrote:
> On Tue, Jan 12, 2021 at 06:20:55PM +0100, Hinko Kocevar wrote:
>> On 1/11/21 11:09 PM, Keith Busch wrote:
>>
>> Here is the log after applying the patch.
>>
>> What sticks out are the numerous "VC buffer not found in pci_save_vc_state"
>> messages, AFAICT from vc.c pci_save_vc_state(), which I have not spotted
>> before:
>>
>> [dev@bd-cpu18 ~]$ dmesg | grep vc_
>> [  336.960749] pcieport 0000:00:01.1: VC buffer not found in pci_save_vc_state
>> [  338.125683] pcieport 0000:01:00.0: VC buffer not found in pci_save_vc_state
>> [  338.342504] pcieport 0000:02:01.0: VC buffer not found in pci_save_vc_state
>> [  338.569035] pcieport 0000:03:00.0: VC buffer not found in pci_save_vc_state
>> [  338.775696] pcieport 0000:04:01.0: VC buffer not found in pci_save_vc_state
>> [  338.982599] pcieport 0000:04:03.0: VC buffer not found in pci_save_vc_state
>> [  339.189608] pcieport 0000:04:08.0: VC buffer not found in pci_save_vc_state
>> [  339.406232] pcieport 0000:04:0a.0: VC buffer not found in pci_save_vc_state
>> [  339.986434] pcieport 0000:04:12.0: VC buffer not found in pci_save_vc_state
> 
> Ah, that's happening because I added the cap caching after the cap
> buffer allocation. The patch below on top of the previous should fix
> those warnings.

Right. With this little change applied, the warnings are gone now, and I 
can see the VC related debug messages you introduced with the first patch.

>   
>> I do not see the lockup anymore, and the recovery seems to have successfully
>> been performed.
> 
> Okay, that kind of indicates the frequent capability lookups are taking
> a while. We cache other capability offsets for similar reasons in the
> past, but I don't recall them ever taking so long that it triggers the
> CPU lockup watchdog.

Now that the VC restore is being invoked again, I see the lockups are back.

> 
> ---
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 56992a42bac6..a12efa87c7e0 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2385,6 +2385,7 @@ static void pci_init_capabilities(struct pci_dev *dev)
>   	pci_ea_init(dev);		/* Enhanced Allocation */
>   	pci_msi_init(dev);		/* Disable MSI */
>   	pci_msix_init(dev);		/* Disable MSI-X */
> +	pci_vc_init(dev);		/* Virtual Channel */
>   
>   	/* Buffers for saving PCIe and PCI-X capabilities */
>   	pci_allocate_cap_save_buffers(dev);
> @@ -2401,7 +2402,6 @@ static void pci_init_capabilities(struct pci_dev *dev)
>   	pci_aer_init(dev);		/* Advanced Error Reporting */
>   	pci_dpc_init(dev);		/* Downstream Port Containment */
>   	pci_rcec_init(dev);		/* Root Complex Event Collector */
> -	pci_vc_init(dev);		/* Virtual Channel */
>   
>   	pcie_report_downtraining(dev);
>   
> --
> 

I feel inclined to provide a little bit more info about the system I'm 
running this on as it is not a regular PC/server/laptop. It is a modular 
micro TCA system with a single CPU and MCH. MCH and CPU are separate 
cards, as are the other processing cards (AMCs) that link up to CPU 
through the MCH PEX8748 switch. I can power each card individually, or 
perform complete system power cycle. The normal power up sequence is: 
MCH, AMCs, CPU. The CPU is powered 30 sec after all other cards so that 
their PCIe links are up and ready for Linux.

All buses below CPU side 02:01.0 are on MCH PEX8748 switch:

[dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, 
speed 8GT/s, width x4
  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 
48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
  │  │     ├─04:01.0 downstream_port, slot 4, power: Off
  │  │     ├─04:03.0 downstream_port, slot 3, power: Off
  │  │     ├─04:08.0 downstream_port, slot 5, power: Off
  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: 
Off, speed 8GT/s, width x4
  │  │     │  └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
  │  │     └─04:12.0 downstream_port, slot 1, power: Off
  │  ├─02:02.0 downstream_port, slot 2
  │  ├─02:08.0 downstream_port, slot 8
  │  ├─02:09.0 downstream_port, slot 9, power: Off
  │  └─02:0a.0 downstream_port, slot 10
  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0


The lockups most frequently appear after the cold boot of the system. If 
I restart the CPU card only, and leave the MCH (where the PEX8748 switch 
resides) powered, the lockups do *not* happen. I'm injecting the same 
error into the root port and the system card 
configuration/location/count is always the same.

Nevertheless, in rare occasions while booting the same kernel image 
after complete system power cycle, no lockup is observed.

So far I observed that the lockups seem to always happen when recovery 
is dealing with the 02:01.0 device/bus.

If the system recovers from a first injected error, I can repeat the 
injection and the system recovers always. If the first recovery fails I 
have to either reboot the CPU or power cycle the complete system.

To me it looks like this behavior is somehow related to the system/setup 
I have, and for some reason is triggered by VC restoration (VC is not is 
use by my system at all, AFAIK).

Are you able to tell which part of the code the CPU is actually spinning 
in when the lockup is detected? I added many printk()s in the 
pci_restore_vc_state(), in the AER IRQ handler, and around to see 
something being continuously printed, but nothing appeared..

I'll continue testing your patches and try to come up with more info.

Thank you for your time!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-12 22:19                         ` Hinko Kocevar
@ 2021-01-12 23:17                           ` Keith Busch
  2021-01-18  8:00                             ` Hinko Kocevar
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-01-12 23:17 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas

On Tue, Jan 12, 2021 at 11:19:37PM +0100, Hinko Kocevar wrote:
> I feel inclined to provide a little bit more info about the system I'm
> running this on as it is not a regular PC/server/laptop. It is a modular
> micro TCA system with a single CPU and MCH. MCH and CPU are separate cards,
> as are the other processing cards (AMCs) that link up to CPU through the MCH
> PEX8748 switch. I can power each card individually, or perform complete
> system power cycle. The normal power up sequence is: MCH, AMCs, CPU. The CPU
> is powered 30 sec after all other cards so that their PCIe links are up and
> ready for Linux.
> 
> All buses below CPU side 02:01.0 are on MCH PEX8748 switch:
> 
> [dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>  │  │     ├─04:01.0 downstream_port, slot 4, power: Off
>  │  │     ├─04:03.0 downstream_port, slot 3, power: Off
>  │  │     ├─04:08.0 downstream_port, slot 5, power: Off
>  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
>  │  │     └─04:12.0 downstream_port, slot 1, power: Off
>  │  ├─02:02.0 downstream_port, slot 2
>  │  ├─02:08.0 downstream_port, slot 8
>  │  ├─02:09.0 downstream_port, slot 9, power: Off
>  │  └─02:0a.0 downstream_port, slot 10
>  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
> 
> 
> The lockups most frequently appear after the cold boot of the system. If I
> restart the CPU card only, and leave the MCH (where the PEX8748 switch
> resides) powered, the lockups do *not* happen. I'm injecting the same error
> into the root port and the system card configuration/location/count is
> always the same.
> 
> Nevertheless, in rare occasions while booting the same kernel image after
> complete system power cycle, no lockup is observed.
> 
> So far I observed that the lockups seem to always happen when recovery is
> dealing with the 02:01.0 device/bus.
> 
> If the system recovers from a first injected error, I can repeat the
> injection and the system recovers always. If the first recovery fails I have
> to either reboot the CPU or power cycle the complete system.
> 
> To me it looks like this behavior is somehow related to the system/setup I
> have, and for some reason is triggered by VC restoration (VC is not is use
> by my system at all, AFAIK).
 
> Are you able to tell which part of the code the CPU is actually spinning in
> when the lockup is detected? I added many printk()s in the
> pci_restore_vc_state(), in the AER IRQ handler, and around to see something
> being continuously printed, but nothing appeared..

It sounds like your setup is having difficulting completing config
cycles timely after a secondary bus reset. I don't see right now how
anything I've provided in this series is causing that.

All the stack traces you've provided so far are all within virtual
channel restoration. Subsequent stack traces are never the same though,
so it does not appear to be permanently stuck; it's just incredibly
slow. This particular capability restoration happens to require more
config cycles than most other capabilities, so I'm guessing it happens
to show up in your observation because of that rather than anything
specific about VC.

The long delays seem like a CTO should have kicked in, but maybe your
hardware isn't doing it right. Your lspci says Completion Timeout
configuration is not supported, so it should default to 50msec maximum,
but since it's taking long enough to trigger a stuck CPU watchdog, and
you appear to be getting valid data back, it doesn't look like CTO is
happening.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-12 23:17                           ` Keith Busch
@ 2021-01-18  8:00                             ` Hinko Kocevar
  2021-01-19 18:28                               ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Hinko Kocevar @ 2021-01-18  8:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas



On 1/13/21 12:17 AM, Keith Busch wrote:
> On Tue, Jan 12, 2021 at 11:19:37PM +0100, Hinko Kocevar wrote:
>> I feel inclined to provide a little bit more info about the system I'm
>> running this on as it is not a regular PC/server/laptop. It is a modular
>> micro TCA system with a single CPU and MCH. MCH and CPU are separate cards,
>> as are the other processing cards (AMCs) that link up to CPU through the MCH
>> PEX8748 switch. I can power each card individually, or perform complete
>> system power cycle. The normal power up sequence is: MCH, AMCs, CPU. The CPU
>> is powered 30 sec after all other cards so that their PCIe links are up and
>> ready for Linux.
>>
>> All buses below CPU side 02:01.0 are on MCH PEX8748 switch:
>>
>> [dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>   ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>   │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>   │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>   │  │     ├─04:01.0 downstream_port, slot 4, power: Off
>>   │  │     ├─04:03.0 downstream_port, slot 3, power: Off
>>   │  │     ├─04:08.0 downstream_port, slot 5, power: Off
>>   │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>   │  │     │  └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
>>   │  │     └─04:12.0 downstream_port, slot 1, power: Off
>>   │  ├─02:02.0 downstream_port, slot 2
>>   │  ├─02:08.0 downstream_port, slot 8
>>   │  ├─02:09.0 downstream_port, slot 9, power: Off
>>   │  └─02:0a.0 downstream_port, slot 10
>>   ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>   ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>   ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>   └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>
>>
>> The lockups most frequently appear after the cold boot of the system. If I
>> restart the CPU card only, and leave the MCH (where the PEX8748 switch
>> resides) powered, the lockups do *not* happen. I'm injecting the same error
>> into the root port and the system card configuration/location/count is
>> always the same.
>>
>> Nevertheless, in rare occasions while booting the same kernel image after
>> complete system power cycle, no lockup is observed.
>>
>> So far I observed that the lockups seem to always happen when recovery is
>> dealing with the 02:01.0 device/bus.
>>
>> If the system recovers from a first injected error, I can repeat the
>> injection and the system recovers always. If the first recovery fails I have
>> to either reboot the CPU or power cycle the complete system.
>>
>> To me it looks like this behavior is somehow related to the system/setup I
>> have, and for some reason is triggered by VC restoration (VC is not is use
>> by my system at all, AFAIK).
>   
>> Are you able to tell which part of the code the CPU is actually spinning in
>> when the lockup is detected? I added many printk()s in the
>> pci_restore_vc_state(), in the AER IRQ handler, and around to see something
>> being continuously printed, but nothing appeared..
> 
> It sounds like your setup is having difficulting completing config
> cycles timely after a secondary bus reset. I don't see right now how
> anything I've provided in this series is causing that.
> 
> All the stack traces you've provided so far are all within virtual
> channel restoration. Subsequent stack traces are never the same though,
> so it does not appear to be permanently stuck; it's just incredibly
> slow. This particular capability restoration happens to require more
> config cycles than most other capabilities, so I'm guessing it happens
> to show up in your observation because of that rather than anything
> specific about VC.
> 
> The long delays seem like a CTO should have kicked in, but maybe your
> hardware isn't doing it right. Your lspci says Completion Timeout
> configuration is not supported, so it should default to 50msec maximum,
> but since it's taking long enough to trigger a stuck CPU watchdog, and
> you appear to be getting valid data back, it doesn't look like CTO is
> happening.
> 

Some very good new (and some little less good).

The good news is that after recompiling the latest PCI GIT kernel 
source, with heavily stripped down config (without any extras and 
modules), I do *not* see the lockups any more. Up to now I've been using 
CentOS7 bloated stock config from stock kernel to compile git sources.

The not so good news is that I'm not sure what exactly solved the issue 
for me; it seems to be config related. I've also been touching some 
other system settings, outside the Linux. Otherwise my HW configuration 
has not changed. I performed the same test steps that previously 
resulted in lockups and none were seen so far. I'm looking into what 
exactly might have been causing the issues.

With that being said, I believe that this patch series is working for my 
system and the issues reported were related to my system mis-configuration.


Thank you again!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-18  8:00                             ` Hinko Kocevar
@ 2021-01-19 18:28                               ` Keith Busch
  0 siblings, 0 replies; 36+ messages in thread
From: Keith Busch @ 2021-01-19 18:28 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: Kelley, Sean V, Linux PCI, Bjorn Helgaas

On Mon, Jan 18, 2021 at 09:00:34AM +0100, Hinko Kocevar wrote:
> With that being said, I believe that this patch series is working for my
> system and the issues reported were related to my system mis-configuration.

Great, thank you for confirming! I couldn't think of any reason this
series would introduce the behavior you were describing.

Bjorn, do you have any additional concerns?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
                   ` (5 preceding siblings ...)
  2021-01-05 14:21 ` [PATCHv2 0/5] aer handling fixups Hinko Kocevar
@ 2021-02-03  0:03 ` Keith Busch
  2021-02-04  8:35   ` Hinko Kocevar
  2021-02-08 12:55 ` Hedi Berriche
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-02-03  0:03 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas

Hi Bjorn,

Any further concern? I beleive Hinko idenfitied his observations as
being solely due to his platform rather than this implementation, which
should fix some outstanding uncorrectable error handling. Please let me
know if there's anything additional you'd like to see with this series.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-02-03  0:03 ` Keith Busch
@ 2021-02-04  8:35   ` Hinko Kocevar
  0 siblings, 0 replies; 36+ messages in thread
From: Hinko Kocevar @ 2021-02-04  8:35 UTC (permalink / raw)
  To: Keith Busch, linux-pci, Bjorn Helgaas

Hi,

On 2/3/21 1:03 AM, Keith Busch wrote:
> Hi Bjorn,
> 
> Any further concern? I beleive Hinko idenfitied his observations as
> being solely due to his platform rather than this implementation, which

True.

Thanks,
//hinko

> should fix some outstanding uncorrectable error handling. Please let me
> know if there's anything additional you'd like to see with this series.
> 
> Thanks,
> Keith
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
                   ` (6 preceding siblings ...)
  2021-02-03  0:03 ` Keith Busch
@ 2021-02-08 12:55 ` Hedi Berriche
  2021-02-09 23:06 ` Bjorn Helgaas
  2021-02-10  9:36 ` Yicong Yang
  9 siblings, 0 replies; 36+ messages in thread
From: Hedi Berriche @ 2021-02-08 12:55 UTC (permalink / raw)
  To: kbusch; +Cc: helgaas, linux-pci, Hedi Berriche

Acked-by: Hedi Berriche <hedi.berriche@hpe.com>
Tested-by: Hedi Berriche <hedi.berriche@hpe.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
                   ` (7 preceding siblings ...)
  2021-02-08 12:55 ` Hedi Berriche
@ 2021-02-09 23:06 ` Bjorn Helgaas
  2021-02-10  4:05   ` Keith Busch
  2021-02-10  9:36 ` Yicong Yang
  9 siblings, 1 reply; 36+ messages in thread
From: Bjorn Helgaas @ 2021-02-09 23:06 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Hinko Kocevar

[+cc Hinko]

On Mon, Jan 04, 2021 at 03:02:55PM -0800, Keith Busch wrote:
> Changes from v1:
> 
>   Added received Acks
> 
>   Split the kernel print identifying the port type being reset.
> 
>   Added a patch for the portdrv to ensure the slot_reset happens without
>   relying on a downstream device driver..
> 
> Keith Busch (5):
>   PCI/ERR: Clear status of the reporting device
>   PCI/AER: Actually get the root port
>   PCI/ERR: Retain status from error notification
>   PCI/AER: Specify the type of port that was reset
>   PCI/portdrv: Report reset for frozen channel
> 
>  drivers/pci/pcie/aer.c         |  5 +++--
>  drivers/pci/pcie/err.c         | 16 +++++++---------
>  drivers/pci/pcie/portdrv_pci.c |  3 ++-
>  3 files changed, 12 insertions(+), 12 deletions(-)

I applied these to pci/error for v5.12, thanks!

I *am* a little concerned about the issues Hinko saw because it
doesn't look we found a root cause.  I didn't spend any time looking
into it, but even if it only shows up on his specific platform or with
some weird config combination, it's a problem.  But I guess we'll see
if anybody else trips over it.

Bjorn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-02-09 23:06 ` Bjorn Helgaas
@ 2021-02-10  4:05   ` Keith Busch
  2021-02-10 21:38     ` Bjorn Helgaas
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-02-10  4:05 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Hinko Kocevar

On Tue, Feb 09, 2021 at 05:06:14PM -0600, Bjorn Helgaas wrote:
> [+cc Hinko]
> 
> On Mon, Jan 04, 2021 at 03:02:55PM -0800, Keith Busch wrote:
> > Changes from v1:
> > 
> >   Added received Acks
> > 
> >   Split the kernel print identifying the port type being reset.
> > 
> >   Added a patch for the portdrv to ensure the slot_reset happens without
> >   relying on a downstream device driver..
> > 
> > Keith Busch (5):
> >   PCI/ERR: Clear status of the reporting device
> >   PCI/AER: Actually get the root port
> >   PCI/ERR: Retain status from error notification
> >   PCI/AER: Specify the type of port that was reset
> >   PCI/portdrv: Report reset for frozen channel
> > 
> >  drivers/pci/pcie/aer.c         |  5 +++--
> >  drivers/pci/pcie/err.c         | 16 +++++++---------
> >  drivers/pci/pcie/portdrv_pci.c |  3 ++-
> >  3 files changed, 12 insertions(+), 12 deletions(-)
> 
> I applied these to pci/error for v5.12, thanks!

Thanks!
 
> I *am* a little concerned about the issues Hinko saw because it
> doesn't look we found a root cause.  I didn't spend any time looking
> into it, but even if it only shows up on his specific platform or with
> some weird config combination, it's a problem.  But I guess we'll see
> if anybody else trips over it.

Yes, I'm also closely monitoring for AER issues. I think Hinko's
observation was seen without this series, and was just initially noticed
with it. I'm reasonably confident this is a safe improvement, but I want
to see this work well with everyone's hardware, too.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
                   ` (8 preceding siblings ...)
  2021-02-09 23:06 ` Bjorn Helgaas
@ 2021-02-10  9:36 ` Yicong Yang
  9 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2021-02-10  9:36 UTC (permalink / raw)
  To: Keith Busch, linux-pci, Bjorn Helgaas

Hi,

I've tested the serial on Kunpeng 920 arm64 server and it works well.

Thanks.

On 2021/1/5 7:02, Keith Busch wrote:
> Changes from v1:
> 
>   Added received Acks
> 
>   Split the kernel print identifying the port type being reset.
> 
>   Added a patch for the portdrv to ensure the slot_reset happens without
>   relying on a downstream device driver..
> 
> Keith Busch (5):
>   PCI/ERR: Clear status of the reporting device
>   PCI/AER: Actually get the root port
>   PCI/ERR: Retain status from error notification
>   PCI/AER: Specify the type of port that was reset
>   PCI/portdrv: Report reset for frozen channel
> 
>  drivers/pci/pcie/aer.c         |  5 +++--
>  drivers/pci/pcie/err.c         | 16 +++++++---------
>  drivers/pci/pcie/portdrv_pci.c |  3 ++-
>  3 files changed, 12 insertions(+), 12 deletions(-)
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 0/5] aer handling fixups
  2021-02-10  4:05   ` Keith Busch
@ 2021-02-10 21:38     ` Bjorn Helgaas
  0 siblings, 0 replies; 36+ messages in thread
From: Bjorn Helgaas @ 2021-02-10 21:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Hinko Kocevar

On Wed, Feb 10, 2021 at 01:05:04PM +0900, Keith Busch wrote:
> On Tue, Feb 09, 2021 at 05:06:14PM -0600, Bjorn Helgaas wrote:
> > On Mon, Jan 04, 2021 at 03:02:55PM -0800, Keith Busch wrote:
> > > Changes from v1:
> > > 
> > >   Added received Acks
> > > 
> > >   Split the kernel print identifying the port type being reset.
> > > 
> > >   Added a patch for the portdrv to ensure the slot_reset happens without
> > >   relying on a downstream device driver..
> > > 
> > > Keith Busch (5):
> > >   PCI/ERR: Clear status of the reporting device
> > >   PCI/AER: Actually get the root port
> > >   PCI/ERR: Retain status from error notification
> > >   PCI/AER: Specify the type of port that was reset
> > >   PCI/portdrv: Report reset for frozen channel
> > > 
> > >  drivers/pci/pcie/aer.c         |  5 +++--
> > >  drivers/pci/pcie/err.c         | 16 +++++++---------
> > >  drivers/pci/pcie/portdrv_pci.c |  3 ++-
> > >  3 files changed, 12 insertions(+), 12 deletions(-)
> > 
> > I applied these to pci/error for v5.12, thanks!
> 
> Thanks!
>  
> > I *am* a little concerned about the issues Hinko saw because it
> > doesn't look we found a root cause.  I didn't spend any time looking
> > into it, but even if it only shows up on his specific platform or with
> > some weird config combination, it's a problem.  But I guess we'll see
> > if anybody else trips over it.
> 
> Yes, I'm also closely monitoring for AER issues. I think Hinko's
> observation was seen without this series, and was just initially noticed
> with it. I'm reasonably confident this is a safe improvement, but I want
> to see this work well with everyone's hardware, too.

Oh, good.  I missed the fact that it was seen even *without* this
series.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-01-04 23:02 ` [PATCHv2 3/5] PCI/ERR: Retain status from error notification Keith Busch
@ 2021-03-03  5:34   ` Williams, Dan J
  2021-03-03  5:46     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 36+ messages in thread
From: Williams, Dan J @ 2021-03-03  5:34 UTC (permalink / raw)
  To: kbusch, linux-pci, helgaas
  Cc: hinko.kocevar, Kelley, Sean V, Kuppuswamy, Sathyanarayanan

[ Add Sathya ]

On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
> Overwriting the frozen detected status with the result of the link reset
> loses the NEED_RESET result that drivers are depending on for error
> handling to report the .slot_reset() callback. Retain this status so
> that subsequent error handling has the correct flow.
> 
> Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
> Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>

Just want to report that this fix might be a candidate for -stable.
Recent DPC error recovery testing on v5.10 shows that losing this
status prevents NVME from restarting the queue after error recovery
with a failing signature like:

[  424.685179] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast error_detected message
[  424.685185] nvme nvme0: frozen state error detected, reset controller
[  425.078620] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast resume message
[  425.078674] pcieport 0000:97:02.0: AER: device recovery successful

...and then later:

[  751.146277] sysrq: Show Blocked State
[  751.146431] task:touch           state:D stack:    0 pid:11969 ppid: 11413 flags:0x00000000
[  751.146434] Call Trace:
[  751.146443]  __schedule+0x31b/0x890
[  751.146445]  schedule+0x3c/0xa0
[  751.146449]  blk_queue_enter+0x151/0x220
[  751.146454]  ? finish_wait+0x80/0x80
[  751.146455]  submit_bio_noacct+0x39a/0x410
[  751.146457]  submit_bio+0x42/0x150
[  751.146460]  ? bio_add_page+0x62/0x90
[  751.146461]  ? guard_bio_eod+0x25/0x70
[  751.146465]  submit_bh_wbc+0x16d/0x190
[  751.146469]  ext4_read_bh+0x47/0xa0
[  751.146472]  ext4_read_inode_bitmap+0x3cd/0x5f0

...because NVME was only told to disable, never to reset and resume the
block queue.

I have not tested it, but by code inspection I eventually found this
upstream fix.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-03-03  5:34   ` Williams, Dan J
@ 2021-03-03  5:46     ` Kuppuswamy, Sathyanarayanan
  2021-03-04 20:01       ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-03  5:46 UTC (permalink / raw)
  To: Williams, Dan J, kbusch, linux-pci, helgaas; +Cc: hinko.kocevar, Kelley, Sean V


On 3/2/21 9:34 PM, Williams, Dan J wrote:
> [ Add Sathya ]
>
> On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
>> Overwriting the frozen detected status with the result of the link reset
>> loses the NEED_RESET result that drivers are depending on for error
>> handling to report the .slot_reset() callback. Retain this status so
>> that subsequent error handling has the correct flow.
>>
>> Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
>> Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
>> Signed-off-by: Keith Busch <kbusch@kernel.org>
> Just want to report that this fix might be a candidate for -stable.
Agree.

I think it can be merged in both stable and mainline kernels.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Recent DPC error recovery testing on v5.10 shows that losing this
> status prevents NVME from restarting the queue after error recovery
> with a failing signature like:
>
> [  424.685179] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast error_detected message
> [  424.685185] nvme nvme0: frozen state error detected, reset controller
> [  425.078620] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast resume message
> [  425.078674] pcieport 0000:97:02.0: AER: device recovery successful
>
> ...and then later:
>
> [  751.146277] sysrq: Show Blocked State
> [  751.146431] task:touch           state:D stack:    0 pid:11969 ppid: 11413 flags:0x00000000
> [  751.146434] Call Trace:
> [  751.146443]  __schedule+0x31b/0x890
> [  751.146445]  schedule+0x3c/0xa0
> [  751.146449]  blk_queue_enter+0x151/0x220
> [  751.146454]  ? finish_wait+0x80/0x80
> [  751.146455]  submit_bio_noacct+0x39a/0x410
> [  751.146457]  submit_bio+0x42/0x150
> [  751.146460]  ? bio_add_page+0x62/0x90
> [  751.146461]  ? guard_bio_eod+0x25/0x70
> [  751.146465]  submit_bh_wbc+0x16d/0x190
> [  751.146469]  ext4_read_bh+0x47/0xa0
> [  751.146472]  ext4_read_inode_bitmap+0x3cd/0x5f0
>
> ...because NVME was only told to disable, never to reset and resume the
> block queue.
>
> I have not tested it, but by code inspection I eventually found this
> upstream fix.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-03-03  5:46     ` Kuppuswamy, Sathyanarayanan
@ 2021-03-04 20:01       ` Keith Busch
  2021-03-04 22:11         ` Dan Williams
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2021-03-04 20:01 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Williams, Dan J, linux-pci, helgaas, hinko.kocevar, Kelley, Sean V

On Tue, Mar 02, 2021 at 09:46:40PM -0800, Kuppuswamy, Sathyanarayanan wrote:
> 
> On 3/2/21 9:34 PM, Williams, Dan J wrote:
> > [ Add Sathya ]
> > 
> > On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
> > > Overwriting the frozen detected status with the result of the link reset
> > > loses the NEED_RESET result that drivers are depending on for error
> > > handling to report the .slot_reset() callback. Retain this status so
> > > that subsequent error handling has the correct flow.
> > > 
> > > Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
> > > Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
> > > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > Just want to report that this fix might be a candidate for -stable.
> Agree.
> 
> I think it can be merged in both stable and mainline kernels.
> 
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Just FYI, this patch is practically a revert of this one:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=6d2c89441571ea534d6240f7724f518936c44f8d

so please let me know if that is still a problem for you.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-03-04 20:01       ` Keith Busch
@ 2021-03-04 22:11         ` Dan Williams
       [not found]           ` <23551edc-965c-21dc-0da8-a492c27c362d@intel.com>
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Williams @ 2021-03-04 22:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kuppuswamy, Sathyanarayanan, linux-pci, helgaas, hinko.kocevar,
	Kelley, Sean V

On Thu, Mar 4, 2021 at 12:03 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Tue, Mar 02, 2021 at 09:46:40PM -0800, Kuppuswamy, Sathyanarayanan wrote:
> >
> > On 3/2/21 9:34 PM, Williams, Dan J wrote:
> > > [ Add Sathya ]
> > >
> > > On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
> > > > Overwriting the frozen detected status with the result of the link reset
> > > > loses the NEED_RESET result that drivers are depending on for error
> > > > handling to report the .slot_reset() callback. Retain this status so
> > > > that subsequent error handling has the correct flow.
> > > >
> > > > Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
> > > > Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
> > > > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > > Just want to report that this fix might be a candidate for -stable.
> > Agree.
> >
> > I think it can be merged in both stable and mainline kernels.
> >
> > Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
> Just FYI, this patch is practically a revert of this one:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=6d2c89441571ea534d6240f7724f518936c44f8d
>
> so please let me know if that is still a problem for you.

For what it's worth I think "6d2c89441571 PCI/ERR: Update error status
after reset_link()" is not justified. The link shouldn't recover if
the attached device is not prepared to handle DPC events. As far as I
can see it's just a cosmetic fix, right? Sathya, was there some end
user visible need to make the DPC recovery report "success" in this
case?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
       [not found]           ` <23551edc-965c-21dc-0da8-a492c27c362d@intel.com>
@ 2021-03-04 22:59             ` Dan Williams
  2021-03-04 23:19               ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Williams @ 2021-03-04 22:59 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Keith Busch, linux-pci, helgaas, hinko.kocevar, Kelley, Sean V

On Thu, Mar 4, 2021 at 2:38 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@intel.com> wrote:
>
>
> On 3/4/21 2:11 PM, Dan Williams wrote:
>
> On Thu, Mar 4, 2021 at 12:03 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Tue, Mar 02, 2021 at 09:46:40PM -0800, Kuppuswamy, Sathyanarayanan wrote:
>
> On 3/2/21 9:34 PM, Williams, Dan J wrote:
>
> [ Add Sathya ]
>
> On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
>
> Overwriting the frozen detected status with the result of the link reset
> loses the NEED_RESET result that drivers are depending on for error
> handling to report the .slot_reset() callback. Retain this status so
> that subsequent error handling has the correct flow.
>
> Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
> Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
>
> Just want to report that this fix might be a candidate for -stable.
>
> Agree.
>
> I think it can be merged in both stable and mainline kernels.
>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
> Just FYI, this patch is practically a revert of this one:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=6d2c89441571ea534d6240f7724f518936c44f8d
>
> so please let me know if that is still a problem for you.
>
> For what it's worth I think "6d2c89441571 PCI/ERR: Update error status
> after reset_link()" is not justified. The link shouldn't recover if
> the attached device is not prepared to handle DPC events.
>
> I added that fix to address the recovery issue seen in a Dell server
> platform (for EDR test case). If I understand the history correctly,
> In EDR case, AER and DPC is owned by firmware, hence we get
> PCI_ERS_RESULT_NO_AER_DRIVER when executing error_detected() callbacks.
> So If we continue the pcie_do_recovery() with PCI_ERS_RESULT_NO_AER_DRIVER
> as error status, then even if we successfully reset the link we will report
> the recovery status as failure.

But that's the right response if there is no handler. If there is a
device attached that was not prepared for the link to go down then it
does not matter if the link comes back ok that device will be
thoroughly confused and should be disconnected.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-03-04 22:59             ` Dan Williams
@ 2021-03-04 23:19               ` Kuppuswamy, Sathyanarayanan
  2021-03-05  0:23                 ` Dan Williams
  0 siblings, 1 reply; 36+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-04 23:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Keith Busch, linux-pci, helgaas, hinko.kocevar, Kelley, Sean V


On 3/4/21 2:59 PM, Dan Williams wrote:
> On Thu, Mar 4, 2021 at 2:38 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@intel.com> wrote:
>>
>> On 3/4/21 2:11 PM, Dan Williams wrote:
>>
>> On Thu, Mar 4, 2021 at 12:03 PM Keith Busch <kbusch@kernel.org> wrote:
>>
>> On Tue, Mar 02, 2021 at 09:46:40PM -0800, Kuppuswamy, Sathyanarayanan wrote:
>>
>> On 3/2/21 9:34 PM, Williams, Dan J wrote:
>>
>> [ Add Sathya ]
>>
>> On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
>>
>> Overwriting the frozen detected status with the result of the link reset
>> loses the NEED_RESET result that drivers are depending on for error
>> handling to report the .slot_reset() callback. Retain this status so
>> that subsequent error handling has the correct flow.
>>
>> Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
>> Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
>> Signed-off-by: Keith Busch <kbusch@kernel.org>
>>
>> Just want to report that this fix might be a candidate for -stable.
>>
>> Agree.
>>
>> I think it can be merged in both stable and mainline kernels.
>>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> Just FYI, this patch is practically a revert of this one:
>>
>>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=6d2c89441571ea534d6240f7724f518936c44f8d
>>
>> so please let me know if that is still a problem for you.
>>
>> For what it's worth I think "6d2c89441571 PCI/ERR: Update error status
>> after reset_link()" is not justified. The link shouldn't recover if
>> the attached device is not prepared to handle DPC events.
>>
>> I added that fix to address the recovery issue seen in a Dell server
>> platform (for EDR test case). If I understand the history correctly,
>> In EDR case, AER and DPC is owned by firmware, hence we get
>> PCI_ERS_RESULT_NO_AER_DRIVER when executing error_detected() callbacks.
>> So If we continue the pcie_do_recovery() with PCI_ERS_RESULT_NO_AER_DRIVER
>> as error status, then even if we successfully reset the link we will report
>> the recovery status as failure.
> But that's the right response if there is no handler.
If the handler is not available due to AER being owned by firmware,
then it needs to be fixed. In EDR mode, even if DPC/AER is owned
by firmware , OS need to own the recovery part. So I think it
needs further investigation to understand why it reports,
PCI_ERS_RESULT_NO_AER_DRIVER
> If there is a
> device attached that was not prepared for the link to go down then it
> does not matter if the link comes back ok that device will be
> thoroughly confused and should be disconnected.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-03-04 23:19               ` Kuppuswamy, Sathyanarayanan
@ 2021-03-05  0:23                 ` Dan Williams
  2021-03-05  0:54                   ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Williams @ 2021-03-05  0:23 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Keith Busch, linux-pci, helgaas, hinko.kocevar, Kelley, Sean V

On Thu, Mar 4, 2021 at 3:19 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@intel.com> wrote:
>
>
> On 3/4/21 2:59 PM, Dan Williams wrote:
> > On Thu, Mar 4, 2021 at 2:38 PM Kuppuswamy, Sathyanarayanan
> > <sathyanarayanan.kuppuswamy@intel.com> wrote:
> >>
> >> On 3/4/21 2:11 PM, Dan Williams wrote:
> >>
> >> On Thu, Mar 4, 2021 at 12:03 PM Keith Busch <kbusch@kernel.org> wrote:
> >>
> >> On Tue, Mar 02, 2021 at 09:46:40PM -0800, Kuppuswamy, Sathyanarayanan wrote:
> >>
> >> On 3/2/21 9:34 PM, Williams, Dan J wrote:
> >>
> >> [ Add Sathya ]
> >>
> >> On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
> >>
> >> Overwriting the frozen detected status with the result of the link reset
> >> loses the NEED_RESET result that drivers are depending on for error
> >> handling to report the .slot_reset() callback. Retain this status so
> >> that subsequent error handling has the correct flow.
> >>
> >> Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
> >> Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
> >> Signed-off-by: Keith Busch <kbusch@kernel.org>
> >>
> >> Just want to report that this fix might be a candidate for -stable.
> >>
> >> Agree.
> >>
> >> I think it can be merged in both stable and mainline kernels.
> >>
> >> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >>
> >> Just FYI, this patch is practically a revert of this one:
> >>
> >>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=6d2c89441571ea534d6240f7724f518936c44f8d
> >>
> >> so please let me know if that is still a problem for you.
> >>
> >> For what it's worth I think "6d2c89441571 PCI/ERR: Update error status
> >> after reset_link()" is not justified. The link shouldn't recover if
> >> the attached device is not prepared to handle DPC events.
> >>
> >> I added that fix to address the recovery issue seen in a Dell server
> >> platform (for EDR test case). If I understand the history correctly,
> >> In EDR case, AER and DPC is owned by firmware, hence we get
> >> PCI_ERS_RESULT_NO_AER_DRIVER when executing error_detected() callbacks.
> >> So If we continue the pcie_do_recovery() with PCI_ERS_RESULT_NO_AER_DRIVER
> >> as error status, then even if we successfully reset the link we will report
> >> the recovery status as failure.
> > But that's the right response if there is no handler.
> If the handler is not available due to AER being owned by firmware,
> then it needs to be fixed. In EDR mode, even if DPC/AER is owned
> by firmware , OS need to own the recovery part. So I think it
> needs further investigation to understand why it reports,
> PCI_ERS_RESULT_NO_AER_DRIVER

As far as I can see the only way to get PCI_ERS_RESULT_NO_AER_DRIVER
is when there actually is no handler, or the device io state has set
to failed. I notice the hotplug handler sets the device io state to
failed while processing link down. If the device is actually missing a
handler definition then disconnect seems the right response.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
  2021-03-05  0:23                 ` Dan Williams
@ 2021-03-05  0:54                   ` Keith Busch
  0 siblings, 0 replies; 36+ messages in thread
From: Keith Busch @ 2021-03-05  0:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy, Sathyanarayanan, linux-pci, helgaas, hinko.kocevar,
	Kelley, Sean V

On Thu, Mar 04, 2021 at 04:23:01PM -0800, Dan Williams wrote:
> On Thu, Mar 4, 2021 at 3:19 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@intel.com> wrote:
>
> > If the handler is not available due to AER being owned by firmware,
> > then it needs to be fixed. In EDR mode, even if DPC/AER is owned
> > by firmware , OS need to own the recovery part. So I think it
> > needs further investigation to understand why it reports,
> > PCI_ERS_RESULT_NO_AER_DRIVER
> 
> As far as I can see the only way to get PCI_ERS_RESULT_NO_AER_DRIVER
> is when there actually is no handler, or the device io state has set
> to failed. I notice the hotplug handler sets the device io state to
> failed while processing link down. If the device is actually missing a
> handler definition then disconnect seems the right response.

Hypothetically speaking, if a DPC event occurs at the Root Port above a
deep PCIe topology, it only takes one device with no handler to spoil
the recovery for the well behaved drivers lower in the hierarchy. We can
probably handle that better if this scenario actually happens. Not sure
if does happen, but the switchtec driver looks sus.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-03-05  0:54 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
2021-01-04 23:02 ` [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device Keith Busch
2021-01-04 23:02 ` [PATCHv2 2/5] PCI/AER: Actually get the root port Keith Busch
2021-01-04 23:02 ` [PATCHv2 3/5] PCI/ERR: Retain status from error notification Keith Busch
2021-03-03  5:34   ` Williams, Dan J
2021-03-03  5:46     ` Kuppuswamy, Sathyanarayanan
2021-03-04 20:01       ` Keith Busch
2021-03-04 22:11         ` Dan Williams
     [not found]           ` <23551edc-965c-21dc-0da8-a492c27c362d@intel.com>
2021-03-04 22:59             ` Dan Williams
2021-03-04 23:19               ` Kuppuswamy, Sathyanarayanan
2021-03-05  0:23                 ` Dan Williams
2021-03-05  0:54                   ` Keith Busch
2021-01-04 23:02 ` [PATCHv2 4/5] PCI/AER: Specify the type of port that was reset Keith Busch
2021-01-04 23:03 ` [PATCHv2 5/5] PCI/portdrv: Report reset for frozen channel Keith Busch
2021-01-05 14:21 ` [PATCHv2 0/5] aer handling fixups Hinko Kocevar
2021-01-05 15:06   ` Hinko Kocevar
2021-01-05 18:33     ` Keith Busch
2021-01-05 23:07       ` Kelley, Sean V
2021-01-07 21:42         ` Keith Busch
2021-01-08  9:38           ` Hinko Kocevar
2021-01-11 13:39             ` Hinko Kocevar
2021-01-11 16:37               ` Keith Busch
2021-01-11 20:02                 ` Hinko Kocevar
2021-01-11 22:09                   ` Keith Busch
     [not found]                     ` <ed8256dd-d70d-b8dc-fdc0-a78b9aa3bbd9@ess.eu>
2021-01-12 19:27                       ` Keith Busch
2021-01-12 22:19                         ` Hinko Kocevar
2021-01-12 23:17                           ` Keith Busch
2021-01-18  8:00                             ` Hinko Kocevar
2021-01-19 18:28                               ` Keith Busch
2021-02-03  0:03 ` Keith Busch
2021-02-04  8:35   ` Hinko Kocevar
2021-02-08 12:55 ` Hedi Berriche
2021-02-09 23:06 ` Bjorn Helgaas
2021-02-10  4:05   ` Keith Busch
2021-02-10 21:38     ` Bjorn Helgaas
2021-02-10  9:36 ` Yicong Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).