linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/8] Implement PCI Error Recovery on Navi12
@ 2020-09-02 18:42 Andrey Grodzovsky
  2020-09-02 18:42 ` [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, 
such as parity errors on the data and address buses,  A typical action taken is to disconnect 
the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered, 
so that the affected PCI device(s) are reset and put back into working condition. 
In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC) 
driver which will intercept the PCIe error, remove (isolate) the faulting device after which it 
will call into PCIe recovery code of the PCI core. 
This code will call hooks which are implemented in this patchset where the error is 
first reported at which point we block the GPU scheduler, next DPC resets the 
PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
start executing mode1 reset of the ASIC, next step is slot reset hook is called 
at which point we wait for ASIC reset to complete, restore PCI config space and run 
HW suspend/resume sequence to resinit the ASIC. 
Last hook called is resume normal operation at which point we will restart the GPU scheduler.

More info on PCIe error handling and DPC are here:
https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
https://patchwork.kernel.org/patch/8945681/

v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which breaks the feature.

Andrey Grodzovsky (8):
  drm/amdgpu: Avoid accessing HW when suspending SW state
  drm/amdgpu: Block all job scheduling activity during DPC recovery
  drm/amdgpu: Fix SMU error failure
  drm/amdgpu: Fix consecutive DPC recovery failures.
  drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  drm/amdgpu: Disable DPC for XGMI for now.
  drm/amdgpu: Minor checkpatch fix
  Revert "PCI/ERR: Update error status after reset_link()"

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 +++++++++++++++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 ++-
 drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
 drivers/pci/pcie/err.c                     |   3 +-
 10 files changed, 222 insertions(+), 79 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 21:56   ` Bjorn Helgaas
  2020-09-03  1:32   ` Li, Dennis
  2020-09-02 18:42 ` [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some blocks might be even gated and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++------
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
 6 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c311a3c..b20354f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+	bool                            in_pci_err_recovery;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 74a1c03..1fbf8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
 {
 	uint32_t ret;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_rreg(adev, reg);
 
@@ -351,6 +354,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  * Returns the 8 bit value from the offset specified.
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
 	BUG();
@@ -372,6 +378,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  * Writes the value specified to the offset specified.
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
 	else
@@ -382,6 +391,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device *adev,
 				       uint32_t reg, uint32_t v,
 				       uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
 	if ((reg * 4) < adev->rmmio_size)
@@ -409,6 +421,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device *adev,
 void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -423,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (amdgpu_sriov_fullaccess(adev) &&
 		adev->gfx.rlc.funcs &&
 		adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -444,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		return ioread32(adev->rio_mem + (reg * 4));
 	else {
@@ -463,6 +484,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		iowrite32(v, adev->rio_mem + (reg * 4));
 	else {
@@ -482,6 +506,9 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return readl(adev->doorbell.ptr + index);
 	} else {
@@ -502,6 +529,9 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		writel(v, adev->doorbell.ptr + index);
 	} else {
@@ -520,6 +550,9 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
 	} else {
@@ -540,6 +573,9 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
 	} else {
@@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 	pci_restore_state(pdev);
 
+	adev->in_pci_err_recovery = true;
 	r = amdgpu_device_ip_suspend(adev);
+	adev->in_pci_err_recovery = false;
 	if (r)
 		goto out;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index d698142..8c9bacf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
 	struct amdgpu_ring *ring = &kiq->ring;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	BUG_ON(!ring->funcs->emit_rreg);
 
 	spin_lock_irqsave(&kiq->ring_lock, flags);
@@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 
 	BUG_ON(!ring->funcs->emit_wreg);
 
+	if (adev->in_pci_err_recovery)
+		return;
+
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 	amdgpu_ring_alloc(ring, 32);
 	amdgpu_ring_emit_wreg(ring, reg, v);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index d6c38e2..a7771aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
 	int i;
 	struct amdgpu_device *adev = psp->adev;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	for (i = 0; i < adev->usec_timeout; i++) {
 		val = RREG32(reg_index);
 		if (check_changed) {
@@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	bool ras_intr = false;
 	bool skip_unsupport = false;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	mutex_lock(&psp->mutex);
 
 	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 2db195e..ccf096c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
 
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
+
+	if (!adev->in_pci_err_recovery) {
 #ifndef BRING_UP_DEBUG
-	if (amdgpu_async_gfx_ring) {
-		r = gfx_v10_0_kiq_disable_kgq(adev);
-		if (r)
-			DRM_ERROR("KGQ disable failed\n");
-	}
+		if (amdgpu_async_gfx_ring) {
+			r = gfx_v10_0_kiq_disable_kgq(adev);
+			if (r)
+				DRM_ERROR("KGQ disable failed\n");
+		}
 #endif
-	if (amdgpu_gfx_disable_kcq(adev))
-		DRM_ERROR("KCQ disable failed\n");
+		if (amdgpu_gfx_disable_kcq(adev))
+			DRM_ERROR("KCQ disable failed\n");
+	}
+
 	if (amdgpu_sriov_vf(adev)) {
 		gfx_v10_0_cp_gfx_enable(adev, false);
 		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index a58ea08..97aa72a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
 	struct amdgpu_device *adev = smu->adev;
 	int ret = 0, index = 0;
 
+	if (smu->adev->in_pci_err_recovery)
+		return 0;
+
 	index = smu_cmn_to_asic_specific_index(smu,
 					       CMN2ASIC_MAPPING_MSG,
 					       msg);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-09-02 18:42 ` [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 22:07   ` Bjorn Helgaas
  2020-09-02 18:42 ` [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

DPC recovery involves ASIC reset just as normal GPU recovery so block
SW GPU schedulers and wait on all concurrent GPU resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1fbf8a1..e999f1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4740,6 +4740,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 	return 0;
 }
 
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		cancel_delayed_work_sync(&ring->sched.work_tdr);
+	}
+}
+
 /**
  * amdgpu_pci_error_detected - Called when a PCI error is detected.
  * @pdev: PCI device struct
@@ -4753,15 +4767,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		/* Fatal error, prepare for slot reset */
-		amdgpu_device_lock_adev(adev);
+	/* Fatal error, prepare for slot reset */
+	case pci_channel_io_frozen:		
+		/*		
+		 * Cancel and wait for all TDRs in progress if failing to
+		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
+		 *
+		 * Locking adev->reset_sem will prevent any external access
+		 * to GPU during PCI error recovery
+		 */
+		while (!amdgpu_device_lock_adev(adev, NULL))
+			amdgpu_cancel_all_tdr(adev);
+
+		/*
+		 * Block any work scheduling as we do for regular GPU reset
+		 * for the duration of the recovery
+		 */
+		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+			struct amdgpu_ring *ring = adev->rings[i];
+
+			if (!ring || !ring->sched.thread)
+				continue;
+
+			drm_sched_stop(&ring->sched, NULL);
+		}
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		/* Permanent error, prepare for device removal */
@@ -4894,8 +4930,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
-	amdgpu_device_unlock_adev(adev);
 
 	DRM_INFO("PCI error: resume callback!!\n");
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+
+		drm_sched_resubmit_jobs(&ring->sched);
+		drm_sched_start(&ring->sched, true);
+	}
+
+	amdgpu_device_unlock_adev(adev);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-09-02 18:42 ` [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
  2020-09-02 18:42 ` [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 22:05   ` Bjorn Helgaas
  2020-09-02 18:42 ` [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

v2: Remove typo

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e999f1f..412d07e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
-	int r;
+	int r, i;
 	bool vram_lost;
+	u32 memsize;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	/* wait for asic to come out of reset */
+	msleep(500);
+
 	pci_restore_state(pdev);
 
-	adev->in_pci_err_recovery = true;
+	/* confirm  ASIC came out of reset */
+	for (i = 0; i < adev->usec_timeout; i++) {
+		memsize = amdgpu_asic_get_config_memsize(adev);
+
+		if (memsize != 0xffffffff)
+			break;
+		udelay(1);
+	}
+	if (memsize == 0xffffffff) {
+		r = -ETIME;
+		goto out;
+	}
+
+	/* TODO Call amdgpu_pre_asic_reset instead */
+	adev->in_pci_err_recovery = true;	
 	r = amdgpu_device_ip_suspend(adev);
 	adev->in_pci_err_recovery = false;
 	if (r)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures.
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2020-09-02 18:42 ` [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 22:23   ` Bjorn Helgaas
  2020-09-02 18:42 ` [PATCH v4 5/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

Cache the PCI state on boot and before each case where we might
loose it.

v2: Add pci_restore_state while caching the PCI state to avoid
breaking PCI core logic for stuff like suspend/resume.

v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
to avoid superflous restores during GPU resets and suspend/resumes.

v4: Style fixes.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  5 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 ++++++++++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
 5 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b20354f..13f92de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,7 +992,9 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+
 	bool                            in_pci_err_recovery;
+	struct pci_saved_state          *pci_state;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
@@ -1272,6 +1274,9 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
 pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
 void amdgpu_pci_resume(struct pci_dev *pdev);
 
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
+
 #include "amdgpu_object.h"
 
 /* used by df_v3_6.c and amdgpu_pmu.c */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 412d07e..174e09b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
 		pci_set_power_state(dev->pdev, PCI_D0);
-		pci_restore_state(dev->pdev);
+		amdgpu_device_load_pci_state(dev->pdev);
 		r = pci_enable_device(dev->pdev);
 		if (r)
 			DRM_WARN("pci_enable_device failed (%d)\n", r);
@@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		drm_kms_helper_poll_disable(dev);
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 		amdgpu_device_suspend(dev, true);
-		pci_save_state(dev->pdev);
+		amdgpu_device_cache_pci_state(dev->pdev);
 		/* Shut down the device */
 		pci_disable_device(dev->pdev);
 		pci_set_power_state(dev->pdev, PCI_D3cold);
@@ -3399,6 +3399,10 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
 
+	/* Have stored pci confspace at hand for restore in sudden PCI error */
+	if (amdgpu_device_cache_pci_state(adev->pdev))
+		pci_restore_state(pdev);
+
 	return 0;
 
 failed:
@@ -3423,6 +3427,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 	flush_delayed_work(&adev->delayed_init_work);
 	adev->shutdown = true;
 
+	kfree(adev->pci_state);
+
 	/* make sure IB test finished before entering exclusive mode
 	 * to avoid preemption on IB test
 	 * */
@@ -4847,7 +4853,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	/* wait for asic to come out of reset */
 	msleep(500);
 
-	pci_restore_state(pdev);
+	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
@@ -4927,6 +4933,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 out:
 
 	if (!r) {
+		if (amdgpu_device_cache_pci_state(adev->pdev))
+			pci_restore_state(adev->pdev);
+
 		DRM_INFO("PCIe error recovery succeeded\n");
 	} else {
 		DRM_ERROR("PCIe error recovery failed, err:%d", r);
@@ -4966,3 +4975,50 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 
 	amdgpu_device_unlock_adev(adev);
 }
+
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	r = pci_save_state(pdev);
+	if (!r) {
+		kfree(adev->pci_state);
+
+		adev->pci_state = pci_store_saved_state(pdev);
+
+		if (!adev->pci_state) {
+			DRM_ERROR("Failed to store PCI saved state");
+			return false;
+		}
+	} else {
+		DRM_WARN("Failed to save PCI state, err:%d\n", r);
+		return false;
+	}
+
+	return true;
+}
+
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	if (!adev->pci_state)
+		return false;
+
+	r = pci_load_saved_state(pdev, adev->pci_state);
+
+	if (!r) {
+		pci_restore_state(pdev);
+	} else {
+		DRM_WARN("Failed to load PCI state, err:%d\n", r);
+		return false;
+	}
+
+	return true;
+}
+
+
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index af99a5d..b653c72 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1322,7 +1322,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
 		if (amdgpu_is_atpx_hybrid()) {
 			pci_ignore_hotplug(pdev);
 		} else {
-			pci_save_state(pdev);
+			amdgpu_device_cache_pci_state(pdev);
 			pci_disable_device(pdev);
 			pci_ignore_hotplug(pdev);
 			pci_set_power_state(pdev, PCI_D3cold);
@@ -1355,7 +1355,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
 			pci_set_master(pdev);
 		} else {
 			pci_set_power_state(pdev, PCI_D0);
-			pci_restore_state(pdev);
+			amdgpu_device_load_pci_state(pdev);
 			ret = pci_enable_device(pdev);
 			if (ret)
 				return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 4d14023..0ec6603 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
 		dev_info(adev->dev, "GPU smu mode1 reset\n");
@@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 2f93c47..ddd55e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	ret = psp_gpu_reset(adev);
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
 
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 5/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2020-09-02 18:42 ` [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 18:42 ` [PATCH v4 6/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

Reuse exsisting functions from GPU recovery to avoid code
duplications.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 +++++-------------------------
 1 file changed, 12 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 174e09b..c477cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4111,7 +4111,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 
 static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 			       struct list_head *device_list_handle,
-			       bool *need_full_reset_arg)
+			       bool *need_full_reset_arg,
+			       bool skip_hw_reset)
 {
 	struct amdgpu_device *tmp_adev = NULL;
 	bool need_full_reset = *need_full_reset_arg, vram_lost = false;
@@ -4121,7 +4122,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 	 * ASIC reset has to be done on all HGMI hive nodes ASAP
 	 * to allow proper links negotiation in FW (within 1 sec)
 	 */
-	if (need_full_reset) {
+	if (!skip_hw_reset && need_full_reset) {
 		list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 			/* For XGMI run all resets in parallel to speed up the process */
 			if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
@@ -4517,7 +4518,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		if (r)
 			adev->asic_reset_res = r;
 	} else {
-		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset);
+		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset, false);
 		if (r && r == -EAGAIN)
 			goto retry;
 	}
@@ -4845,14 +4846,19 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
 	int r, i;
-	bool vram_lost;
+	bool need_full_reset = true;
 	u32 memsize;
+	struct list_head device_list;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	INIT_LIST_HEAD(&device_list);
+	list_add_tail(&adev->gmc.xgmi.head, &device_list);
+
 	/* wait for asic to come out of reset */
 	msleep(500);
 
+	/* Restore PCI confspace */
 	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
@@ -4868,70 +4874,15 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 		goto out;
 	}
 
-	/* TODO Call amdgpu_pre_asic_reset instead */
 	adev->in_pci_err_recovery = true;	
-	r = amdgpu_device_ip_suspend(adev);
+	r = amdgpu_device_pre_asic_reset(adev, NULL, &need_full_reset);
 	adev->in_pci_err_recovery = false;
 	if (r)
 		goto out;
 
-
-	/* post card */
-	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_ip_resume_phase1(adev);
-	if (r)
-		goto out;
-
-	vram_lost = amdgpu_device_check_vram_lost(adev);
-	if (vram_lost) {
-		DRM_INFO("VRAM is lost due to GPU reset!\n");
-		amdgpu_inc_vram_lost(adev);
-	}
-
-	r = amdgpu_gtt_mgr_recover(
-		&adev->mman.bdev.man[TTM_PL_TT]);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_fw_loading(adev);
-	if (r)
-		return r;
-
-	r = amdgpu_device_ip_resume_phase2(adev);
-	if (r)
-		goto out;
-
-	if (vram_lost)
-		amdgpu_device_fill_reset_magic(adev);
-
-	/*
-	 * Add this ASIC as tracked as reset was already
-	 * complete successfully.
-	 */
-	amdgpu_register_gpu_instance(adev);
-
-	r = amdgpu_device_ip_late_init(adev);
-	if (r)
-		goto out;
-
-	amdgpu_fbdev_set_suspend(adev, 0);
-
-	/* must succeed. */
-	amdgpu_ras_resume(adev);
-
-
-	amdgpu_irq_gpu_reset_resume_helper(adev);
-	r = amdgpu_ib_ring_tests(adev);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_recover_vram(adev);
+	r = amdgpu_do_asic_reset(NULL, &device_list, &need_full_reset, true);
 
 out:
-
 	if (!r) {
 		if (amdgpu_device_cache_pci_state(adev->pdev))
 			pci_restore_state(adev->pdev);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 6/8] drm/amdgpu: Disable DPC for XGMI for now.
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2020-09-02 18:42 ` [PATCH v4 5/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 18:42 ` [PATCH v4 7/8] drm/amdgpu: Minor checkpatch fix Andrey Grodzovsky
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

XGMI support is more complicated than single device support as
questions of synchronization between the device recovering from
PCI error and other members of the hive are required.
Leaving this for next round.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c477cfd..4d4fc67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4778,6 +4778,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
+	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+		DRM_WARN("No support for XGMI hive yet...");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 7/8] drm/amdgpu: Minor checkpatch fix
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2020-09-02 18:42 ` [PATCH v4 6/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 18:42 ` [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()" Andrey Grodzovsky
  2020-09-02 21:36 ` [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Bjorn Helgaas
  8 siblings, 0 replies; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d4fc67..3748bef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -353,7 +353,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  *
  * Returns the 8 bit value from the offset specified.
  */
-uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
+{
 	if (adev->in_pci_err_recovery)
 		return 0;
 
@@ -377,7 +378,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  *
  * Writes the value specified to the offset specified.
  */
-void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
+{
 	if (adev->in_pci_err_recovery)
 		return;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (6 preceding siblings ...)
  2020-09-02 18:42 ` [PATCH v4 7/8] drm/amdgpu: Minor checkpatch fix Andrey Grodzovsky
@ 2020-09-02 18:42 ` Andrey Grodzovsky
  2020-09-02 19:00   ` Kuppuswamy, Sathyanarayanan
  2020-09-02 21:36 ` [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Bjorn Helgaas
  8 siblings, 1 reply; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 18:42 UTC (permalink / raw)
  To: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas, Andrey Grodzovsky

This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.

In the code bellow

                pci_walk_bus(bus, report_frozen_detected, &status);
-               if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+               status = reset_link(dev, service);

status returned from report_frozen_detected is unconditionally masked
by status returned from reset_link which is wrong.

This breaks error recovery implementation for AMDGPU driver
by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
and hence skiping slot reset callback which is necessary for proper
ASIC recovery. Effectively no other callback besides resume callback will
be called after link reset the way it is implemented now regardless of what
value error_detected callback returns.

In general step 6.1.4 describing link reset unlike the other steps is not well defined
in what are the  expected return values and the appropriate next steps as
it is for other stpes.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/pci/pcie/err.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f41..81dd719 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,8 +165,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bus(bus, report_frozen_detected, &status);
-		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
 			pci_warn(dev, "link reset failed\n");
 			goto failed;
 		}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"
  2020-09-02 18:42 ` [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()" Andrey Grodzovsky
@ 2020-09-02 19:00   ` Kuppuswamy, Sathyanarayanan
  2020-09-02 19:54     ` Andrey Grodzovsky
  0 siblings, 1 reply; 20+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-09-02 19:00 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas



On 9/2/20 11:42 AM, Andrey Grodzovsky wrote:
> This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.
> 
> In the code bellow
> 
>                  pci_walk_bus(bus, report_frozen_detected, &status);
> -               if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
> +               status = reset_link(dev, service);
> 
> status returned from report_frozen_detected is unconditionally masked
> by status returned from reset_link which is wrong.
> 
> This breaks error recovery implementation for AMDGPU driver
> by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
> and hence skiping slot reset callback which is necessary for proper
> ASIC recovery. Effectively no other callback besides resume callback will
> be called after link reset the way it is implemented now regardless of what
> value error_detected callback returns.
> 
	}

Instead of reverting this change, can you try following patch ?
https://lore.kernel.org/linux-pci/56ad4901-725f-7b88-2117-b124b28b027f@linux.intel.com/T/#me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"
  2020-09-02 19:00   ` Kuppuswamy, Sathyanarayanan
@ 2020-09-02 19:54     ` Andrey Grodzovsky
  2020-09-02 20:27       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-02 19:54 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, amd-gfx, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas

Yes, works also.

Can you provide me a formal patch that i can commit into our local amd staging 
tree with my patch set ?

Alex - is that how we want to do it, without this patch or reverting the 
original patch the feature is broken.

Andrey

On 9/2/20 3:00 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 9/2/20 11:42 AM, Andrey Grodzovsky wrote:
>> This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.
>>
>> In the code bellow
>>
>>                  pci_walk_bus(bus, report_frozen_detected, &status);
>> -               if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>> +               status = reset_link(dev, service);
>>
>> status returned from report_frozen_detected is unconditionally masked
>> by status returned from reset_link which is wrong.
>>
>> This breaks error recovery implementation for AMDGPU driver
>> by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
>> and hence skiping slot reset callback which is necessary for proper
>> ASIC recovery. Effectively no other callback besides resume callback will
>> be called after link reset the way it is implemented now regardless of what
>> value error_detected callback returns.
>>
>     }
>
> Instead of reverting this change, can you try following patch ?
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F56ad4901-725f-7b88-2117-b124b28b027f%40linux.intel.com%2FT%2F%23me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7C77325d6a2abc42d26ae608d84f726c51%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346700170831846&amp;sdata=JPo8lOXfjxpq%2BnmlVrSi93aZxGjIlbuh0rkZmNKkzQM%3D&amp;reserved=0 
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"
  2020-09-02 19:54     ` Andrey Grodzovsky
@ 2020-09-02 20:27       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 20+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-09-02 20:27 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, linux-pci
  Cc: alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas



On 9/2/20 12:54 PM, Andrey Grodzovsky wrote:
> Yes, works also.
> 
> Can you provide me a formal patch that i can commit into our local amd staging tree with my patch set ?
https://patchwork.kernel.org/patch/11684175/mbox/
> 
> Alex - is that how we want to do it, without this patch or reverting the original patch the feature 
> is broken.
> 
> Andrey
> 
> On 9/2/20 3:00 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>>
>> On 9/2/20 11:42 AM, Andrey Grodzovsky wrote:
>>> This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.
>>>
>>> In the code bellow
>>>
>>>                  pci_walk_bus(bus, report_frozen_detected, &status);
>>> -               if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>>> +               status = reset_link(dev, service);
>>>
>>> status returned from report_frozen_detected is unconditionally masked
>>> by status returned from reset_link which is wrong.
>>>
>>> This breaks error recovery implementation for AMDGPU driver
>>> by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
>>> and hence skiping slot reset callback which is necessary for proper
>>> ASIC recovery. Effectively no other callback besides resume callback will
>>> be called after link reset the way it is implemented now regardless of what
>>> value error_detected callback returns.
>>>
>>     }
>>
>> Instead of reverting this change, can you try following patch ?
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F56ad4901-725f-7b88-2117-b124b28b027f%40linux.intel.com%2FT%2F%23me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7C77325d6a2abc42d26ae608d84f726c51%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346700170831846&amp;sdata=JPo8lOXfjxpq%2BnmlVrSi93aZxGjIlbuh0rkZmNKkzQM%3D&amp;reserved=0 
>>
>>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12
  2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (7 preceding siblings ...)
  2020-09-02 18:42 ` [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()" Andrey Grodzovsky
@ 2020-09-02 21:36 ` Bjorn Helgaas
  8 siblings, 0 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2020-09-02 21:36 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci,
	alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas

On Wed, Sep 02, 2020 at 02:42:02PM -0400, Andrey Grodzovsky wrote:
> Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, 
> such as parity errors on the data and address buses,  A typical action taken is to disconnect 
> the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered, 
> so that the affected PCI device(s) are reset and put back into working condition. 
> In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC) 
> driver which will intercept the PCIe error, remove (isolate) the faulting device after which it 
> will call into PCIe recovery code of the PCI core. 
> This code will call hooks which are implemented in this patchset where the error is 
> first reported at which point we block the GPU scheduler, next DPC resets the 
> PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
> start executing mode1 reset of the ASIC, next step is slot reset hook is called 
> at which point we wait for ASIC reset to complete, restore PCI config space and run 
> HW suspend/resume sequence to resinit the ASIC. 
> Last hook called is resume normal operation at which point we will restart the GPU scheduler.
> 
> More info on PCIe error handling and DPC are here:
> https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
> https://patchwork.kernel.org/patch/8945681/
> 
> v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which breaks the feature.

What does this apply to?  I tried 

  - v5.9-rc1 (9123e3a74ec7 ("Linux 5.9-rc1")),
  - v5.9-rc2 (d012a7190fc1 ("Linux 5.9-rc2")),
  - v5.9-rc3 (f75aef392f86 ("Linux 5.9-rc3")),
  - drm-next (3393649977f9 ("Merge tag 'drm-intel-next-2020-08-24-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-next")),
  - linux-next (4442749a2031 ("Add linux-next specific files for 20200902"))

but it doesn't apply cleanly to any.

> Andrey Grodzovsky (8):
>   drm/amdgpu: Avoid accessing HW when suspending SW state
>   drm/amdgpu: Block all job scheduling activity during DPC recovery
>   drm/amdgpu: Fix SMU error failure
>   drm/amdgpu: Fix consecutive DPC recovery failures.
>   drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
>   drm/amdgpu: Disable DPC for XGMI for now.
>   drm/amdgpu: Minor checkpatch fix
>   Revert "PCI/ERR: Update error status after reset_link()"
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 +++++++++++++++++++++--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 ++-
>  drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
>  drivers/pci/pcie/err.c                     |   3 +-
>  10 files changed, 222 insertions(+), 79 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-09-02 18:42 ` [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
@ 2020-09-02 21:56   ` Bjorn Helgaas
  2020-09-03  1:32   ` Li, Dennis
  1 sibling, 0 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2020-09-02 21:56 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci,
	alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas

On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:
> At this point the ASIC is already post reset by the HW/PSP
> so the HW not in proper state to be configured for suspension,
> some blocks might be even gated and so best is to avoid touching it.
> 
> v2: Rename in_dpc to more meaningful name
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++------
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>  6 files changed, 65 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c311a3c..b20354f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,7 @@ struct amdgpu_device {
>  	atomic_t			throttling_logging_enabled;
>  	struct ratelimit_state		throttling_logging_rs;
>  	uint32_t			ras_features;
> +	bool                            in_pci_err_recovery;
>  };
>  
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 74a1c03..1fbf8a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>  {
>  	uint32_t ret;
>  
> +	if (adev->in_pci_err_recovery)
> +		return 0;

I don't know the whole scheme of this, but this looks racy.

It looks like the normal path through this function is the readl(),
which I assume is an MMIO read from the PCI device.  If this is called
after a PCI error occurs, but before amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, the readl() will return 0xffffffff.

If this is called after amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, it will return 0.  Do you really want those
two different cases?

>  	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>  		return amdgpu_kiq_rreg(adev, reg);

> @@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  
>  	pci_restore_state(pdev);
>  
> +	adev->in_pci_err_recovery = true;
>  	r = amdgpu_device_ip_suspend(adev);
> +	adev->in_pci_err_recovery = false;
>  	if (r)
>  		goto out;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure
  2020-09-02 18:42 ` [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
@ 2020-09-02 22:05   ` Bjorn Helgaas
  2020-09-03 15:29     ` Andrey Grodzovsky
  0 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2020-09-02 22:05 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci,
	alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas

On Wed, Sep 02, 2020 at 02:42:05PM -0400, Andrey Grodzovsky wrote:
> Wait for HW/PSP initiated ASIC reset to complete before
> starting the recovery operations.
> 
> v2: Remove typo
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e999f1f..412d07e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
> -	int r;
> +	int r, i;
>  	bool vram_lost;
> +	u32 memsize;
>  
>  	DRM_INFO("PCI error: slot reset callback!!\n");
>  
> +	/* wait for asic to come out of reset */

I know it's totally OCD, but it is a minor speed bump to read "asic"
here and "ASIC" in the commit log above and the new comment below.

> +	msleep(500);
> +
>  	pci_restore_state(pdev);
>  
> -	adev->in_pci_err_recovery = true;
> +	/* confirm  ASIC came out of reset */
> +	for (i = 0; i < adev->usec_timeout; i++) {
> +		memsize = amdgpu_asic_get_config_memsize(adev);
> +
> +		if (memsize != 0xffffffff)

I guess this is a spot where you actually depend on an MMIO read
returning 0xffffffff because adev->in_pci_err_recovery is false at
this point, so amdgpu_mm_rreg() will actually *do* the MMIO read
instead of returning 0.  Right?

> +			break;
> +		udelay(1);
> +	}
> +	if (memsize == 0xffffffff) {
> +		r = -ETIME;
> +		goto out;
> +	}
> +
> +	/* TODO Call amdgpu_pre_asic_reset instead */
> +	adev->in_pci_err_recovery = true;	
>  	r = amdgpu_device_ip_suspend(adev);
>  	adev->in_pci_err_recovery = false;
>  	if (r)
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-09-02 18:42 ` [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-09-02 22:07   ` Bjorn Helgaas
  0 siblings, 0 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2020-09-02 22:07 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci,
	alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas

On Wed, Sep 02, 2020 at 02:42:04PM -0400, Andrey Grodzovsky wrote:
> DPC recovery involves ASIC reset just as normal GPU recovery so block
> SW GPU schedulers and wait on all concurrent GPU resets.

> +		 * Cancel and wait for all TDRs in progress if failing to
> +		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev

OCD typo, s/set  adev/set adev/ (two spaces)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures.
  2020-09-02 18:42 ` [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
@ 2020-09-02 22:23   ` Bjorn Helgaas
  2020-09-03 15:45     ` Andrey Grodzovsky
  0 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2020-09-02 22:23 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci,
	alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas

On Wed, Sep 02, 2020 at 02:42:06PM -0400, Andrey Grodzovsky wrote:
> Cache the PCI state on boot and before each case where we might
> loose it.

s/loose/lose/

> v2: Add pci_restore_state while caching the PCI state to avoid
> breaking PCI core logic for stuff like suspend/resume.
> 
> v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
> to avoid superflous restores during GPU resets and suspend/resumes.
> 
> v4: Style fixes.

Is the DRM convention to keep the v2/v3/v4 stuff in the commit log?  I
keep those below the "---" or manually remove them for PCI, but use
the local convention, of course.

> +	/* Have stored pci confspace at hand for restore in sudden PCI error */

I assume that at least from the perspective of this code, all PCI
errors are "sudden".  Or if they're not, I'm curious about which would
be sudden and which would not.

> +	if (amdgpu_device_cache_pci_state(adev->pdev))
> +		pci_restore_state(pdev);

> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +
> +	r = pci_save_state(pdev);
> +	if (!r) {
> +		kfree(adev->pci_state);
> +
> +		adev->pci_state = pci_store_saved_state(pdev);
> +
> +		if (!adev->pci_state) {
> +			DRM_ERROR("Failed to store PCI saved state");
> +			return false;
> +		}
> +	} else {
> +		DRM_WARN("Failed to save PCI state, err:%d\n", r);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +
> +	if (!adev->pci_state)
> +		return false;
> +
> +	r = pci_load_saved_state(pdev, adev->pci_state);

I'm a little bit hesitant to pci_load_saved_state() and
pci_store_saved_state() being used here, simply because they're
currently only used by VFIO, Xen, and nvme.  So I don't have a real
objection, but just pointing out that apparently you're doing
something really special that isn't commonly used and tested, so it's
more likely to be broken or incomplete.

There's lots of state that the PCI core *can't* save/restore, and
pci_load_saved_state() doesn't even handle all the architected PCI
state, i.e., we only call pci_add_cap_save_buffer() or
pci_add_ext_cap_save_buffer() for a few of the capabilities.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-09-02 18:42 ` [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
  2020-09-02 21:56   ` Bjorn Helgaas
@ 2020-09-03  1:32   ` Li, Dennis
  1 sibling, 0 replies; 20+ messages in thread
From: Li, Dennis @ 2020-09-03  1:32 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx, sathyanarayanan.kuppuswamy, linux-pci
  Cc: Deucher, Alexander, Das, Nirmoy, Koenig, Christian, Tuikov,
	Luben, bhelgaas, Grodzovsky, Andrey

[AMD Official Use Only - Internal Distribution Only]

Hi, andrey

Did you want to use adev->in_pci_err_recovery to avoid hardware accessed by other threads when doing PCI recovery? If so, it is better to change to use lock protect them. This patch can't solve your issue completely. 

Best Regards
Dennis Li
-----Original Message-----
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> 
Sent: Thursday, September 3, 2020 2:42 AM
To: amd-gfx@lists.freedesktop.org; sathyanarayanan.kuppuswamy@linux.intel.com; linux-pci@vger.kernel.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Das, Nirmoy <Nirmoy.Das@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; bhelgaas@google.com; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Subject: [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state

At this point the ASIC is already post reset by the HW/PSP so the HW not in proper state to be configured for suspension, some blocks might be even gated and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++------
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
 6 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c311a3c..b20354f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+	bool                            in_pci_err_recovery;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 74a1c03..1fbf8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,  {
 	uint32_t ret;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_rreg(adev, reg);
 
@@ -351,6 +354,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  * Returns the 8 bit value from the offset specified.
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
 	BUG();
@@ -372,6 +378,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  * Writes the value specified to the offset specified.
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
 	else
@@ -382,6 +391,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device *adev,
 				       uint32_t reg, uint32_t v,
 				       uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
 	if ((reg * 4) < adev->rmmio_size)
@@ -409,6 +421,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device *adev,  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -423,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (amdgpu_sriov_fullaccess(adev) &&
 		adev->gfx.rlc.funcs &&
 		adev->gfx.rlc.funcs->is_rlcg_access_range) { @@ -444,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)  {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		return ioread32(adev->rio_mem + (reg * 4));
 	else {
@@ -463,6 +484,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)  {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		iowrite32(v, adev->rio_mem + (reg * 4));
 	else {
@@ -482,6 +506,9 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)  {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return readl(adev->doorbell.ptr + index);
 	} else {
@@ -502,6 +529,9 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)  {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		writel(v, adev->doorbell.ptr + index);
 	} else {
@@ -520,6 +550,9 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)  {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
 	} else {
@@ -540,6 +573,9 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)  {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
 	} else {
@@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 	pci_restore_state(pdev);
 
+	adev->in_pci_err_recovery = true;
 	r = amdgpu_device_ip_suspend(adev);
+	adev->in_pci_err_recovery = false;
 	if (r)
 		goto out;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index d698142..8c9bacf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
 	struct amdgpu_ring *ring = &kiq->ring;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	BUG_ON(!ring->funcs->emit_rreg);
 
 	spin_lock_irqsave(&kiq->ring_lock, flags); @@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 
 	BUG_ON(!ring->funcs->emit_wreg);
 
+	if (adev->in_pci_err_recovery)
+		return;
+
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 	amdgpu_ring_alloc(ring, 32);
 	amdgpu_ring_emit_wreg(ring, reg, v);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index d6c38e2..a7771aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
 	int i;
 	struct amdgpu_device *adev = psp->adev;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	for (i = 0; i < adev->usec_timeout; i++) {
 		val = RREG32(reg_index);
 		if (check_changed) {
@@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	bool ras_intr = false;
 	bool skip_unsupport = false;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	mutex_lock(&psp->mutex);
 
 	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 2db195e..ccf096c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
 
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
+
+	if (!adev->in_pci_err_recovery) {
 #ifndef BRING_UP_DEBUG
-	if (amdgpu_async_gfx_ring) {
-		r = gfx_v10_0_kiq_disable_kgq(adev);
-		if (r)
-			DRM_ERROR("KGQ disable failed\n");
-	}
+		if (amdgpu_async_gfx_ring) {
+			r = gfx_v10_0_kiq_disable_kgq(adev);
+			if (r)
+				DRM_ERROR("KGQ disable failed\n");
+		}
 #endif
-	if (amdgpu_gfx_disable_kcq(adev))
-		DRM_ERROR("KCQ disable failed\n");
+		if (amdgpu_gfx_disable_kcq(adev))
+			DRM_ERROR("KCQ disable failed\n");
+	}
+
 	if (amdgpu_sriov_vf(adev)) {
 		gfx_v10_0_cp_gfx_enable(adev, false);
 		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */ diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index a58ea08..97aa72a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
 	struct amdgpu_device *adev = smu->adev;
 	int ret = 0, index = 0;
 
+	if (smu->adev->in_pci_err_recovery)
+		return 0;
+
 	index = smu_cmn_to_asic_specific_index(smu,
 					       CMN2ASIC_MAPPING_MSG,
 					       msg);
--
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure
  2020-09-02 22:05   ` Bjorn Helgaas
@ 2020-09-03 15:29     ` Andrey Grodzovsky
  0 siblings, 0 replies; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-03 15:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci,
	alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas


On 9/2/20 6:05 PM, Bjorn Helgaas wrote:
> On Wed, Sep 02, 2020 at 02:42:05PM -0400, Andrey Grodzovsky wrote:
>> Wait for HW/PSP initiated ASIC reset to complete before
>> starting the recovery operations.
>>
>> v2: Remove typo
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index e999f1f..412d07e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> -	int r;
>> +	int r, i;
>>   	bool vram_lost;
>> +	u32 memsize;
>>   
>>   	DRM_INFO("PCI error: slot reset callback!!\n");
>>   
>> +	/* wait for asic to come out of reset */
> I know it's totally OCD, but it is a minor speed bump to read "asic"
> here and "ASIC" in the commit log above and the new comment below.
>
>> +	msleep(500);
>> +
>>   	pci_restore_state(pdev);
>>   
>> -	adev->in_pci_err_recovery = true;
>> +	/* confirm  ASIC came out of reset */
>> +	for (i = 0; i < adev->usec_timeout; i++) {
>> +		memsize = amdgpu_asic_get_config_memsize(adev);
>> +
>> +		if (memsize != 0xffffffff)
> I guess this is a spot where you actually depend on an MMIO read
> returning 0xffffffff because adev->in_pci_err_recovery is false at
> this point, so amdgpu_mm_rreg() will actually *do* the MMIO read
> instead of returning 0.  Right?


Yes, i picked this form another place where they use it to confirm PCI confspace and
the BARs specifically are restored and active and so you can read back sane MMIO 
regs values.

Andrey


>
>> +			break;
>> +		udelay(1);
>> +	}
>> +	if (memsize == 0xffffffff) {
>> +		r = -ETIME;
>> +		goto out;
>> +	}
>> +
>> +	/* TODO Call amdgpu_pre_asic_reset instead */
>> +	adev->in_pci_err_recovery = true;	
>>   	r = amdgpu_device_ip_suspend(adev);
>>   	adev->in_pci_err_recovery = false;
>>   	if (r)
>> -- 
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures.
  2020-09-02 22:23   ` Bjorn Helgaas
@ 2020-09-03 15:45     ` Andrey Grodzovsky
  0 siblings, 0 replies; 20+ messages in thread
From: Andrey Grodzovsky @ 2020-09-03 15:45 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: amd-gfx, sathyanarayanan.kuppuswamy, linux-pci,
	alexander.deucher, nirmodas, Dennis.Li, christian.koenig,
	luben.tuikov, bhelgaas


On 9/2/20 6:23 PM, Bjorn Helgaas wrote:
> On Wed, Sep 02, 2020 at 02:42:06PM -0400, Andrey Grodzovsky wrote:
>> Cache the PCI state on boot and before each case where we might
>> loose it.
> s/loose/lose/
>
>> v2: Add pci_restore_state while caching the PCI state to avoid
>> breaking PCI core logic for stuff like suspend/resume.
>>
>> v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
>> to avoid superflous restores during GPU resets and suspend/resumes.
>>
>> v4: Style fixes.
> Is the DRM convention to keep the v2/v3/v4 stuff in the commit log?  I
> keep those below the "---" or manually remove them for PCI, but use
> the local convention, of course.
>
>> +	/* Have stored pci confspace at hand for restore in sudden PCI error */
> I assume that at least from the perspective of this code, all PCI
> errors are "sudden".  Or if they're not, I'm curious about which would
> be sudden and which would not.


Yes, all PCI errors are sudden from drivers point of view indeed, I am just 
stressing
the fact that i have to 'plan ahead' and store working PCI config because to my 
understating by the
time when we are notified of the PCI error in err_detected callback the device 
and it's PCI confspace
registers are already inaccessible  and so it's to late to try and save the 
confspace at this time.


>
>> +	if (amdgpu_device_cache_pci_state(adev->pdev))
>> +		pci_restore_state(pdev);
>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
>> +{
>> +	struct drm_device *dev = pci_get_drvdata(pdev);
>> +	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int r;
>> +
>> +	r = pci_save_state(pdev);
>> +	if (!r) {
>> +		kfree(adev->pci_state);
>> +
>> +		adev->pci_state = pci_store_saved_state(pdev);
>> +
>> +		if (!adev->pci_state) {
>> +			DRM_ERROR("Failed to store PCI saved state");
>> +			return false;
>> +		}
>> +	} else {
>> +		DRM_WARN("Failed to save PCI state, err:%d\n", r);
>> +		return false;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
>> +{
>> +	struct drm_device *dev = pci_get_drvdata(pdev);
>> +	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int r;
>> +
>> +	if (!adev->pci_state)
>> +		return false;
>> +
>> +	r = pci_load_saved_state(pdev, adev->pci_state);
> I'm a little bit hesitant to pci_load_saved_state() and
> pci_store_saved_state() being used here, simply because they're
> currently only used by VFIO, Xen, and nvme.  So I don't have a real
> objection, but just pointing out that apparently you're doing
> something really special that isn't commonly used and tested, so it's
> more likely to be broken or incomplete.


See my assumption above, I assume I have to explicitly save (cache) the PCI 
state while
the device is fully operational since I assume it's not possible when PCI error 
occurs and
DPC isolates the device. I do spot PCI core calling pci_dev_save_and_disable and 
pci_dev_restore
as part of pcie_do_recovery->pci_reset_bus and those functions do store and 
restore the PCI confpsace
but I am not sure this code executes in all cases and 
pcie_do_recovery->pci_reset_bus
was just added now by last sathyanarayanan's patch he gave me anyway...

Andrey


>
> There's lots of state that the PCI core *can't* save/restore, and
> pci_load_saved_state() doesn't even handle all the architected PCI
> state, i.e., we only call pci_add_cap_save_buffer() or
> pci_add_ext_cap_save_buffer() for a few of the capabilities.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-09-03 15:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
2020-09-02 21:56   ` Bjorn Helgaas
2020-09-03  1:32   ` Li, Dennis
2020-09-02 18:42 ` [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
2020-09-02 22:07   ` Bjorn Helgaas
2020-09-02 18:42 ` [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
2020-09-02 22:05   ` Bjorn Helgaas
2020-09-03 15:29     ` Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
2020-09-02 22:23   ` Bjorn Helgaas
2020-09-03 15:45     ` Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 5/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 6/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 7/8] drm/amdgpu: Minor checkpatch fix Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()" Andrey Grodzovsky
2020-09-02 19:00   ` Kuppuswamy, Sathyanarayanan
2020-09-02 19:54     ` Andrey Grodzovsky
2020-09-02 20:27       ` Kuppuswamy, Sathyanarayanan
2020-09-02 21:36 ` [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).