All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Implement PCI Error Recovery on Navi12
@ 2020-08-28 16:05 Andrey Grodzovsky
  2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, 
such as parity errors on the data and address buses,  A typical action taken is to disconnect 
the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered, 
so that the affected PCI device(s) are reset and put back into working condition. 
In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC) 
driver which will intercept the PCIe error, remove (isolate) the faulting device and reset the PCI 
link of the device after which it will call into PCIe recovery code of the PCI core. 
This code will call hooks which are implemented in this patchset where the error is 
first reported at which point we block the GPU scheduler, next DPC resets the 
PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
start executing mode1 reset of the ASIC, next step is slot reset hook is called 
at which point we wait for ASIC reset to complete, restore PCI config space and run 
HW suspend/resume sequence to resinit the ASIC. 
Last hook called is resume normal operation at which point we will restart the GPU scheduler.

Andrey Grodzovsky (7):
  drm/amdgpu: Implement DPC recovery
  drm/amdgpu: Avoid accessing HW when suspending SW state
  drm/amdgpu: Block all job scheduling activity during DPC recovery
  drm/amdgpu: Fix SMU error failure
  drm/amdgpu: Fix consecutive DPC recovery failures.
  drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  drm/amdgpu: Disable DPC for XGMI for now.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  16 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 296 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/atom.c          |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |   3 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
 11 files changed, 348 insertions(+), 22 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
@ 2020-08-28 16:05 ` Andrey Grodzovsky
  2020-08-28 19:23   ` Alex Deucher
                     ` (2 more replies)
  2020-08-28 16:05 ` [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Add DPC handlers with basic recovery functionality.

v2: remove pci_save_state to avoid breaking suspend/resume

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 169 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
 3 files changed, 184 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 49ea9fa..3399242 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -49,6 +49,8 @@
 #include <linux/rbtree.h>
 #include <linux/hashtable.h>
 #include <linux/dma-fence.h>
+#include <linux/pci.h>
+#include <linux/aer.h>
 
 #include <drm/ttm/ttm_bo_api.h>
 #include <drm/ttm/ttm_bo_driver.h>
@@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
 void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
 void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
 
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
+					   pci_channel_state_t state);
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
+void amdgpu_pci_resume(struct pci_dev *pdev);
+
+
 #include "amdgpu_object.h"
 
 /* used by df_v3_6.c and amdgpu_pmu.c */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a948ed..937f8b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -350,7 +350,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  *
  * Returns the 8 bit value from the offset specified.
  */
-uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
+{
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
 	BUG();
@@ -371,7 +372,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  *
  * Writes the value specified to the offset specified.
  */
-void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
+{
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
 	else
@@ -2989,6 +2991,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
 	NULL
 };
 
+
 /**
  * amdgpu_device_init - initialize the driver
  *
@@ -3207,6 +3210,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		}
 	}
 
+	pci_enable_pcie_error_reporting(adev->ddev.pdev);
+
+
 	/* Post card if necessary */
 	if (amdgpu_device_need_post(adev)) {
 		if (!adev->bios) {
@@ -4701,3 +4707,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 
 	return 0;
 }
+
+/**
+ * amdgpu_pci_error_detected - Called when a PCI error is detected.
+ * @pdev: PCI device struct
+ * @state: PCI channel state
+ *
+ * Description: Called when a PCI error is detected.
+ *
+ * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
+ */
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
+
+	switch (state) {
+	case pci_channel_io_normal:
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen: {
+		/* Fatal error, prepare for slot reset */
+
+		amdgpu_device_lock_adev(adev);
+		return PCI_ERS_RESULT_NEED_RESET;
+	}
+	case pci_channel_io_perm_failure:
+		/* Permanent error, prepare for device removal */
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
+ * @pdev: pointer to PCI device
+ */
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
+{
+
+	DRM_INFO("PCI error: mmio enabled callback!!\n");
+
+	/* TODO - dump whatever for debugging purposes */
+
+	/* This called only if amdgpu_pci_error_detected returns
+	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
+	 * works, no need to reset slot.
+	 */
+
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
+ * @pdev: PCI device struct
+ *
+ * Description: This routine is called by the pci error recovery
+ * code after the PCI slot has been reset, just before we
+ * should resume normal operations.
+ */
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+	bool vram_lost;
+
+	DRM_INFO("PCI error: slot reset callback!!\n");
+
+	pci_restore_state(pdev);
+
+	r = amdgpu_device_ip_suspend(adev);
+	if (r)
+		goto out;
+
+
+	/* post card */
+	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_ip_resume_phase1(adev);
+	if (r)
+		goto out;
+
+	vram_lost = amdgpu_device_check_vram_lost(adev);
+	if (vram_lost) {
+		DRM_INFO("VRAM is lost due to GPU reset!\n");
+		amdgpu_inc_vram_lost(adev);
+	}
+
+	r = amdgpu_gtt_mgr_recover(
+		&adev->mman.bdev.man[TTM_PL_TT]);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_fw_loading(adev);
+	if (r)
+		return r;
+
+	r = amdgpu_device_ip_resume_phase2(adev);
+	if (r)
+		goto out;
+
+	if (vram_lost)
+		amdgpu_device_fill_reset_magic(adev);
+
+	/*
+	 * Add this ASIC as tracked as reset was already
+	 * complete successfully.
+	 */
+	amdgpu_register_gpu_instance(adev);
+
+	r = amdgpu_device_ip_late_init(adev);
+	if (r)
+		goto out;
+
+	amdgpu_fbdev_set_suspend(adev, 0);
+
+	/* must succeed. */
+	amdgpu_ras_resume(adev);
+
+
+	amdgpu_irq_gpu_reset_resume_helper(adev);
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_recover_vram(adev);
+
+out:
+
+	if (!r)
+		DRM_INFO("PCIe error recovery succeeded\n");
+	else {
+		DRM_ERROR("PCIe error recovery failed, err:%d", r);
+		amdgpu_device_unlock_adev(adev);
+	}
+
+	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * amdgpu_pci_resume() - resume normal ops after PCI reset
+ * @pdev: pointer to PCI device
+ *
+ * Called when the error recovery driver tells us that its
+ * OK to resume normal operation. Use completion to allow
+ * halted scsi ops to resume.
+ */
+void amdgpu_pci_resume(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	amdgpu_device_unlock_adev(adev);
+
+	DRM_INFO("PCI error: resume callback!!\n");
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d984c6a..4bbcc70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -31,7 +31,6 @@
 #include <drm/drm_pciids.h>
 #include <linux/console.h>
 #include <linux/module.h>
-#include <linux/pci.h>
 #include <linux/pm_runtime.h>
 #include <linux/vga_switcheroo.h>
 #include <drm/drm_probe_helper.h>
@@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
 	.patchlevel = KMS_DRIVER_PATCHLEVEL,
 };
 
+static struct pci_error_handlers amdgpu_pci_err_handler = {
+	.error_detected	= amdgpu_pci_error_detected,
+	.mmio_enabled	= amdgpu_pci_mmio_enabled,
+	.slot_reset	= amdgpu_pci_slot_reset,
+	.resume		= amdgpu_pci_resume,
+};
+
 static struct pci_driver amdgpu_kms_pci_driver = {
 	.name = DRIVER_NAME,
 	.id_table = pciidlist,
@@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
 	.remove = amdgpu_pci_remove,
 	.shutdown = amdgpu_pci_shutdown,
 	.driver.pm = &amdgpu_pm_ops,
+	.err_handler = &amdgpu_pci_err_handler,
 };
 
 static int __init amdgpu_init(void)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
@ 2020-08-28 16:05 ` Andrey Grodzovsky
  2020-08-28 19:26   ` Alex Deucher
  2020-08-28 16:05 ` [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some bloks might be even gated and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 41 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++-----
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
 8 files changed, 71 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3399242..cac51e8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+	bool                            in_pci_err_recovery;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 937f8b0..e67cbf2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
 {
 	uint32_t ret;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_rreg(adev, reg);
 
@@ -352,6 +355,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
 	BUG();
@@ -374,14 +380,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
 	else
 		BUG();
 }
 
-void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
+void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
+				       uint32_t v, uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
 	if ((reg * 4) < adev->rmmio_size)
@@ -409,6 +422,9 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
 void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -423,6 +439,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (amdgpu_sriov_fullaccess(adev) &&
 		adev->gfx.rlc.funcs &&
 		adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -444,6 +463,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		return ioread32(adev->rio_mem + (reg * 4));
 	else {
@@ -463,6 +485,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		iowrite32(v, adev->rio_mem + (reg * 4));
 	else {
@@ -482,6 +507,9 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return readl(adev->doorbell.ptr + index);
 	} else {
@@ -502,6 +530,9 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		writel(v, adev->doorbell.ptr + index);
 	} else {
@@ -520,6 +551,9 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
 	} else {
@@ -540,6 +574,9 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
 	} else {
@@ -4778,7 +4815,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 	pci_restore_state(pdev);
 
+	adev->in_pci_err_recovery = true;
 	r = amdgpu_device_ip_suspend(adev);
+	adev->in_pci_err_recovery = false;
 	if (r)
 		goto out;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index d698142..8c9bacf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
 	struct amdgpu_ring *ring = &kiq->ring;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	BUG_ON(!ring->funcs->emit_rreg);
 
 	spin_lock_irqsave(&kiq->ring_lock, flags);
@@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 
 	BUG_ON(!ring->funcs->emit_wreg);
 
+	if (adev->in_pci_err_recovery)
+		return;
+
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 	amdgpu_ring_alloc(ring, 32);
 	amdgpu_ring_emit_wreg(ring, reg, v);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index d6c38e2..a7771aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
 	int i;
 	struct amdgpu_device *adev = psp->adev;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	for (i = 0; i < adev->usec_timeout; i++) {
 		val = RREG32(reg_index);
 		if (check_changed) {
@@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	bool ras_intr = false;
 	bool skip_unsupport = false;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	mutex_lock(&psp->mutex);
 
 	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
index 4cfc786..613dac1 100644
--- a/drivers/gpu/drm/amd/amdgpu/atom.c
+++ b/drivers/gpu/drm/amd/amdgpu/atom.c
@@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
 					DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
 						  ATOM_CMD_TIMEOUT_SEC);
 					ctx->abort = true;
+					dump_stack();
 				}
 			} else {
 				/* jiffies wrap around we will just wait a little longer */
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 2db195e..ccf096c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
 
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
+
+	if (!adev->in_pci_err_recovery) {
 #ifndef BRING_UP_DEBUG
-	if (amdgpu_async_gfx_ring) {
-		r = gfx_v10_0_kiq_disable_kgq(adev);
-		if (r)
-			DRM_ERROR("KGQ disable failed\n");
-	}
+		if (amdgpu_async_gfx_ring) {
+			r = gfx_v10_0_kiq_disable_kgq(adev);
+			if (r)
+				DRM_ERROR("KGQ disable failed\n");
+		}
 #endif
-	if (amdgpu_gfx_disable_kcq(adev))
-		DRM_ERROR("KCQ disable failed\n");
+		if (amdgpu_gfx_disable_kcq(adev))
+			DRM_ERROR("KCQ disable failed\n");
+	}
+
 	if (amdgpu_sriov_vf(adev)) {
 		gfx_v10_0_cp_gfx_enable(adev, false);
 		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 8462b30..306461d 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
 {
 	int ret = 0;
 
+
 	if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
 		return -EOPNOTSUPP;
 
+
+
 	switch (block_type) {
 	/*
 	 * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index a58ea08..97aa72a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
 	struct amdgpu_device *adev = smu->adev;
 	int ret = 0, index = 0;
 
+	if (smu->adev->in_pci_err_recovery)
+		return 0;
+
 	index = smu_cmn_to_asic_specific_index(smu,
 					       CMN2ASIC_MAPPING_MSG,
 					       msg);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-28 16:05 ` [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
@ 2020-08-28 16:05 ` Andrey Grodzovsky
  2020-08-28 19:28   ` Alex Deucher
  2020-08-28 16:05 ` [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

DPC recovery involves ASIC reset just as normal GPU recovery so blosk
SW GPU scedulers and wait on all concurent GPU resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e67cbf2..9a367a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4745,6 +4745,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 	return 0;
 }
 
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		cancel_delayed_work_sync(&ring->sched.work_tdr);
+	}
+}
+
 /**
  * amdgpu_pci_error_detected - Called when a PCI error is detected.
  * @pdev: PCI device struct
@@ -4758,16 +4772,38 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen: {
-		/* Fatal error, prepare for slot reset */
+	case pci_channel_io_frozen: { /* Fatal error, prepare for slot reset */
+
+		/*
+		 * Cancel and wait for all TDRs in progress if failing to
+		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
+		 *
+		 * Locking adev->reset_sem will perevent any external access
+		 * to GPU during PCI error recovery
+		 */
+		while (!amdgpu_device_lock_adev(adev, NULL))
+			amdgpu_cancel_all_tdr(adev);
+
+		/*
+		 * Block any work scheduling as we do for regualr GPU reset
+		 * for the duration of the recoveryq
+		 */
+		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+			struct amdgpu_ring *ring = adev->rings[i];
+
+			if (!ring || !ring->sched.thread)
+				continue;
+
+			drm_sched_stop(&ring->sched, NULL);
+		}
 
-		amdgpu_device_lock_adev(adev);
 		return PCI_ERS_RESULT_NEED_RESET;
 	}
 	case pci_channel_io_perm_failure:
@@ -4900,8 +4936,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
-	amdgpu_device_unlock_adev(adev);
 
 	DRM_INFO("PCI error: resume callback!!\n");
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+
+		drm_sched_resubmit_jobs(&ring->sched);
+		drm_sched_start(&ring->sched, true);
+	}
+
+	amdgpu_device_unlock_adev(adev);
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure
  2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2020-08-28 16:05 ` [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-08-28 16:05 ` Andrey Grodzovsky
  2020-08-28 19:29   ` Alex Deucher
  2020-08-28 16:05 ` [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

v2: Remove typo

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9a367a8..06664a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4844,14 +4844,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
-	int r;
+	int r, i;
 	bool vram_lost;
+	u32 memsize;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	/* wait for asic to come out of reset */
+	msleep(500);
+
 	pci_restore_state(pdev);
 
-	adev->in_pci_err_recovery = true;
+	/* confirm  ASIC came out of reset */
+	for (i = 0; i < adev->usec_timeout; i++) {
+		memsize = amdgpu_asic_get_config_memsize(adev);
+
+		if (memsize != 0xffffffff)
+			break;
+		udelay(1);
+	}
+	if (memsize == 0xffffffff) {
+		r = -ETIME;
+		goto out;
+	}
+
+	/* TODO Call amdgpu_pre_asic_reset instead */
+	adev->in_pci_err_recovery = true;	
 	r = amdgpu_device_ip_suspend(adev);
 	adev->in_pci_err_recovery = false;
 	if (r)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures.
  2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2020-08-28 16:05 ` [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
@ 2020-08-28 16:05 ` Andrey Grodzovsky
  2020-08-28 19:19   ` Alex Deucher
  2020-08-28 16:05 ` [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
  2020-08-28 16:05 ` [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
  6 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Cache the PCI state on boot and before each case were we might
loose it.

v2: Add pci_restore_state while caching the PCI state to avoid
breaking PCI core logic for stuff like suspend/resume.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 59 ++++++++++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
 5 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index cac51e8..5e74db6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,7 +992,9 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+
 	bool                            in_pci_err_recovery;
+	struct pci_saved_state          *pci_state;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
@@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
 pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
 void amdgpu_pci_resume(struct pci_dev *pdev);
 
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
+
+
 
 #include "amdgpu_object.h"
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 06664a9..7f1b970 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1284,7 +1284,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
 		pci_set_power_state(dev->pdev, PCI_D0);
-		pci_restore_state(dev->pdev);
+		amdgpu_device_load_pci_state(dev->pdev);
 		r = pci_enable_device(dev->pdev);
 		if (r)
 			DRM_WARN("pci_enable_device failed (%d)\n", r);
@@ -1297,7 +1297,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		drm_kms_helper_poll_disable(dev);
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 		amdgpu_device_suspend(dev, true);
-		pci_save_state(dev->pdev);
+		amdgpu_device_cache_pci_state(dev->pdev);
 		/* Shut down the device */
 		pci_disable_device(dev->pdev);
 		pci_set_power_state(dev->pdev, PCI_D3cold);
@@ -3402,6 +3402,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
 
+	/* Have stored pci confspace at hand for restore in sudden PCI error */
+	if (!amdgpu_device_cache_pci_state(adev->pdev))
+		DRM_WARN("Failed to cache PCI state!");
 	return 0;
 
 failed:
@@ -3428,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 	flush_delayed_work(&adev->delayed_init_work);
 	adev->shutdown = true;
 
+	kfree(adev->pci_state);
+
 	/* make sure IB test finished before entering exclusive mode
 	 * to avoid preemption on IB test
 	 * */
@@ -4853,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	/* wait for asic to come out of reset */
 	msleep(500);
 
-	pci_restore_state(pdev);
+	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
@@ -4932,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 out:
 
-	if (!r)
+	if (!r) {
+		amdgpu_device_cache_pci_state(adev->pdev);
 		DRM_INFO("PCIe error recovery succeeded\n");
+	}
 	else {
 		DRM_ERROR("PCIe error recovery failed, err:%d", r);
 		amdgpu_device_unlock_adev(adev);
@@ -4972,3 +4979,47 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 
 	amdgpu_device_unlock_adev(adev);
 }
+
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	r = pci_save_state(pdev);
+	if (!r) {
+		kfree(adev->pci_state);
+
+		adev->pci_state = pci_store_saved_state(pdev);
+		pci_restore_state(pdev);
+
+		if (!adev->pci_state) {
+			DRM_ERROR("Failed to store PCI saved state");
+			return false;
+		}
+	} else {
+		DRM_WARN("Failed to save PCI state, err:%d\n", r);
+		return false;
+	}
+
+	return true;
+}
+
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	if (!adev->pci_state)
+		return false;
+
+	r = pci_load_saved_state(pdev, adev->pci_state);
+
+	if (!r) {
+		pci_restore_state(pdev);
+	} else {
+		DRM_WARN("Failed to load PCI state, err:%d\n", r);
+		return false;
+	}
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4bbcc70..7a6482a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
 		if (amdgpu_is_atpx_hybrid()) {
 			pci_ignore_hotplug(pdev);
 		} else {
-			pci_save_state(pdev);
+			amdgpu_device_cache_pci_state(pdev);
 			pci_disable_device(pdev);
 			pci_ignore_hotplug(pdev);
 			pci_set_power_state(pdev, PCI_D3cold);
@@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
 			pci_set_master(pdev);
 		} else {
 			pci_set_power_state(pdev, PCI_D0);
-			pci_restore_state(pdev);
+			amdgpu_device_load_pci_state(pdev);
 			ret = pci_enable_device(pdev);
 			if (ret)
 				return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 4d14023..0ec6603 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
 		dev_info(adev->dev, "GPU smu mode1 reset\n");
@@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 2f93c47..ddd55e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	ret = psp_gpu_reset(adev);
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
 
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2020-08-28 16:05 ` [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
@ 2020-08-28 16:05 ` Andrey Grodzovsky
  2020-08-28 19:30   ` Alex Deucher
  2020-08-28 16:05 ` [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
  6 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Reuse exsisting functions from GPU recovery to avoid code
duplications.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 75 ++++++------------------------
 1 file changed, 14 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7f1b970..429167b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4115,7 +4115,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 
 static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 			       struct list_head *device_list_handle,
-			       bool *need_full_reset_arg)
+			       bool *need_full_reset_arg,
+			       bool skip_hw_reset)
 {
 	struct amdgpu_device *tmp_adev = NULL;
 	bool need_full_reset = *need_full_reset_arg, vram_lost = false;
@@ -4125,7 +4126,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 	 * ASIC reset has to be done on all HGMI hive nodes ASAP
 	 * to allow proper links negotiation in FW (within 1 sec)
 	 */
-	if (need_full_reset) {
+	if (!skip_hw_reset && need_full_reset) {
 		list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 			/* For XGMI run all resets in parallel to speed up the process */
 			if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
@@ -4521,7 +4522,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		if (r)
 			adev->asic_reset_res = r;
 	} else {
-		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset);
+		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset, false);
 		if (r && r == -EAGAIN)
 			goto retry;
 	}
@@ -4850,14 +4851,19 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
 	int r, i;
-	bool vram_lost;
+	bool need_full_reset = true;
 	u32 memsize;
+	struct list_head device_list;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	INIT_LIST_HEAD(&device_list);
+	list_add_tail(&adev->gmc.xgmi.head, &device_list);
+
 	/* wait for asic to come out of reset */
 	msleep(500);
 
+	/* Restore PCI confspace */
 	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
@@ -4873,70 +4879,15 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 		goto out;
 	}
 
-	/* TODO Call amdgpu_pre_asic_reset instead */
 	adev->in_pci_err_recovery = true;	
-	r = amdgpu_device_ip_suspend(adev);
+	r = amdgpu_device_pre_asic_reset(adev, NULL, &need_full_reset);
 	adev->in_pci_err_recovery = false;
 	if (r)
 		goto out;
 
-
-	/* post card */
-	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_ip_resume_phase1(adev);
-	if (r)
-		goto out;
-
-	vram_lost = amdgpu_device_check_vram_lost(adev);
-	if (vram_lost) {
-		DRM_INFO("VRAM is lost due to GPU reset!\n");
-		amdgpu_inc_vram_lost(adev);
-	}
-
-	r = amdgpu_gtt_mgr_recover(
-		&adev->mman.bdev.man[TTM_PL_TT]);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_fw_loading(adev);
-	if (r)
-		return r;
-
-	r = amdgpu_device_ip_resume_phase2(adev);
-	if (r)
-		goto out;
-
-	if (vram_lost)
-		amdgpu_device_fill_reset_magic(adev);
-
-	/*
-	 * Add this ASIC as tracked as reset was already
-	 * complete successfully.
-	 */
-	amdgpu_register_gpu_instance(adev);
-
-	r = amdgpu_device_ip_late_init(adev);
-	if (r)
-		goto out;
-
-	amdgpu_fbdev_set_suspend(adev, 0);
-
-	/* must succeed. */
-	amdgpu_ras_resume(adev);
-
-
-	amdgpu_irq_gpu_reset_resume_helper(adev);
-	r = amdgpu_ib_ring_tests(adev);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_recover_vram(adev);
+	r = amdgpu_do_asic_reset(NULL, &device_list, &need_full_reset, true);
 
 out:
-
 	if (!r) {
 		amdgpu_device_cache_pci_state(adev->pdev);
 		DRM_INFO("PCIe error recovery succeeded\n");
@@ -5022,4 +4973,6 @@ bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
 		DRM_WARN("Failed to load PCI state, err:%d\n", r);
 		return false;
 	}
+
+	return true;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now.
  2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2020-08-28 16:05 ` [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
@ 2020-08-28 16:05 ` Andrey Grodzovsky
  2020-08-28 19:30   ` Alex Deucher
  6 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 16:05 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

XGMI support is more complicated then single device support as
questions synchronization between the device recovering from
PCI error and other memebers of the hive is required.
Leaving this for next round.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 429167b..47e16baf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4782,6 +4782,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
+	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+		DRM_WARN("No support for XGMI hive yet...");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures.
  2020-08-28 16:05 ` [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
@ 2020-08-28 19:19   ` Alex Deucher
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:19 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Cache the PCI state on boot and before each case were we might
> loose it.
>
> v2: Add pci_restore_state while caching the PCI state to avoid
> breaking PCI core logic for stuff like suspend/resume.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 59 ++++++++++++++++++++++++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
>  drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
>  5 files changed, 67 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index cac51e8..5e74db6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,7 +992,9 @@ struct amdgpu_device {
>         atomic_t                        throttling_logging_enabled;
>         struct ratelimit_state          throttling_logging_rs;
>         uint32_t                        ras_features;
> +
>         bool                            in_pci_err_recovery;
> +       struct pci_saved_state          *pci_state;
>  };
>
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>  pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>  void amdgpu_pci_resume(struct pci_dev *pdev);
>
> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
> +
> +
>
>  #include "amdgpu_object.h"
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 06664a9..7f1b970 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1284,7 +1284,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>                 dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>
>                 pci_set_power_state(dev->pdev, PCI_D0);
> -               pci_restore_state(dev->pdev);
> +               amdgpu_device_load_pci_state(dev->pdev);
>                 r = pci_enable_device(dev->pdev);
>                 if (r)
>                         DRM_WARN("pci_enable_device failed (%d)\n", r);
> @@ -1297,7 +1297,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>                 drm_kms_helper_poll_disable(dev);
>                 dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>                 amdgpu_device_suspend(dev, true);
> -               pci_save_state(dev->pdev);
> +               amdgpu_device_cache_pci_state(dev->pdev);
>                 /* Shut down the device */
>                 pci_disable_device(dev->pdev);
>                 pci_set_power_state(dev->pdev, PCI_D3cold);
> @@ -3402,6 +3402,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         if (r)
>                 dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>
> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> +               DRM_WARN("Failed to cache PCI state!");

We should call pci_restore_state(pdev) here rather than in the helpers
otherwise we incur the extra overhead in all cases and it's not
necessary.

>         return 0;
>
>  failed:
> @@ -3428,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>         flush_delayed_work(&adev->delayed_init_work);
>         adev->shutdown = true;
>
> +       kfree(adev->pci_state);
> +
>         /* make sure IB test finished before entering exclusive mode
>          * to avoid preemption on IB test
>          * */
> @@ -4853,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>         /* wait for asic to come out of reset */
>         msleep(500);
>
> -       pci_restore_state(pdev);
> +       amdgpu_device_load_pci_state(pdev);
>
>         /* confirm  ASIC came out of reset */
>         for (i = 0; i < adev->usec_timeout; i++) {
> @@ -4932,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>
>  out:
>
> -       if (!r)
> +       if (!r) {
> +               amdgpu_device_cache_pci_state(adev->pdev);
>                 DRM_INFO("PCIe error recovery succeeded\n");
> +       }
>         else {
>                 DRM_ERROR("PCIe error recovery failed, err:%d", r);
>                 amdgpu_device_unlock_adev(adev);
> @@ -4972,3 +4979,47 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>
>         amdgpu_device_unlock_adev(adev);
>  }
> +
> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +       int r;
> +
> +       r = pci_save_state(pdev);
> +       if (!r) {
> +               kfree(adev->pci_state);
> +
> +               adev->pci_state = pci_store_saved_state(pdev);
> +               pci_restore_state(pdev);

We don't want to restore this here.  See my comment above.

> +
> +               if (!adev->pci_state) {
> +                       DRM_ERROR("Failed to store PCI saved state");
> +                       return false;
> +               }
> +       } else {
> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
> +               return false;
> +       }
> +
> +       return true;
> +}
> +
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +       int r;
> +
> +       if (!adev->pci_state)
> +               return false;
> +
> +       r = pci_load_saved_state(pdev, adev->pci_state);
> +
> +       if (!r) {
> +               pci_restore_state(pdev);
> +       } else {
> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
> +               return false;
> +       }
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 4bbcc70..7a6482a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>                 if (amdgpu_is_atpx_hybrid()) {
>                         pci_ignore_hotplug(pdev);
>                 } else {
> -                       pci_save_state(pdev);
> +                       amdgpu_device_cache_pci_state(pdev);
>                         pci_disable_device(pdev);
>                         pci_ignore_hotplug(pdev);
>                         pci_set_power_state(pdev, PCI_D3cold);
> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>                         pci_set_master(pdev);
>                 } else {
>                         pci_set_power_state(pdev, PCI_D0);
> -                       pci_restore_state(pdev);
> +                       amdgpu_device_load_pci_state(pdev);
>                         ret = pci_enable_device(pdev);
>                         if (ret)
>                                 return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 4d14023..0ec6603 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>         /* disable BM */
>         pci_clear_master(adev->pdev);
>
> -       pci_save_state(adev->pdev);
> +       amdgpu_device_cache_pci_state(adev->pdev);
>
>         if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
>                 dev_info(adev->dev, "GPU smu mode1 reset\n");
> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>
>         if (ret)
>                 dev_err(adev->dev, "GPU mode1 reset failed\n");
> -       pci_restore_state(adev->pdev);
> +       amdgpu_device_load_pci_state(adev->pdev);
>
>         /* wait for asic to come out of reset */
>         for (i = 0; i < adev->usec_timeout; i++) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 2f93c47..ddd55e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
>         /* disable BM */
>         pci_clear_master(adev->pdev);
>
> -       pci_save_state(adev->pdev);
> +       amdgpu_device_cache_pci_state(adev->pdev);
>
>         ret = psp_gpu_reset(adev);
>         if (ret)
>                 dev_err(adev->dev, "GPU mode1 reset failed\n");
>
> -       pci_restore_state(adev->pdev);
> +       amdgpu_device_load_pci_state(adev->pdev);
>
>         /* wait for asic to come out of reset */
>         for (i = 0; i < adev->usec_timeout; i++) {
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
@ 2020-08-28 19:23   ` Alex Deucher
  2020-08-28 19:24     ` Alex Deucher
  2020-08-31 14:26     ` Andrey Grodzovsky
  2020-08-28 19:25   ` Alex Deucher
  2020-08-31 12:44   ` Christian König
  2 siblings, 2 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:23 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Add DPC handlers with basic recovery functionality.
>
> v2: remove pci_save_state to avoid breaking suspend/resume
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 169 ++++++++++++++++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>  3 files changed, 184 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/hashtable.h>
>  #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>
>  #include <drm/ttm/ttm_bo_api.h>
>  #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +                                          pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> +void amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>  #include "amdgpu_object.h"
>
>  /* used by df_v3_6.c and amdgpu_pmu.c */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..937f8b0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -350,7 +350,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   *
>   * Returns the 8 bit value from the offset specified.
>   */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> +{
>         if (offset < adev->rmmio_size)
>                 return (readb(adev->rmmio + offset));
>         BUG();
> @@ -371,7 +372,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>   *
>   * Writes the value specified to the offset specified.
>   */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> +{
>         if (offset < adev->rmmio_size)
>                 writeb(value, adev->rmmio + offset);
>         else
> @@ -2989,6 +2991,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>         NULL
>  };
>
> +
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3207,6 +3210,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>                 }
>         }
>
> +       pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>         /* Post card if necessary */
>         if (amdgpu_device_need_post(adev)) {
>                 if (!adev->bios) {
> @@ -4701,3 +4707,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>
>         return 0;
>  }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +       DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +       switch (state) {
> +       case pci_channel_io_normal:
> +               return PCI_ERS_RESULT_CAN_RECOVER;
> +       case pci_channel_io_frozen: {
> +               /* Fatal error, prepare for slot reset */
> +
> +               amdgpu_device_lock_adev(adev);
> +               return PCI_ERS_RESULT_NEED_RESET;
> +       }
> +       case pci_channel_io_perm_failure:
> +               /* Permanent error, prepare for device removal */
> +               return PCI_ERS_RESULT_DISCONNECT;
> +       }
> +       return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> +{
> +
> +       DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +       /* TODO - dump whatever for debugging purposes */
> +
> +       /* This called only if amdgpu_pci_error_detected returns
> +        * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +        * works, no need to reset slot.
> +        */
> +
> +       return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +       int r;
> +       bool vram_lost;
> +
> +       DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +       pci_restore_state(pdev);
> +
> +       r = amdgpu_device_ip_suspend(adev);
> +       if (r)
> +               goto out;
> +
> +
> +       /* post card */
> +       r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_ip_resume_phase1(adev);
> +       if (r)
> +               goto out;
> +
> +       vram_lost = amdgpu_device_check_vram_lost(adev);
> +       if (vram_lost) {
> +               DRM_INFO("VRAM is lost due to GPU reset!\n");
> +               amdgpu_inc_vram_lost(adev);
> +       }
> +
> +       r = amdgpu_gtt_mgr_recover(
> +               &adev->mman.bdev.man[TTM_PL_TT]);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_fw_loading(adev);
> +       if (r)
> +               return r;
> +
> +       r = amdgpu_device_ip_resume_phase2(adev);
> +       if (r)
> +               goto out;
> +
> +       if (vram_lost)
> +               amdgpu_device_fill_reset_magic(adev);
> +
> +       /*
> +        * Add this ASIC as tracked as reset was already
> +        * complete successfully.
> +        */
> +       amdgpu_register_gpu_instance(adev);
> +
> +       r = amdgpu_device_ip_late_init(adev);
> +       if (r)
> +               goto out;
> +
> +       amdgpu_fbdev_set_suspend(adev, 0);
> +
> +       /* must succeed. */
> +       amdgpu_ras_resume(adev);
> +
> +
> +       amdgpu_irq_gpu_reset_resume_helper(adev);
> +       r = amdgpu_ib_ring_tests(adev);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +       if (!r)
> +               DRM_INFO("PCIe error recovery succeeded\n");
> +       else {
> +               DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +               amdgpu_device_unlock_adev(adev);
> +       }
> +
> +       return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +       amdgpu_device_unlock_adev(adev);
> +
> +       DRM_INFO("PCI error: resume callback!!\n");
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>  #include <drm/drm_pciids.h>
>  #include <linux/console.h>
>  #include <linux/module.h>
> -#include <linux/pci.h>

Is this intended?  Seems unrelated.  I think this should be part of
the previous patch.

Alex


>  #include <linux/pm_runtime.h>
>  #include <linux/vga_switcheroo.h>
>  #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>         .patchlevel = KMS_DRIVER_PATCHLEVEL,
>  };
>
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +       .error_detected = amdgpu_pci_error_detected,
> +       .mmio_enabled   = amdgpu_pci_mmio_enabled,
> +       .slot_reset     = amdgpu_pci_slot_reset,
> +       .resume         = amdgpu_pci_resume,
> +};
> +
>  static struct pci_driver amdgpu_kms_pci_driver = {
>         .name = DRIVER_NAME,
>         .id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>         .remove = amdgpu_pci_remove,
>         .shutdown = amdgpu_pci_shutdown,
>         .driver.pm = &amdgpu_pm_ops,
> +       .err_handler = &amdgpu_pci_err_handler,
>  };
>
>  static int __init amdgpu_init(void)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-28 19:23   ` Alex Deucher
@ 2020-08-28 19:24     ` Alex Deucher
  2020-08-31 14:26     ` Andrey Grodzovsky
  1 sibling, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:24 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 3:23 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
> >
> > Add DPC handlers with basic recovery functionality.
> >
> > v2: remove pci_save_state to avoid breaking suspend/resume
> >
> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 169 ++++++++++++++++++++++++++++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
> >  3 files changed, 184 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 49ea9fa..3399242 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -49,6 +49,8 @@
> >  #include <linux/rbtree.h>
> >  #include <linux/hashtable.h>
> >  #include <linux/dma-fence.h>
> > +#include <linux/pci.h>
> > +#include <linux/aer.h>
> >
> >  #include <drm/ttm/ttm_bo_api.h>
> >  #include <drm/ttm/ttm_bo_driver.h>
> > @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
> >  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
> >  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
> >
> > +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> > +                                          pci_channel_state_t state);
> > +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> > +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> > +void amdgpu_pci_resume(struct pci_dev *pdev);
> > +
> > +
> >  #include "amdgpu_object.h"
> >
> >  /* used by df_v3_6.c and amdgpu_pmu.c */
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 5a948ed..937f8b0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -350,7 +350,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
> >   *
> >   * Returns the 8 bit value from the offset specified.
> >   */
> > -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> > +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> > +{
> >         if (offset < adev->rmmio_size)
> >                 return (readb(adev->rmmio + offset));
> >         BUG();
> > @@ -371,7 +372,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> >   *
> >   * Writes the value specified to the offset specified.
> >   */
> > -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> > +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> > +{
> >         if (offset < adev->rmmio_size)
> >                 writeb(value, adev->rmmio + offset);
> >         else
> > @@ -2989,6 +2991,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
> >         NULL
> >  };
> >
> > +
> >  /**
> >   * amdgpu_device_init - initialize the driver
> >   *
> > @@ -3207,6 +3210,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >                 }
> >         }
> >
> > +       pci_enable_pcie_error_reporting(adev->ddev.pdev);
> > +
> > +
> >         /* Post card if necessary */
> >         if (amdgpu_device_need_post(adev)) {
> >                 if (!adev->bios) {
> > @@ -4701,3 +4707,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
> >
> >         return 0;
> >  }
> > +
> > +/**
> > + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> > + * @pdev: PCI device struct
> > + * @state: PCI channel state
> > + *
> > + * Description: Called when a PCI error is detected.
> > + *
> > + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> > + */
> > +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> > +{
> > +       struct drm_device *dev = pci_get_drvdata(pdev);
> > +       struct amdgpu_device *adev = drm_to_adev(dev);
> > +
> > +       DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> > +
> > +       switch (state) {
> > +       case pci_channel_io_normal:
> > +               return PCI_ERS_RESULT_CAN_RECOVER;
> > +       case pci_channel_io_frozen: {
> > +               /* Fatal error, prepare for slot reset */
> > +
> > +               amdgpu_device_lock_adev(adev);
> > +               return PCI_ERS_RESULT_NEED_RESET;
> > +       }
> > +       case pci_channel_io_perm_failure:
> > +               /* Permanent error, prepare for device removal */
> > +               return PCI_ERS_RESULT_DISCONNECT;
> > +       }
> > +       return PCI_ERS_RESULT_NEED_RESET;
> > +}
> > +
> > +/**
> > + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> > + * @pdev: pointer to PCI device
> > + */
> > +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> > +{
> > +
> > +       DRM_INFO("PCI error: mmio enabled callback!!\n");
> > +
> > +       /* TODO - dump whatever for debugging purposes */
> > +
> > +       /* This called only if amdgpu_pci_error_detected returns
> > +        * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> > +        * works, no need to reset slot.
> > +        */
> > +
> > +       return PCI_ERS_RESULT_RECOVERED;
> > +}
> > +
> > +/**
> > + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> > + * @pdev: PCI device struct
> > + *
> > + * Description: This routine is called by the pci error recovery
> > + * code after the PCI slot has been reset, just before we
> > + * should resume normal operations.
> > + */
> > +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> > +{
> > +       struct drm_device *dev = pci_get_drvdata(pdev);
> > +       struct amdgpu_device *adev = drm_to_adev(dev);
> > +       int r;
> > +       bool vram_lost;
> > +
> > +       DRM_INFO("PCI error: slot reset callback!!\n");
> > +
> > +       pci_restore_state(pdev);
> > +
> > +       r = amdgpu_device_ip_suspend(adev);
> > +       if (r)
> > +               goto out;
> > +
> > +
> > +       /* post card */
> > +       r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> > +       if (r)
> > +               goto out;
> > +
> > +       r = amdgpu_device_ip_resume_phase1(adev);
> > +       if (r)
> > +               goto out;
> > +
> > +       vram_lost = amdgpu_device_check_vram_lost(adev);
> > +       if (vram_lost) {
> > +               DRM_INFO("VRAM is lost due to GPU reset!\n");
> > +               amdgpu_inc_vram_lost(adev);
> > +       }
> > +
> > +       r = amdgpu_gtt_mgr_recover(
> > +               &adev->mman.bdev.man[TTM_PL_TT]);
> > +       if (r)
> > +               goto out;
> > +
> > +       r = amdgpu_device_fw_loading(adev);
> > +       if (r)
> > +               return r;
> > +
> > +       r = amdgpu_device_ip_resume_phase2(adev);
> > +       if (r)
> > +               goto out;
> > +
> > +       if (vram_lost)
> > +               amdgpu_device_fill_reset_magic(adev);
> > +
> > +       /*
> > +        * Add this ASIC as tracked as reset was already
> > +        * complete successfully.
> > +        */
> > +       amdgpu_register_gpu_instance(adev);
> > +
> > +       r = amdgpu_device_ip_late_init(adev);
> > +       if (r)
> > +               goto out;
> > +
> > +       amdgpu_fbdev_set_suspend(adev, 0);
> > +
> > +       /* must succeed. */
> > +       amdgpu_ras_resume(adev);
> > +
> > +
> > +       amdgpu_irq_gpu_reset_resume_helper(adev);
> > +       r = amdgpu_ib_ring_tests(adev);
> > +       if (r)
> > +               goto out;
> > +
> > +       r = amdgpu_device_recover_vram(adev);
> > +
> > +out:
> > +
> > +       if (!r)
> > +               DRM_INFO("PCIe error recovery succeeded\n");
> > +       else {
> > +               DRM_ERROR("PCIe error recovery failed, err:%d", r);
> > +               amdgpu_device_unlock_adev(adev);
> > +       }
> > +
> > +       return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> > +}
> > +
> > +/**
> > + * amdgpu_pci_resume() - resume normal ops after PCI reset
> > + * @pdev: pointer to PCI device
> > + *
> > + * Called when the error recovery driver tells us that its
> > + * OK to resume normal operation. Use completion to allow
> > + * halted scsi ops to resume.
> > + */
> > +void amdgpu_pci_resume(struct pci_dev *pdev)
> > +{
> > +       struct drm_device *dev = pci_get_drvdata(pdev);
> > +       struct amdgpu_device *adev = drm_to_adev(dev);
> > +
> > +       amdgpu_device_unlock_adev(adev);
> > +
> > +       DRM_INFO("PCI error: resume callback!!\n");
> > +}
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index d984c6a..4bbcc70 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -31,7 +31,6 @@
> >  #include <drm/drm_pciids.h>
> >  #include <linux/console.h>
> >  #include <linux/module.h>
> > -#include <linux/pci.h>
>
> Is this intended?  Seems unrelated.  I think this should be part of
> the previous patch.

Nevermind, I see it was added to amdgpu.h.  Ignore this comment.

Alex

>
> Alex
>
>
> >  #include <linux/pm_runtime.h>
> >  #include <linux/vga_switcheroo.h>
> >  #include <drm/drm_probe_helper.h>
> > @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
> >         .patchlevel = KMS_DRIVER_PATCHLEVEL,
> >  };
> >
> > +static struct pci_error_handlers amdgpu_pci_err_handler = {
> > +       .error_detected = amdgpu_pci_error_detected,
> > +       .mmio_enabled   = amdgpu_pci_mmio_enabled,
> > +       .slot_reset     = amdgpu_pci_slot_reset,
> > +       .resume         = amdgpu_pci_resume,
> > +};
> > +
> >  static struct pci_driver amdgpu_kms_pci_driver = {
> >         .name = DRIVER_NAME,
> >         .id_table = pciidlist,
> > @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
> >         .remove = amdgpu_pci_remove,
> >         .shutdown = amdgpu_pci_shutdown,
> >         .driver.pm = &amdgpu_pm_ops,
> > +       .err_handler = &amdgpu_pci_err_handler,
> >  };
> >
> >  static int __init amdgpu_init(void)
> > --
> > 2.7.4
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-28 19:23   ` Alex Deucher
@ 2020-08-28 19:25   ` Alex Deucher
  2020-08-31 12:44   ` Christian König
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:25 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Add DPC handlers with basic recovery functionality.
>
> v2: remove pci_save_state to avoid breaking suspend/resume
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 169 ++++++++++++++++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>  3 files changed, 184 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/hashtable.h>
>  #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>
>  #include <drm/ttm/ttm_bo_api.h>
>  #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +                                          pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> +void amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>  #include "amdgpu_object.h"
>
>  /* used by df_v3_6.c and amdgpu_pmu.c */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..937f8b0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -350,7 +350,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   *
>   * Returns the 8 bit value from the offset specified.
>   */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> +{
>         if (offset < adev->rmmio_size)
>                 return (readb(adev->rmmio + offset));
>         BUG();
> @@ -371,7 +372,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>   *
>   * Writes the value specified to the offset specified.
>   */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> +{
>         if (offset < adev->rmmio_size)
>                 writeb(value, adev->rmmio + offset);
>         else
> @@ -2989,6 +2991,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>         NULL
>  };
>
> +
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3207,6 +3210,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>                 }
>         }
>
> +       pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>         /* Post card if necessary */
>         if (amdgpu_device_need_post(adev)) {
>                 if (!adev->bios) {
> @@ -4701,3 +4707,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>
>         return 0;
>  }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +       DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +       switch (state) {
> +       case pci_channel_io_normal:
> +               return PCI_ERS_RESULT_CAN_RECOVER;
> +       case pci_channel_io_frozen: {
> +               /* Fatal error, prepare for slot reset */
> +
> +               amdgpu_device_lock_adev(adev);
> +               return PCI_ERS_RESULT_NEED_RESET;
> +       }
> +       case pci_channel_io_perm_failure:
> +               /* Permanent error, prepare for device removal */
> +               return PCI_ERS_RESULT_DISCONNECT;
> +       }
> +       return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> +{
> +
> +       DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +       /* TODO - dump whatever for debugging purposes */
> +
> +       /* This called only if amdgpu_pci_error_detected returns
> +        * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +        * works, no need to reset slot.
> +        */
> +
> +       return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +       int r;
> +       bool vram_lost;
> +
> +       DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +       pci_restore_state(pdev);
> +
> +       r = amdgpu_device_ip_suspend(adev);
> +       if (r)
> +               goto out;
> +
> +
> +       /* post card */
> +       r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_ip_resume_phase1(adev);
> +       if (r)
> +               goto out;
> +
> +       vram_lost = amdgpu_device_check_vram_lost(adev);
> +       if (vram_lost) {
> +               DRM_INFO("VRAM is lost due to GPU reset!\n");
> +               amdgpu_inc_vram_lost(adev);
> +       }
> +
> +       r = amdgpu_gtt_mgr_recover(
> +               &adev->mman.bdev.man[TTM_PL_TT]);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_fw_loading(adev);
> +       if (r)
> +               return r;
> +
> +       r = amdgpu_device_ip_resume_phase2(adev);
> +       if (r)
> +               goto out;
> +
> +       if (vram_lost)
> +               amdgpu_device_fill_reset_magic(adev);
> +
> +       /*
> +        * Add this ASIC as tracked as reset was already
> +        * complete successfully.
> +        */
> +       amdgpu_register_gpu_instance(adev);
> +
> +       r = amdgpu_device_ip_late_init(adev);
> +       if (r)
> +               goto out;
> +
> +       amdgpu_fbdev_set_suspend(adev, 0);
> +
> +       /* must succeed. */
> +       amdgpu_ras_resume(adev);
> +
> +
> +       amdgpu_irq_gpu_reset_resume_helper(adev);
> +       r = amdgpu_ib_ring_tests(adev);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +       if (!r)
> +               DRM_INFO("PCIe error recovery succeeded\n");
> +       else {
> +               DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +               amdgpu_device_unlock_adev(adev);
> +       }
> +
> +       return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +       amdgpu_device_unlock_adev(adev);
> +
> +       DRM_INFO("PCI error: resume callback!!\n");
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>  #include <drm/drm_pciids.h>
>  #include <linux/console.h>
>  #include <linux/module.h>
> -#include <linux/pci.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/vga_switcheroo.h>
>  #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>         .patchlevel = KMS_DRIVER_PATCHLEVEL,
>  };
>
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +       .error_detected = amdgpu_pci_error_detected,
> +       .mmio_enabled   = amdgpu_pci_mmio_enabled,
> +       .slot_reset     = amdgpu_pci_slot_reset,
> +       .resume         = amdgpu_pci_resume,
> +};
> +
>  static struct pci_driver amdgpu_kms_pci_driver = {
>         .name = DRIVER_NAME,
>         .id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>         .remove = amdgpu_pci_remove,
>         .shutdown = amdgpu_pci_shutdown,
>         .driver.pm = &amdgpu_pm_ops,
> +       .err_handler = &amdgpu_pci_err_handler,
>  };
>
>  static int __init amdgpu_init(void)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-28 16:05 ` [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
@ 2020-08-28 19:26   ` Alex Deucher
  2020-08-31 20:19     ` Luben Tuikov
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:26 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> At this point the ASIC is already post reset by the HW/PSP
> so the HW not in proper state to be configured for suspension,
> some bloks might be even gated and so best is to avoid touching it.

typo: "blocks"

>
> v2: Rename in_dpc to more meaningful name
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 41 +++++++++++++++++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++-----
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>  8 files changed, 71 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3399242..cac51e8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,7 @@ struct amdgpu_device {
>         atomic_t                        throttling_logging_enabled;
>         struct ratelimit_state          throttling_logging_rs;
>         uint32_t                        ras_features;
> +       bool                            in_pci_err_recovery;
>  };
>
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 937f8b0..e67cbf2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>  {
>         uint32_t ret;
>
> +       if (adev->in_pci_err_recovery)
> +               return 0;
> +
>         if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>                 return amdgpu_kiq_rreg(adev, reg);
>
> @@ -352,6 +355,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   */
>  uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return 0;
> +
>         if (offset < adev->rmmio_size)
>                 return (readb(adev->rmmio + offset));
>         BUG();
> @@ -374,14 +380,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>   */
>  void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         if (offset < adev->rmmio_size)
>                 writeb(value, adev->rmmio + offset);
>         else
>                 BUG();
>  }
>
> -void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
> +void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
> +                                      uint32_t v, uint32_t acc_flags)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>
>         if ((reg * 4) < adev->rmmio_size)
> @@ -409,6 +422,9 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>                     uint32_t acc_flags)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>                 return amdgpu_kiq_wreg(adev, reg, v);
>
> @@ -423,6 +439,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>                     uint32_t acc_flags)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         if (amdgpu_sriov_fullaccess(adev) &&
>                 adev->gfx.rlc.funcs &&
>                 adev->gfx.rlc.funcs->is_rlcg_access_range) {
> @@ -444,6 +463,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
>   */
>  u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return 0;
> +
>         if ((reg * 4) < adev->rio_mem_size)
>                 return ioread32(adev->rio_mem + (reg * 4));
>         else {
> @@ -463,6 +485,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>   */
>  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         if ((reg * 4) < adev->rio_mem_size)
>                 iowrite32(v, adev->rio_mem + (reg * 4));
>         else {
> @@ -482,6 +507,9 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>   */
>  u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return 0;
> +
>         if (index < adev->doorbell.num_doorbells) {
>                 return readl(adev->doorbell.ptr + index);
>         } else {
> @@ -502,6 +530,9 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         if (index < adev->doorbell.num_doorbells) {
>                 writel(v, adev->doorbell.ptr + index);
>         } else {
> @@ -520,6 +551,9 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>   */
>  u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return 0;
> +
>         if (index < adev->doorbell.num_doorbells) {
>                 return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
>         } else {
> @@ -540,6 +574,9 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>  {
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         if (index < adev->doorbell.num_doorbells) {
>                 atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>         } else {
> @@ -4778,7 +4815,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>
>         pci_restore_state(pdev);
>
> +       adev->in_pci_err_recovery = true;
>         r = amdgpu_device_ip_suspend(adev);
> +       adev->in_pci_err_recovery = false;
>         if (r)
>                 goto out;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index d698142..8c9bacf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>         struct amdgpu_ring *ring = &kiq->ring;
>
> +       if (adev->in_pci_err_recovery)
> +               return 0;
> +
>         BUG_ON(!ring->funcs->emit_rreg);
>
>         spin_lock_irqsave(&kiq->ring_lock, flags);
> @@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>
>         BUG_ON(!ring->funcs->emit_wreg);
>
> +       if (adev->in_pci_err_recovery)
> +               return;
> +
>         spin_lock_irqsave(&kiq->ring_lock, flags);
>         amdgpu_ring_alloc(ring, 32);
>         amdgpu_ring_emit_wreg(ring, reg, v);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index d6c38e2..a7771aa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
>         int i;
>         struct amdgpu_device *adev = psp->adev;
>
> +       if (psp->adev->in_pci_err_recovery)
> +               return 0;
> +
>         for (i = 0; i < adev->usec_timeout; i++) {
>                 val = RREG32(reg_index);
>                 if (check_changed) {
> @@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>         bool ras_intr = false;
>         bool skip_unsupport = false;
>
> +       if (psp->adev->in_pci_err_recovery)
> +               return 0;
> +
>         mutex_lock(&psp->mutex);
>
>         memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
> index 4cfc786..613dac1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/atom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/atom.c
> @@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
>                                         DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
>                                                   ATOM_CMD_TIMEOUT_SEC);
>                                         ctx->abort = true;
> +                                       dump_stack();

Leftover from debugging?



>                                 }
>                         } else {
>                                 /* jiffies wrap around we will just wait a little longer */
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 2db195e..ccf096c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
>
>         amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>         amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> +
> +       if (!adev->in_pci_err_recovery) {
>  #ifndef BRING_UP_DEBUG
> -       if (amdgpu_async_gfx_ring) {
> -               r = gfx_v10_0_kiq_disable_kgq(adev);
> -               if (r)
> -                       DRM_ERROR("KGQ disable failed\n");
> -       }
> +               if (amdgpu_async_gfx_ring) {
> +                       r = gfx_v10_0_kiq_disable_kgq(adev);
> +                       if (r)
> +                               DRM_ERROR("KGQ disable failed\n");
> +               }
>  #endif
> -       if (amdgpu_gfx_disable_kcq(adev))
> -               DRM_ERROR("KCQ disable failed\n");
> +               if (amdgpu_gfx_disable_kcq(adev))
> +                       DRM_ERROR("KCQ disable failed\n");
> +       }
> +

gfx9 probably needs something similar, but that can come later if we
ever validate this for older parts.

>         if (amdgpu_sriov_vf(adev)) {
>                 gfx_v10_0_cp_gfx_enable(adev, false);
>                 /* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 8462b30..306461d 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
>  {
>         int ret = 0;
>
> +
>         if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
>                 return -EOPNOTSUPP;
>
> +
> +

Unrelated whitespaces changes.  Please drop.  With this things fixed:
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>


>         switch (block_type) {
>         /*
>          * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index a58ea08..97aa72a 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>         struct amdgpu_device *adev = smu->adev;
>         int ret = 0, index = 0;
>
> +       if (smu->adev->in_pci_err_recovery)
> +               return 0;
> +
>         index = smu_cmn_to_asic_specific_index(smu,
>                                                CMN2ASIC_MAPPING_MSG,
>                                                msg);
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-28 16:05 ` [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-08-28 19:28   ` Alex Deucher
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:28 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> DPC recovery involves ASIC reset just as normal GPU recovery so blosk

Typo: "block"

> SW GPU scedulers and wait on all concurent GPU resets.

Typos: "schedulers" and "concurrent"

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
>  1 file changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e67cbf2..9a367a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4745,6 +4745,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>         return 0;
>  }
>
> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> +{
> +       int i;
> +
> +       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +               struct amdgpu_ring *ring = adev->rings[i];
> +
> +               if (!ring || !ring->sched.thread)
> +                       continue;
> +
> +               cancel_delayed_work_sync(&ring->sched.work_tdr);
> +       }
> +}
> +
>  /**
>   * amdgpu_pci_error_detected - Called when a PCI error is detected.
>   * @pdev: PCI device struct
> @@ -4758,16 +4772,38 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>  {
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = drm_to_adev(dev);
> +       int i;
>
>         DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>
>         switch (state) {
>         case pci_channel_io_normal:
>                 return PCI_ERS_RESULT_CAN_RECOVER;
> -       case pci_channel_io_frozen: {
> -               /* Fatal error, prepare for slot reset */
> +       case pci_channel_io_frozen: { /* Fatal error, prepare for slot reset */
> +
> +               /*
> +                * Cancel and wait for all TDRs in progress if failing to
> +                * set  adev->in_gpu_reset in amdgpu_device_lock_adev
> +                *
> +                * Locking adev->reset_sem will perevent any external access

Typo: "prevent"

> +                * to GPU during PCI error recovery
> +                */
> +               while (!amdgpu_device_lock_adev(adev, NULL))
> +                       amdgpu_cancel_all_tdr(adev);
> +
> +               /*
> +                * Block any work scheduling as we do for regualr GPU reset

Typo: "regular"

> +                * for the duration of the recoveryq

Typo: "recovery"

Overall looks good to me, but you might want to run the scheduling
changes by Christian as well.  With the typos fixed:
Acked-by: Alex Deucher <alexander.deucher@amd.com>


> +                */
> +               for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +                       struct amdgpu_ring *ring = adev->rings[i];
> +
> +                       if (!ring || !ring->sched.thread)
> +                               continue;
> +
> +                       drm_sched_stop(&ring->sched, NULL);
> +               }
>
> -               amdgpu_device_lock_adev(adev);
>                 return PCI_ERS_RESULT_NEED_RESET;
>         }
>         case pci_channel_io_perm_failure:
> @@ -4900,8 +4936,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  {
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = drm_to_adev(dev);
> +       int i;
>
> -       amdgpu_device_unlock_adev(adev);
>
>         DRM_INFO("PCI error: resume callback!!\n");
> +
> +       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +               struct amdgpu_ring *ring = adev->rings[i];
> +
> +               if (!ring || !ring->sched.thread)
> +                       continue;
> +
> +
> +               drm_sched_resubmit_jobs(&ring->sched);
> +               drm_sched_start(&ring->sched, true);
> +       }
> +
> +       amdgpu_device_unlock_adev(adev);
>  }
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure
  2020-08-28 16:05 ` [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
@ 2020-08-28 19:29   ` Alex Deucher
  2020-08-28 20:28     ` Andrey Grodzovsky
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:29 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Wait for HW/PSP initiated ASIC reset to complete before
> starting the recovery operations.
>
> v2: Remove typo
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9a367a8..06664a9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4844,14 +4844,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  {
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = drm_to_adev(dev);
> -       int r;
> +       int r, i;
>         bool vram_lost;
> +       u32 memsize;
>
>         DRM_INFO("PCI error: slot reset callback!!\n");
>
> +       /* wait for asic to come out of reset */
> +       msleep(500);
> +

I wonder if other reset paths need this wait as well?
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

>         pci_restore_state(pdev);
>
> -       adev->in_pci_err_recovery = true;
> +       /* confirm  ASIC came out of reset */
> +       for (i = 0; i < adev->usec_timeout; i++) {
> +               memsize = amdgpu_asic_get_config_memsize(adev);
> +
> +               if (memsize != 0xffffffff)
> +                       break;
> +               udelay(1);
> +       }
> +       if (memsize == 0xffffffff) {
> +               r = -ETIME;
> +               goto out;
> +       }
> +
> +       /* TODO Call amdgpu_pre_asic_reset instead */
> +       adev->in_pci_err_recovery = true;
>         r = amdgpu_device_ip_suspend(adev);
>         adev->in_pci_err_recovery = false;
>         if (r)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  2020-08-28 16:05 ` [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
@ 2020-08-28 19:30   ` Alex Deucher
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:30 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Reuse exsisting functions from GPU recovery to avoid code
> duplications.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 75 ++++++------------------------
>  1 file changed, 14 insertions(+), 61 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7f1b970..429167b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4115,7 +4115,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>
>  static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>                                struct list_head *device_list_handle,
> -                              bool *need_full_reset_arg)
> +                              bool *need_full_reset_arg,
> +                              bool skip_hw_reset)
>  {
>         struct amdgpu_device *tmp_adev = NULL;
>         bool need_full_reset = *need_full_reset_arg, vram_lost = false;
> @@ -4125,7 +4126,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>          * ASIC reset has to be done on all HGMI hive nodes ASAP
>          * to allow proper links negotiation in FW (within 1 sec)
>          */
> -       if (need_full_reset) {
> +       if (!skip_hw_reset && need_full_reset) {
>                 list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>                         /* For XGMI run all resets in parallel to speed up the process */
>                         if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
> @@ -4521,7 +4522,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                 if (r)
>                         adev->asic_reset_res = r;
>         } else {
> -               r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset);
> +               r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset, false);
>                 if (r && r == -EAGAIN)
>                         goto retry;
>         }
> @@ -4850,14 +4851,19 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = drm_to_adev(dev);
>         int r, i;
> -       bool vram_lost;
> +       bool need_full_reset = true;
>         u32 memsize;
> +       struct list_head device_list;
>
>         DRM_INFO("PCI error: slot reset callback!!\n");
>
> +       INIT_LIST_HEAD(&device_list);
> +       list_add_tail(&adev->gmc.xgmi.head, &device_list);
> +
>         /* wait for asic to come out of reset */
>         msleep(500);
>
> +       /* Restore PCI confspace */
>         amdgpu_device_load_pci_state(pdev);
>
>         /* confirm  ASIC came out of reset */
> @@ -4873,70 +4879,15 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>                 goto out;
>         }
>
> -       /* TODO Call amdgpu_pre_asic_reset instead */
>         adev->in_pci_err_recovery = true;
> -       r = amdgpu_device_ip_suspend(adev);
> +       r = amdgpu_device_pre_asic_reset(adev, NULL, &need_full_reset);
>         adev->in_pci_err_recovery = false;
>         if (r)
>                 goto out;
>
> -
> -       /* post card */
> -       r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> -       if (r)
> -               goto out;
> -
> -       r = amdgpu_device_ip_resume_phase1(adev);
> -       if (r)
> -               goto out;
> -
> -       vram_lost = amdgpu_device_check_vram_lost(adev);
> -       if (vram_lost) {
> -               DRM_INFO("VRAM is lost due to GPU reset!\n");
> -               amdgpu_inc_vram_lost(adev);
> -       }
> -
> -       r = amdgpu_gtt_mgr_recover(
> -               &adev->mman.bdev.man[TTM_PL_TT]);
> -       if (r)
> -               goto out;
> -
> -       r = amdgpu_device_fw_loading(adev);
> -       if (r)
> -               return r;
> -
> -       r = amdgpu_device_ip_resume_phase2(adev);
> -       if (r)
> -               goto out;
> -
> -       if (vram_lost)
> -               amdgpu_device_fill_reset_magic(adev);
> -
> -       /*
> -        * Add this ASIC as tracked as reset was already
> -        * complete successfully.
> -        */
> -       amdgpu_register_gpu_instance(adev);
> -
> -       r = amdgpu_device_ip_late_init(adev);
> -       if (r)
> -               goto out;
> -
> -       amdgpu_fbdev_set_suspend(adev, 0);
> -
> -       /* must succeed. */
> -       amdgpu_ras_resume(adev);
> -
> -
> -       amdgpu_irq_gpu_reset_resume_helper(adev);
> -       r = amdgpu_ib_ring_tests(adev);
> -       if (r)
> -               goto out;
> -
> -       r = amdgpu_device_recover_vram(adev);
> +       r = amdgpu_do_asic_reset(NULL, &device_list, &need_full_reset, true);
>
>  out:
> -
>         if (!r) {
>                 amdgpu_device_cache_pci_state(adev->pdev);
>                 DRM_INFO("PCIe error recovery succeeded\n");
> @@ -5022,4 +4973,6 @@ bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
>                 DRM_WARN("Failed to load PCI state, err:%d\n", r);
>                 return false;
>         }
> +
> +       return true;

Looks like this should have been part of a previous patch.  With that fixed:
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>


>  }
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now.
  2020-08-28 16:05 ` [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
@ 2020-08-28 19:30   ` Alex Deucher
  2020-08-28 19:31     ` Alex Deucher
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:30 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> XGMI support is more complicated then single device support as
> questions synchronization between the device recovering from
> PCI error and other memebers of the hive is required.
> Leaving this for next round.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 429167b..47e16baf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4782,6 +4782,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>
>         DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>
> +       if (adev->gmc.xgmi.num_physical_nodes > 1) {
> +               DRM_WARN("No support for XGMI hive yet...");
> +               return PCI_ERS_RESULT_DISCONNECT;
> +       }
> +
>         switch (state) {
>         case pci_channel_io_normal:
>                 return PCI_ERS_RESULT_CAN_RECOVER;
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now.
  2020-08-28 19:30   ` Alex Deucher
@ 2020-08-28 19:31     ` Alex Deucher
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-28 19:31 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Fri, Aug 28, 2020 at 3:30 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
> >
> > XGMI support is more complicated then single device support as
> > questions synchronization between the device recovering from
> > PCI error and other memebers of the hive is required.

Typo: "members"
With that fixed:
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> > Leaving this for next round.
> >
> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 429167b..47e16baf 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4782,6 +4782,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
> >
> >         DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> >
> > +       if (adev->gmc.xgmi.num_physical_nodes > 1) {
> > +               DRM_WARN("No support for XGMI hive yet...");
> > +               return PCI_ERS_RESULT_DISCONNECT;
> > +       }
> > +
> >         switch (state) {
> >         case pci_channel_io_normal:
> >                 return PCI_ERS_RESULT_CAN_RECOVER;
> > --
> > 2.7.4
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure
  2020-08-28 19:29   ` Alex Deucher
@ 2020-08-28 20:28     ` Andrey Grodzovsky
  0 siblings, 0 replies; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 20:28 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li


On 8/28/20 3:29 PM, Alex Deucher wrote:
> On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>> Wait for HW/PSP initiated ASIC reset to complete before
>> starting the recovery operations.
>>
>> v2: Remove typo
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 9a367a8..06664a9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4844,14 +4844,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>   {
>>          struct drm_device *dev = pci_get_drvdata(pdev);
>>          struct amdgpu_device *adev = drm_to_adev(dev);
>> -       int r;
>> +       int r, i;
>>          bool vram_lost;
>> +       u32 memsize;
>>
>>          DRM_INFO("PCI error: slot reset callback!!\n");
>>
>> +       /* wait for asic to come out of reset */
>> +       msleep(500);
>> +
> I wonder if other reset paths need this wait as well?
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>


They already have this (see psp_v11_0_mode1_reset)

Andrey


>
>>          pci_restore_state(pdev);
>>
>> -       adev->in_pci_err_recovery = true;
>> +       /* confirm  ASIC came out of reset */
>> +       for (i = 0; i < adev->usec_timeout; i++) {
>> +               memsize = amdgpu_asic_get_config_memsize(adev);
>> +
>> +               if (memsize != 0xffffffff)
>> +                       break;
>> +               udelay(1);
>> +       }
>> +       if (memsize == 0xffffffff) {
>> +               r = -ETIME;
>> +               goto out;
>> +       }
>> +
>> +       /* TODO Call amdgpu_pre_asic_reset instead */
>> +       adev->in_pci_err_recovery = true;
>>          r = amdgpu_device_ip_suspend(adev);
>>          adev->in_pci_err_recovery = false;
>>          if (r)
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cbcd9cce89b49498ecc9808d84b88ac73%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637342397661929023&amp;sdata=imwQ%2F25z3XUPWlPtnP2UTCfRmv3Ejx04zWvOHn1Re7k%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-28 19:23   ` Alex Deucher
  2020-08-28 19:25   ` Alex Deucher
@ 2020-08-31 12:44   ` Christian König
  2 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2020-08-31 12:44 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: alexander.deucher, nirmodas, christian.koenig, Dennis.Li

Am 28.08.20 um 18:05 schrieb Andrey Grodzovsky:
> Add DPC handlers with basic recovery functionality.
>
> v2: remove pci_save_state to avoid breaking suspend/resume
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 169 ++++++++++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>   3 files changed, 184 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>   #include <linux/rbtree.h>
>   #include <linux/hashtable.h>
>   #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>   
>   #include <drm/ttm/ttm_bo_api.h>
>   #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>   void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>   void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>   
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +					   pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> +void amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>   #include "amdgpu_object.h"
>   
>   /* used by df_v3_6.c and amdgpu_pmu.c */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..937f8b0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -350,7 +350,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>    *
>    * Returns the 8 bit value from the offset specified.
>    */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> +{

God fix, but unrelated to the patch. Maybe send a style fix up for those 
separately.

>   	if (offset < adev->rmmio_size)
>   		return (readb(adev->rmmio + offset));
>   	BUG();
> @@ -371,7 +372,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>    *
>    * Writes the value specified to the offset specified.
>    */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> +{
>   	if (offset < adev->rmmio_size)
>   		writeb(value, adev->rmmio + offset);
>   	else
> @@ -2989,6 +2991,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>   	NULL
>   };
>   
> +
>   /**
>    * amdgpu_device_init - initialize the driver
>    *
> @@ -3207,6 +3210,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   		}
>   	}
>   
> +	pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>   	/* Post card if necessary */
>   	if (amdgpu_device_need_post(adev)) {
>   		if (!adev->bios) {
> @@ -4701,3 +4707,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>   
>   	return 0;
>   }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;
> +	case pci_channel_io_frozen: {
> +		/* Fatal error, prepare for slot reset */
> +
> +		amdgpu_device_lock_adev(adev);
> +		return PCI_ERS_RESULT_NEED_RESET;
> +	}

Those extra {} here look superfluous.

Christian.

> +	case pci_channel_io_perm_failure:
> +		/* Permanent error, prepare for device removal */
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +	return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> +{
> +
> +	DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +	/* TODO - dump whatever for debugging purposes */
> +
> +	/* This called only if amdgpu_pci_error_detected returns
> +	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +	 * works, no need to reset slot.
> +	 */
> +
> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +	bool vram_lost;
> +
> +	DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +	pci_restore_state(pdev);
> +
> +	r = amdgpu_device_ip_suspend(adev);
> +	if (r)
> +		goto out;
> +
> +
> +	/* post card */
> +	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_ip_resume_phase1(adev);
> +	if (r)
> +		goto out;
> +
> +	vram_lost = amdgpu_device_check_vram_lost(adev);
> +	if (vram_lost) {
> +		DRM_INFO("VRAM is lost due to GPU reset!\n");
> +		amdgpu_inc_vram_lost(adev);
> +	}
> +
> +	r = amdgpu_gtt_mgr_recover(
> +		&adev->mman.bdev.man[TTM_PL_TT]);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_fw_loading(adev);
> +	if (r)
> +		return r;
> +
> +	r = amdgpu_device_ip_resume_phase2(adev);
> +	if (r)
> +		goto out;
> +
> +	if (vram_lost)
> +		amdgpu_device_fill_reset_magic(adev);
> +
> +	/*
> +	 * Add this ASIC as tracked as reset was already
> +	 * complete successfully.
> +	 */
> +	amdgpu_register_gpu_instance(adev);
> +
> +	r = amdgpu_device_ip_late_init(adev);
> +	if (r)
> +		goto out;
> +
> +	amdgpu_fbdev_set_suspend(adev, 0);
> +
> +	/* must succeed. */
> +	amdgpu_ras_resume(adev);
> +
> +
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +	if (!r)
> +		DRM_INFO("PCIe error recovery succeeded\n");
> +	else {
> +		DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +		amdgpu_device_unlock_adev(adev);
> +	}
> +
> +	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	amdgpu_device_unlock_adev(adev);
> +
> +	DRM_INFO("PCI error: resume callback!!\n");
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>   #include <drm/drm_pciids.h>
>   #include <linux/console.h>
>   #include <linux/module.h>
> -#include <linux/pci.h>
>   #include <linux/pm_runtime.h>
>   #include <linux/vga_switcheroo.h>
>   #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>   	.patchlevel = KMS_DRIVER_PATCHLEVEL,
>   };
>   
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +	.error_detected	= amdgpu_pci_error_detected,
> +	.mmio_enabled	= amdgpu_pci_mmio_enabled,
> +	.slot_reset	= amdgpu_pci_slot_reset,
> +	.resume		= amdgpu_pci_resume,
> +};
> +
>   static struct pci_driver amdgpu_kms_pci_driver = {
>   	.name = DRIVER_NAME,
>   	.id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>   	.remove = amdgpu_pci_remove,
>   	.shutdown = amdgpu_pci_shutdown,
>   	.driver.pm = &amdgpu_pm_ops,
> +	.err_handler = &amdgpu_pci_err_handler,
>   };
>   
>   static int __init amdgpu_init(void)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-28 19:23   ` Alex Deucher
  2020-08-28 19:24     ` Alex Deucher
@ 2020-08-31 14:26     ` Andrey Grodzovsky
  2020-08-31 14:30       ` Alex Deucher
  1 sibling, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 14:26 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li


On 8/28/20 3:23 PM, Alex Deucher wrote:
> On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>> Add DPC handlers with basic recovery functionality.
>>
>> v2: remove pci_save_state to avoid breaking suspend/resume
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 169 ++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>>   3 files changed, 184 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 49ea9fa..3399242 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -49,6 +49,8 @@
>>   #include <linux/rbtree.h>
>>   #include <linux/hashtable.h>
>>   #include <linux/dma-fence.h>
>> +#include <linux/pci.h>
>> +#include <linux/aer.h>
>>
>>   #include <drm/ttm/ttm_bo_api.h>
>>   #include <drm/ttm/ttm_bo_driver.h>
>> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>>   void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>>   void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>>
>> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
>> +                                          pci_channel_state_t state);
>> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>> +void amdgpu_pci_resume(struct pci_dev *pdev);
>> +
>> +
>>   #include "amdgpu_object.h"
>>
>>   /* used by df_v3_6.c and amdgpu_pmu.c */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5a948ed..937f8b0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -350,7 +350,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>>    *
>>    * Returns the 8 bit value from the offset specified.
>>    */
>> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>> +{
>>          if (offset < adev->rmmio_size)
>>                  return (readb(adev->rmmio + offset));
>>          BUG();
>> @@ -371,7 +372,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>>    *
>>    * Writes the value specified to the offset specified.
>>    */
>> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
>> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>> +{
>>          if (offset < adev->rmmio_size)
>>                  writeb(value, adev->rmmio + offset);
>>          else
>> @@ -2989,6 +2991,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>>          NULL
>>   };
>>
>> +
>>   /**
>>    * amdgpu_device_init - initialize the driver
>>    *
>> @@ -3207,6 +3210,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>                  }
>>          }
>>
>> +       pci_enable_pcie_error_reporting(adev->ddev.pdev);
>> +
>> +
>>          /* Post card if necessary */
>>          if (amdgpu_device_need_post(adev)) {
>>                  if (!adev->bios) {
>> @@ -4701,3 +4707,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>>
>>          return 0;
>>   }
>> +
>> +/**
>> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
>> + * @pdev: PCI device struct
>> + * @state: PCI channel state
>> + *
>> + * Description: Called when a PCI error is detected.
>> + *
>> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
>> + */
>> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +
>> +       DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>> +
>> +       switch (state) {
>> +       case pci_channel_io_normal:
>> +               return PCI_ERS_RESULT_CAN_RECOVER;
>> +       case pci_channel_io_frozen: {
>> +               /* Fatal error, prepare for slot reset */
>> +
>> +               amdgpu_device_lock_adev(adev);
>> +               return PCI_ERS_RESULT_NEED_RESET;
>> +       }
>> +       case pci_channel_io_perm_failure:
>> +               /* Permanent error, prepare for device removal */
>> +               return PCI_ERS_RESULT_DISCONNECT;
>> +       }
>> +       return PCI_ERS_RESULT_NEED_RESET;
>> +}
>> +
>> +/**
>> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
>> + * @pdev: pointer to PCI device
>> + */
>> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
>> +{
>> +
>> +       DRM_INFO("PCI error: mmio enabled callback!!\n");
>> +
>> +       /* TODO - dump whatever for debugging purposes */
>> +
>> +       /* This called only if amdgpu_pci_error_detected returns
>> +        * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
>> +        * works, no need to reset slot.
>> +        */
>> +
>> +       return PCI_ERS_RESULT_RECOVERED;
>> +}
>> +
>> +/**
>> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
>> + * @pdev: PCI device struct
>> + *
>> + * Description: This routine is called by the pci error recovery
>> + * code after the PCI slot has been reset, just before we
>> + * should resume normal operations.
>> + */
>> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +       int r;
>> +       bool vram_lost;
>> +
>> +       DRM_INFO("PCI error: slot reset callback!!\n");
>> +
>> +       pci_restore_state(pdev);
>> +
>> +       r = amdgpu_device_ip_suspend(adev);
>> +       if (r)
>> +               goto out;
>> +
>> +
>> +       /* post card */
>> +       r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
>> +       if (r)
>> +               goto out;
>> +
>> +       r = amdgpu_device_ip_resume_phase1(adev);
>> +       if (r)
>> +               goto out;
>> +
>> +       vram_lost = amdgpu_device_check_vram_lost(adev);
>> +       if (vram_lost) {
>> +               DRM_INFO("VRAM is lost due to GPU reset!\n");
>> +               amdgpu_inc_vram_lost(adev);
>> +       }
>> +
>> +       r = amdgpu_gtt_mgr_recover(
>> +               &adev->mman.bdev.man[TTM_PL_TT]);
>> +       if (r)
>> +               goto out;
>> +
>> +       r = amdgpu_device_fw_loading(adev);
>> +       if (r)
>> +               return r;
>> +
>> +       r = amdgpu_device_ip_resume_phase2(adev);
>> +       if (r)
>> +               goto out;
>> +
>> +       if (vram_lost)
>> +               amdgpu_device_fill_reset_magic(adev);
>> +
>> +       /*
>> +        * Add this ASIC as tracked as reset was already
>> +        * complete successfully.
>> +        */
>> +       amdgpu_register_gpu_instance(adev);
>> +
>> +       r = amdgpu_device_ip_late_init(adev);
>> +       if (r)
>> +               goto out;
>> +
>> +       amdgpu_fbdev_set_suspend(adev, 0);
>> +
>> +       /* must succeed. */
>> +       amdgpu_ras_resume(adev);
>> +
>> +
>> +       amdgpu_irq_gpu_reset_resume_helper(adev);
>> +       r = amdgpu_ib_ring_tests(adev);
>> +       if (r)
>> +               goto out;
>> +
>> +       r = amdgpu_device_recover_vram(adev);
>> +
>> +out:
>> +
>> +       if (!r)
>> +               DRM_INFO("PCIe error recovery succeeded\n");
>> +       else {
>> +               DRM_ERROR("PCIe error recovery failed, err:%d", r);
>> +               amdgpu_device_unlock_adev(adev);
>> +       }
>> +
>> +       return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>> +}
>> +
>> +/**
>> + * amdgpu_pci_resume() - resume normal ops after PCI reset
>> + * @pdev: pointer to PCI device
>> + *
>> + * Called when the error recovery driver tells us that its
>> + * OK to resume normal operation. Use completion to allow
>> + * halted scsi ops to resume.
>> + */
>> +void amdgpu_pci_resume(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +
>> +       amdgpu_device_unlock_adev(adev);
>> +
>> +       DRM_INFO("PCI error: resume callback!!\n");
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index d984c6a..4bbcc70 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -31,7 +31,6 @@
>>   #include <drm/drm_pciids.h>
>>   #include <linux/console.h>
>>   #include <linux/module.h>
>> -#include <linux/pci.h>
> Is this intended?  Seems unrelated.  I think this should be part of
> the previous patch.
>
> Alex


Which previous patch (it's the first one) ?

Andrey


>
>
>>   #include <linux/pm_runtime.h>
>>   #include <linux/vga_switcheroo.h>
>>   #include <drm/drm_probe_helper.h>
>> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>>          .patchlevel = KMS_DRIVER_PATCHLEVEL,
>>   };
>>
>> +static struct pci_error_handlers amdgpu_pci_err_handler = {
>> +       .error_detected = amdgpu_pci_error_detected,
>> +       .mmio_enabled   = amdgpu_pci_mmio_enabled,
>> +       .slot_reset     = amdgpu_pci_slot_reset,
>> +       .resume         = amdgpu_pci_resume,
>> +};
>> +
>>   static struct pci_driver amdgpu_kms_pci_driver = {
>>          .name = DRIVER_NAME,
>>          .id_table = pciidlist,
>> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>>          .remove = amdgpu_pci_remove,
>>          .shutdown = amdgpu_pci_shutdown,
>>          .driver.pm = &amdgpu_pm_ops,
>> +       .err_handler = &amdgpu_pci_err_handler,
>>   };
>>
>>   static int __init amdgpu_init(void)
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc25d1302358c420604b908d84b87e5a8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637342394332573022&amp;sdata=MvfAEC9Vc7fwkmBvihaTxDswrRJwFhjLWoEmjQKIIPg%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-31 14:26     ` Andrey Grodzovsky
@ 2020-08-31 14:30       ` Alex Deucher
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-08-31 14:30 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On Mon, Aug 31, 2020 at 10:26 AM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 8/28/20 3:23 PM, Alex Deucher wrote:
> > On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >> Add DPC handlers with basic recovery functionality.
> >>
> >> v2: remove pci_save_state to avoid breaking suspend/resume
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 169 ++++++++++++++++++++++++++++-
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
> >>   3 files changed, 184 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> index 49ea9fa..3399242 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> @@ -49,6 +49,8 @@
> >>   #include <linux/rbtree.h>
> >>   #include <linux/hashtable.h>
> >>   #include <linux/dma-fence.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/aer.h>
> >>
> >>   #include <drm/ttm/ttm_bo_api.h>
> >>   #include <drm/ttm/ttm_bo_driver.h>
> >> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
> >>   void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
> >>   void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
> >>
> >> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> >> +                                          pci_channel_state_t state);
> >> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> >> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> >> +void amdgpu_pci_resume(struct pci_dev *pdev);
> >> +
> >> +
> >>   #include "amdgpu_object.h"
> >>
> >>   /* used by df_v3_6.c and amdgpu_pmu.c */
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index 5a948ed..937f8b0 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -350,7 +350,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
> >>    *
> >>    * Returns the 8 bit value from the offset specified.
> >>    */
> >> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> >> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> >> +{
> >>          if (offset < adev->rmmio_size)
> >>                  return (readb(adev->rmmio + offset));
> >>          BUG();
> >> @@ -371,7 +372,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> >>    *
> >>    * Writes the value specified to the offset specified.
> >>    */
> >> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> >> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> >> +{
> >>          if (offset < adev->rmmio_size)
> >>                  writeb(value, adev->rmmio + offset);
> >>          else
> >> @@ -2989,6 +2991,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
> >>          NULL
> >>   };
> >>
> >> +
> >>   /**
> >>    * amdgpu_device_init - initialize the driver
> >>    *
> >> @@ -3207,6 +3210,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >>                  }
> >>          }
> >>
> >> +       pci_enable_pcie_error_reporting(adev->ddev.pdev);
> >> +
> >> +
> >>          /* Post card if necessary */
> >>          if (amdgpu_device_need_post(adev)) {
> >>                  if (!adev->bios) {
> >> @@ -4701,3 +4707,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
> >>
> >>          return 0;
> >>   }
> >> +
> >> +/**
> >> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> >> + * @pdev: PCI device struct
> >> + * @state: PCI channel state
> >> + *
> >> + * Description: Called when a PCI error is detected.
> >> + *
> >> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> >> + */
> >> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +
> >> +       DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> >> +
> >> +       switch (state) {
> >> +       case pci_channel_io_normal:
> >> +               return PCI_ERS_RESULT_CAN_RECOVER;
> >> +       case pci_channel_io_frozen: {
> >> +               /* Fatal error, prepare for slot reset */
> >> +
> >> +               amdgpu_device_lock_adev(adev);
> >> +               return PCI_ERS_RESULT_NEED_RESET;
> >> +       }
> >> +       case pci_channel_io_perm_failure:
> >> +               /* Permanent error, prepare for device removal */
> >> +               return PCI_ERS_RESULT_DISCONNECT;
> >> +       }
> >> +       return PCI_ERS_RESULT_NEED_RESET;
> >> +}
> >> +
> >> +/**
> >> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> >> + * @pdev: pointer to PCI device
> >> + */
> >> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> >> +{
> >> +
> >> +       DRM_INFO("PCI error: mmio enabled callback!!\n");
> >> +
> >> +       /* TODO - dump whatever for debugging purposes */
> >> +
> >> +       /* This called only if amdgpu_pci_error_detected returns
> >> +        * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> >> +        * works, no need to reset slot.
> >> +        */
> >> +
> >> +       return PCI_ERS_RESULT_RECOVERED;
> >> +}
> >> +
> >> +/**
> >> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> >> + * @pdev: PCI device struct
> >> + *
> >> + * Description: This routine is called by the pci error recovery
> >> + * code after the PCI slot has been reset, just before we
> >> + * should resume normal operations.
> >> + */
> >> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +       int r;
> >> +       bool vram_lost;
> >> +
> >> +       DRM_INFO("PCI error: slot reset callback!!\n");
> >> +
> >> +       pci_restore_state(pdev);
> >> +
> >> +       r = amdgpu_device_ip_suspend(adev);
> >> +       if (r)
> >> +               goto out;
> >> +
> >> +
> >> +       /* post card */
> >> +       r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> >> +       if (r)
> >> +               goto out;
> >> +
> >> +       r = amdgpu_device_ip_resume_phase1(adev);
> >> +       if (r)
> >> +               goto out;
> >> +
> >> +       vram_lost = amdgpu_device_check_vram_lost(adev);
> >> +       if (vram_lost) {
> >> +               DRM_INFO("VRAM is lost due to GPU reset!\n");
> >> +               amdgpu_inc_vram_lost(adev);
> >> +       }
> >> +
> >> +       r = amdgpu_gtt_mgr_recover(
> >> +               &adev->mman.bdev.man[TTM_PL_TT]);
> >> +       if (r)
> >> +               goto out;
> >> +
> >> +       r = amdgpu_device_fw_loading(adev);
> >> +       if (r)
> >> +               return r;
> >> +
> >> +       r = amdgpu_device_ip_resume_phase2(adev);
> >> +       if (r)
> >> +               goto out;
> >> +
> >> +       if (vram_lost)
> >> +               amdgpu_device_fill_reset_magic(adev);
> >> +
> >> +       /*
> >> +        * Add this ASIC as tracked as reset was already
> >> +        * complete successfully.
> >> +        */
> >> +       amdgpu_register_gpu_instance(adev);
> >> +
> >> +       r = amdgpu_device_ip_late_init(adev);
> >> +       if (r)
> >> +               goto out;
> >> +
> >> +       amdgpu_fbdev_set_suspend(adev, 0);
> >> +
> >> +       /* must succeed. */
> >> +       amdgpu_ras_resume(adev);
> >> +
> >> +
> >> +       amdgpu_irq_gpu_reset_resume_helper(adev);
> >> +       r = amdgpu_ib_ring_tests(adev);
> >> +       if (r)
> >> +               goto out;
> >> +
> >> +       r = amdgpu_device_recover_vram(adev);
> >> +
> >> +out:
> >> +
> >> +       if (!r)
> >> +               DRM_INFO("PCIe error recovery succeeded\n");
> >> +       else {
> >> +               DRM_ERROR("PCIe error recovery failed, err:%d", r);
> >> +               amdgpu_device_unlock_adev(adev);
> >> +       }
> >> +
> >> +       return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> >> +}
> >> +
> >> +/**
> >> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> >> + * @pdev: pointer to PCI device
> >> + *
> >> + * Called when the error recovery driver tells us that its
> >> + * OK to resume normal operation. Use completion to allow
> >> + * halted scsi ops to resume.
> >> + */
> >> +void amdgpu_pci_resume(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +
> >> +       amdgpu_device_unlock_adev(adev);
> >> +
> >> +       DRM_INFO("PCI error: resume callback!!\n");
> >> +}
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index d984c6a..4bbcc70 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -31,7 +31,6 @@
> >>   #include <drm/drm_pciids.h>
> >>   #include <linux/console.h>
> >>   #include <linux/module.h>
> >> -#include <linux/pci.h>
> > Is this intended?  Seems unrelated.  I think this should be part of
> > the previous patch.
> >
> > Alex
>
>
> Which previous patch (it's the first one) ?

Ignore this comment.  I mixed up the patch ordering in my head when I
was reviewing this.  See my follow up reply.

Alex

>
> Andrey
>
>
> >
> >
> >>   #include <linux/pm_runtime.h>
> >>   #include <linux/vga_switcheroo.h>
> >>   #include <drm/drm_probe_helper.h>
> >> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
> >>          .patchlevel = KMS_DRIVER_PATCHLEVEL,
> >>   };
> >>
> >> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> >> +       .error_detected = amdgpu_pci_error_detected,
> >> +       .mmio_enabled   = amdgpu_pci_mmio_enabled,
> >> +       .slot_reset     = amdgpu_pci_slot_reset,
> >> +       .resume         = amdgpu_pci_resume,
> >> +};
> >> +
> >>   static struct pci_driver amdgpu_kms_pci_driver = {
> >>          .name = DRIVER_NAME,
> >>          .id_table = pciidlist,
> >> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
> >>          .remove = amdgpu_pci_remove,
> >>          .shutdown = amdgpu_pci_shutdown,
> >>          .driver.pm = &amdgpu_pm_ops,
> >> +       .err_handler = &amdgpu_pci_err_handler,
> >>   };
> >>
> >>   static int __init amdgpu_init(void)
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc25d1302358c420604b908d84b87e5a8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637342394332573022&amp;sdata=MvfAEC9Vc7fwkmBvihaTxDswrRJwFhjLWoEmjQKIIPg%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-28 19:26   ` Alex Deucher
@ 2020-08-31 20:19     ` Luben Tuikov
  0 siblings, 0 replies; 23+ messages in thread
From: Luben Tuikov @ 2020-08-31 20:19 UTC (permalink / raw)
  To: Alex Deucher, Andrey Grodzovsky
  Cc: Deucher, Alexander, Nirmoy, Christian Koenig, amd-gfx list, Dennis Li

On 2020-08-28 3:26 p.m., Alex Deucher wrote:
> On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> At this point the ASIC is already post reset by the HW/PSP
>> so the HW not in proper state to be configured for suspension,
>> some bloks might be even gated and so best is to avoid touching it.
> 
> typo: "blocks"

I already mentioned this in the 1st version of the patch--seems to have been ignored.

Regards,
Luben


> 
>>
>> v2: Rename in_dpc to more meaningful name
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 41 +++++++++++++++++++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>>  drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++-----
>>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
>>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>>  8 files changed, 71 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3399242..cac51e8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -992,6 +992,7 @@ struct amdgpu_device {
>>         atomic_t                        throttling_logging_enabled;
>>         struct ratelimit_state          throttling_logging_rs;
>>         uint32_t                        ras_features;
>> +       bool                            in_pci_err_recovery;
>>  };
>>
>>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 937f8b0..e67cbf2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>>  {
>>         uint32_t ret;
>>
>> +       if (adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>>                 return amdgpu_kiq_rreg(adev, reg);
>>
>> @@ -352,6 +355,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>>   */
>>  uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         if (offset < adev->rmmio_size)
>>                 return (readb(adev->rmmio + offset));
>>         BUG();
>> @@ -374,14 +380,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>>   */
>>  void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         if (offset < adev->rmmio_size)
>>                 writeb(value, adev->rmmio + offset);
>>         else
>>                 BUG();
>>  }
>>
>> -void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>> +void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>> +                                      uint32_t v, uint32_t acc_flags)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>>
>>         if ((reg * 4) < adev->rmmio_size)
>> @@ -409,6 +422,9 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>>  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>>                     uint32_t acc_flags)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>>                 return amdgpu_kiq_wreg(adev, reg, v);
>>
>> @@ -423,6 +439,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>>  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>>                     uint32_t acc_flags)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         if (amdgpu_sriov_fullaccess(adev) &&
>>                 adev->gfx.rlc.funcs &&
>>                 adev->gfx.rlc.funcs->is_rlcg_access_range) {
>> @@ -444,6 +463,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
>>   */
>>  u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         if ((reg * 4) < adev->rio_mem_size)
>>                 return ioread32(adev->rio_mem + (reg * 4));
>>         else {
>> @@ -463,6 +485,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>>   */
>>  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         if ((reg * 4) < adev->rio_mem_size)
>>                 iowrite32(v, adev->rio_mem + (reg * 4));
>>         else {
>> @@ -482,6 +507,9 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>>   */
>>  u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         if (index < adev->doorbell.num_doorbells) {
>>                 return readl(adev->doorbell.ptr + index);
>>         } else {
>> @@ -502,6 +530,9 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>>   */
>>  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         if (index < adev->doorbell.num_doorbells) {
>>                 writel(v, adev->doorbell.ptr + index);
>>         } else {
>> @@ -520,6 +551,9 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>>   */
>>  u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         if (index < adev->doorbell.num_doorbells) {
>>                 return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
>>         } else {
>> @@ -540,6 +574,9 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>>   */
>>  void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>>  {
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         if (index < adev->doorbell.num_doorbells) {
>>                 atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>>         } else {
>> @@ -4778,7 +4815,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>
>>         pci_restore_state(pdev);
>>
>> +       adev->in_pci_err_recovery = true;
>>         r = amdgpu_device_ip_suspend(adev);
>> +       adev->in_pci_err_recovery = false;
>>         if (r)
>>                 goto out;
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index d698142..8c9bacf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>>         struct amdgpu_ring *ring = &kiq->ring;
>>
>> +       if (adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         BUG_ON(!ring->funcs->emit_rreg);
>>
>>         spin_lock_irqsave(&kiq->ring_lock, flags);
>> @@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>
>>         BUG_ON(!ring->funcs->emit_wreg);
>>
>> +       if (adev->in_pci_err_recovery)
>> +               return;
>> +
>>         spin_lock_irqsave(&kiq->ring_lock, flags);
>>         amdgpu_ring_alloc(ring, 32);
>>         amdgpu_ring_emit_wreg(ring, reg, v);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> index d6c38e2..a7771aa 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> @@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
>>         int i;
>>         struct amdgpu_device *adev = psp->adev;
>>
>> +       if (psp->adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         for (i = 0; i < adev->usec_timeout; i++) {
>>                 val = RREG32(reg_index);
>>                 if (check_changed) {
>> @@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>         bool ras_intr = false;
>>         bool skip_unsupport = false;
>>
>> +       if (psp->adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         mutex_lock(&psp->mutex);
>>
>>         memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
>> index 4cfc786..613dac1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/atom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/atom.c
>> @@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
>>                                         DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
>>                                                   ATOM_CMD_TIMEOUT_SEC);
>>                                         ctx->abort = true;
>> +                                       dump_stack();
> 
> Leftover from debugging?
> 
> 
> 
>>                                 }
>>                         } else {
>>                                 /* jiffies wrap around we will just wait a little longer */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> index 2db195e..ccf096c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> @@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
>>
>>         amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>>         amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
>> +
>> +       if (!adev->in_pci_err_recovery) {
>>  #ifndef BRING_UP_DEBUG
>> -       if (amdgpu_async_gfx_ring) {
>> -               r = gfx_v10_0_kiq_disable_kgq(adev);
>> -               if (r)
>> -                       DRM_ERROR("KGQ disable failed\n");
>> -       }
>> +               if (amdgpu_async_gfx_ring) {
>> +                       r = gfx_v10_0_kiq_disable_kgq(adev);
>> +                       if (r)
>> +                               DRM_ERROR("KGQ disable failed\n");
>> +               }
>>  #endif
>> -       if (amdgpu_gfx_disable_kcq(adev))
>> -               DRM_ERROR("KCQ disable failed\n");
>> +               if (amdgpu_gfx_disable_kcq(adev))
>> +                       DRM_ERROR("KCQ disable failed\n");
>> +       }
>> +
> 
> gfx9 probably needs something similar, but that can come later if we
> ever validate this for older parts.
> 
>>         if (amdgpu_sriov_vf(adev)) {
>>                 gfx_v10_0_cp_gfx_enable(adev, false);
>>                 /* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
>> index 8462b30..306461d 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
>> @@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
>>  {
>>         int ret = 0;
>>
>> +
>>         if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
>>                 return -EOPNOTSUPP;
>>
>> +
>> +
> 
> Unrelated whitespaces changes.  Please drop.  With this things fixed:
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
> 
> 
>>         switch (block_type) {
>>         /*
>>          * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index a58ea08..97aa72a 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>         struct amdgpu_device *adev = smu->adev;
>>         int ret = 0, index = 0;
>>
>> +       if (smu->adev->in_pci_err_recovery)
>> +               return 0;
>> +
>>         index = smu_cmn_to_asic_specific_index(smu,
>>                                                CMN2ASIC_MAPPING_MSG,
>>                                                msg);
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cluben.tuikov%40amd.com%7Cb1b272ee77ba4d18e56e08d84b885545%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637342396521349293&amp;sdata=%2BLIG7VZ3lpaEUuNydr1XwodqRlAqkiweEcWtk2zppAU%3D&amp;reserved=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cluben.tuikov%40amd.com%7Cb1b272ee77ba4d18e56e08d84b885545%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637342396521349293&amp;sdata=%2BLIG7VZ3lpaEUuNydr1XwodqRlAqkiweEcWtk2zppAU%3D&amp;reserved=0
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-08-31 20:19 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
2020-08-28 19:23   ` Alex Deucher
2020-08-28 19:24     ` Alex Deucher
2020-08-31 14:26     ` Andrey Grodzovsky
2020-08-31 14:30       ` Alex Deucher
2020-08-28 19:25   ` Alex Deucher
2020-08-31 12:44   ` Christian König
2020-08-28 16:05 ` [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
2020-08-28 19:26   ` Alex Deucher
2020-08-31 20:19     ` Luben Tuikov
2020-08-28 16:05 ` [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
2020-08-28 19:28   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
2020-08-28 19:29   ` Alex Deucher
2020-08-28 20:28     ` Andrey Grodzovsky
2020-08-28 16:05 ` [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
2020-08-28 19:19   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
2020-08-28 19:30   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
2020-08-28 19:30   ` Alex Deucher
2020-08-28 19:31     ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.