All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] Implement PCI Error Recovery on Navi12
@ 2020-08-31 15:50 Andrey Grodzovsky
  2020-08-31 15:50 ` [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, 
such as parity errors on the data and address buses,  A typical action taken is to disconnect 
the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered, 
so that the affected PCI device(s) are reset and put back into working condition. 
In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC) 
driver which will intercept the PCIe error, remove (isolate) the faulting device after which it 
will call into PCIe recovery code of the PCI core. 
This code will call hooks which are implemented in this patchset where the error is 
first reported at which point we block the GPU scheduler, next DPC resets the 
PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
start executing mode1 reset of the ASIC, next step is slot reset hook is called 
at which point we wait for ASIC reset to complete, restore PCI config space and run 
HW suspend/resume sequence to resinit the ASIC. 
Last hook called is resume normal operation at which point we will restart the GPU scheduler.

Andrey Grodzovsky (8):
  drm/amdgpu: Implement DPC recovery
  drm/amdgpu: Avoid accessing HW when suspending SW state
  drm/amdgpu: Block all job scheduling activity during DPC recovery
  drm/amdgpu: Fix SMU error failure
  drm/amdgpu: Fix consecutive DPC recovery failures.
  drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  drm/amdgpu: Disable DPC for XGMI for now.
  drm/amdgpu: Minor checkpatch fix

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  16 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 298 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
 9 files changed, 346 insertions(+), 22 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 20:28   ` Luben Tuikov
  2020-08-31 20:47   ` Luben Tuikov
  2020-08-31 15:50 ` [PATCH v3 2/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Add DPC handlers with basic recovery functionality.

v2: remove pci_save_state to avoid breaking suspend/resume
v3: Fix style comments

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 162 +++++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
 3 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 49ea9fa..3399242 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -49,6 +49,8 @@
 #include <linux/rbtree.h>
 #include <linux/hashtable.h>
 #include <linux/dma-fence.h>
+#include <linux/pci.h>
+#include <linux/aer.h>
 
 #include <drm/ttm/ttm_bo_api.h>
 #include <drm/ttm/ttm_bo_driver.h>
@@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
 void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
 void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
 
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
+					   pci_channel_state_t state);
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
+void amdgpu_pci_resume(struct pci_dev *pdev);
+
+
 #include "amdgpu_object.h"
 
 /* used by df_v3_6.c and amdgpu_pmu.c */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a948ed..67d61a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2989,6 +2989,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
 	NULL
 };
 
+
 /**
  * amdgpu_device_init - initialize the driver
  *
@@ -3207,6 +3208,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		}
 	}
 
+	pci_enable_pcie_error_reporting(adev->ddev.pdev);
+
+
 	/* Post card if necessary */
 	if (amdgpu_device_need_post(adev)) {
 		if (!adev->bios) {
@@ -4701,3 +4705,161 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 
 	return 0;
 }
+
+/**
+ * amdgpu_pci_error_detected - Called when a PCI error is detected.
+ * @pdev: PCI device struct
+ * @state: PCI channel state
+ *
+ * Description: Called when a PCI error is detected.
+ *
+ * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
+ */
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
+
+	switch (state) {
+	case pci_channel_io_normal:
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen:
+		/* Fatal error, prepare for slot reset */
+		amdgpu_device_lock_adev(adev);
+		return PCI_ERS_RESULT_NEED_RESET;
+	case pci_channel_io_perm_failure:
+		/* Permanent error, prepare for device removal */
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
+ * @pdev: pointer to PCI device
+ */
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
+{
+
+	DRM_INFO("PCI error: mmio enabled callback!!\n");
+
+	/* TODO - dump whatever for debugging purposes */
+
+	/* This called only if amdgpu_pci_error_detected returns
+	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
+	 * works, no need to reset slot.
+	 */
+
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
+ * @pdev: PCI device struct
+ *
+ * Description: This routine is called by the pci error recovery
+ * code after the PCI slot has been reset, just before we
+ * should resume normal operations.
+ */
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+	bool vram_lost;
+
+	DRM_INFO("PCI error: slot reset callback!!\n");
+
+	pci_restore_state(pdev);
+
+	r = amdgpu_device_ip_suspend(adev);
+	if (r)
+		goto out;
+
+
+	/* post card */
+	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_ip_resume_phase1(adev);
+	if (r)
+		goto out;
+
+	vram_lost = amdgpu_device_check_vram_lost(adev);
+	if (vram_lost) {
+		DRM_INFO("VRAM is lost due to GPU reset!\n");
+		amdgpu_inc_vram_lost(adev);
+	}
+
+	r = amdgpu_gtt_mgr_recover(
+		&adev->mman.bdev.man[TTM_PL_TT]);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_fw_loading(adev);
+	if (r)
+		return r;
+
+	r = amdgpu_device_ip_resume_phase2(adev);
+	if (r)
+		goto out;
+
+	if (vram_lost)
+		amdgpu_device_fill_reset_magic(adev);
+
+	/*
+	 * Add this ASIC as tracked as reset was already
+	 * complete successfully.
+	 */
+	amdgpu_register_gpu_instance(adev);
+
+	r = amdgpu_device_ip_late_init(adev);
+	if (r)
+		goto out;
+
+	amdgpu_fbdev_set_suspend(adev, 0);
+
+	/* must succeed. */
+	amdgpu_ras_resume(adev);
+
+
+	amdgpu_irq_gpu_reset_resume_helper(adev);
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_recover_vram(adev);
+
+out:
+
+	if (!r) {
+		DRM_INFO("PCIe error recovery succeeded\n");
+	} else {
+		DRM_ERROR("PCIe error recovery failed, err:%d", r);
+		amdgpu_device_unlock_adev(adev);
+	}
+
+	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * amdgpu_pci_resume() - resume normal ops after PCI reset
+ * @pdev: pointer to PCI device
+ *
+ * Called when the error recovery driver tells us that its
+ * OK to resume normal operation. Use completion to allow
+ * halted scsi ops to resume.
+ */
+void amdgpu_pci_resume(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	amdgpu_device_unlock_adev(adev);
+
+	DRM_INFO("PCI error: resume callback!!\n");
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d984c6a..4bbcc70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -31,7 +31,6 @@
 #include <drm/drm_pciids.h>
 #include <linux/console.h>
 #include <linux/module.h>
-#include <linux/pci.h>
 #include <linux/pm_runtime.h>
 #include <linux/vga_switcheroo.h>
 #include <drm/drm_probe_helper.h>
@@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
 	.patchlevel = KMS_DRIVER_PATCHLEVEL,
 };
 
+static struct pci_error_handlers amdgpu_pci_err_handler = {
+	.error_detected	= amdgpu_pci_error_detected,
+	.mmio_enabled	= amdgpu_pci_mmio_enabled,
+	.slot_reset	= amdgpu_pci_slot_reset,
+	.resume		= amdgpu_pci_resume,
+};
+
 static struct pci_driver amdgpu_kms_pci_driver = {
 	.name = DRIVER_NAME,
 	.id_table = pciidlist,
@@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
 	.remove = amdgpu_pci_remove,
 	.shutdown = amdgpu_pci_shutdown,
 	.driver.pm = &amdgpu_pm_ops,
+	.err_handler = &amdgpu_pci_err_handler,
 };
 
 static int __init amdgpu_init(void)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/8] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-08-31 15:50 ` [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 15:50 ` [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some blocks might be even gated and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 41 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++-----
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
 6 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3399242..cac51e8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+	bool                            in_pci_err_recovery;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 67d61a5..43ce473 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
 {
 	uint32_t ret;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_rreg(adev, reg);
 
@@ -351,6 +354,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  * Returns the 8 bit value from the offset specified.
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
 	BUG();
@@ -372,14 +378,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  * Writes the value specified to the offset specified.
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
 	else
 		BUG();
 }
 
-void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
+void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
+				       uint32_t v, uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
 	if ((reg * 4) < adev->rmmio_size)
@@ -407,6 +420,9 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
 void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -421,6 +437,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (amdgpu_sriov_fullaccess(adev) &&
 		adev->gfx.rlc.funcs &&
 		adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -442,6 +461,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		return ioread32(adev->rio_mem + (reg * 4));
 	else {
@@ -461,6 +483,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		iowrite32(v, adev->rio_mem + (reg * 4));
 	else {
@@ -480,6 +505,9 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return readl(adev->doorbell.ptr + index);
 	} else {
@@ -500,6 +528,9 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		writel(v, adev->doorbell.ptr + index);
 	} else {
@@ -518,6 +549,9 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	if (index < adev->doorbell.num_doorbells) {
 		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
 	} else {
@@ -538,6 +572,9 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
 {
+	if (adev->in_pci_err_recovery)
+		return;
+
 	if (index < adev->doorbell.num_doorbells) {
 		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
 	} else {
@@ -4775,7 +4812,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 	pci_restore_state(pdev);
 
+	adev->in_pci_err_recovery = true;
 	r = amdgpu_device_ip_suspend(adev);
+	adev->in_pci_err_recovery = false;
 	if (r)
 		goto out;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index d698142..8c9bacf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
 	struct amdgpu_ring *ring = &kiq->ring;
 
+	if (adev->in_pci_err_recovery)
+		return 0;
+
 	BUG_ON(!ring->funcs->emit_rreg);
 
 	spin_lock_irqsave(&kiq->ring_lock, flags);
@@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 
 	BUG_ON(!ring->funcs->emit_wreg);
 
+	if (adev->in_pci_err_recovery)
+		return;
+
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 	amdgpu_ring_alloc(ring, 32);
 	amdgpu_ring_emit_wreg(ring, reg, v);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index d6c38e2..a7771aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
 	int i;
 	struct amdgpu_device *adev = psp->adev;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	for (i = 0; i < adev->usec_timeout; i++) {
 		val = RREG32(reg_index);
 		if (check_changed) {
@@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	bool ras_intr = false;
 	bool skip_unsupport = false;
 
+	if (psp->adev->in_pci_err_recovery)
+		return 0;
+
 	mutex_lock(&psp->mutex);
 
 	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 2db195e..ccf096c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
 
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
+
+	if (!adev->in_pci_err_recovery) {
 #ifndef BRING_UP_DEBUG
-	if (amdgpu_async_gfx_ring) {
-		r = gfx_v10_0_kiq_disable_kgq(adev);
-		if (r)
-			DRM_ERROR("KGQ disable failed\n");
-	}
+		if (amdgpu_async_gfx_ring) {
+			r = gfx_v10_0_kiq_disable_kgq(adev);
+			if (r)
+				DRM_ERROR("KGQ disable failed\n");
+		}
 #endif
-	if (amdgpu_gfx_disable_kcq(adev))
-		DRM_ERROR("KCQ disable failed\n");
+		if (amdgpu_gfx_disable_kcq(adev))
+			DRM_ERROR("KCQ disable failed\n");
+	}
+
 	if (amdgpu_sriov_vf(adev)) {
 		gfx_v10_0_cp_gfx_enable(adev, false);
 		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index a58ea08..97aa72a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
 	struct amdgpu_device *adev = smu->adev;
 	int ret = 0, index = 0;
 
+	if (smu->adev->in_pci_err_recovery)
+		return 0;
+
 	index = smu_cmn_to_asic_specific_index(smu,
 					       CMN2ASIC_MAPPING_MSG,
 					       msg);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-08-31 15:50 ` [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-31 15:50 ` [PATCH v3 2/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 21:00   ` Luben Tuikov
  2020-08-31 15:50 ` [PATCH v3 4/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

DPC recovery involves ASIC reset just as normal GPU recovery so blosk
SW GPU schedulers and wait on all concurrent GPU resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 43ce473..c569523 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4743,6 +4743,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 	return 0;
 }
 
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		cancel_delayed_work_sync(&ring->sched.work_tdr);
+	}
+}
+
 /**
  * amdgpu_pci_error_detected - Called when a PCI error is detected.
  * @pdev: PCI device struct
@@ -4756,15 +4770,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		/* Fatal error, prepare for slot reset */
-		amdgpu_device_lock_adev(adev);
+	/* Fatal error, prepare f
+	case pci_channel_io_frozen:		
+		/*		
+		 * Cancel and wait for all TDRs in progress if failing to
+		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
+		 *
+		 * Locking adev->reset_sem will prevent any external access
+		 * to GPU during PCI error recovery
+		 */
+		while (!amdgpu_device_lock_adev(adev, NULL))
+			amdgpu_cancel_all_tdr(adev);
+
+		/*
+		 * Block any work scheduling as we do for regular GPU reset
+		 * for the duration of the recovery
+		 */
+		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+			struct amdgpu_ring *ring = adev->rings[i];
+
+			if (!ring || !ring->sched.thread)
+				continue;
+
+			drm_sched_stop(&ring->sched, NULL);
+		}
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		/* Permanent error, prepare for device removal */
@@ -4897,8 +4933,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
-	amdgpu_device_unlock_adev(adev);
 
 	DRM_INFO("PCI error: resume callback!!\n");
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+
+		drm_sched_resubmit_jobs(&ring->sched);
+		drm_sched_start(&ring->sched, true);
+	}
+
+	amdgpu_device_unlock_adev(adev);
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 4/8] drm/amdgpu: Fix SMU error failure
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2020-08-31 15:50 ` [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 15:50 ` [PATCH v3 5/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

v2: Remove typo

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c569523..4a009ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4841,14 +4841,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
-	int r;
+	int r, i;
 	bool vram_lost;
+	u32 memsize;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	/* wait for asic to come out of reset */
+	msleep(500);
+
 	pci_restore_state(pdev);
 
-	adev->in_pci_err_recovery = true;
+	/* confirm  ASIC came out of reset */
+	for (i = 0; i < adev->usec_timeout; i++) {
+		memsize = amdgpu_asic_get_config_memsize(adev);
+
+		if (memsize != 0xffffffff)
+			break;
+		udelay(1);
+	}
+	if (memsize == 0xffffffff) {
+		r = -ETIME;
+		goto out;
+	}
+
+	/* TODO Call amdgpu_pre_asic_reset instead */
+	adev->in_pci_err_recovery = true;	
 	r = amdgpu_device_ip_suspend(adev);
 	adev->in_pci_err_recovery = false;
 	if (r)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 5/8] drm/amdgpu: Fix consecutive DPC recovery failures.
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2020-08-31 15:50 ` [PATCH v3 4/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 21:05   ` Luben Tuikov
  2020-08-31 15:50 ` [PATCH v3 6/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Cache the PCI state on boot and before each case were we might
loose it.

v2: Add pci_restore_state while caching the PCI state to avoid
breaking PCI core logic for stuff like suspend/resume.

v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
to avoid superflous restores during GPU resets and suspend/resumes.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 ++++++++++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
 5 files changed, 71 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index cac51e8..5e74db6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,7 +992,9 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+
 	bool                            in_pci_err_recovery;
+	struct pci_saved_state          *pci_state;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
@@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
 pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
 void amdgpu_pci_resume(struct pci_dev *pdev);
 
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
+
+
 
 #include "amdgpu_object.h"
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4a009ca..0329b43 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1282,7 +1282,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
 		pci_set_power_state(dev->pdev, PCI_D0);
-		pci_restore_state(dev->pdev);
+		amdgpu_device_load_pci_state(dev->pdev);
 		r = pci_enable_device(dev->pdev);
 		if (r)
 			DRM_WARN("pci_enable_device failed (%d)\n", r);
@@ -1295,7 +1295,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		drm_kms_helper_poll_disable(dev);
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 		amdgpu_device_suspend(dev, true);
-		pci_save_state(dev->pdev);
+		amdgpu_device_cache_pci_state(dev->pdev);
 		/* Shut down the device */
 		pci_disable_device(dev->pdev);
 		pci_set_power_state(dev->pdev, PCI_D3cold);
@@ -3400,6 +3400,10 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
 
+	/* Have stored pci confspace at hand for restore in sudden PCI error */
+	if (amdgpu_device_cache_pci_state(adev->pdev))
+		pci_restore_state(pdev);
+
 	return 0;
 
 failed:
@@ -3426,6 +3430,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 	flush_delayed_work(&adev->delayed_init_work);
 	adev->shutdown = true;
 
+	kfree(adev->pci_state);
+
 	/* make sure IB test finished before entering exclusive mode
 	 * to avoid preemption on IB test
 	 * */
@@ -4850,7 +4856,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	/* wait for asic to come out of reset */
 	msleep(500);
 
-	pci_restore_state(pdev);
+	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
@@ -4930,6 +4936,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 out:
 
 	if (!r) {
+		if (amdgpu_device_cache_pci_state(adev->pdev))
+			pci_restore_state(adev->pdev);
+
 		DRM_INFO("PCIe error recovery succeeded\n");
 	} else {
 		DRM_ERROR("PCIe error recovery failed, err:%d", r);
@@ -4969,3 +4978,50 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 
 	amdgpu_device_unlock_adev(adev);
 }
+
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	r = pci_save_state(pdev);
+	if (!r) {
+		kfree(adev->pci_state);
+
+		adev->pci_state = pci_store_saved_state(pdev);
+
+		if (!adev->pci_state) {
+			DRM_ERROR("Failed to store PCI saved state");
+			return false;
+		}
+	} else {
+		DRM_WARN("Failed to save PCI state, err:%d\n", r);
+		return false;
+	}
+
+	return true;
+}
+
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	if (!adev->pci_state)
+		return false;
+
+	r = pci_load_saved_state(pdev, adev->pci_state);
+
+	if (!r) {
+		pci_restore_state(pdev);
+	} else {
+		DRM_WARN("Failed to load PCI state, err:%d\n", r);
+		return false;
+	}
+
+	return true;
+}
+
+
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4bbcc70..7a6482a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
 		if (amdgpu_is_atpx_hybrid()) {
 			pci_ignore_hotplug(pdev);
 		} else {
-			pci_save_state(pdev);
+			amdgpu_device_cache_pci_state(pdev);
 			pci_disable_device(pdev);
 			pci_ignore_hotplug(pdev);
 			pci_set_power_state(pdev, PCI_D3cold);
@@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
 			pci_set_master(pdev);
 		} else {
 			pci_set_power_state(pdev, PCI_D0);
-			pci_restore_state(pdev);
+			amdgpu_device_load_pci_state(pdev);
 			ret = pci_enable_device(pdev);
 			if (ret)
 				return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 4d14023..0ec6603 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
 		dev_info(adev->dev, "GPU smu mode1 reset\n");
@@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 2f93c47..ddd55e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	ret = psp_gpu_reset(adev);
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
 
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 6/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2020-08-31 15:50 ` [PATCH v3 5/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 15:50 ` [PATCH v3 7/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Reuse exsisting functions from GPU recovery to avoid code
duplications.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 +++++-------------------------
 1 file changed, 12 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0329b43..b31868a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4114,7 +4114,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 
 static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 			       struct list_head *device_list_handle,
-			       bool *need_full_reset_arg)
+			       bool *need_full_reset_arg,
+			       bool skip_hw_reset)
 {
 	struct amdgpu_device *tmp_adev = NULL;
 	bool need_full_reset = *need_full_reset_arg, vram_lost = false;
@@ -4124,7 +4125,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 	 * ASIC reset has to be done on all HGMI hive nodes ASAP
 	 * to allow proper links negotiation in FW (within 1 sec)
 	 */
-	if (need_full_reset) {
+	if (!skip_hw_reset && need_full_reset) {
 		list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 			/* For XGMI run all resets in parallel to speed up the process */
 			if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
@@ -4520,7 +4521,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		if (r)
 			adev->asic_reset_res = r;
 	} else {
-		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset);
+		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset, false);
 		if (r && r == -EAGAIN)
 			goto retry;
 	}
@@ -4848,14 +4849,19 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
 	int r, i;
-	bool vram_lost;
+	bool need_full_reset = true;
 	u32 memsize;
+	struct list_head device_list;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	INIT_LIST_HEAD(&device_list);
+	list_add_tail(&adev->gmc.xgmi.head, &device_list);
+
 	/* wait for asic to come out of reset */
 	msleep(500);
 
+	/* Restore PCI confspace */
 	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
@@ -4871,70 +4877,15 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 		goto out;
 	}
 
-	/* TODO Call amdgpu_pre_asic_reset instead */
 	adev->in_pci_err_recovery = true;	
-	r = amdgpu_device_ip_suspend(adev);
+	r = amdgpu_device_pre_asic_reset(adev, NULL, &need_full_reset);
 	adev->in_pci_err_recovery = false;
 	if (r)
 		goto out;
 
-
-	/* post card */
-	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_ip_resume_phase1(adev);
-	if (r)
-		goto out;
-
-	vram_lost = amdgpu_device_check_vram_lost(adev);
-	if (vram_lost) {
-		DRM_INFO("VRAM is lost due to GPU reset!\n");
-		amdgpu_inc_vram_lost(adev);
-	}
-
-	r = amdgpu_gtt_mgr_recover(
-		&adev->mman.bdev.man[TTM_PL_TT]);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_fw_loading(adev);
-	if (r)
-		return r;
-
-	r = amdgpu_device_ip_resume_phase2(adev);
-	if (r)
-		goto out;
-
-	if (vram_lost)
-		amdgpu_device_fill_reset_magic(adev);
-
-	/*
-	 * Add this ASIC as tracked as reset was already
-	 * complete successfully.
-	 */
-	amdgpu_register_gpu_instance(adev);
-
-	r = amdgpu_device_ip_late_init(adev);
-	if (r)
-		goto out;
-
-	amdgpu_fbdev_set_suspend(adev, 0);
-
-	/* must succeed. */
-	amdgpu_ras_resume(adev);
-
-
-	amdgpu_irq_gpu_reset_resume_helper(adev);
-	r = amdgpu_ib_ring_tests(adev);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_recover_vram(adev);
+	r = amdgpu_do_asic_reset(NULL, &device_list, &need_full_reset, true);
 
 out:
-
 	if (!r) {
 		if (amdgpu_device_cache_pci_state(adev->pdev))
 			pci_restore_state(adev->pdev);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 7/8] drm/amdgpu: Disable DPC for XGMI for now.
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2020-08-31 15:50 ` [PATCH v3 6/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 21:09   ` Luben Tuikov
  2020-08-31 15:50 ` [PATCH v3 8/8] drm/amdgpu: Minor checkpatch fix Andrey Grodzovsky
  2020-08-31 16:47 ` [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Nirmoy
  8 siblings, 1 reply; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

XGMI support is more complicated then single device support as
questions of synchronization between the device recovering from
PCI error and other members of the hive is required.
Leaving this for next round.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b31868a..fe720c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4781,6 +4781,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
+	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+		DRM_WARN("No support for XGMI hive yet...");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 8/8] drm/amdgpu: Minor checkpatch fix
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (6 preceding siblings ...)
  2020-08-31 15:50 ` [PATCH v3 7/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
@ 2020-08-31 15:50 ` Andrey Grodzovsky
  2020-08-31 16:47 ` [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Nirmoy
  8 siblings, 0 replies; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:50 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Andrey Grodzovsky, nirmodas, christian.koenig,
	Dennis.Li

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fe720c2..16c7842 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -353,7 +353,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  *
  * Returns the 8 bit value from the offset specified.
  */
-uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
+{
 	if (adev->in_pci_err_recovery)
 		return 0;
 
@@ -377,7 +378,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  *
  * Writes the value specified to the offset specified.
  */
-void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
+{
 	if (adev->in_pci_err_recovery)
 		return;
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/8] Implement PCI Error Recovery on Navi12
  2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (7 preceding siblings ...)
  2020-08-31 15:50 ` [PATCH v3 8/8] drm/amdgpu: Minor checkpatch fix Andrey Grodzovsky
@ 2020-08-31 16:47 ` Nirmoy
  8 siblings, 0 replies; 16+ messages in thread
From: Nirmoy @ 2020-08-31 16:47 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: alexander.deucher, christian.koenig, Dennis.Li

Hi Andrey,


I need to understand more about pci saved state. So excluding patch 5 
the series is Acked-by: Nirmoy Das <nirmoy.das@amd.com>.



Regards,

Nirmoy


On 8/31/20 5:50 PM, Andrey Grodzovsky wrote:
> Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus,
> such as parity errors on the data and address buses,  A typical action taken is to disconnect
> the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered,
> so that the affected PCI device(s) are reset and put back into working condition.
> In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC)
> driver which will intercept the PCIe error, remove (isolate) the faulting device after which it
> will call into PCIe recovery code of the PCI core.
> This code will call hooks which are implemented in this patchset where the error is
> first reported at which point we block the GPU scheduler, next DPC resets the
> PCI link which generates HW interrupt which is intercepted by SMU/PSP who
> start executing mode1 reset of the ASIC, next step is slot reset hook is called
> at which point we wait for ASIC reset to complete, restore PCI config space and run
> HW suspend/resume sequence to resinit the ASIC.
> Last hook called is resume normal operation at which point we will restart the GPU scheduler.
>
> Andrey Grodzovsky (8):
>    drm/amdgpu: Implement DPC recovery
>    drm/amdgpu: Avoid accessing HW when suspending SW state
>    drm/amdgpu: Block all job scheduling activity during DPC recovery
>    drm/amdgpu: Fix SMU error failure
>    drm/amdgpu: Fix consecutive DPC recovery failures.
>    drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
>    drm/amdgpu: Disable DPC for XGMI for now.
>    drm/amdgpu: Minor checkpatch fix
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  16 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 298 ++++++++++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  13 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 +-
>   drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
>   drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
>   9 files changed, 346 insertions(+), 22 deletions(-)
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery
  2020-08-31 15:50 ` [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
@ 2020-08-31 20:28   ` Luben Tuikov
  2020-08-31 20:47   ` Luben Tuikov
  1 sibling, 0 replies; 16+ messages in thread
From: Luben Tuikov @ 2020-08-31 20:28 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: alexander.deucher, nirmodas, christian.koenig, Dennis.Li

On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
> Add DPC handlers with basic recovery functionality.
> 

When this goes into the kernel log, it would
be quite unclear what "DPC" stands for,
as it is used undefined in this patch and this
patchset.

I'd write,

	Add PCI Downstream Port Containment (DPC) with
	basic recovery functionality.

Regards,
Luben

> v2: remove pci_save_state to avoid breaking suspend/resume
> v3: Fix style comments
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 162 +++++++++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>  3 files changed, 179 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/hashtable.h>
>  #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>  
>  #include <drm/ttm/ttm_bo_api.h>
>  #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>  
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +					   pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> +void amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>  #include "amdgpu_object.h"
>  
>  /* used by df_v3_6.c and amdgpu_pmu.c */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..67d61a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2989,6 +2989,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>  	NULL
>  };
>  
> +
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3207,6 +3208,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  		}
>  	}
>  
> +	pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>  	/* Post card if necessary */
>  	if (amdgpu_device_need_post(adev)) {
>  		if (!adev->bios) {
> @@ -4701,3 +4705,161 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>  
>  	return 0;
>  }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;
> +	case pci_channel_io_frozen:
> +		/* Fatal error, prepare for slot reset */
> +		amdgpu_device_lock_adev(adev);
> +		return PCI_ERS_RESULT_NEED_RESET;
> +	case pci_channel_io_perm_failure:
> +		/* Permanent error, prepare for device removal */
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +
> +	return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> +{
> +
> +	DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +	/* TODO - dump whatever for debugging purposes */
> +
> +	/* This called only if amdgpu_pci_error_detected returns
> +	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +	 * works, no need to reset slot.
> +	 */
> +
> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +	bool vram_lost;
> +
> +	DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +	pci_restore_state(pdev);
> +
> +	r = amdgpu_device_ip_suspend(adev);
> +	if (r)
> +		goto out;
> +
> +
> +	/* post card */
> +	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_ip_resume_phase1(adev);
> +	if (r)
> +		goto out;
> +
> +	vram_lost = amdgpu_device_check_vram_lost(adev);
> +	if (vram_lost) {
> +		DRM_INFO("VRAM is lost due to GPU reset!\n");
> +		amdgpu_inc_vram_lost(adev);
> +	}
> +
> +	r = amdgpu_gtt_mgr_recover(
> +		&adev->mman.bdev.man[TTM_PL_TT]);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_fw_loading(adev);
> +	if (r)
> +		return r;
> +
> +	r = amdgpu_device_ip_resume_phase2(adev);
> +	if (r)
> +		goto out;
> +
> +	if (vram_lost)
> +		amdgpu_device_fill_reset_magic(adev);
> +
> +	/*
> +	 * Add this ASIC as tracked as reset was already
> +	 * complete successfully.
> +	 */
> +	amdgpu_register_gpu_instance(adev);
> +
> +	r = amdgpu_device_ip_late_init(adev);
> +	if (r)
> +		goto out;
> +
> +	amdgpu_fbdev_set_suspend(adev, 0);
> +
> +	/* must succeed. */
> +	amdgpu_ras_resume(adev);
> +
> +
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +	if (!r) {
> +		DRM_INFO("PCIe error recovery succeeded\n");
> +	} else {
> +		DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +		amdgpu_device_unlock_adev(adev);
> +	}
> +
> +	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	amdgpu_device_unlock_adev(adev);
> +
> +	DRM_INFO("PCI error: resume callback!!\n");
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>  #include <drm/drm_pciids.h>
>  #include <linux/console.h>
>  #include <linux/module.h>
> -#include <linux/pci.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/vga_switcheroo.h>
>  #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>  	.patchlevel = KMS_DRIVER_PATCHLEVEL,
>  };
>  
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +	.error_detected	= amdgpu_pci_error_detected,
> +	.mmio_enabled	= amdgpu_pci_mmio_enabled,
> +	.slot_reset	= amdgpu_pci_slot_reset,
> +	.resume		= amdgpu_pci_resume,
> +};
> +
>  static struct pci_driver amdgpu_kms_pci_driver = {
>  	.name = DRIVER_NAME,
>  	.id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>  	.remove = amdgpu_pci_remove,
>  	.shutdown = amdgpu_pci_shutdown,
>  	.driver.pm = &amdgpu_pm_ops,
> +	.err_handler = &amdgpu_pci_err_handler,
>  };
>  
>  static int __init amdgpu_init(void)
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery
  2020-08-31 15:50 ` [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-31 20:28   ` Luben Tuikov
@ 2020-08-31 20:47   ` Luben Tuikov
  1 sibling, 0 replies; 16+ messages in thread
From: Luben Tuikov @ 2020-08-31 20:47 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: alexander.deucher, nirmodas, christian.koenig, Dennis.Li

On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
> Add DPC handlers with basic recovery functionality.
> 
> v2: remove pci_save_state to avoid breaking suspend/resume
> v3: Fix style comments
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 162 +++++++++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>  3 files changed, 179 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/hashtable.h>
>  #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>  
>  #include <drm/ttm/ttm_bo_api.h>
>  #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>  
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +					   pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> +void amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>  #include "amdgpu_object.h"

checkpatch.pl reports too many extra empty lines added,
like the one above and a few below.

Regards,
Luben

>  
>  /* used by df_v3_6.c and amdgpu_pmu.c */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..67d61a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2989,6 +2989,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>  	NULL
>  };
>  
> +
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3207,6 +3208,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  		}
>  	}
>  
> +	pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>  	/* Post card if necessary */
>  	if (amdgpu_device_need_post(adev)) {
>  		if (!adev->bios) {
> @@ -4701,3 +4705,161 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>  
>  	return 0;
>  }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;
> +	case pci_channel_io_frozen:
> +		/* Fatal error, prepare for slot reset */
> +		amdgpu_device_lock_adev(adev);
> +		return PCI_ERS_RESULT_NEED_RESET;
> +	case pci_channel_io_perm_failure:
> +		/* Permanent error, prepare for device removal */
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +
> +	return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> +{
> +
> +	DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +	/* TODO - dump whatever for debugging purposes */
> +
> +	/* This called only if amdgpu_pci_error_detected returns
> +	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +	 * works, no need to reset slot.
> +	 */
> +
> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +	bool vram_lost;
> +
> +	DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +	pci_restore_state(pdev);
> +
> +	r = amdgpu_device_ip_suspend(adev);
> +	if (r)
> +		goto out;
> +
> +
> +	/* post card */
> +	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_ip_resume_phase1(adev);
> +	if (r)
> +		goto out;
> +
> +	vram_lost = amdgpu_device_check_vram_lost(adev);
> +	if (vram_lost) {
> +		DRM_INFO("VRAM is lost due to GPU reset!\n");
> +		amdgpu_inc_vram_lost(adev);
> +	}
> +
> +	r = amdgpu_gtt_mgr_recover(
> +		&adev->mman.bdev.man[TTM_PL_TT]);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_fw_loading(adev);
> +	if (r)
> +		return r;
> +
> +	r = amdgpu_device_ip_resume_phase2(adev);
> +	if (r)
> +		goto out;
> +
> +	if (vram_lost)
> +		amdgpu_device_fill_reset_magic(adev);
> +
> +	/*
> +	 * Add this ASIC as tracked as reset was already
> +	 * complete successfully.
> +	 */
> +	amdgpu_register_gpu_instance(adev);
> +
> +	r = amdgpu_device_ip_late_init(adev);
> +	if (r)
> +		goto out;
> +
> +	amdgpu_fbdev_set_suspend(adev, 0);
> +
> +	/* must succeed. */
> +	amdgpu_ras_resume(adev);
> +
> +
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +	if (!r) {
> +		DRM_INFO("PCIe error recovery succeeded\n");
> +	} else {
> +		DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +		amdgpu_device_unlock_adev(adev);
> +	}
> +
> +	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	amdgpu_device_unlock_adev(adev);
> +
> +	DRM_INFO("PCI error: resume callback!!\n");
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>  #include <drm/drm_pciids.h>
>  #include <linux/console.h>
>  #include <linux/module.h>
> -#include <linux/pci.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/vga_switcheroo.h>
>  #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>  	.patchlevel = KMS_DRIVER_PATCHLEVEL,
>  };
>  
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +	.error_detected	= amdgpu_pci_error_detected,
> +	.mmio_enabled	= amdgpu_pci_mmio_enabled,
> +	.slot_reset	= amdgpu_pci_slot_reset,
> +	.resume		= amdgpu_pci_resume,
> +};
> +
>  static struct pci_driver amdgpu_kms_pci_driver = {
>  	.name = DRIVER_NAME,
>  	.id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>  	.remove = amdgpu_pci_remove,
>  	.shutdown = amdgpu_pci_shutdown,
>  	.driver.pm = &amdgpu_pm_ops,
> +	.err_handler = &amdgpu_pci_err_handler,
>  };
>  
>  static int __init amdgpu_init(void)
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-31 15:50 ` [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-08-31 21:00   ` Luben Tuikov
  2020-08-31 22:04     ` Andrey Grodzovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Luben Tuikov @ 2020-08-31 21:00 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: alexander.deucher, nirmodas, christian.koenig, Dennis.Li

On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
> DPC recovery involves ASIC reset just as normal GPU recovery so blosk

Again, typo: "blosk" --> "blocks".

> SW GPU schedulers and wait on all concurrent GPU resets.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Acked-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
>  1 file changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 43ce473..c569523 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4743,6 +4743,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>  	return 0;
>  }
>  
> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> +{
> +	int i;
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		cancel_delayed_work_sync(&ring->sched.work_tdr);
> +	}
> +}
> +
>  /**
>   * amdgpu_pci_error_detected - Called when a PCI error is detected.
>   * @pdev: PCI device struct
> @@ -4756,15 +4770,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int i;
>  
>  	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>  
>  	switch (state) {
>  	case pci_channel_io_normal:
>  		return PCI_ERS_RESULT_CAN_RECOVER;
> -	case pci_channel_io_frozen:
> -		/* Fatal error, prepare for slot reset */
> -		amdgpu_device_lock_adev(adev);
> +	/* Fatal error, prepare f
> +	case pci_channel_io_frozen:		
> +		/*		
> +		 * Cancel and wait for all TDRs in progress if failing to
> +		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
> +		 *
> +		 * Locking adev->reset_sem will prevent any external access
> +		 * to GPU during PCI error recovery
> +		 */
> +		while (!amdgpu_device_lock_adev(adev, NULL))
> +			amdgpu_cancel_all_tdr(adev);

As in v1, I don't see how this is protected from
polling forever. "amdgpu_cancel_all_tdr()" defined above,
doesn't actively cancel anything--it just waits.

"amdgpu_device_lock_adev()" similarly only modifies
a few variables.

Neither of those two functions seem to escalate or otherwise
reset the hardware. If the reset/hw blocks for whatever
reason, this loop will hang forever.

Regards,
Luben

> +
> +		/*
> +		 * Block any work scheduling as we do for regular GPU reset
> +		 * for the duration of the recovery
> +		 */
> +		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +			struct amdgpu_ring *ring = adev->rings[i];
> +
> +			if (!ring || !ring->sched.thread)
> +				continue;
> +
> +			drm_sched_stop(&ring->sched, NULL);
> +		}
>  		return PCI_ERS_RESULT_NEED_RESET;
>  	case pci_channel_io_perm_failure:
>  		/* Permanent error, prepare for device removal */
> @@ -4897,8 +4933,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int i;
>  
> -	amdgpu_device_unlock_adev(adev);
>  
>  	DRM_INFO("PCI error: resume callback!!\n");
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +
> +		drm_sched_resubmit_jobs(&ring->sched);
> +		drm_sched_start(&ring->sched, true);
> +	}
> +
> +	amdgpu_device_unlock_adev(adev);
>  }
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 5/8] drm/amdgpu: Fix consecutive DPC recovery failures.
  2020-08-31 15:50 ` [PATCH v3 5/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
@ 2020-08-31 21:05   ` Luben Tuikov
  0 siblings, 0 replies; 16+ messages in thread
From: Luben Tuikov @ 2020-08-31 21:05 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: alexander.deucher, nirmodas, christian.koenig, Dennis.Li

On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
> Cache the PCI state on boot and before each case were we might
> loose it.

Typo: "were" --> "where".

> 
> v2: Add pci_restore_state while caching the PCI state to avoid
> breaking PCI core logic for stuff like suspend/resume.
> 
> v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
> to avoid superflous restores during GPU resets and suspend/resumes.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 ++++++++++++++++++++++++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
>  drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
>  5 files changed, 71 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index cac51e8..5e74db6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,7 +992,9 @@ struct amdgpu_device {
>  	atomic_t			throttling_logging_enabled;
>  	struct ratelimit_state		throttling_logging_rs;
>  	uint32_t			ras_features;
> +
>  	bool                            in_pci_err_recovery;
> +	struct pci_saved_state          *pci_state;
>  };
>  
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>  pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>  void amdgpu_pci_resume(struct pci_dev *pdev);
>  
> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
> +
> +

Unnecessary extra line added.

Regards,
Luben

>  
>  #include "amdgpu_object.h"
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4a009ca..0329b43 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1282,7 +1282,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>  		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>  
>  		pci_set_power_state(dev->pdev, PCI_D0);
> -		pci_restore_state(dev->pdev);
> +		amdgpu_device_load_pci_state(dev->pdev);
>  		r = pci_enable_device(dev->pdev);
>  		if (r)
>  			DRM_WARN("pci_enable_device failed (%d)\n", r);
> @@ -1295,7 +1295,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>  		drm_kms_helper_poll_disable(dev);
>  		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>  		amdgpu_device_suspend(dev, true);
> -		pci_save_state(dev->pdev);
> +		amdgpu_device_cache_pci_state(dev->pdev);
>  		/* Shut down the device */
>  		pci_disable_device(dev->pdev);
>  		pci_set_power_state(dev->pdev, PCI_D3cold);
> @@ -3400,6 +3400,10 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  	if (r)
>  		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>  
> +	/* Have stored pci confspace at hand for restore in sudden PCI error */
> +	if (amdgpu_device_cache_pci_state(adev->pdev))
> +		pci_restore_state(pdev);
> +
>  	return 0;
>  
>  failed:
> @@ -3426,6 +3430,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>  	flush_delayed_work(&adev->delayed_init_work);
>  	adev->shutdown = true;
>  
> +	kfree(adev->pci_state);
> +
>  	/* make sure IB test finished before entering exclusive mode
>  	 * to avoid preemption on IB test
>  	 * */
> @@ -4850,7 +4856,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  	/* wait for asic to come out of reset */
>  	msleep(500);
>  
> -	pci_restore_state(pdev);
> +	amdgpu_device_load_pci_state(pdev);
>  
>  	/* confirm  ASIC came out of reset */
>  	for (i = 0; i < adev->usec_timeout; i++) {
> @@ -4930,6 +4936,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  out:
>  
>  	if (!r) {
> +		if (amdgpu_device_cache_pci_state(adev->pdev))
> +			pci_restore_state(adev->pdev);
> +
>  		DRM_INFO("PCIe error recovery succeeded\n");
>  	} else {
>  		DRM_ERROR("PCIe error recovery failed, err:%d", r);
> @@ -4969,3 +4978,50 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  
>  	amdgpu_device_unlock_adev(adev);
>  }
> +
> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +
> +	r = pci_save_state(pdev);
> +	if (!r) {
> +		kfree(adev->pci_state);
> +
> +		adev->pci_state = pci_store_saved_state(pdev);
> +
> +		if (!adev->pci_state) {
> +			DRM_ERROR("Failed to store PCI saved state");
> +			return false;
> +		}
> +	} else {
> +		DRM_WARN("Failed to save PCI state, err:%d\n", r);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +
> +	if (!adev->pci_state)
> +		return false;
> +
> +	r = pci_load_saved_state(pdev, adev->pci_state);
> +
> +	if (!r) {
> +		pci_restore_state(pdev);
> +	} else {
> +		DRM_WARN("Failed to load PCI state, err:%d\n", r);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 4bbcc70..7a6482a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>  		if (amdgpu_is_atpx_hybrid()) {
>  			pci_ignore_hotplug(pdev);
>  		} else {
> -			pci_save_state(pdev);
> +			amdgpu_device_cache_pci_state(pdev);
>  			pci_disable_device(pdev);
>  			pci_ignore_hotplug(pdev);
>  			pci_set_power_state(pdev, PCI_D3cold);
> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>  			pci_set_master(pdev);
>  		} else {
>  			pci_set_power_state(pdev, PCI_D0);
> -			pci_restore_state(pdev);
> +			amdgpu_device_load_pci_state(pdev);
>  			ret = pci_enable_device(pdev);
>  			if (ret)
>  				return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 4d14023..0ec6603 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>  	/* disable BM */
>  	pci_clear_master(adev->pdev);
>  
> -	pci_save_state(adev->pdev);
> +	amdgpu_device_cache_pci_state(adev->pdev);
>  
>  	if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
>  		dev_info(adev->dev, "GPU smu mode1 reset\n");
> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>  
>  	if (ret)
>  		dev_err(adev->dev, "GPU mode1 reset failed\n");
> -	pci_restore_state(adev->pdev);
> +	amdgpu_device_load_pci_state(adev->pdev);
>  
>  	/* wait for asic to come out of reset */
>  	for (i = 0; i < adev->usec_timeout; i++) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 2f93c47..ddd55e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
>  	/* disable BM */
>  	pci_clear_master(adev->pdev);
>  
> -	pci_save_state(adev->pdev);
> +	amdgpu_device_cache_pci_state(adev->pdev);
>  
>  	ret = psp_gpu_reset(adev);
>  	if (ret)
>  		dev_err(adev->dev, "GPU mode1 reset failed\n");
>  
> -	pci_restore_state(adev->pdev);
> +	amdgpu_device_load_pci_state(adev->pdev);
>  
>  	/* wait for asic to come out of reset */
>  	for (i = 0; i < adev->usec_timeout; i++) {
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 7/8] drm/amdgpu: Disable DPC for XGMI for now.
  2020-08-31 15:50 ` [PATCH v3 7/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
@ 2020-08-31 21:09   ` Luben Tuikov
  0 siblings, 0 replies; 16+ messages in thread
From: Luben Tuikov @ 2020-08-31 21:09 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: alexander.deucher, nirmodas, christian.koenig, Dennis.Li

On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
> XGMI support is more complicated then single device support as

Typo: "then" --> "than".

> questions of synchronization between the device recovering from
> PCI error and other members of the hive is required.

Typo: "is" --> "are", as the subject of this long sentence changes
from "XGMI support" to "questions", which is plural: "questions ... _are_ required."

Regards,
Luben

> Leaving this for next round.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b31868a..fe720c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4781,6 +4781,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>  
>  	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>  
> +	if (adev->gmc.xgmi.num_physical_nodes > 1) {
> +		DRM_WARN("No support for XGMI hive yet...");
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +
>  	switch (state) {
>  	case pci_channel_io_normal:
>  		return PCI_ERS_RESULT_CAN_RECOVER;
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-31 21:00   ` Luben Tuikov
@ 2020-08-31 22:04     ` Andrey Grodzovsky
  0 siblings, 0 replies; 16+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 22:04 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx
  Cc: alexander.deucher, nirmodas, christian.koenig, Dennis.Li


On 8/31/20 5:00 PM, Luben Tuikov wrote:
> On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
>> DPC recovery involves ASIC reset just as normal GPU recovery so blosk
> Again, typo: "blosk" --> "blocks".
>
>> SW GPU schedulers and wait on all concurrent GPU resets.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Alex Deucher <alexander.deucher@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
>>   1 file changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 43ce473..c569523 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4743,6 +4743,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>>   	return 0;
>>   }
>>   
>> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +		struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +		if (!ring || !ring->sched.thread)
>> +			continue;
>> +
>> +		cancel_delayed_work_sync(&ring->sched.work_tdr);
>> +	}
>> +}
>> +
>>   /**
>>    * amdgpu_pci_error_detected - Called when a PCI error is detected.
>>    * @pdev: PCI device struct
>> @@ -4756,15 +4770,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int i;
>>   
>>   	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>>   
>>   	switch (state) {
>>   	case pci_channel_io_normal:
>>   		return PCI_ERS_RESULT_CAN_RECOVER;
>> -	case pci_channel_io_frozen:
>> -		/* Fatal error, prepare for slot reset */
>> -		amdgpu_device_lock_adev(adev);
>> +	/* Fatal error, prepare f
>> +	case pci_channel_io_frozen:		
>> +		/*		
>> +		 * Cancel and wait for all TDRs in progress if failing to
>> +		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
>> +		 *
>> +		 * Locking adev->reset_sem will prevent any external access
>> +		 * to GPU during PCI error recovery
>> +		 */
>> +		while (!amdgpu_device_lock_adev(adev, NULL))
>> +			amdgpu_cancel_all_tdr(adev);
> As in v1, I don't see how this is protected from
> polling forever. "amdgpu_cancel_all_tdr()" defined above,
> doesn't actively cancel anything--it just waits.
>
> "amdgpu_device_lock_adev()" similarly only modifies
> a few variables.
>
> Neither of those two functions seem to escalate or otherwise
> reset the hardware. If the reset/hw blocks for whatever
> reason, this loop will hang forever.
>
> Regards,
> Luben


But the fact that reset blocks for ever for some reason is not acceptable anyway 
- it's a bug
we must handle and so I don't think our code here should be robust for this bug 
case because if we
are in this scenario already we will have a lot of other issues anyway and the 
driver will stop being usable.
Your comment made me see there is a question of possible starvation when trying 
to amdgpu_device_lock_adev
because we could say that if the HW is dead and jobs keep coming you will have 
new job timeouts
all the time and they could sneak in and hijack adev->in_gpu_reset so that PCIe 
error recovery never
has a chance to run but, I looked into the code and that is not the case since 
drm_sched_select_entity->drm_sched_ready
will return NULL for any scheduler that already has his HW ring full of 
unfinished jobs and so no new timeouts will be
triggered so PCIe recovery code will have eventually his opportunity to lock the 
device and run.

Andrey


>
>> +
>> +		/*
>> +		 * Block any work scheduling as we do for regular GPU reset
>> +		 * for the duration of the recovery
>> +		 */
>> +		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +			struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +			if (!ring || !ring->sched.thread)
>> +				continue;
>> +
>> +			drm_sched_stop(&ring->sched, NULL);
>> +		}
>>   		return PCI_ERS_RESULT_NEED_RESET;
>>   	case pci_channel_io_perm_failure:
>>   		/* Permanent error, prepare for device removal */
>> @@ -4897,8 +4933,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int i;
>>   
>> -	amdgpu_device_unlock_adev(adev);
>>   
>>   	DRM_INFO("PCI error: resume callback!!\n");
>> +
>> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +		struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +		if (!ring || !ring->sched.thread)
>> +			continue;
>> +
>> +
>> +		drm_sched_resubmit_jobs(&ring->sched);
>> +		drm_sched_start(&ring->sched, true);
>> +	}
>> +
>> +	amdgpu_device_unlock_adev(adev);
>>   }
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-08-31 22:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-31 15:50 [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
2020-08-31 15:50 ` [PATCH v3 1/8] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
2020-08-31 20:28   ` Luben Tuikov
2020-08-31 20:47   ` Luben Tuikov
2020-08-31 15:50 ` [PATCH v3 2/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
2020-08-31 15:50 ` [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
2020-08-31 21:00   ` Luben Tuikov
2020-08-31 22:04     ` Andrey Grodzovsky
2020-08-31 15:50 ` [PATCH v3 4/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
2020-08-31 15:50 ` [PATCH v3 5/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
2020-08-31 21:05   ` Luben Tuikov
2020-08-31 15:50 ` [PATCH v3 6/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
2020-08-31 15:50 ` [PATCH v3 7/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
2020-08-31 21:09   ` Luben Tuikov
2020-08-31 15:50 ` [PATCH v3 8/8] drm/amdgpu: Minor checkpatch fix Andrey Grodzovsky
2020-08-31 16:47 ` [PATCH v3 0/8] Implement PCI Error Recovery on Navi12 Nirmoy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.