All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Implement PCI Error Recovery on Navi12
@ 2020-08-26 14:46 Andrey Grodzovsky
  2020-08-26 14:46 ` [PATCH 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, 
such as parity errors on the data and address buses,  A typical action taken is to disconnect 
the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered, 
so that the affected PCI device(s) are reset and put back into working condition. 
In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC) 
driver which will intercept the PCIe error, remove (isolate) the faulting device and reset the PCI 
link of the device after which it will call into PCIe recovery code of the PCI core. 
This code will call hooks which are implemented in this patchset where the error is 
first reported at which point we block the GPU scheduler, next DPC resets the 
PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
start executing mode1 reset of the ASIC, next step is slot reset hook is called 
at which point we wait for ASIC reset to complete, restore PCI config space and run 
HW suspend/resume sequence to resinit the ASIC. 
Last hook called is resume normal operation at which point we will restart the GPU scheduler.

More info on PCIe error handling and DPC are here:
https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
https://patchwork.kernel.org/patch/8945681/

Andrey Grodzovsky (7):
  drm/amdgpu: Implement DPC recovery
  drm/amdgpu: Avoid accessing HW when suspending SW state
  drm/amdgpu: Block all job scheduling activity during DPC recovery
  drm/amdgpu: Fix SMU error failure
  drm/amdgpu: Fix consecutive DPC	recoveries failure.
  drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  drm/amdgpu: Disable DPC for XGMI for now.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  16 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 289 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/atom.c          |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |   3 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
 11 files changed, 344 insertions(+), 19 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
@ 2020-08-26 14:46 ` Andrey Grodzovsky
  2020-08-26 15:08   ` Alex Deucher
                     ` (2 more replies)
  2020-08-26 14:46 ` [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

Add DPC handlers with basic recovery functionality.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 181 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
 3 files changed, 196 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 49ea9fa..3399242 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -49,6 +49,8 @@
 #include <linux/rbtree.h>
 #include <linux/hashtable.h>
 #include <linux/dma-fence.h>
+#include <linux/pci.h>
+#include <linux/aer.h>
 
 #include <drm/ttm/ttm_bo_api.h>
 #include <drm/ttm/ttm_bo_driver.h>
@@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
 void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
 void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
 
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
+					   pci_channel_state_t state);
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
+void amdgpu_pci_resume(struct pci_dev *pdev);
+
+
 #include "amdgpu_object.h"
 
 /* used by df_v3_6.c and amdgpu_pmu.c */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a948ed..84f8d14 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -350,7 +350,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  *
  * Returns the 8 bit value from the offset specified.
  */
-uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
+{
+
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
 	BUG();
@@ -371,7 +373,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  *
  * Writes the value specified to the offset specified.
  */
-void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
+{
+
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
 	else
@@ -380,6 +384,7 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
 
 void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
 {
+
 	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
 	if ((reg * 4) < adev->rmmio_size)
@@ -407,6 +412,7 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
 void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -461,6 +467,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
 {
+
 	if ((reg * 4) < adev->rio_mem_size)
 		iowrite32(v, adev->rio_mem + (reg * 4));
 	else {
@@ -480,6 +487,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
 {
+
 	if (index < adev->doorbell.num_doorbells) {
 		return readl(adev->doorbell.ptr + index);
 	} else {
@@ -500,6 +508,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
 {
+
 	if (index < adev->doorbell.num_doorbells) {
 		writel(v, adev->doorbell.ptr + index);
 	} else {
@@ -518,6 +527,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
 {
+
 	if (index < adev->doorbell.num_doorbells) {
 		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
 	} else {
@@ -538,6 +548,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
 {
+
 	if (index < adev->doorbell.num_doorbells) {
 		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
 	} else {
@@ -2989,6 +3000,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
 	NULL
 };
 
+
 /**
  * amdgpu_device_init - initialize the driver
  *
@@ -3207,6 +3219,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		}
 	}
 
+	pci_enable_pcie_error_reporting(adev->ddev.pdev);
+
+
 	/* Post card if necessary */
 	if (amdgpu_device_need_post(adev)) {
 		if (!adev->bios) {
@@ -3359,6 +3374,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
 
+	if (pci_save_state(pdev))
+		DRM_ERROR("Failed to save PCI state!!\n");
+
 	return 0;
 
 failed:
@@ -4701,3 +4719,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 
 	return 0;
 }
+
+/**
+ * amdgpu_pci_error_detected - Called when a PCI error is detected.
+ * @pdev: PCI device struct
+ * @state: PCI channel state
+ *
+ * Description: Called when a PCI error is detected.
+ *
+ * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
+ */
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
+
+	switch (state) {
+	case pci_channel_io_normal:
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen: {
+		/* Fatal error, prepare for slot reset */
+
+		amdgpu_device_lock_adev(adev);
+		return PCI_ERS_RESULT_NEED_RESET;
+	}
+	case pci_channel_io_perm_failure:
+		/* Permanent error, prepare for device removal */
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
+ * @pdev: pointer to PCI device
+ */
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
+{
+
+	DRM_INFO("PCI error: mmio enabled callback!!\n");
+
+	/* TODO - dump whatever for debugging purposes */
+
+	/* This called only if amdgpu_pci_error_detected returns
+	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
+	 * works, no need to reset slot.
+	 */
+
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
+ * @pdev: PCI device struct
+ *
+ * Description: This routine is called by the pci error recovery
+ * code after the PCI slot has been reset, just before we
+ * should resume normal operations.
+ */
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+	bool vram_lost;
+
+	DRM_INFO("PCI error: slot reset callback!!\n");
+
+	pci_restore_state(pdev);
+
+	r = amdgpu_device_ip_suspend(adev);
+	if (r)
+		goto out;
+
+
+	/* post card */
+	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_ip_resume_phase1(adev);
+	if (r)
+		goto out;
+
+	vram_lost = amdgpu_device_check_vram_lost(adev);
+	if (vram_lost) {
+		DRM_INFO("VRAM is lost due to GPU reset!\n");
+		amdgpu_inc_vram_lost(adev);
+	}
+
+	r = amdgpu_gtt_mgr_recover(
+		&adev->mman.bdev.man[TTM_PL_TT]);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_fw_loading(adev);
+	if (r)
+		return r;
+
+	r = amdgpu_device_ip_resume_phase2(adev);
+	if (r)
+		goto out;
+
+	if (vram_lost)
+		amdgpu_device_fill_reset_magic(adev);
+
+	/*
+	 * Add this ASIC as tracked as reset was already
+	 * complete successfully.
+	 */
+	amdgpu_register_gpu_instance(adev);
+
+	r = amdgpu_device_ip_late_init(adev);
+	if (r)
+		goto out;
+
+	amdgpu_fbdev_set_suspend(adev, 0);
+
+	/* must succeed. */
+	amdgpu_ras_resume(adev);
+
+
+	amdgpu_irq_gpu_reset_resume_helper(adev);
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_recover_vram(adev);
+
+out:
+
+	if (!r)
+		DRM_INFO("PCIe error recovery succeeded\n");
+	else {
+		DRM_ERROR("PCIe error recovery failed, err:%d", r);
+		amdgpu_device_unlock_adev(adev);
+	}
+
+	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * amdgpu_pci_resume() - resume normal ops after PCI reset
+ * @pdev: pointer to PCI device
+ *
+ * Called when the error recovery driver tells us that its
+ * OK to resume normal operation. Use completion to allow
+ * halted scsi ops to resume.
+ */
+void amdgpu_pci_resume(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	amdgpu_device_unlock_adev(adev);
+
+	DRM_INFO("PCI error: resume callback!!\n");
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d984c6a..4bbcc70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -31,7 +31,6 @@
 #include <drm/drm_pciids.h>
 #include <linux/console.h>
 #include <linux/module.h>
-#include <linux/pci.h>
 #include <linux/pm_runtime.h>
 #include <linux/vga_switcheroo.h>
 #include <drm/drm_probe_helper.h>
@@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
 	.patchlevel = KMS_DRIVER_PATCHLEVEL,
 };
 
+static struct pci_error_handlers amdgpu_pci_err_handler = {
+	.error_detected	= amdgpu_pci_error_detected,
+	.mmio_enabled	= amdgpu_pci_mmio_enabled,
+	.slot_reset	= amdgpu_pci_slot_reset,
+	.resume		= amdgpu_pci_resume,
+};
+
 static struct pci_driver amdgpu_kms_pci_driver = {
 	.name = DRIVER_NAME,
 	.id_table = pciidlist,
@@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
 	.remove = amdgpu_pci_remove,
 	.shutdown = amdgpu_pci_shutdown,
 	.driver.pm = &amdgpu_pm_ops,
+	.err_handler = &amdgpu_pci_err_handler,
 };
 
 static int __init amdgpu_init(void)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-08-26 14:46 ` [PATCH 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
@ 2020-08-26 14:46 ` Andrey Grodzovsky
  2020-08-26 15:09   ` Alex Deucher
                     ` (2 more replies)
  2020-08-26 14:46 ` [PATCH 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some bloks might be even gated and so best is to avoid touching it.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 37 ++++++++++++++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 +++++++++------
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
 8 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3399242..3489622 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+	bool                            in_dpc;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 84f8d14..e1bbefd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
 {
 	uint32_t ret;
 
+	if (adev->in_dpc)
+		return 0;
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_rreg(adev, reg);
 
@@ -350,8 +353,10 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  *
  * Returns the 8 bit value from the offset specified.
  */
-uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
-{
+uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+
+	if (adev->in_dpc)
+		return 0;
 
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
@@ -373,8 +378,10 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
  *
  * Writes the value specified to the offset specified.
  */
-void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
-{
+void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+
+	if (adev->in_dpc)
+		return;
 
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
@@ -384,6 +391,8 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
 
 void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
 {
+	if (adev->in_dpc)
+		return;
 
 	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
@@ -412,6 +421,8 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
 void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_dpc)
+		return;
 
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_wreg(adev, reg, v);
@@ -427,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+	if (adev->in_dpc)
+		return;
+
 	if (amdgpu_sriov_fullaccess(adev) &&
 		adev->gfx.rlc.funcs &&
 		adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -448,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
 {
+	if (adev->in_dpc)
+		return 0;
+
 	if ((reg * 4) < adev->rio_mem_size)
 		return ioread32(adev->rio_mem + (reg * 4));
 	else {
@@ -467,6 +484,8 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
 {
+	if (adev->in_dpc)
+		return;
 
 	if ((reg * 4) < adev->rio_mem_size)
 		iowrite32(v, adev->rio_mem + (reg * 4));
@@ -487,6 +506,8 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_dpc)
+		return 0;
 
 	if (index < adev->doorbell.num_doorbells) {
 		return readl(adev->doorbell.ptr + index);
@@ -508,6 +529,8 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
 {
+	if (adev->in_dpc)
+		return;
 
 	if (index < adev->doorbell.num_doorbells) {
 		writel(v, adev->doorbell.ptr + index);
@@ -527,6 +550,8 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
 {
+	if (adev->in_dpc)
+		return 0;
 
 	if (index < adev->doorbell.num_doorbells) {
 		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
@@ -548,6 +573,8 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
 {
+	if (adev->in_dpc)
+			return;
 
 	if (index < adev->doorbell.num_doorbells) {
 		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
@@ -4790,7 +4817,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 	pci_restore_state(pdev);
 
+	adev->in_dpc = true;
 	r = amdgpu_device_ip_suspend(adev);
+	adev->in_dpc = false;
 	if (r)
 		goto out;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index d698142..50c42c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
 	struct amdgpu_ring *ring = &kiq->ring;
 
+	if (adev->in_dpc)
+		return 0;
+
 	BUG_ON(!ring->funcs->emit_rreg);
 
 	spin_lock_irqsave(&kiq->ring_lock, flags);
@@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 
 	BUG_ON(!ring->funcs->emit_wreg);
 
+	if (adev->in_dpc)
+		return;
+
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 	amdgpu_ring_alloc(ring, 32);
 	amdgpu_ring_emit_wreg(ring, reg, v);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index d6c38e2..555ef59 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
 	int i;
 	struct amdgpu_device *adev = psp->adev;
 
+	if (psp->adev->in_dpc)
+		return 0;
+
 	for (i = 0; i < adev->usec_timeout; i++) {
 		val = RREG32(reg_index);
 		if (check_changed) {
@@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	bool ras_intr = false;
 	bool skip_unsupport = false;
 
+	if (psp->adev->in_dpc)
+		return 0;
+
 	mutex_lock(&psp->mutex);
 
 	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
index 4cfc786..613dac1 100644
--- a/drivers/gpu/drm/amd/amdgpu/atom.c
+++ b/drivers/gpu/drm/amd/amdgpu/atom.c
@@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
 					DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
 						  ATOM_CMD_TIMEOUT_SEC);
 					ctx->abort = true;
+					dump_stack();
 				}
 			} else {
 				/* jiffies wrap around we will just wait a little longer */
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 2db195e..86f268e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
 
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
+
+	if (!adev->in_dpc) {
 #ifndef BRING_UP_DEBUG
-	if (amdgpu_async_gfx_ring) {
-		r = gfx_v10_0_kiq_disable_kgq(adev);
-		if (r)
-			DRM_ERROR("KGQ disable failed\n");
-	}
+		if (amdgpu_async_gfx_ring) {
+			r = gfx_v10_0_kiq_disable_kgq(adev);
+			if (r)
+				DRM_ERROR("KGQ disable failed\n");
+		}
 #endif
-	if (amdgpu_gfx_disable_kcq(adev))
-		DRM_ERROR("KCQ disable failed\n");
+		if (amdgpu_gfx_disable_kcq(adev))
+			DRM_ERROR("KCQ disable failed\n");
+	}
+
 	if (amdgpu_sriov_vf(adev)) {
 		gfx_v10_0_cp_gfx_enable(adev, false);
 		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 8462b30..306461d 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
 {
 	int ret = 0;
 
+
 	if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
 		return -EOPNOTSUPP;
 
+
+
 	switch (block_type) {
 	/*
 	 * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index a58ea08..02cf55e 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
 	struct amdgpu_device *adev = smu->adev;
 	int ret = 0, index = 0;
 
+	if (smu->adev->in_dpc)
+		return 0;
+
 	index = smu_cmn_to_asic_specific_index(smu,
 					       CMN2ASIC_MAPPING_MSG,
 					       msg);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
  2020-08-26 14:46 ` [PATCH 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-26 14:46 ` [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
@ 2020-08-26 14:46 ` Andrey Grodzovsky
  2020-08-29  2:16   ` Luben Tuikov
  2020-08-26 14:46 ` [PATCH 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

DPC recovery involves ASIC reset just as normal GPU recovery so block
SW GPU shcedulers and wait on all concurent GPU resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e1bbefd..0a2130b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4747,6 +4747,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 	return 0;
 }
 
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		cancel_delayed_work_sync(&ring->sched.work_tdr);
+	}
+}
+
 /**
  * amdgpu_pci_error_detected - Called when a PCI error is detected.
  * @pdev: PCI device struct
@@ -4760,16 +4774,38 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen: {
-		/* Fatal error, prepare for slot reset */
+	case pci_channel_io_frozen: { /* Fatal error, prepare for slot reset */
+
+		/*
+		 * Cancel and wait for all TDRs in progress if failing to
+		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
+		 *
+		 * Locking adev->reset_sem will perevent any external access
+		 * to GPU during PCI error recovery
+		 */
+		while (!amdgpu_device_lock_adev(adev, NULL))
+			amdgpu_cancel_all_tdr(adev);
+
+		/*
+		 * Block any work scheduling as we do for regualr GPU reset
+		 * for the duration of the recoveryq
+		 */
+		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+			struct amdgpu_ring *ring = adev->rings[i];
+
+			if (!ring || !ring->sched.thread)
+				continue;
+
+			drm_sched_stop(&ring->sched, NULL);
+		}
 
-		amdgpu_device_lock_adev(adev);
 		return PCI_ERS_RESULT_NEED_RESET;
 	}
 	case pci_channel_io_perm_failure:
@@ -4902,8 +4938,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
 
-	amdgpu_device_unlock_adev(adev);
 
 	DRM_INFO("PCI error: resume callback!!\n");
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+
+		drm_sched_resubmit_jobs(&ring->sched);
+		drm_sched_start(&ring->sched, true);
+	}
+
+	amdgpu_device_unlock_adev(adev);
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/7] drm/amdgpu: Fix SMU error failure
  2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2020-08-26 14:46 ` [PATCH 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-08-26 14:46 ` Andrey Grodzovsky
  2020-08-26 15:10   ` Alex Deucher
  2020-08-26 14:46 ` [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure Andrey Grodzovsky
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0a2130b..d9e3994 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4846,13 +4846,31 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
-	int r;
+	int r, i;
 	bool vram_lost;
+	u32 memsize;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	/* wait for asic to come out of reset */
+	msleep(500);
+
 	pci_restore_state(pdev);
 
+	/* confirm  ASIC came out of reset */
+	for (i = 0; i < adev->usec_timeout; i++) {
+		memsize = amdgpu_asic_get_config_memsize(adev);
+
+		if (memsize != 0xffffffff)
+			break;
+		udelay(1);
+	}
+	if (memsize == 0xffffffff) {
+		r = -ETIME;
+		goto out;
+	}
+
+	/* TODO Call amdgpu_pre_asic_reset instead */
 	adev->in_dpc = true;
 	r = amdgpu_device_ip_suspend(adev);
 	adev->in_dpc = false;
@@ -4882,7 +4900,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 	r = amdgpu_device_fw_loading(adev);
 	if (r)
-		return r;
+		goto out;
 
 	r = amdgpu_device_ip_resume_phase2(adev);
 	if (r)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2020-08-26 14:46 ` [PATCH 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
@ 2020-08-26 14:46 ` Andrey Grodzovsky
  2020-08-26 15:20   ` Alex Deucher
  2020-08-26 14:46 ` [PATCH 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
  2020-08-26 14:46 ` [PATCH 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
  6 siblings, 1 reply; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
unles you save the cashe the saved PCI confspace to load it after
each new reset.
Also use same cached state for other use case of restoring PCI confspace
such as GPU mode1 or VGA switheroo.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
 5 files changed, 66 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3489622..42ee208 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,7 +992,9 @@ struct amdgpu_device {
 	atomic_t			throttling_logging_enabled;
 	struct ratelimit_state		throttling_logging_rs;
 	uint32_t			ras_features;
+
 	bool                            in_dpc;
+	struct pci_saved_state          *pci_state;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
@@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
 pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
 void amdgpu_pci_resume(struct pci_dev *pdev);
 
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
+
+
 
 #include "amdgpu_object.h"
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d9e3994..2c088df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
 		pci_set_power_state(dev->pdev, PCI_D0);
-		pci_restore_state(dev->pdev);
+		amdgpu_device_load_pci_state(dev->pdev);
 		r = pci_enable_device(dev->pdev);
 		if (r)
 			DRM_WARN("pci_enable_device failed (%d)\n", r);
@@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
 		drm_kms_helper_poll_disable(dev);
 		dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 		amdgpu_device_suspend(dev, true);
-		pci_save_state(dev->pdev);
+		amdgpu_device_cache_pci_state(dev->pdev);
 		/* Shut down the device */
 		pci_disable_device(dev->pdev);
 		pci_set_power_state(dev->pdev, PCI_D3cold);
@@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
 
-	if (pci_save_state(pdev))
-		DRM_ERROR("Failed to save PCI state!!\n");
+	/* Have stored pci confspace at hand for restore in sudden PCI error */
+	if (!amdgpu_device_cache_pci_state(adev->pdev))
+		DRM_WARN("Failed to cache PCI state!");
 
 	return 0;
 
@@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 	flush_delayed_work(&adev->delayed_init_work);
 	adev->shutdown = true;
 
+	kfree(adev->pci_state);
+
 	/* make sure IB test finished before entering exclusive mode
 	 * to avoid preemption on IB test
 	 * */
@@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	/* wait for asic to come out of reset */
 	msleep(500);
 
-	pci_restore_state(pdev);
+	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
@@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 
 out:
 
-	if (!r)
+	if (!r) {
+		amdgpu_device_cache_pci_state(adev->pdev);
 		DRM_INFO("PCIe error recovery succeeded\n");
+	}
 	else {
 		DRM_ERROR("PCIe error recovery failed, err:%d", r);
 		amdgpu_device_unlock_adev(adev);
@@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 
 	amdgpu_device_unlock_adev(adev);
 }
+
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	r = pci_save_state(pdev);
+	if (!r) {
+		kfree(adev->pci_state);
+
+		adev->pci_state = pci_store_saved_state(pdev);
+
+		if (!adev->pci_state) {
+			DRM_ERROR("Failed to store PCI saved state");
+			return false;
+		}
+	} else {
+		DRM_WARN("Failed to save PCI state, err:%d\n", r);
+		return false;
+	}
+
+	return true;
+}
+
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	if (!adev->pci_state)
+		return false;
+
+	r = pci_load_saved_state(pdev, adev->pci_state);
+
+	if (!r) {
+		pci_restore_state(pdev);
+	} else {
+		DRM_WARN("Failed to load PCI state, err:%d\n", r);
+		return false;
+	}
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4bbcc70..7a6482a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
 		if (amdgpu_is_atpx_hybrid()) {
 			pci_ignore_hotplug(pdev);
 		} else {
-			pci_save_state(pdev);
+			amdgpu_device_cache_pci_state(pdev);
 			pci_disable_device(pdev);
 			pci_ignore_hotplug(pdev);
 			pci_set_power_state(pdev, PCI_D3cold);
@@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
 			pci_set_master(pdev);
 		} else {
 			pci_set_power_state(pdev, PCI_D0);
-			pci_restore_state(pdev);
+			amdgpu_device_load_pci_state(pdev);
 			ret = pci_enable_device(pdev);
 			if (ret)
 				return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 4d14023..0ec6603 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
 		dev_info(adev->dev, "GPU smu mode1 reset\n");
@@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
 
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 2f93c47..ddd55e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
 	/* disable BM */
 	pci_clear_master(adev->pdev);
 
-	pci_save_state(adev->pdev);
+	amdgpu_device_cache_pci_state(adev->pdev);
 
 	ret = psp_gpu_reset(adev);
 	if (ret)
 		dev_err(adev->dev, "GPU mode1 reset failed\n");
 
-	pci_restore_state(adev->pdev);
+	amdgpu_device_load_pci_state(adev->pdev);
 
 	/* wait for asic to come out of reset */
 	for (i = 0; i < adev->usec_timeout; i++) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2020-08-26 14:46 ` [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure Andrey Grodzovsky
@ 2020-08-26 14:46 ` Andrey Grodzovsky
  2020-08-26 14:46 ` [PATCH 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
  6 siblings, 0 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

Reuse exsisting functions from GPU recovery to avoid code
duplications.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 75 ++++++------------------------
 1 file changed, 14 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2c088df..473553f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4115,7 +4115,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 
 static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 			       struct list_head *device_list_handle,
-			       bool *need_full_reset_arg)
+			       bool *need_full_reset_arg,
+			       bool skip_hw_reset)
 {
 	struct amdgpu_device *tmp_adev = NULL;
 	bool need_full_reset = *need_full_reset_arg, vram_lost = false;
@@ -4125,7 +4126,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 	 * ASIC reset has to be done on all HGMI hive nodes ASAP
 	 * to allow proper links negotiation in FW (within 1 sec)
 	 */
-	if (need_full_reset) {
+	if (!skip_hw_reset && need_full_reset) {
 		list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 			/* For XGMI run all resets in parallel to speed up the process */
 			if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
@@ -4521,7 +4522,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		if (r)
 			adev->asic_reset_res = r;
 	} else {
-		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset);
+		r  = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset, false);
 		if (r && r == -EAGAIN)
 			goto retry;
 	}
@@ -4850,14 +4851,19 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
 	int r, i;
-	bool vram_lost;
+	bool need_full_reset = true;
 	u32 memsize;
+	struct list_head device_list;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	INIT_LIST_HEAD(&device_list);
+	list_add_tail(&adev->gmc.xgmi.head, &device_list);
+
 	/* wait for asic to come out of reset */
 	msleep(500);
 
+	/* Restore PCI confspace */
 	amdgpu_device_load_pci_state(pdev);
 
 	/* confirm  ASIC came out of reset */
@@ -4873,70 +4879,15 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 		goto out;
 	}
 
-	/* TODO Call amdgpu_pre_asic_reset instead */
 	adev->in_dpc = true;
-	r = amdgpu_device_ip_suspend(adev);
+	r = amdgpu_device_pre_asic_reset(adev, NULL, &need_full_reset);
 	adev->in_dpc = false;
 	if (r)
 		goto out;
 
-
-	/* post card */
-	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_ip_resume_phase1(adev);
-	if (r)
-		goto out;
-
-	vram_lost = amdgpu_device_check_vram_lost(adev);
-	if (vram_lost) {
-		DRM_INFO("VRAM is lost due to GPU reset!\n");
-		amdgpu_inc_vram_lost(adev);
-	}
-
-	r = amdgpu_gtt_mgr_recover(
-		&adev->mman.bdev.man[TTM_PL_TT]);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_fw_loading(adev);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_ip_resume_phase2(adev);
-	if (r)
-		goto out;
-
-	if (vram_lost)
-		amdgpu_device_fill_reset_magic(adev);
-
-	/*
-	 * Add this ASIC as tracked as reset was already
-	 * complete successfully.
-	 */
-	amdgpu_register_gpu_instance(adev);
-
-	r = amdgpu_device_ip_late_init(adev);
-	if (r)
-		goto out;
-
-	amdgpu_fbdev_set_suspend(adev, 0);
-
-	/* must succeed. */
-	amdgpu_ras_resume(adev);
-
-
-	amdgpu_irq_gpu_reset_resume_helper(adev);
-	r = amdgpu_ib_ring_tests(adev);
-	if (r)
-		goto out;
-
-	r = amdgpu_device_recover_vram(adev);
+	r = amdgpu_do_asic_reset(NULL, &device_list, &need_full_reset, true);
 
 out:
-
 	if (!r) {
 		amdgpu_device_cache_pci_state(adev->pdev);
 		DRM_INFO("PCIe error recovery succeeded\n");
@@ -5021,4 +4972,6 @@ bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
 		DRM_WARN("Failed to load PCI state, err:%d\n", r);
 		return false;
 	}
+
+	return true;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/7] drm/amdgpu: Disable DPC for XGMI for now.
  2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2020-08-26 14:46 ` [PATCH 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
@ 2020-08-26 14:46 ` Andrey Grodzovsky
  6 siblings, 0 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 14:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Andrey Grodzovsky, nirmodas

XGMI support is more complicated then single device support as
questions of synchronization between the device recovering from
PCI error and other memebers of the hive is required.
Leaving this for next round.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 473553f..50a6d78 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4782,6 +4782,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 
 	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
+	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+		DRM_WARN("No support for XGMI hive yet...");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-26 14:46 ` [PATCH 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
@ 2020-08-26 15:08   ` Alex Deucher
  2020-08-27  1:23   ` Li, Dennis
  2020-08-29  2:07   ` Luben Tuikov
  2 siblings, 0 replies; 29+ messages in thread
From: Alex Deucher @ 2020-08-26 15:08 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list

On Wed, Aug 26, 2020 at 10:47 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Add DPC handlers with basic recovery functionality.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 181 ++++++++++++++++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>  3 files changed, 196 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/hashtable.h>
>  #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>
>  #include <drm/ttm/ttm_bo_api.h>
>  #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +                                          pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> +void amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>  #include "amdgpu_object.h"
>
>  /* used by df_v3_6.c and amdgpu_pmu.c */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..84f8d14 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -350,7 +350,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   *
>   * Returns the 8 bit value from the offset specified.
>   */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> +{
> +

Unrelated whitespace changes.

>         if (offset < adev->rmmio_size)
>                 return (readb(adev->rmmio + offset));
>         BUG();
> @@ -371,7 +373,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>   *
>   * Writes the value specified to the offset specified.
>   */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> +{
> +

Unrelated whitespace changes.

>         if (offset < adev->rmmio_size)
>                 writeb(value, adev->rmmio + offset);
>         else
> @@ -380,6 +384,7 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>
>  void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>  {
> +

Unrelated whitespace changes.

>         trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>
>         if ((reg * 4) < adev->rmmio_size)
> @@ -407,6 +412,7 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>                     uint32_t acc_flags)
>  {
> +

Unrelated whitespace changes.

>         if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>                 return amdgpu_kiq_wreg(adev, reg, v);
>
> @@ -461,6 +467,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>   */
>  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>  {
> +

Unrelated whitespace changes.

>         if ((reg * 4) < adev->rio_mem_size)
>                 iowrite32(v, adev->rio_mem + (reg * 4));
>         else {
> @@ -480,6 +487,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>   */
>  u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>  {
> +

Unrelated whitespace changes.

>         if (index < adev->doorbell.num_doorbells) {
>                 return readl(adev->doorbell.ptr + index);
>         } else {
> @@ -500,6 +508,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>  {
> +

Unrelated whitespace changes.

>         if (index < adev->doorbell.num_doorbells) {
>                 writel(v, adev->doorbell.ptr + index);
>         } else {
> @@ -518,6 +527,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>   */
>  u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>  {
> +

Unrelated whitespace changes.

>         if (index < adev->doorbell.num_doorbells) {
>                 return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
>         } else {
> @@ -538,6 +548,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>  {
> +

Unrelated whitespace changes.

>         if (index < adev->doorbell.num_doorbells) {
>                 atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>         } else {
> @@ -2989,6 +3000,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>         NULL
>  };
>
> +

Unrelated whitespace changes.

>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3207,6 +3219,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>                 }
>         }
>
> +       pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>         /* Post card if necessary */
>         if (amdgpu_device_need_post(adev)) {
>                 if (!adev->bios) {
> @@ -3359,6 +3374,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         if (r)
>                 dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>
> +       if (pci_save_state(pdev))
> +               DRM_ERROR("Failed to save PCI state!!\n");
> +

This will break suspend.

>         return 0;
>
>  failed:
> @@ -4701,3 +4719,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>
>         return 0;
>  }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +       DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +       switch (state) {
> +       case pci_channel_io_normal:
> +               return PCI_ERS_RESULT_CAN_RECOVER;
> +       case pci_channel_io_frozen: {
> +               /* Fatal error, prepare for slot reset */
> +
> +               amdgpu_device_lock_adev(adev);
> +               return PCI_ERS_RESULT_NEED_RESET;
> +       }
> +       case pci_channel_io_perm_failure:
> +               /* Permanent error, prepare for device removal */
> +               return PCI_ERS_RESULT_DISCONNECT;
> +       }
> +       return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> +{
> +
> +       DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +       /* TODO - dump whatever for debugging purposes */
> +
> +       /* This called only if amdgpu_pci_error_detected returns
> +        * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +        * works, no need to reset slot.
> +        */
> +
> +       return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +       int r;
> +       bool vram_lost;
> +
> +       DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +       pci_restore_state(pdev);
> +
> +       r = amdgpu_device_ip_suspend(adev);
> +       if (r)
> +               goto out;
> +
> +
> +       /* post card */
> +       r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_ip_resume_phase1(adev);
> +       if (r)
> +               goto out;
> +
> +       vram_lost = amdgpu_device_check_vram_lost(adev);
> +       if (vram_lost) {
> +               DRM_INFO("VRAM is lost due to GPU reset!\n");
> +               amdgpu_inc_vram_lost(adev);
> +       }
> +
> +       r = amdgpu_gtt_mgr_recover(
> +               &adev->mman.bdev.man[TTM_PL_TT]);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_fw_loading(adev);
> +       if (r)
> +               return r;
> +
> +       r = amdgpu_device_ip_resume_phase2(adev);
> +       if (r)
> +               goto out;
> +
> +       if (vram_lost)
> +               amdgpu_device_fill_reset_magic(adev);
> +
> +       /*
> +        * Add this ASIC as tracked as reset was already
> +        * complete successfully.
> +        */
> +       amdgpu_register_gpu_instance(adev);
> +
> +       r = amdgpu_device_ip_late_init(adev);
> +       if (r)
> +               goto out;
> +
> +       amdgpu_fbdev_set_suspend(adev, 0);
> +
> +       /* must succeed. */
> +       amdgpu_ras_resume(adev);
> +
> +
> +       amdgpu_irq_gpu_reset_resume_helper(adev);
> +       r = amdgpu_ib_ring_tests(adev);
> +       if (r)
> +               goto out;
> +
> +       r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +       if (!r)
> +               DRM_INFO("PCIe error recovery succeeded\n");
> +       else {
> +               DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +               amdgpu_device_unlock_adev(adev);
> +       }
> +
> +       return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +       amdgpu_device_unlock_adev(adev);
> +
> +       DRM_INFO("PCI error: resume callback!!\n");
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>  #include <drm/drm_pciids.h>
>  #include <linux/console.h>
>  #include <linux/module.h>
> -#include <linux/pci.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/vga_switcheroo.h>
>  #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>         .patchlevel = KMS_DRIVER_PATCHLEVEL,
>  };
>
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +       .error_detected = amdgpu_pci_error_detected,
> +       .mmio_enabled   = amdgpu_pci_mmio_enabled,
> +       .slot_reset     = amdgpu_pci_slot_reset,
> +       .resume         = amdgpu_pci_resume,
> +};
> +
>  static struct pci_driver amdgpu_kms_pci_driver = {
>         .name = DRIVER_NAME,
>         .id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>         .remove = amdgpu_pci_remove,
>         .shutdown = amdgpu_pci_shutdown,
>         .driver.pm = &amdgpu_pm_ops,
> +       .err_handler = &amdgpu_pci_err_handler,
>  };
>
>  static int __init amdgpu_init(void)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-26 14:46 ` [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
@ 2020-08-26 15:09   ` Alex Deucher
  2020-08-26 16:53   ` Nirmoy
  2020-08-29  2:03   ` Luben Tuikov
  2 siblings, 0 replies; 29+ messages in thread
From: Alex Deucher @ 2020-08-26 15:09 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list

On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> At this point the ASIC is already post reset by the HW/PSP
> so the HW not in proper state to be configured for suspension,
> some bloks might be even gated and so best is to avoid touching it.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 37 ++++++++++++++++++++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 +++++++++------
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>  8 files changed, 64 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3399242..3489622 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,7 @@ struct amdgpu_device {
>         atomic_t                        throttling_logging_enabled;
>         struct ratelimit_state          throttling_logging_rs;
>         uint32_t                        ras_features;
> +       bool                            in_dpc;
>  };
>
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 84f8d14..e1bbefd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>  {
>         uint32_t ret;
>
> +       if (adev->in_dpc)
> +               return 0;
> +
>         if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>                 return amdgpu_kiq_rreg(adev, reg);
>
> @@ -350,8 +353,10 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   *
>   * Returns the 8 bit value from the offset specified.
>   */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> -{
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {

Unrelated whitespace changes.

> +
> +       if (adev->in_dpc)
> +               return 0;
>
>         if (offset < adev->rmmio_size)
>                 return (readb(adev->rmmio + offset));
> @@ -373,8 +378,10 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>   *
>   * Writes the value specified to the offset specified.
>   */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> -{
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {

Unrelated whitespace changes.

> +
> +       if (adev->in_dpc)
> +               return;
>
>         if (offset < adev->rmmio_size)
>                 writeb(value, adev->rmmio + offset);
> @@ -384,6 +391,8 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>
>  void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>  {
> +       if (adev->in_dpc)
> +               return;
>
>         trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>
> @@ -412,6 +421,8 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>                     uint32_t acc_flags)
>  {
> +       if (adev->in_dpc)
> +               return;
>
>         if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>                 return amdgpu_kiq_wreg(adev, reg, v);
> @@ -427,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>                     uint32_t acc_flags)
>  {
> +       if (adev->in_dpc)
> +               return;
> +
>         if (amdgpu_sriov_fullaccess(adev) &&
>                 adev->gfx.rlc.funcs &&
>                 adev->gfx.rlc.funcs->is_rlcg_access_range) {
> @@ -448,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
>   */
>  u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>  {
> +       if (adev->in_dpc)
> +               return 0;
> +
>         if ((reg * 4) < adev->rio_mem_size)
>                 return ioread32(adev->rio_mem + (reg * 4));
>         else {
> @@ -467,6 +484,8 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>   */
>  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>  {
> +       if (adev->in_dpc)
> +               return;
>
>         if ((reg * 4) < adev->rio_mem_size)
>                 iowrite32(v, adev->rio_mem + (reg * 4));
> @@ -487,6 +506,8 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>   */
>  u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>  {
> +       if (adev->in_dpc)
> +               return 0;
>
>         if (index < adev->doorbell.num_doorbells) {
>                 return readl(adev->doorbell.ptr + index);
> @@ -508,6 +529,8 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>  {
> +       if (adev->in_dpc)
> +               return;
>
>         if (index < adev->doorbell.num_doorbells) {
>                 writel(v, adev->doorbell.ptr + index);
> @@ -527,6 +550,8 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>   */
>  u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>  {
> +       if (adev->in_dpc)
> +               return 0;
>
>         if (index < adev->doorbell.num_doorbells) {
>                 return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
> @@ -548,6 +573,8 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>  {
> +       if (adev->in_dpc)
> +                       return;
>
>         if (index < adev->doorbell.num_doorbells) {
>                 atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
> @@ -4790,7 +4817,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>
>         pci_restore_state(pdev);
>
> +       adev->in_dpc = true;
>         r = amdgpu_device_ip_suspend(adev);
> +       adev->in_dpc = false;
>         if (r)
>                 goto out;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index d698142..50c42c9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>         struct amdgpu_ring *ring = &kiq->ring;
>
> +       if (adev->in_dpc)
> +               return 0;
> +
>         BUG_ON(!ring->funcs->emit_rreg);
>
>         spin_lock_irqsave(&kiq->ring_lock, flags);
> @@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>
>         BUG_ON(!ring->funcs->emit_wreg);
>
> +       if (adev->in_dpc)
> +               return;
> +
>         spin_lock_irqsave(&kiq->ring_lock, flags);
>         amdgpu_ring_alloc(ring, 32);
>         amdgpu_ring_emit_wreg(ring, reg, v);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index d6c38e2..555ef59 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
>         int i;
>         struct amdgpu_device *adev = psp->adev;
>
> +       if (psp->adev->in_dpc)
> +               return 0;
> +
>         for (i = 0; i < adev->usec_timeout; i++) {
>                 val = RREG32(reg_index);
>                 if (check_changed) {
> @@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>         bool ras_intr = false;
>         bool skip_unsupport = false;
>
> +       if (psp->adev->in_dpc)
> +               return 0;
> +
>         mutex_lock(&psp->mutex);
>
>         memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
> index 4cfc786..613dac1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/atom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/atom.c
> @@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
>                                         DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
>                                                   ATOM_CMD_TIMEOUT_SEC);
>                                         ctx->abort = true;
> +                                       dump_stack();

Leftover from debug?

>                                 }
>                         } else {
>                                 /* jiffies wrap around we will just wait a little longer */
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 2db195e..86f268e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
>
>         amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>         amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> +
> +       if (!adev->in_dpc) {
>  #ifndef BRING_UP_DEBUG
> -       if (amdgpu_async_gfx_ring) {
> -               r = gfx_v10_0_kiq_disable_kgq(adev);
> -               if (r)
> -                       DRM_ERROR("KGQ disable failed\n");
> -       }
> +               if (amdgpu_async_gfx_ring) {
> +                       r = gfx_v10_0_kiq_disable_kgq(adev);
> +                       if (r)
> +                               DRM_ERROR("KGQ disable failed\n");
> +               }
>  #endif
> -       if (amdgpu_gfx_disable_kcq(adev))
> -               DRM_ERROR("KCQ disable failed\n");
> +               if (amdgpu_gfx_disable_kcq(adev))
> +                       DRM_ERROR("KCQ disable failed\n");
> +       }
> +
>         if (amdgpu_sriov_vf(adev)) {
>                 gfx_v10_0_cp_gfx_enable(adev, false);
>                 /* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 8462b30..306461d 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
>  {
>         int ret = 0;
>
> +

Unrelated whitespace changes.

>         if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
>                 return -EOPNOTSUPP;
>
> +
> +

Unrelated whitespace changes.

>         switch (block_type) {
>         /*
>          * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index a58ea08..02cf55e 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>         struct amdgpu_device *adev = smu->adev;
>         int ret = 0, index = 0;
>
> +       if (smu->adev->in_dpc)
> +               return 0;
> +
>         index = smu_cmn_to_asic_specific_index(smu,
>                                                CMN2ASIC_MAPPING_MSG,
>                                                msg);
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/7] drm/amdgpu: Fix SMU error failure
  2020-08-26 14:46 ` [PATCH 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
@ 2020-08-26 15:10   ` Alex Deucher
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Deucher @ 2020-08-26 15:10 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list

On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Wait for HW/PSP initiated ASIC reset to complete before
> starting the recovery operations.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0a2130b..d9e3994 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4846,13 +4846,31 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  {
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = drm_to_adev(dev);
> -       int r;
> +       int r, i;
>         bool vram_lost;
> +       u32 memsize;
>
>         DRM_INFO("PCI error: slot reset callback!!\n");
>
> +       /* wait for asic to come out of reset */
> +       msleep(500);
> +
>         pci_restore_state(pdev);
>
> +       /* confirm  ASIC came out of reset */
> +       for (i = 0; i < adev->usec_timeout; i++) {
> +               memsize = amdgpu_asic_get_config_memsize(adev);
> +
> +               if (memsize != 0xffffffff)
> +                       break;
> +               udelay(1);
> +       }
> +       if (memsize == 0xffffffff) {
> +               r = -ETIME;
> +               goto out;
> +       }
> +
> +       /* TODO Call amdgpu_pre_asic_reset instead */
>         adev->in_dpc = true;
>         r = amdgpu_device_ip_suspend(adev);
>         adev->in_dpc = false;
> @@ -4882,7 +4900,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>
>         r = amdgpu_device_fw_loading(adev);
>         if (r)
> -               return r;
> +               goto out;

Is this a separate bug fix?

Alex

>
>         r = amdgpu_device_ip_resume_phase2(adev);
>         if (r)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-26 14:46 ` [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure Andrey Grodzovsky
@ 2020-08-26 15:20   ` Alex Deucher
  2020-08-26 15:31     ` Andrey Grodzovsky
  2020-08-27 14:54     ` Andrey Grodzovsky
  0 siblings, 2 replies; 29+ messages in thread
From: Alex Deucher @ 2020-08-26 15:20 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list

On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
> unles you save the cashe the saved PCI confspace to load it after
> each new reset.
> Also use same cached state for other use case of restoring PCI confspace
> such as GPU mode1 or VGA switheroo.
>

We don't want to keep the saved state around in the pci core
otherwise, the pci core will assume we are managing the saved state
for suspend and resume.  I think we want logic like this:

At driver load time:
pci_save_state(pdev);
adev->pci_state = pci_store_saved_state(pdev);
pci_restore_state(adev->pdev);

then in the case of dpc, do:
pci_load_saved_state(pdev, adev->pci_state);

For all the other cases, just leave the code as is.

Alex


> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
>  drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
>  5 files changed, 66 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3489622..42ee208 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,7 +992,9 @@ struct amdgpu_device {
>         atomic_t                        throttling_logging_enabled;
>         struct ratelimit_state          throttling_logging_rs;
>         uint32_t                        ras_features;
> +

Unrelated whitespace changes.

>         bool                            in_dpc;
> +       struct pci_saved_state          *pci_state;
>  };
>
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>  pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>  void amdgpu_pci_resume(struct pci_dev *pdev);
>
> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
> +
> +
>
>  #include "amdgpu_object.h"
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d9e3994..2c088df 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>                 dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>
>                 pci_set_power_state(dev->pdev, PCI_D0);
> -               pci_restore_state(dev->pdev);
> +               amdgpu_device_load_pci_state(dev->pdev);
>                 r = pci_enable_device(dev->pdev);
>                 if (r)
>                         DRM_WARN("pci_enable_device failed (%d)\n", r);
> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>                 drm_kms_helper_poll_disable(dev);
>                 dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>                 amdgpu_device_suspend(dev, true);
> -               pci_save_state(dev->pdev);
> +               amdgpu_device_cache_pci_state(dev->pdev);
>                 /* Shut down the device */
>                 pci_disable_device(dev->pdev);
>                 pci_set_power_state(dev->pdev, PCI_D3cold);
> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         if (r)
>                 dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>
> -       if (pci_save_state(pdev))
> -               DRM_ERROR("Failed to save PCI state!!\n");
> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> +               DRM_WARN("Failed to cache PCI state!");
>
>         return 0;
>
> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>         flush_delayed_work(&adev->delayed_init_work);
>         adev->shutdown = true;
>
> +       kfree(adev->pci_state);
> +
>         /* make sure IB test finished before entering exclusive mode
>          * to avoid preemption on IB test
>          * */
> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>         /* wait for asic to come out of reset */
>         msleep(500);
>
> -       pci_restore_state(pdev);
> +       amdgpu_device_load_pci_state(pdev);
>
>         /* confirm  ASIC came out of reset */
>         for (i = 0; i < adev->usec_timeout; i++) {
> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>
>  out:
>
> -       if (!r)
> +       if (!r) {
> +               amdgpu_device_cache_pci_state(adev->pdev);
>                 DRM_INFO("PCIe error recovery succeeded\n");
> +       }
>         else {
>                 DRM_ERROR("PCIe error recovery failed, err:%d", r);
>                 amdgpu_device_unlock_adev(adev);
> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>
>         amdgpu_device_unlock_adev(adev);
>  }
> +
> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +       int r;
> +
> +       r = pci_save_state(pdev);
> +       if (!r) {
> +               kfree(adev->pci_state);
> +
> +               adev->pci_state = pci_store_saved_state(pdev);
> +
> +               if (!adev->pci_state) {
> +                       DRM_ERROR("Failed to store PCI saved state");
> +                       return false;
> +               }
> +       } else {
> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
> +               return false;
> +       }
> +
> +       return true;
> +}
> +
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> +{
> +       struct drm_device *dev = pci_get_drvdata(pdev);
> +       struct amdgpu_device *adev = drm_to_adev(dev);
> +       int r;
> +
> +       if (!adev->pci_state)
> +               return false;
> +
> +       r = pci_load_saved_state(pdev, adev->pci_state);
> +
> +       if (!r) {
> +               pci_restore_state(pdev);
> +       } else {
> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
> +               return false;
> +       }
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 4bbcc70..7a6482a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>                 if (amdgpu_is_atpx_hybrid()) {
>                         pci_ignore_hotplug(pdev);
>                 } else {
> -                       pci_save_state(pdev);
> +                       amdgpu_device_cache_pci_state(pdev);
>                         pci_disable_device(pdev);
>                         pci_ignore_hotplug(pdev);
>                         pci_set_power_state(pdev, PCI_D3cold);
> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>                         pci_set_master(pdev);
>                 } else {
>                         pci_set_power_state(pdev, PCI_D0);
> -                       pci_restore_state(pdev);
> +                       amdgpu_device_load_pci_state(pdev);
>                         ret = pci_enable_device(pdev);
>                         if (ret)
>                                 return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 4d14023..0ec6603 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>         /* disable BM */
>         pci_clear_master(adev->pdev);
>
> -       pci_save_state(adev->pdev);
> +       amdgpu_device_cache_pci_state(adev->pdev);
>
>         if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
>                 dev_info(adev->dev, "GPU smu mode1 reset\n");
> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>
>         if (ret)
>                 dev_err(adev->dev, "GPU mode1 reset failed\n");
> -       pci_restore_state(adev->pdev);
> +       amdgpu_device_load_pci_state(adev->pdev);
>
>         /* wait for asic to come out of reset */
>         for (i = 0; i < adev->usec_timeout; i++) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 2f93c47..ddd55e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
>         /* disable BM */
>         pci_clear_master(adev->pdev);
>
> -       pci_save_state(adev->pdev);
> +       amdgpu_device_cache_pci_state(adev->pdev);
>
>         ret = psp_gpu_reset(adev);
>         if (ret)
>                 dev_err(adev->dev, "GPU mode1 reset failed\n");
>
> -       pci_restore_state(adev->pdev);
> +       amdgpu_device_load_pci_state(adev->pdev);
>
>         /* wait for asic to come out of reset */
>         for (i = 0; i < adev->usec_timeout; i++) {
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-26 15:20   ` Alex Deucher
@ 2020-08-26 15:31     ` Andrey Grodzovsky
  2020-08-26 15:39       ` Alex Deucher
  2020-08-27 14:54     ` Andrey Grodzovsky
  1 sibling, 1 reply; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-26 15:31 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list


On 8/26/20 11:20 AM, Alex Deucher wrote:
> On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
>> unles you save the cashe the saved PCI confspace to load it after
>> each new reset.
>> Also use same cached state for other use case of restoring PCI confspace
>> such as GPU mode1 or VGA switheroo.
>>
> We don't want to keep the saved state around in the pci core
> otherwise, the pci core will assume we are managing the saved state
> for suspend and resume.


Are you talking about all the references to pci_dev->state_saved in pci-driver.c ?

Andrey


> I think we want logic like this:
>
> At driver load time:
> pci_save_state(pdev);
> adev->pci_state = pci_store_saved_state(pdev);
> pci_restore_state(adev->pdev);
>
> then in the case of dpc, do:
> pci_load_saved_state(pdev, adev->pci_state);
>
> For all the other cases, just leave the code as is.
>
> Alex
>
>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
>>   5 files changed, 66 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3489622..42ee208 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -992,7 +992,9 @@ struct amdgpu_device {
>>          atomic_t                        throttling_logging_enabled;
>>          struct ratelimit_state          throttling_logging_rs;
>>          uint32_t                        ras_features;
>> +
> Unrelated whitespace changes.
>
>>          bool                            in_dpc;
>> +       struct pci_saved_state          *pci_state;
>>   };
>>
>>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>>   pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>>   void amdgpu_pci_resume(struct pci_dev *pdev);
>>
>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
>> +
>> +
>>
>>   #include "amdgpu_object.h"
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d9e3994..2c088df 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>
>>                  pci_set_power_state(dev->pdev, PCI_D0);
>> -               pci_restore_state(dev->pdev);
>> +               amdgpu_device_load_pci_state(dev->pdev);
>>                  r = pci_enable_device(dev->pdev);
>>                  if (r)
>>                          DRM_WARN("pci_enable_device failed (%d)\n", r);
>> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>                  drm_kms_helper_poll_disable(dev);
>>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>                  amdgpu_device_suspend(dev, true);
>> -               pci_save_state(dev->pdev);
>> +               amdgpu_device_cache_pci_state(dev->pdev);
>>                  /* Shut down the device */
>>                  pci_disable_device(dev->pdev);
>>                  pci_set_power_state(dev->pdev, PCI_D3cold);
>> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>          if (r)
>>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>>
>> -       if (pci_save_state(pdev))
>> -               DRM_ERROR("Failed to save PCI state!!\n");
>> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
>> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
>> +               DRM_WARN("Failed to cache PCI state!");
>>
>>          return 0;
>>
>> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>          flush_delayed_work(&adev->delayed_init_work);
>>          adev->shutdown = true;
>>
>> +       kfree(adev->pci_state);
>> +
>>          /* make sure IB test finished before entering exclusive mode
>>           * to avoid preemption on IB test
>>           * */
>> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>          /* wait for asic to come out of reset */
>>          msleep(500);
>>
>> -       pci_restore_state(pdev);
>> +       amdgpu_device_load_pci_state(pdev);
>>
>>          /* confirm  ASIC came out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>
>>   out:
>>
>> -       if (!r)
>> +       if (!r) {
>> +               amdgpu_device_cache_pci_state(adev->pdev);
>>                  DRM_INFO("PCIe error recovery succeeded\n");
>> +       }
>>          else {
>>                  DRM_ERROR("PCIe error recovery failed, err:%d", r);
>>                  amdgpu_device_unlock_adev(adev);
>> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>
>>          amdgpu_device_unlock_adev(adev);
>>   }
>> +
>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +       int r;
>> +
>> +       r = pci_save_state(pdev);
>> +       if (!r) {
>> +               kfree(adev->pci_state);
>> +
>> +               adev->pci_state = pci_store_saved_state(pdev);
>> +
>> +               if (!adev->pci_state) {
>> +                       DRM_ERROR("Failed to store PCI saved state");
>> +                       return false;
>> +               }
>> +       } else {
>> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
>> +               return false;
>> +       }
>> +
>> +       return true;
>> +}
>> +
>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +       int r;
>> +
>> +       if (!adev->pci_state)
>> +               return false;
>> +
>> +       r = pci_load_saved_state(pdev, adev->pci_state);
>> +
>> +       if (!r) {
>> +               pci_restore_state(pdev);
>> +       } else {
>> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
>> +               return false;
>> +       }
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 4bbcc70..7a6482a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>>                  if (amdgpu_is_atpx_hybrid()) {
>>                          pci_ignore_hotplug(pdev);
>>                  } else {
>> -                       pci_save_state(pdev);
>> +                       amdgpu_device_cache_pci_state(pdev);
>>                          pci_disable_device(pdev);
>>                          pci_ignore_hotplug(pdev);
>>                          pci_set_power_state(pdev, PCI_D3cold);
>> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>>                          pci_set_master(pdev);
>>                  } else {
>>                          pci_set_power_state(pdev, PCI_D0);
>> -                       pci_restore_state(pdev);
>> +                       amdgpu_device_load_pci_state(pdev);
>>                          ret = pci_enable_device(pdev);
>>                          if (ret)
>>                                  return ret;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
>> index 4d14023..0ec6603 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>          /* disable BM */
>>          pci_clear_master(adev->pdev);
>>
>> -       pci_save_state(adev->pdev);
>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>
>>          if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
>>                  dev_info(adev->dev, "GPU smu mode1 reset\n");
>> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>
>>          if (ret)
>>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
>> -       pci_restore_state(adev->pdev);
>> +       amdgpu_device_load_pci_state(adev->pdev);
>>
>>          /* wait for asic to come out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> index 2f93c47..ddd55e3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
>>          /* disable BM */
>>          pci_clear_master(adev->pdev);
>>
>> -       pci_save_state(adev->pdev);
>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>
>>          ret = psp_gpu_reset(adev);
>>          if (ret)
>>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
>>
>> -       pci_restore_state(adev->pdev);
>> +       amdgpu_device_load_pci_state(adev->pdev);
>>
>>          /* wait for asic to come out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7C5faaaf11d6024193cb2508d849d38fae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637340520282366646&amp;sdata=ziPLmwT95XAnKftzplY2QBHq%2BL1961d0SdaMPAlVMcE%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-26 15:31     ` Andrey Grodzovsky
@ 2020-08-26 15:39       ` Alex Deucher
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Deucher @ 2020-08-26 15:39 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list

On Wed, Aug 26, 2020 at 11:31 AM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 8/26/20 11:20 AM, Alex Deucher wrote:
> > On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
> >> unles you save the cashe the saved PCI confspace to load it after
> >> each new reset.
> >> Also use same cached state for other use case of restoring PCI confspace
> >> such as GPU mode1 or VGA switheroo.
> >>
> > We don't want to keep the saved state around in the pci core
> > otherwise, the pci core will assume we are managing the saved state
> > for suspend and resume.
>
>
> Are you talking about all the references to pci_dev->state_saved in pci-driver.c ?

Right.

Alex

>
> Andrey
>
>
> > I think we want logic like this:
> >
> > At driver load time:
> > pci_save_state(pdev);
> > adev->pci_state = pci_store_saved_state(pdev);
> > pci_restore_state(adev->pdev);
> >
> > then in the case of dpc, do:
> > pci_load_saved_state(pdev, adev->pci_state);
> >
> > For all the other cases, just leave the code as is.
> >
> > Alex
> >
> >
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
> >>   drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
> >>   drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
> >>   5 files changed, 66 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> index 3489622..42ee208 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> @@ -992,7 +992,9 @@ struct amdgpu_device {
> >>          atomic_t                        throttling_logging_enabled;
> >>          struct ratelimit_state          throttling_logging_rs;
> >>          uint32_t                        ras_features;
> >> +
> > Unrelated whitespace changes.
> >
> >>          bool                            in_dpc;
> >> +       struct pci_saved_state          *pci_state;
> >>   };
> >>
> >>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> >> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> >>   pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> >>   void amdgpu_pci_resume(struct pci_dev *pdev);
> >>
> >> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
> >> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
> >> +
> >> +
> >>
> >>   #include "amdgpu_object.h"
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index d9e3994..2c088df 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>
> >>                  pci_set_power_state(dev->pdev, PCI_D0);
> >> -               pci_restore_state(dev->pdev);
> >> +               amdgpu_device_load_pci_state(dev->pdev);
> >>                  r = pci_enable_device(dev->pdev);
> >>                  if (r)
> >>                          DRM_WARN("pci_enable_device failed (%d)\n", r);
> >> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >>                  drm_kms_helper_poll_disable(dev);
> >>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>                  amdgpu_device_suspend(dev, true);
> >> -               pci_save_state(dev->pdev);
> >> +               amdgpu_device_cache_pci_state(dev->pdev);
> >>                  /* Shut down the device */
> >>                  pci_disable_device(dev->pdev);
> >>                  pci_set_power_state(dev->pdev, PCI_D3cold);
> >> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >>          if (r)
> >>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
> >>
> >> -       if (pci_save_state(pdev))
> >> -               DRM_ERROR("Failed to save PCI state!!\n");
> >> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> >> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> >> +               DRM_WARN("Failed to cache PCI state!");
> >>
> >>          return 0;
> >>
> >> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
> >>          flush_delayed_work(&adev->delayed_init_work);
> >>          adev->shutdown = true;
> >>
> >> +       kfree(adev->pci_state);
> >> +
> >>          /* make sure IB test finished before entering exclusive mode
> >>           * to avoid preemption on IB test
> >>           * */
> >> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >>          /* wait for asic to come out of reset */
> >>          msleep(500);
> >>
> >> -       pci_restore_state(pdev);
> >> +       amdgpu_device_load_pci_state(pdev);
> >>
> >>          /* confirm  ASIC came out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >>
> >>   out:
> >>
> >> -       if (!r)
> >> +       if (!r) {
> >> +               amdgpu_device_cache_pci_state(adev->pdev);
> >>                  DRM_INFO("PCIe error recovery succeeded\n");
> >> +       }
> >>          else {
> >>                  DRM_ERROR("PCIe error recovery failed, err:%d", r);
> >>                  amdgpu_device_unlock_adev(adev);
> >> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
> >>
> >>          amdgpu_device_unlock_adev(adev);
> >>   }
> >> +
> >> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +       int r;
> >> +
> >> +       r = pci_save_state(pdev);
> >> +       if (!r) {
> >> +               kfree(adev->pci_state);
> >> +
> >> +               adev->pci_state = pci_store_saved_state(pdev);
> >> +
> >> +               if (!adev->pci_state) {
> >> +                       DRM_ERROR("Failed to store PCI saved state");
> >> +                       return false;
> >> +               }
> >> +       } else {
> >> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
> >> +               return false;
> >> +       }
> >> +
> >> +       return true;
> >> +}
> >> +
> >> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +       int r;
> >> +
> >> +       if (!adev->pci_state)
> >> +               return false;
> >> +
> >> +       r = pci_load_saved_state(pdev, adev->pci_state);
> >> +
> >> +       if (!r) {
> >> +               pci_restore_state(pdev);
> >> +       } else {
> >> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
> >> +               return false;
> >> +       }
> >> +}
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index 4bbcc70..7a6482a 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
> >>                  if (amdgpu_is_atpx_hybrid()) {
> >>                          pci_ignore_hotplug(pdev);
> >>                  } else {
> >> -                       pci_save_state(pdev);
> >> +                       amdgpu_device_cache_pci_state(pdev);
> >>                          pci_disable_device(pdev);
> >>                          pci_ignore_hotplug(pdev);
> >>                          pci_set_power_state(pdev, PCI_D3cold);
> >> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
> >>                          pci_set_master(pdev);
> >>                  } else {
> >>                          pci_set_power_state(pdev, PCI_D0);
> >> -                       pci_restore_state(pdev);
> >> +                       amdgpu_device_load_pci_state(pdev);
> >>                          ret = pci_enable_device(pdev);
> >>                          if (ret)
> >>                                  return ret;
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> >> index 4d14023..0ec6603 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> >> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
> >>          /* disable BM */
> >>          pci_clear_master(adev->pdev);
> >>
> >> -       pci_save_state(adev->pdev);
> >> +       amdgpu_device_cache_pci_state(adev->pdev);
> >>
> >>          if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
> >>                  dev_info(adev->dev, "GPU smu mode1 reset\n");
> >> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
> >>
> >>          if (ret)
> >>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
> >> -       pci_restore_state(adev->pdev);
> >> +       amdgpu_device_load_pci_state(adev->pdev);
> >>
> >>          /* wait for asic to come out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> index 2f93c47..ddd55e3 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
> >>          /* disable BM */
> >>          pci_clear_master(adev->pdev);
> >>
> >> -       pci_save_state(adev->pdev);
> >> +       amdgpu_device_cache_pci_state(adev->pdev);
> >>
> >>          ret = psp_gpu_reset(adev);
> >>          if (ret)
> >>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
> >>
> >> -       pci_restore_state(adev->pdev);
> >> +       amdgpu_device_load_pci_state(adev->pdev);
> >>
> >>          /* wait for asic to come out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7C5faaaf11d6024193cb2508d849d38fae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637340520282366646&amp;sdata=ziPLmwT95XAnKftzplY2QBHq%2BL1961d0SdaMPAlVMcE%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-26 14:46 ` [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
  2020-08-26 15:09   ` Alex Deucher
@ 2020-08-26 16:53   ` Nirmoy
  2020-08-29  2:03   ` Luben Tuikov
  2 siblings, 0 replies; 29+ messages in thread
From: Nirmoy @ 2020-08-26 16:53 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: alexander.deucher


On 8/26/20 4:46 PM, Andrey Grodzovsky wrote:
> At this point the ASIC is already post reset by the HW/PSP
> so the HW not in proper state to be configured for suspension,
> some bloks might be even gated and so best is to avoid touching it.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 37 ++++++++++++++++++++++++++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>   drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 +++++++++------
>   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>   8 files changed, 64 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3399242..3489622 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,7 @@ struct amdgpu_device {
>   	atomic_t			throttling_logging_enabled;
>   	struct ratelimit_state		throttling_logging_rs;
>   	uint32_t			ras_features;
> +	bool                            in_dpc;


We should have more generic name instead of "in_dpc", may be 
"in_pci_error" ?


Nirmoy


>   };
>   
>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 84f8d14..e1bbefd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   {
>   	uint32_t ret;
>   
> +	if (adev->in_dpc)
> +		return 0;
> +
>   	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>   		return amdgpu_kiq_rreg(adev, reg);
>   
> @@ -350,8 +353,10 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>    *
>    * Returns the 8 bit value from the offset specified.
>    */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> -{
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +
> +	if (adev->in_dpc)
> +		return 0;
>   
>   	if (offset < adev->rmmio_size)
>   		return (readb(adev->rmmio + offset));
> @@ -373,8 +378,10 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>    *
>    * Writes the value specified to the offset specified.
>    */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> -{
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +
> +	if (adev->in_dpc)
> +		return;
>   
>   	if (offset < adev->rmmio_size)
>   		writeb(value, adev->rmmio + offset);
> @@ -384,6 +391,8 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>   
>   void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>   {
> +	if (adev->in_dpc)
> +		return;
>   
>   	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>   
> @@ -412,6 +421,8 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>   void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>   		    uint32_t acc_flags)
>   {
> +	if (adev->in_dpc)
> +		return;
>   
>   	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>   		return amdgpu_kiq_wreg(adev, reg, v);
> @@ -427,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>   void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>   		    uint32_t acc_flags)
>   {
> +	if (adev->in_dpc)
> +		return;
> +
>   	if (amdgpu_sriov_fullaccess(adev) &&
>   		adev->gfx.rlc.funcs &&
>   		adev->gfx.rlc.funcs->is_rlcg_access_range) {
> @@ -448,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
>    */
>   u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>   {
> +	if (adev->in_dpc)
> +		return 0;
> +
>   	if ((reg * 4) < adev->rio_mem_size)
>   		return ioread32(adev->rio_mem + (reg * 4));
>   	else {
> @@ -467,6 +484,8 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>    */
>   void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>   {
> +	if (adev->in_dpc)
> +		return;
>   
>   	if ((reg * 4) < adev->rio_mem_size)
>   		iowrite32(v, adev->rio_mem + (reg * 4));
> @@ -487,6 +506,8 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>    */
>   u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>   {
> +	if (adev->in_dpc)
> +		return 0;
>   
>   	if (index < adev->doorbell.num_doorbells) {
>   		return readl(adev->doorbell.ptr + index);
> @@ -508,6 +529,8 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>    */
>   void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>   {
> +	if (adev->in_dpc)
> +		return;
>   
>   	if (index < adev->doorbell.num_doorbells) {
>   		writel(v, adev->doorbell.ptr + index);
> @@ -527,6 +550,8 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>    */
>   u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>   {
> +	if (adev->in_dpc)
> +		return 0;
>   
>   	if (index < adev->doorbell.num_doorbells) {
>   		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
> @@ -548,6 +573,8 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>    */
>   void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>   {
> +	if (adev->in_dpc)
> +			return;
>   
>   	if (index < adev->doorbell.num_doorbells) {
>   		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
> @@ -4790,7 +4817,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>   
>   	pci_restore_state(pdev);
>   
> +	adev->in_dpc = true;
>   	r = amdgpu_device_ip_suspend(adev);
> +	adev->in_dpc = false;
>   	if (r)
>   		goto out;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index d698142..50c42c9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>   	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>   	struct amdgpu_ring *ring = &kiq->ring;
>   
> +	if (adev->in_dpc)
> +		return 0;
> +
>   	BUG_ON(!ring->funcs->emit_rreg);
>   
>   	spin_lock_irqsave(&kiq->ring_lock, flags);
> @@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>   
>   	BUG_ON(!ring->funcs->emit_wreg);
>   
> +	if (adev->in_dpc)
> +		return;
> +
>   	spin_lock_irqsave(&kiq->ring_lock, flags);
>   	amdgpu_ring_alloc(ring, 32);
>   	amdgpu_ring_emit_wreg(ring, reg, v);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index d6c38e2..555ef59 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
>   	int i;
>   	struct amdgpu_device *adev = psp->adev;
>   
> +	if (psp->adev->in_dpc)
> +		return 0;
> +
>   	for (i = 0; i < adev->usec_timeout; i++) {
>   		val = RREG32(reg_index);
>   		if (check_changed) {
> @@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>   	bool ras_intr = false;
>   	bool skip_unsupport = false;
>   
> +	if (psp->adev->in_dpc)
> +		return 0;
> +
>   	mutex_lock(&psp->mutex);
>   
>   	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
> index 4cfc786..613dac1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/atom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/atom.c
> @@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
>   					DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
>   						  ATOM_CMD_TIMEOUT_SEC);
>   					ctx->abort = true;
> +					dump_stack();
>   				}
>   			} else {
>   				/* jiffies wrap around we will just wait a little longer */
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 2db195e..86f268e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
>   
>   	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>   	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> +
> +	if (!adev->in_dpc) {
>   #ifndef BRING_UP_DEBUG
> -	if (amdgpu_async_gfx_ring) {
> -		r = gfx_v10_0_kiq_disable_kgq(adev);
> -		if (r)
> -			DRM_ERROR("KGQ disable failed\n");
> -	}
> +		if (amdgpu_async_gfx_ring) {
> +			r = gfx_v10_0_kiq_disable_kgq(adev);
> +			if (r)
> +				DRM_ERROR("KGQ disable failed\n");
> +		}
>   #endif
> -	if (amdgpu_gfx_disable_kcq(adev))
> -		DRM_ERROR("KCQ disable failed\n");
> +		if (amdgpu_gfx_disable_kcq(adev))
> +			DRM_ERROR("KCQ disable failed\n");
> +	}
> +
>   	if (amdgpu_sriov_vf(adev)) {
>   		gfx_v10_0_cp_gfx_enable(adev, false);
>   		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 8462b30..306461d 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
>   {
>   	int ret = 0;
>   
> +
>   	if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
>   		return -EOPNOTSUPP;
>   
> +
> +
>   	switch (block_type) {
>   	/*
>   	 * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index a58ea08..02cf55e 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>   	struct amdgpu_device *adev = smu->adev;
>   	int ret = 0, index = 0;
>   
> +	if (smu->adev->in_dpc)
> +		return 0;
> +
>   	index = smu_cmn_to_asic_specific_index(smu,
>   					       CMN2ASIC_MAPPING_MSG,
>   					       msg);
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-26 14:46 ` [PATCH 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-26 15:08   ` Alex Deucher
@ 2020-08-27  1:23   ` Li, Dennis
  2020-08-27 13:38     ` Andrey Grodzovsky
  2020-08-29  2:07   ` Luben Tuikov
  2 siblings, 1 reply; 29+ messages in thread
From: Li, Dennis @ 2020-08-27  1:23 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx
  Cc: Deucher, Alexander, Grodzovsky, Andrey, Das, Nirmoy

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,
     I found that the sequences of amdgpu_pci_slot_reset is mostly similar to amdgpu_do_asic_reset. Could help us refactor them to reuse more codes? 

Best Regards
Dennis Li
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Andrey Grodzovsky
Sent: Wednesday, August 26, 2020 10:46 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Das, Nirmoy <Nirmoy.Das@amd.com>
Subject: [PATCH 1/7] drm/amdgpu: Implement DPC recovery

Add DPC handlers with basic recovery functionality.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 181 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
 3 files changed, 196 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 49ea9fa..3399242 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -49,6 +49,8 @@
 #include <linux/rbtree.h>
 #include <linux/hashtable.h>
 #include <linux/dma-fence.h>
+#include <linux/pci.h>
+#include <linux/aer.h>
 
 #include <drm/ttm/ttm_bo_api.h>
 #include <drm/ttm/ttm_bo_driver.h>
@@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
 
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
+					   pci_channel_state_t state);
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev); 
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev); void 
+amdgpu_pci_resume(struct pci_dev *pdev);
+
+
 #include "amdgpu_object.h"
 
 /* used by df_v3_6.c and amdgpu_pmu.c */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a948ed..84f8d14 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -350,7 +350,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
  *
  * Returns the 8 bit value from the offset specified.
  */
-uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+
 	if (offset < adev->rmmio_size)
 		return (readb(adev->rmmio + offset));
 	BUG();
@@ -371,7 +373,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
  *
  * Writes the value specified to the offset specified.
  */
-void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
+void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, 
+uint8_t value) {
+
 	if (offset < adev->rmmio_size)
 		writeb(value, adev->rmmio + offset);
 	else
@@ -380,6 +384,7 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
 
 void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)  {
+
 	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
 	if ((reg * 4) < adev->rmmio_size)
@@ -407,6 +412,7 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
 		    uint32_t acc_flags)
 {
+
 	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
 		return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -461,6 +467,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)  {
+
 	if ((reg * 4) < adev->rio_mem_size)
 		iowrite32(v, adev->rio_mem + (reg * 4));
 	else {
@@ -480,6 +487,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)  {
+
 	if (index < adev->doorbell.num_doorbells) {
 		return readl(adev->doorbell.ptr + index);
 	} else {
@@ -500,6 +508,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)  {
+
 	if (index < adev->doorbell.num_doorbells) {
 		writel(v, adev->doorbell.ptr + index);
 	} else {
@@ -518,6 +527,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)  {
+
 	if (index < adev->doorbell.num_doorbells) {
 		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
 	} else {
@@ -538,6 +548,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)  {
+
 	if (index < adev->doorbell.num_doorbells) {
 		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
 	} else {
@@ -2989,6 +3000,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
 	NULL
 };
 
+
 /**
  * amdgpu_device_init - initialize the driver
  *
@@ -3207,6 +3219,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		}
 	}
 
+	pci_enable_pcie_error_reporting(adev->ddev.pdev);
+
+
 	/* Post card if necessary */
 	if (amdgpu_device_need_post(adev)) {
 		if (!adev->bios) {
@@ -3359,6 +3374,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
 
+	if (pci_save_state(pdev))
+		DRM_ERROR("Failed to save PCI state!!\n");
+
 	return 0;
 
 failed:
@@ -4701,3 +4719,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 
 	return 0;
 }
+
+/**
+ * amdgpu_pci_error_detected - Called when a PCI error is detected.
+ * @pdev: PCI device struct
+ * @state: PCI channel state
+ *
+ * Description: Called when a PCI error is detected.
+ *
+ * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
+ */
+pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, 
+pci_channel_state_t state) {
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
+
+	switch (state) {
+	case pci_channel_io_normal:
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen: {
+		/* Fatal error, prepare for slot reset */
+
+		amdgpu_device_lock_adev(adev);
+		return PCI_ERS_RESULT_NEED_RESET;
+	}
+	case pci_channel_io_perm_failure:
+		/* Permanent error, prepare for device removal */
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
+ * @pdev: pointer to PCI device
+ */
+pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev) {
+
+	DRM_INFO("PCI error: mmio enabled callback!!\n");
+
+	/* TODO - dump whatever for debugging purposes */
+
+	/* This called only if amdgpu_pci_error_detected returns
+	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
+	 * works, no need to reset slot.
+	 */
+
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
+ * @pdev: PCI device struct
+ *
+ * Description: This routine is called by the pci error recovery
+ * code after the PCI slot has been reset, just before we
+ * should resume normal operations.
+ */
+pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev) {
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+	bool vram_lost;
+
+	DRM_INFO("PCI error: slot reset callback!!\n");
+
+	pci_restore_state(pdev);
+
+	r = amdgpu_device_ip_suspend(adev);
+	if (r)
+		goto out;
+
+
+	/* post card */
+	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_ip_resume_phase1(adev);
+	if (r)
+		goto out;
+
+	vram_lost = amdgpu_device_check_vram_lost(adev);
+	if (vram_lost) {
+		DRM_INFO("VRAM is lost due to GPU reset!\n");
+		amdgpu_inc_vram_lost(adev);
+	}
+
+	r = amdgpu_gtt_mgr_recover(
+		&adev->mman.bdev.man[TTM_PL_TT]);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_fw_loading(adev);
+	if (r)
+		return r;
+
+	r = amdgpu_device_ip_resume_phase2(adev);
+	if (r)
+		goto out;
+
+	if (vram_lost)
+		amdgpu_device_fill_reset_magic(adev);
+
+	/*
+	 * Add this ASIC as tracked as reset was already
+	 * complete successfully.
+	 */
+	amdgpu_register_gpu_instance(adev);
+
+	r = amdgpu_device_ip_late_init(adev);
+	if (r)
+		goto out;
+
+	amdgpu_fbdev_set_suspend(adev, 0);
+
+	/* must succeed. */
+	amdgpu_ras_resume(adev);
+
+
+	amdgpu_irq_gpu_reset_resume_helper(adev);
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
+		goto out;
+
+	r = amdgpu_device_recover_vram(adev);
+
+out:
+
+	if (!r)
+		DRM_INFO("PCIe error recovery succeeded\n");
+	else {
+		DRM_ERROR("PCIe error recovery failed, err:%d", r);
+		amdgpu_device_unlock_adev(adev);
+	}
+
+	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED; }
+
+/**
+ * amdgpu_pci_resume() - resume normal ops after PCI reset
+ * @pdev: pointer to PCI device
+ *
+ * Called when the error recovery driver tells us that its
+ * OK to resume normal operation. Use completion to allow
+ * halted scsi ops to resume.
+ */
+void amdgpu_pci_resume(struct pci_dev *pdev) {
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	amdgpu_device_unlock_adev(adev);
+
+	DRM_INFO("PCI error: resume callback!!\n"); }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d984c6a..4bbcc70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -31,7 +31,6 @@
 #include <drm/drm_pciids.h>
 #include <linux/console.h>
 #include <linux/module.h>
-#include <linux/pci.h>
 #include <linux/pm_runtime.h>
 #include <linux/vga_switcheroo.h>
 #include <drm/drm_probe_helper.h>
@@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
 	.patchlevel = KMS_DRIVER_PATCHLEVEL,
 };
 
+static struct pci_error_handlers amdgpu_pci_err_handler = {
+	.error_detected	= amdgpu_pci_error_detected,
+	.mmio_enabled	= amdgpu_pci_mmio_enabled,
+	.slot_reset	= amdgpu_pci_slot_reset,
+	.resume		= amdgpu_pci_resume,
+};
+
 static struct pci_driver amdgpu_kms_pci_driver = {
 	.name = DRIVER_NAME,
 	.id_table = pciidlist,
@@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
 	.remove = amdgpu_pci_remove,
 	.shutdown = amdgpu_pci_shutdown,
 	.driver.pm = &amdgpu_pm_ops,
+	.err_handler = &amdgpu_pci_err_handler,
 };
 
 static int __init amdgpu_init(void)
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CDennis.Li%40amd.com%7Cfbaac227eff74bcab03108d849cedd81%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637340500116686286&amp;sdata=lYjIk50kyGIHG7UaAzx%2FswghLD8gc2DdCXUmTNj4EkU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-27  1:23   ` Li, Dennis
@ 2020-08-27 13:38     ` Andrey Grodzovsky
  0 siblings, 0 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-27 13:38 UTC (permalink / raw)
  To: Li, Dennis, amd-gfx; +Cc: Deucher, Alexander, Das, Nirmoy

Indeed, I noticed it later and did it in patch 6 (merging 1 and 6 was very messy 
due to following changes so I just kept it as a separate patch).

Andrey

On 8/26/20 9:23 PM, Li, Dennis wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Andrey,
>       I found that the sequences of amdgpu_pci_slot_reset is mostly similar to amdgpu_do_asic_reset. Could help us refactor them to reuse more codes?
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Andrey Grodzovsky
> Sent: Wednesday, August 26, 2020 10:46 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Das, Nirmoy <Nirmoy.Das@amd.com>
> Subject: [PATCH 1/7] drm/amdgpu: Implement DPC recovery
>
> Add DPC handlers with basic recovery functionality.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 181 ++++++++++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>   3 files changed, 196 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>   #include <linux/rbtree.h>
>   #include <linux/hashtable.h>
>   #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>   
>   #include <drm/ttm/ttm_bo_api.h>
>   #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>   
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +					   pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev); void
> +amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>   #include "amdgpu_object.h"
>   
>   /* used by df_v3_6.c and amdgpu_pmu.c */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..84f8d14 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -350,7 +350,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>    *
>    * Returns the 8 bit value from the offset specified.
>    */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +
>   	if (offset < adev->rmmio_size)
>   		return (readb(adev->rmmio + offset));
>   	BUG();
> @@ -371,7 +373,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>    *
>    * Writes the value specified to the offset specified.
>    */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset,
> +uint8_t value) {
> +
>   	if (offset < adev->rmmio_size)
>   		writeb(value, adev->rmmio + offset);
>   	else
> @@ -380,6 +384,7 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>   
>   void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)  {
> +
>   	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>   
>   	if ((reg * 4) < adev->rmmio_size)
> @@ -407,6 +412,7 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>   		    uint32_t acc_flags)
>   {
> +
>   	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>   		return amdgpu_kiq_wreg(adev, reg, v);
>   
> @@ -461,6 +467,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>    */
>   void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)  {
> +
>   	if ((reg * 4) < adev->rio_mem_size)
>   		iowrite32(v, adev->rio_mem + (reg * 4));
>   	else {
> @@ -480,6 +487,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>    */
>   u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)  {
> +
>   	if (index < adev->doorbell.num_doorbells) {
>   		return readl(adev->doorbell.ptr + index);
>   	} else {
> @@ -500,6 +508,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>    */
>   void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)  {
> +
>   	if (index < adev->doorbell.num_doorbells) {
>   		writel(v, adev->doorbell.ptr + index);
>   	} else {
> @@ -518,6 +527,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>    */
>   u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)  {
> +
>   	if (index < adev->doorbell.num_doorbells) {
>   		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
>   	} else {
> @@ -538,6 +548,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>    */
>   void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)  {
> +
>   	if (index < adev->doorbell.num_doorbells) {
>   		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>   	} else {
> @@ -2989,6 +3000,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>   	NULL
>   };
>   
> +
>   /**
>    * amdgpu_device_init - initialize the driver
>    *
> @@ -3207,6 +3219,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   		}
>   	}
>   
> +	pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>   	/* Post card if necessary */
>   	if (amdgpu_device_need_post(adev)) {
>   		if (!adev->bios) {
> @@ -3359,6 +3374,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	if (r)
>   		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>   
> +	if (pci_save_state(pdev))
> +		DRM_ERROR("Failed to save PCI state!!\n");
> +
>   	return 0;
>   
>   failed:
> @@ -4701,3 +4719,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>   
>   	return 0;
>   }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +pci_channel_state_t state) {
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;
> +	case pci_channel_io_frozen: {
> +		/* Fatal error, prepare for slot reset */
> +
> +		amdgpu_device_lock_adev(adev);
> +		return PCI_ERS_RESULT_NEED_RESET;
> +	}
> +	case pci_channel_io_perm_failure:
> +		/* Permanent error, prepare for device removal */
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +	return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev) {
> +
> +	DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +	/* TODO - dump whatever for debugging purposes */
> +
> +	/* This called only if amdgpu_pci_error_detected returns
> +	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +	 * works, no need to reset slot.
> +	 */
> +
> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev) {
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +	bool vram_lost;
> +
> +	DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +	pci_restore_state(pdev);
> +
> +	r = amdgpu_device_ip_suspend(adev);
> +	if (r)
> +		goto out;
> +
> +
> +	/* post card */
> +	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_ip_resume_phase1(adev);
> +	if (r)
> +		goto out;
> +
> +	vram_lost = amdgpu_device_check_vram_lost(adev);
> +	if (vram_lost) {
> +		DRM_INFO("VRAM is lost due to GPU reset!\n");
> +		amdgpu_inc_vram_lost(adev);
> +	}
> +
> +	r = amdgpu_gtt_mgr_recover(
> +		&adev->mman.bdev.man[TTM_PL_TT]);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_fw_loading(adev);
> +	if (r)
> +		return r;
> +
> +	r = amdgpu_device_ip_resume_phase2(adev);
> +	if (r)
> +		goto out;
> +
> +	if (vram_lost)
> +		amdgpu_device_fill_reset_magic(adev);
> +
> +	/*
> +	 * Add this ASIC as tracked as reset was already
> +	 * complete successfully.
> +	 */
> +	amdgpu_register_gpu_instance(adev);
> +
> +	r = amdgpu_device_ip_late_init(adev);
> +	if (r)
> +		goto out;
> +
> +	amdgpu_fbdev_set_suspend(adev, 0);
> +
> +	/* must succeed. */
> +	amdgpu_ras_resume(adev);
> +
> +
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +	if (!r)
> +		DRM_INFO("PCIe error recovery succeeded\n");
> +	else {
> +		DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +		amdgpu_device_unlock_adev(adev);
> +	}
> +
> +	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED; }
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev) {
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	amdgpu_device_unlock_adev(adev);
> +
> +	DRM_INFO("PCI error: resume callback!!\n"); }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>   #include <drm/drm_pciids.h>
>   #include <linux/console.h>
>   #include <linux/module.h>
> -#include <linux/pci.h>
>   #include <linux/pm_runtime.h>
>   #include <linux/vga_switcheroo.h>
>   #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>   	.patchlevel = KMS_DRIVER_PATCHLEVEL,
>   };
>   
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +	.error_detected	= amdgpu_pci_error_detected,
> +	.mmio_enabled	= amdgpu_pci_mmio_enabled,
> +	.slot_reset	= amdgpu_pci_slot_reset,
> +	.resume		= amdgpu_pci_resume,
> +};
> +
>   static struct pci_driver amdgpu_kms_pci_driver = {
>   	.name = DRIVER_NAME,
>   	.id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>   	.remove = amdgpu_pci_remove,
>   	.shutdown = amdgpu_pci_shutdown,
>   	.driver.pm = &amdgpu_pm_ops,
> +	.err_handler = &amdgpu_pci_err_handler,
>   };
>   
>   static int __init amdgpu_init(void)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CDennis.Li%40amd.com%7Cfbaac227eff74bcab03108d849cedd81%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637340500116686286&amp;sdata=lYjIk50kyGIHG7UaAzx%2FswghLD8gc2DdCXUmTNj4EkU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-26 15:20   ` Alex Deucher
  2020-08-26 15:31     ` Andrey Grodzovsky
@ 2020-08-27 14:54     ` Andrey Grodzovsky
  2020-08-28  0:00       ` Grodzovsky, Andrey
  2020-08-28 14:13       ` Alex Deucher
  1 sibling, 2 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-27 14:54 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list


On 8/26/20 11:20 AM, Alex Deucher wrote:
> On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
>> unles you save the cashe the saved PCI confspace to load it after
>> each new reset.
>> Also use same cached state for other use case of restoring PCI confspace
>> such as GPU mode1 or VGA switheroo.
>>
> We don't want to keep the saved state around in the pci core
> otherwise, the pci core will assume we are managing the saved state
> for suspend and resume.  I think we want logic like this:
>
> At driver load time:
> pci_save_state(pdev);
> adev->pci_state = pci_store_saved_state(pdev);
> pci_restore_state(adev->pdev);
>
> then in the case of dpc, do:
> pci_load_saved_state(pdev, adev->pci_state);
>
> For all the other cases, just leave the code as is.


Actually, as we already discussed - caching the PCI confspace only once on boot 
and not doing it again after each subsequent
controlled or spontaneous reset runs the risk of loading back outdated confspace 
settings. I am not sure if and when but, is it indeed
possible we make changes to PCI confspace registers during runtime and so the 
cached state from boot might be outdated
to load back ?

Andrey


>
> Alex
>
>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
>>   5 files changed, 66 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3489622..42ee208 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -992,7 +992,9 @@ struct amdgpu_device {
>>          atomic_t                        throttling_logging_enabled;
>>          struct ratelimit_state          throttling_logging_rs;
>>          uint32_t                        ras_features;
>> +
> Unrelated whitespace changes.
>
>>          bool                            in_dpc;
>> +       struct pci_saved_state          *pci_state;
>>   };
>>
>>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>>   pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>>   void amdgpu_pci_resume(struct pci_dev *pdev);
>>
>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
>> +
>> +
>>
>>   #include "amdgpu_object.h"
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d9e3994..2c088df 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>
>>                  pci_set_power_state(dev->pdev, PCI_D0);
>> -               pci_restore_state(dev->pdev);
>> +               amdgpu_device_load_pci_state(dev->pdev);
>>                  r = pci_enable_device(dev->pdev);
>>                  if (r)
>>                          DRM_WARN("pci_enable_device failed (%d)\n", r);
>> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>                  drm_kms_helper_poll_disable(dev);
>>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>                  amdgpu_device_suspend(dev, true);
>> -               pci_save_state(dev->pdev);
>> +               amdgpu_device_cache_pci_state(dev->pdev);
>>                  /* Shut down the device */
>>                  pci_disable_device(dev->pdev);
>>                  pci_set_power_state(dev->pdev, PCI_D3cold);
>> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>          if (r)
>>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>>
>> -       if (pci_save_state(pdev))
>> -               DRM_ERROR("Failed to save PCI state!!\n");
>> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
>> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
>> +               DRM_WARN("Failed to cache PCI state!");
>>
>>          return 0;
>>
>> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>          flush_delayed_work(&adev->delayed_init_work);
>>          adev->shutdown = true;
>>
>> +       kfree(adev->pci_state);
>> +
>>          /* make sure IB test finished before entering exclusive mode
>>           * to avoid preemption on IB test
>>           * */
>> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>          /* wait for asic to come out of reset */
>>          msleep(500);
>>
>> -       pci_restore_state(pdev);
>> +       amdgpu_device_load_pci_state(pdev);
>>
>>          /* confirm  ASIC came out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>
>>   out:
>>
>> -       if (!r)
>> +       if (!r) {
>> +               amdgpu_device_cache_pci_state(adev->pdev);
>>                  DRM_INFO("PCIe error recovery succeeded\n");
>> +       }
>>          else {
>>                  DRM_ERROR("PCIe error recovery failed, err:%d", r);
>>                  amdgpu_device_unlock_adev(adev);
>> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>
>>          amdgpu_device_unlock_adev(adev);
>>   }
>> +
>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +       int r;
>> +
>> +       r = pci_save_state(pdev);
>> +       if (!r) {
>> +               kfree(adev->pci_state);
>> +
>> +               adev->pci_state = pci_store_saved_state(pdev);
>> +
>> +               if (!adev->pci_state) {
>> +                       DRM_ERROR("Failed to store PCI saved state");
>> +                       return false;
>> +               }
>> +       } else {
>> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
>> +               return false;
>> +       }
>> +
>> +       return true;
>> +}
>> +
>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +       int r;
>> +
>> +       if (!adev->pci_state)
>> +               return false;
>> +
>> +       r = pci_load_saved_state(pdev, adev->pci_state);
>> +
>> +       if (!r) {
>> +               pci_restore_state(pdev);
>> +       } else {
>> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
>> +               return false;
>> +       }
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 4bbcc70..7a6482a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>>                  if (amdgpu_is_atpx_hybrid()) {
>>                          pci_ignore_hotplug(pdev);
>>                  } else {
>> -                       pci_save_state(pdev);
>> +                       amdgpu_device_cache_pci_state(pdev);
>>                          pci_disable_device(pdev);
>>                          pci_ignore_hotplug(pdev);
>>                          pci_set_power_state(pdev, PCI_D3cold);
>> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>>                          pci_set_master(pdev);
>>                  } else {
>>                          pci_set_power_state(pdev, PCI_D0);
>> -                       pci_restore_state(pdev);
>> +                       amdgpu_device_load_pci_state(pdev);
>>                          ret = pci_enable_device(pdev);
>>                          if (ret)
>>                                  return ret;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
>> index 4d14023..0ec6603 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>          /* disable BM */
>>          pci_clear_master(adev->pdev);
>>
>> -       pci_save_state(adev->pdev);
>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>
>>          if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
>>                  dev_info(adev->dev, "GPU smu mode1 reset\n");
>> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>
>>          if (ret)
>>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
>> -       pci_restore_state(adev->pdev);
>> +       amdgpu_device_load_pci_state(adev->pdev);
>>
>>          /* wait for asic to come out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> index 2f93c47..ddd55e3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
>>          /* disable BM */
>>          pci_clear_master(adev->pdev);
>>
>> -       pci_save_state(adev->pdev);
>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>
>>          ret = psp_gpu_reset(adev);
>>          if (ret)
>>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
>>
>> -       pci_restore_state(adev->pdev);
>> +       amdgpu_device_load_pci_state(adev->pdev);
>>
>>          /* wait for asic to come out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7C5faaaf11d6024193cb2508d849d38fae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637340520282366646&amp;sdata=ziPLmwT95XAnKftzplY2QBHq%2BL1961d0SdaMPAlVMcE%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-27 14:54     ` Andrey Grodzovsky
@ 2020-08-28  0:00       ` Grodzovsky, Andrey
  2020-08-28  6:56         ` Christian König
  2020-08-28 14:13       ` Alex Deucher
  1 sibling, 1 reply; 29+ messages in thread
From: Grodzovsky, Andrey @ 2020-08-28  0:00 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Das, Nirmoy, amd-gfx list


[-- Attachment #1.1: Type: text/plain, Size: 11723 bytes --]

Ping

Andrey
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Sent: 27 August 2020 10:54
To: Alex Deucher <alexdeucher@gmail.com>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Das, Nirmoy <Nirmoy.Das@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.


On 8/26/20 11:20 AM, Alex Deucher wrote:
> On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
>> unles you save the cashe the saved PCI confspace to load it after
>> each new reset.
>> Also use same cached state for other use case of restoring PCI confspace
>> such as GPU mode1 or VGA switheroo.
>>
> We don't want to keep the saved state around in the pci core
> otherwise, the pci core will assume we are managing the saved state
> for suspend and resume.  I think we want logic like this:
>
> At driver load time:
> pci_save_state(pdev);
> adev->pci_state = pci_store_saved_state(pdev);
> pci_restore_state(adev->pdev);
>
> then in the case of dpc, do:
> pci_load_saved_state(pdev, adev->pci_state);
>
> For all the other cases, just leave the code as is.


Actually, as we already discussed - caching the PCI confspace only once on boot
and not doing it again after each subsequent
controlled or spontaneous reset runs the risk of loading back outdated confspace
settings. I am not sure if and when but, is it indeed
possible we make changes to PCI confspace registers during runtime and so the
cached state from boot might be outdated
to load back ?

Andrey


>
> Alex
>
>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
>>   5 files changed, 66 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3489622..42ee208 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -992,7 +992,9 @@ struct amdgpu_device {
>>          atomic_t                        throttling_logging_enabled;
>>          struct ratelimit_state          throttling_logging_rs;
>>          uint32_t                        ras_features;
>> +
> Unrelated whitespace changes.
>
>>          bool                            in_dpc;
>> +       struct pci_saved_state          *pci_state;
>>   };
>>
>>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>>   pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>>   void amdgpu_pci_resume(struct pci_dev *pdev);
>>
>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
>> +
>> +
>>
>>   #include "amdgpu_object.h"
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d9e3994..2c088df 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>
>>                  pci_set_power_state(dev->pdev, PCI_D0);
>> -               pci_restore_state(dev->pdev);
>> +               amdgpu_device_load_pci_state(dev->pdev);
>>                  r = pci_enable_device(dev->pdev);
>>                  if (r)
>>                          DRM_WARN("pci_enable_device failed (%d)\n", r);
>> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>                  drm_kms_helper_poll_disable(dev);
>>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>                  amdgpu_device_suspend(dev, true);
>> -               pci_save_state(dev->pdev);
>> +               amdgpu_device_cache_pci_state(dev->pdev);
>>                  /* Shut down the device */
>>                  pci_disable_device(dev->pdev);
>>                  pci_set_power_state(dev->pdev, PCI_D3cold);
>> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>          if (r)
>>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>>
>> -       if (pci_save_state(pdev))
>> -               DRM_ERROR("Failed to save PCI state!!\n");
>> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
>> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
>> +               DRM_WARN("Failed to cache PCI state!");
>>
>>          return 0;
>>
>> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>          flush_delayed_work(&adev->delayed_init_work);
>>          adev->shutdown = true;
>>
>> +       kfree(adev->pci_state);
>> +
>>          /* make sure IB test finished before entering exclusive mode
>>           * to avoid preemption on IB test
>>           * */
>> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>          /* wait for asic to come out of reset */
>>          msleep(500);
>>
>> -       pci_restore_state(pdev);
>> +       amdgpu_device_load_pci_state(pdev);
>>
>>          /* confirm  ASIC came out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>
>>   out:
>>
>> -       if (!r)
>> +       if (!r) {
>> +               amdgpu_device_cache_pci_state(adev->pdev);
>>                  DRM_INFO("PCIe error recovery succeeded\n");
>> +       }
>>          else {
>>                  DRM_ERROR("PCIe error recovery failed, err:%d", r);
>>                  amdgpu_device_unlock_adev(adev);
>> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>
>>          amdgpu_device_unlock_adev(adev);
>>   }
>> +
>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +       int r;
>> +
>> +       r = pci_save_state(pdev);
>> +       if (!r) {
>> +               kfree(adev->pci_state);
>> +
>> +               adev->pci_state = pci_store_saved_state(pdev);
>> +
>> +               if (!adev->pci_state) {
>> +                       DRM_ERROR("Failed to store PCI saved state");
>> +                       return false;
>> +               }
>> +       } else {
>> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
>> +               return false;
>> +       }
>> +
>> +       return true;
>> +}
>> +
>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
>> +{
>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>> +       int r;
>> +
>> +       if (!adev->pci_state)
>> +               return false;
>> +
>> +       r = pci_load_saved_state(pdev, adev->pci_state);
>> +
>> +       if (!r) {
>> +               pci_restore_state(pdev);
>> +       } else {
>> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
>> +               return false;
>> +       }
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 4bbcc70..7a6482a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>>                  if (amdgpu_is_atpx_hybrid()) {
>>                          pci_ignore_hotplug(pdev);
>>                  } else {
>> -                       pci_save_state(pdev);
>> +                       amdgpu_device_cache_pci_state(pdev);
>>                          pci_disable_device(pdev);
>>                          pci_ignore_hotplug(pdev);
>>                          pci_set_power_state(pdev, PCI_D3cold);
>> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>>                          pci_set_master(pdev);
>>                  } else {
>>                          pci_set_power_state(pdev, PCI_D0);
>> -                       pci_restore_state(pdev);
>> +                       amdgpu_device_load_pci_state(pdev);
>>                          ret = pci_enable_device(pdev);
>>                          if (ret)
>>                                  return ret;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
>> index 4d14023..0ec6603 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>          /* disable BM */
>>          pci_clear_master(adev->pdev);
>>
>> -       pci_save_state(adev->pdev);
>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>
>>          if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
>>                  dev_info(adev->dev, "GPU smu mode1 reset\n");
>> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>
>>          if (ret)
>>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
>> -       pci_restore_state(adev->pdev);
>> +       amdgpu_device_load_pci_state(adev->pdev);
>>
>>          /* wait for asic to come out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> index 2f93c47..ddd55e3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
>>          /* disable BM */
>>          pci_clear_master(adev->pdev);
>>
>> -       pci_save_state(adev->pdev);
>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>
>>          ret = psp_gpu_reset(adev);
>>          if (ret)
>>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
>>
>> -       pci_restore_state(adev->pdev);
>> +       amdgpu_device_load_pci_state(adev->pdev);
>>
>>          /* wait for asic to come out of reset */
>>          for (i = 0; i < adev->usec_timeout; i++) {
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cf7e5e09330074d176b5b08d84a99236e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637341368926287137&amp;sdata=HTTP5xt%2BFTc%2F%2BlcI37wkRQ0BqTHOXOQVpSlR%2FDEnFlM%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cf7e5e09330074d176b5b08d84a99236e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637341368926287137&amp;sdata=HTTP5xt%2BFTc%2F%2BlcI37wkRQ0BqTHOXOQVpSlR%2FDEnFlM%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 23685 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-28  0:00       ` Grodzovsky, Andrey
@ 2020-08-28  6:56         ` Christian König
  0 siblings, 0 replies; 29+ messages in thread
From: Christian König @ 2020-08-28  6:56 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Alex Deucher
  Cc: Deucher, Alexander, Das, Nirmoy, amd-gfx list


[-- Attachment #1.1: Type: text/plain, Size: 12070 bytes --]

Am 28.08.20 um 02:00 schrieb Grodzovsky, Andrey:
> Ping
>
> Andrey
> ------------------------------------------------------------------------
> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of 
> Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
> *Sent:* 27 August 2020 10:54
> *To:* Alex Deucher <alexdeucher@gmail.com>
> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Das, Nirmoy 
> <Nirmoy.Das@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>
> *Subject:* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries 
> failure.
>
> On 8/26/20 11:20 AM, Alex Deucher wrote:
> > On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
> >> unles you save the cashe the saved PCI confspace to load it after
> >> each new reset.
> >> Also use same cached state for other use case of restoring PCI 
> confspace
> >> such as GPU mode1 or VGA switheroo.
> >>
> > We don't want to keep the saved state around in the pci core
> > otherwise, the pci core will assume we are managing the saved state
> > for suspend and resume.  I think we want logic like this:
> >
> > At driver load time:
> > pci_save_state(pdev);
> > adev->pci_state = pci_store_saved_state(pdev);
> > pci_restore_state(adev->pdev);
> >
> > then in the case of dpc, do:
> > pci_load_saved_state(pdev, adev->pci_state);
> >
> > For all the other cases, just leave the code as is.
>
>
> Actually, as we already discussed - caching the PCI confspace only 
> once on boot
> and not doing it again after each subsequent
> controlled or spontaneous reset runs the risk of loading back outdated 
> confspace
> settings. I am not sure if and when but, is it indeed
> possible we make changes to PCI confspace registers during runtime and 
> so the
> cached state from boot might be outdated
> to load back ?

Well we change the confspace settings for quite a bunch of things. For 
example during BAR resize or link speed adjustment.

Christian.

>
> Andrey
>
>
> >
> > Alex
> >
> >
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 
> +++++++++++++++++++++++++++---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
> >>   drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
> >>   drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
> >>   5 files changed, 66 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> index 3489622..42ee208 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> @@ -992,7 +992,9 @@ struct amdgpu_device {
> >>          atomic_t throttling_logging_enabled;
> >>          struct ratelimit_state throttling_logging_rs;
> >>          uint32_t ras_features;
> >> +
> > Unrelated whitespace changes.
> >
> >>          bool                            in_dpc;
> >> +       struct pci_saved_state *pci_state;
> >>   };
> >>
> >>   static inline struct amdgpu_device *drm_to_adev(struct drm_device 
> *ddev)
> >> @@ -1272,6 +1274,10 @@ pci_ers_result_t 
> amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> >>   pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> >>   void amdgpu_pci_resume(struct pci_dev *pdev);
> >>
> >> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
> >> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
> >> +
> >> +
> >>
> >>   #include "amdgpu_object.h"
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index d9e3994..2c088df 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -1283,7 +1283,7 @@ static void 
> amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>
> >> pci_set_power_state(dev->pdev, PCI_D0);
> >> -               pci_restore_state(dev->pdev);
> >> + amdgpu_device_load_pci_state(dev->pdev);
> >>                  r = pci_enable_device(dev->pdev);
> >>                  if (r)
> >> DRM_WARN("pci_enable_device failed (%d)\n", r);
> >> @@ -1296,7 +1296,7 @@ static void 
> amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >> drm_kms_helper_poll_disable(dev);
> >>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>                  amdgpu_device_suspend(dev, true);
> >> -               pci_save_state(dev->pdev);
> >> + amdgpu_device_cache_pci_state(dev->pdev);
> >>                  /* Shut down the device */
> >> pci_disable_device(dev->pdev);
> >> pci_set_power_state(dev->pdev, PCI_D3cold);
> >> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device 
> *adev,
> >>          if (r)
> >>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
> >>
> >> -       if (pci_save_state(pdev))
> >> -               DRM_ERROR("Failed to save PCI state!!\n");
> >> +       /* Have stored pci confspace at hand for restore in sudden 
> PCI error */
> >> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> >> +               DRM_WARN("Failed to cache PCI state!");
> >>
> >>          return 0;
> >>
> >> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device 
> *adev)
> >> flush_delayed_work(&adev->delayed_init_work);
> >>          adev->shutdown = true;
> >>
> >> +       kfree(adev->pci_state);
> >> +
> >>          /* make sure IB test finished before entering exclusive mode
> >>           * to avoid preemption on IB test
> >>           * */
> >> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct 
> pci_dev *pdev)
> >>          /* wait for asic to come out of reset */
> >>          msleep(500);
> >>
> >> -       pci_restore_state(pdev);
> >> +       amdgpu_device_load_pci_state(pdev);
> >>
> >>          /* confirm  ASIC came out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> @@ -4934,8 +4937,10 @@ pci_ers_result_t 
> amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >>
> >>   out:
> >>
> >> -       if (!r)
> >> +       if (!r) {
> >> + amdgpu_device_cache_pci_state(adev->pdev);
> >>                  DRM_INFO("PCIe error recovery succeeded\n");
> >> +       }
> >>          else {
> >>                  DRM_ERROR("PCIe error recovery failed, err:%d", r);
> >>                  amdgpu_device_unlock_adev(adev);
> >> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
> >>
> >>          amdgpu_device_unlock_adev(adev);
> >>   }
> >> +
> >> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +       int r;
> >> +
> >> +       r = pci_save_state(pdev);
> >> +       if (!r) {
> >> +               kfree(adev->pci_state);
> >> +
> >> +               adev->pci_state = pci_store_saved_state(pdev);
> >> +
> >> +               if (!adev->pci_state) {
> >> +                       DRM_ERROR("Failed to store PCI saved state");
> >> +                       return false;
> >> +               }
> >> +       } else {
> >> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
> >> +               return false;
> >> +       }
> >> +
> >> +       return true;
> >> +}
> >> +
> >> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +       int r;
> >> +
> >> +       if (!adev->pci_state)
> >> +               return false;
> >> +
> >> +       r = pci_load_saved_state(pdev, adev->pci_state);
> >> +
> >> +       if (!r) {
> >> +               pci_restore_state(pdev);
> >> +       } else {
> >> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
> >> +               return false;
> >> +       }
> >> +}
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index 4bbcc70..7a6482a 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -1320,7 +1320,7 @@ static int 
> amdgpu_pmops_runtime_suspend(struct device *dev)
> >>                  if (amdgpu_is_atpx_hybrid()) {
> >> pci_ignore_hotplug(pdev);
> >>                  } else {
> >> -                       pci_save_state(pdev);
> >> + amdgpu_device_cache_pci_state(pdev);
> >> pci_disable_device(pdev);
> >> pci_ignore_hotplug(pdev);
> >> pci_set_power_state(pdev, PCI_D3cold);
> >> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct 
> device *dev)
> >>                          pci_set_master(pdev);
> >>                  } else {
> >> pci_set_power_state(pdev, PCI_D0);
> >> -                       pci_restore_state(pdev);
> >> + amdgpu_device_load_pci_state(pdev);
> >>                          ret = pci_enable_device(pdev);
> >>                          if (ret)
> >>                                  return ret;
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
> b/drivers/gpu/drm/amd/amdgpu/nv.c
> >> index 4d14023..0ec6603 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> >> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct 
> amdgpu_device *adev)
> >>          /* disable BM */
> >>          pci_clear_master(adev->pdev);
> >>
> >> -       pci_save_state(adev->pdev);
> >> + amdgpu_device_cache_pci_state(adev->pdev);
> >>
> >>          if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
> >>                  dev_info(adev->dev, "GPU smu mode1 reset\n");
> >> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct 
> amdgpu_device *adev)
> >>
> >>          if (ret)
> >>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
> >> -       pci_restore_state(adev->pdev);
> >> + amdgpu_device_load_pci_state(adev->pdev);
> >>
> >>          /* wait for asic to come out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> index 2f93c47..ddd55e3 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct 
> amdgpu_device *adev)
> >>          /* disable BM */
> >>          pci_clear_master(adev->pdev);
> >>
> >> -       pci_save_state(adev->pdev);
> >> + amdgpu_device_cache_pci_state(adev->pdev);
> >>
> >>          ret = psp_gpu_reset(adev);
> >>          if (ret)
> >>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
> >>
> >> -       pci_restore_state(adev->pdev);
> >> + amdgpu_device_load_pci_state(adev->pdev);
> >>
> >>          /* wait for asic to come out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cf7e5e09330074d176b5b08d84a99236e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637341368926287137&amp;sdata=HTTP5xt%2BFTc%2F%2BlcI37wkRQ0BqTHOXOQVpSlR%2FDEnFlM%3D&amp;reserved=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cf7e5e09330074d176b5b08d84a99236e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637341368926287137&amp;sdata=HTTP5xt%2BFTc%2F%2BlcI37wkRQ0BqTHOXOQVpSlR%2FDEnFlM%3D&amp;reserved=0
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 22700 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-27 14:54     ` Andrey Grodzovsky
  2020-08-28  0:00       ` Grodzovsky, Andrey
@ 2020-08-28 14:13       ` Alex Deucher
  2020-08-28 14:21         ` Andrey Grodzovsky
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Deucher @ 2020-08-28 14:13 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list

On Thu, Aug 27, 2020 at 10:54 AM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 8/26/20 11:20 AM, Alex Deucher wrote:
> > On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
> >> unles you save the cashe the saved PCI confspace to load it after
> >> each new reset.
> >> Also use same cached state for other use case of restoring PCI confspace
> >> such as GPU mode1 or VGA switheroo.
> >>
> > We don't want to keep the saved state around in the pci core
> > otherwise, the pci core will assume we are managing the saved state
> > for suspend and resume.  I think we want logic like this:
> >
> > At driver load time:
> > pci_save_state(pdev);
> > adev->pci_state = pci_store_saved_state(pdev);
> > pci_restore_state(adev->pdev);
> >
> > then in the case of dpc, do:
> > pci_load_saved_state(pdev, adev->pci_state);
> >
> > For all the other cases, just leave the code as is.
>
>
> Actually, as we already discussed - caching the PCI confspace only once on boot
> and not doing it again after each subsequent
> controlled or spontaneous reset runs the risk of loading back outdated confspace
> settings. I am not sure if and when but, is it indeed
> possible we make changes to PCI confspace registers during runtime and so the
> cached state from boot might be outdated
> to load back ?

As Christain noted we may change config space at load time if we
resize the BAR.  So we should probably save config space at the end of
the driver load init sequence.  This saved state is just a backup in
case something kills pci config space without saving it.  I think it's
fine to update the cached copy whenever we call pci_save_state(), but
what we don't want to do is call pci_save_state() at init time without
calling pci_restore_state().  pci_save_state() caches a copy of the
save state in the pdev structure.  The pci core uses the presence of
this cached state to make decisions about whether the driver or the
core should handle a bunch of stuff at suspend/resume.  If the cached
state is present, it assumes the driver will be handling all of the
pci related state management for power management.  I'd rather keep
that in the core.  I think the patch is good, but this part needs to
be reworked:

@@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
        if (r)
                dev_err(adev->dev, "amdgpu_pmu_init failed\n");

-       if (pci_save_state(pdev))
-               DRM_ERROR("Failed to save PCI state!!\n");
+       /* Have stored pci confspace at hand for restore in sudden PCI error */
+       if (!amdgpu_device_cache_pci_state(adev->pdev))
+               DRM_WARN("Failed to cache PCI state!");

Something like this:

@@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
        if (r)
                dev_err(adev->dev, "amdgpu_pmu_init failed\n");

-       if (pci_save_state(pdev))
-               DRM_ERROR("Failed to save PCI state!!\n");
+       /* Have stored pci confspace at hand for restore in sudden PCI error */
+       if (!amdgpu_device_cache_pci_state(adev->pdev))
+               DRM_WARN("Failed to cache PCI state!");
+       if (!amdgpu_device_load_pci_state(adev->pdev))
+               DRM_WARN("Failed to restore PCI state!");

This way we have a cached copy in the driver but not in the pci core.


Alex


>
> Andrey
>
>
> >
> > Alex
> >
> >
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
> >>   drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
> >>   drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
> >>   5 files changed, 66 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> index 3489622..42ee208 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> @@ -992,7 +992,9 @@ struct amdgpu_device {
> >>          atomic_t                        throttling_logging_enabled;
> >>          struct ratelimit_state          throttling_logging_rs;
> >>          uint32_t                        ras_features;
> >> +
> > Unrelated whitespace changes.
> >
> >>          bool                            in_dpc;
> >> +       struct pci_saved_state          *pci_state;
> >>   };
> >>
> >>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> >> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> >>   pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> >>   void amdgpu_pci_resume(struct pci_dev *pdev);
> >>
> >> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
> >> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
> >> +
> >> +
> >>
> >>   #include "amdgpu_object.h"
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index d9e3994..2c088df 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>
> >>                  pci_set_power_state(dev->pdev, PCI_D0);
> >> -               pci_restore_state(dev->pdev);
> >> +               amdgpu_device_load_pci_state(dev->pdev);
> >>                  r = pci_enable_device(dev->pdev);
> >>                  if (r)
> >>                          DRM_WARN("pci_enable_device failed (%d)\n", r);
> >> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >>                  drm_kms_helper_poll_disable(dev);
> >>                  dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>                  amdgpu_device_suspend(dev, true);
> >> -               pci_save_state(dev->pdev);
> >> +               amdgpu_device_cache_pci_state(dev->pdev);
> >>                  /* Shut down the device */
> >>                  pci_disable_device(dev->pdev);
> >>                  pci_set_power_state(dev->pdev, PCI_D3cold);
> >> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >>          if (r)
> >>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
> >>
> >> -       if (pci_save_state(pdev))
> >> -               DRM_ERROR("Failed to save PCI state!!\n");
> >> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> >> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> >> +               DRM_WARN("Failed to cache PCI state!");
> >>
> >>          return 0;
> >>
> >> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
> >>          flush_delayed_work(&adev->delayed_init_work);
> >>          adev->shutdown = true;
> >>
> >> +       kfree(adev->pci_state);
> >> +
> >>          /* make sure IB test finished before entering exclusive mode
> >>           * to avoid preemption on IB test
> >>           * */
> >> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >>          /* wait for asic to come out of reset */
> >>          msleep(500);
> >>
> >> -       pci_restore_state(pdev);
> >> +       amdgpu_device_load_pci_state(pdev);
> >>
> >>          /* confirm  ASIC came out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >>
> >>   out:
> >>
> >> -       if (!r)
> >> +       if (!r) {
> >> +               amdgpu_device_cache_pci_state(adev->pdev);
> >>                  DRM_INFO("PCIe error recovery succeeded\n");
> >> +       }
> >>          else {
> >>                  DRM_ERROR("PCIe error recovery failed, err:%d", r);
> >>                  amdgpu_device_unlock_adev(adev);
> >> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
> >>
> >>          amdgpu_device_unlock_adev(adev);
> >>   }
> >> +
> >> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +       int r;
> >> +
> >> +       r = pci_save_state(pdev);
> >> +       if (!r) {
> >> +               kfree(adev->pci_state);
> >> +
> >> +               adev->pci_state = pci_store_saved_state(pdev);
> >> +
> >> +               if (!adev->pci_state) {
> >> +                       DRM_ERROR("Failed to store PCI saved state");
> >> +                       return false;
> >> +               }
> >> +       } else {
> >> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
> >> +               return false;
> >> +       }
> >> +
> >> +       return true;
> >> +}
> >> +
> >> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> >> +{
> >> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >> +       int r;
> >> +
> >> +       if (!adev->pci_state)
> >> +               return false;
> >> +
> >> +       r = pci_load_saved_state(pdev, adev->pci_state);
> >> +
> >> +       if (!r) {
> >> +               pci_restore_state(pdev);
> >> +       } else {
> >> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
> >> +               return false;
> >> +       }
> >> +}
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index 4bbcc70..7a6482a 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
> >>                  if (amdgpu_is_atpx_hybrid()) {
> >>                          pci_ignore_hotplug(pdev);
> >>                  } else {
> >> -                       pci_save_state(pdev);
> >> +                       amdgpu_device_cache_pci_state(pdev);
> >>                          pci_disable_device(pdev);
> >>                          pci_ignore_hotplug(pdev);
> >>                          pci_set_power_state(pdev, PCI_D3cold);
> >> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
> >>                          pci_set_master(pdev);
> >>                  } else {
> >>                          pci_set_power_state(pdev, PCI_D0);
> >> -                       pci_restore_state(pdev);
> >> +                       amdgpu_device_load_pci_state(pdev);
> >>                          ret = pci_enable_device(pdev);
> >>                          if (ret)
> >>                                  return ret;
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> >> index 4d14023..0ec6603 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> >> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
> >>          /* disable BM */
> >>          pci_clear_master(adev->pdev);
> >>
> >> -       pci_save_state(adev->pdev);
> >> +       amdgpu_device_cache_pci_state(adev->pdev);
> >>
> >>          if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
> >>                  dev_info(adev->dev, "GPU smu mode1 reset\n");
> >> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
> >>
> >>          if (ret)
> >>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
> >> -       pci_restore_state(adev->pdev);
> >> +       amdgpu_device_load_pci_state(adev->pdev);
> >>
> >>          /* wait for asic to come out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> index 2f93c47..ddd55e3 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
> >>          /* disable BM */
> >>          pci_clear_master(adev->pdev);
> >>
> >> -       pci_save_state(adev->pdev);
> >> +       amdgpu_device_cache_pci_state(adev->pdev);
> >>
> >>          ret = psp_gpu_reset(adev);
> >>          if (ret)
> >>                  dev_err(adev->dev, "GPU mode1 reset failed\n");
> >>
> >> -       pci_restore_state(adev->pdev);
> >> +       amdgpu_device_load_pci_state(adev->pdev);
> >>
> >>          /* wait for asic to come out of reset */
> >>          for (i = 0; i < adev->usec_timeout; i++) {
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7C5faaaf11d6024193cb2508d849d38fae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637340520282366646&amp;sdata=ziPLmwT95XAnKftzplY2QBHq%2BL1961d0SdaMPAlVMcE%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-28 14:13       ` Alex Deucher
@ 2020-08-28 14:21         ` Andrey Grodzovsky
  2020-08-28 14:32           ` Alex Deucher
  0 siblings, 1 reply; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-28 14:21 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list


On 8/28/20 10:13 AM, Alex Deucher wrote:
> On Thu, Aug 27, 2020 at 10:54 AM Andrey Grodzovsky
> <Andrey.Grodzovsky@amd.com> wrote:
>>
>> On 8/26/20 11:20 AM, Alex Deucher wrote:
>>> On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
>>>> unles you save the cashe the saved PCI confspace to load it after
>>>> each new reset.
>>>> Also use same cached state for other use case of restoring PCI confspace
>>>> such as GPU mode1 or VGA switheroo.
>>>>
>>> We don't want to keep the saved state around in the pci core
>>> otherwise, the pci core will assume we are managing the saved state
>>> for suspend and resume.  I think we want logic like this:
>>>
>>> At driver load time:
>>> pci_save_state(pdev);
>>> adev->pci_state = pci_store_saved_state(pdev);
>>> pci_restore_state(adev->pdev);
>>>
>>> then in the case of dpc, do:
>>> pci_load_saved_state(pdev, adev->pci_state);
>>>
>>> For all the other cases, just leave the code as is.
>>
>> Actually, as we already discussed - caching the PCI confspace only once on boot
>> and not doing it again after each subsequent
>> controlled or spontaneous reset runs the risk of loading back outdated confspace
>> settings. I am not sure if and when but, is it indeed
>> possible we make changes to PCI confspace registers during runtime and so the
>> cached state from boot might be outdated
>> to load back ?
> As Christain noted we may change config space at load time if we
> resize the BAR.  So we should probably save config space at the end of
> the driver load init sequence.  This saved state is just a backup in
> case something kills pci config space without saving it.  I think it's
> fine to update the cached copy whenever we call pci_save_state(), but
> what we don't want to do is call pci_save_state() at init time without
> calling pci_restore_state().  pci_save_state() caches a copy of the
> save state in the pdev structure.  The pci core uses the presence of
> this cached state to make decisions about whether the driver or the
> core should handle a bunch of stuff at suspend/resume.  If the cached
> state is present, it assumes the driver will be handling all of the
> pci related state management for power management.  I'd rather keep
> that in the core.  I think the patch is good, but this part needs to
> be reworked:


In this case I believe the patch should stay as is with the only change at
adding pci_restore_state(pdev) in amdgpu_device_cache_pci_state after the call
to pci_store_saved_state - right ?

Andrey


>
> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>          if (r)
>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>
> -       if (pci_save_state(pdev))
> -               DRM_ERROR("Failed to save PCI state!!\n");
> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> +               DRM_WARN("Failed to cache PCI state!");
>
> Something like this:
>
> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>          if (r)
>                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>
> -       if (pci_save_state(pdev))
> -               DRM_ERROR("Failed to save PCI state!!\n");
> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> +               DRM_WARN("Failed to cache PCI state!");
> +       if (!amdgpu_device_load_pci_state(adev->pdev))
> +               DRM_WARN("Failed to restore PCI state!");
>
> This way we have a cached copy in the driver but not in the pci core.
>
>
> Alex
>
>
>> Andrey
>>
>>
>>> Alex
>>>
>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
>>>>    drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
>>>>    drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
>>>>    5 files changed, 66 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index 3489622..42ee208 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -992,7 +992,9 @@ struct amdgpu_device {
>>>>           atomic_t                        throttling_logging_enabled;
>>>>           struct ratelimit_state          throttling_logging_rs;
>>>>           uint32_t                        ras_features;
>>>> +
>>> Unrelated whitespace changes.
>>>
>>>>           bool                            in_dpc;
>>>> +       struct pci_saved_state          *pci_state;
>>>>    };
>>>>
>>>>    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>>>> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>>>>    pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>>>>    void amdgpu_pci_resume(struct pci_dev *pdev);
>>>>
>>>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
>>>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
>>>> +
>>>> +
>>>>
>>>>    #include "amdgpu_object.h"
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index d9e3994..2c088df 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>>>                   dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>>>
>>>>                   pci_set_power_state(dev->pdev, PCI_D0);
>>>> -               pci_restore_state(dev->pdev);
>>>> +               amdgpu_device_load_pci_state(dev->pdev);
>>>>                   r = pci_enable_device(dev->pdev);
>>>>                   if (r)
>>>>                           DRM_WARN("pci_enable_device failed (%d)\n", r);
>>>> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>>>                   drm_kms_helper_poll_disable(dev);
>>>>                   dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>>>>                   amdgpu_device_suspend(dev, true);
>>>> -               pci_save_state(dev->pdev);
>>>> +               amdgpu_device_cache_pci_state(dev->pdev);
>>>>                   /* Shut down the device */
>>>>                   pci_disable_device(dev->pdev);
>>>>                   pci_set_power_state(dev->pdev, PCI_D3cold);
>>>> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>>           if (r)
>>>>                   dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>>>>
>>>> -       if (pci_save_state(pdev))
>>>> -               DRM_ERROR("Failed to save PCI state!!\n");
>>>> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
>>>> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
>>>> +               DRM_WARN("Failed to cache PCI state!");
>>>>
>>>>           return 0;
>>>>
>>>> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>>>           flush_delayed_work(&adev->delayed_init_work);
>>>>           adev->shutdown = true;
>>>>
>>>> +       kfree(adev->pci_state);
>>>> +
>>>>           /* make sure IB test finished before entering exclusive mode
>>>>            * to avoid preemption on IB test
>>>>            * */
>>>> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>           /* wait for asic to come out of reset */
>>>>           msleep(500);
>>>>
>>>> -       pci_restore_state(pdev);
>>>> +       amdgpu_device_load_pci_state(pdev);
>>>>
>>>>           /* confirm  ASIC came out of reset */
>>>>           for (i = 0; i < adev->usec_timeout; i++) {
>>>> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>
>>>>    out:
>>>>
>>>> -       if (!r)
>>>> +       if (!r) {
>>>> +               amdgpu_device_cache_pci_state(adev->pdev);
>>>>                   DRM_INFO("PCIe error recovery succeeded\n");
>>>> +       }
>>>>           else {
>>>>                   DRM_ERROR("PCIe error recovery failed, err:%d", r);
>>>>                   amdgpu_device_unlock_adev(adev);
>>>> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>>>
>>>>           amdgpu_device_unlock_adev(adev);
>>>>    }
>>>> +
>>>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
>>>> +{
>>>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>>>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>>>> +       int r;
>>>> +
>>>> +       r = pci_save_state(pdev);
>>>> +       if (!r) {
>>>> +               kfree(adev->pci_state);
>>>> +
>>>> +               adev->pci_state = pci_store_saved_state(pdev);
>>>> +
>>>> +               if (!adev->pci_state) {
>>>> +                       DRM_ERROR("Failed to store PCI saved state");
>>>> +                       return false;
>>>> +               }
>>>> +       } else {
>>>> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
>>>> +               return false;
>>>> +       }
>>>> +
>>>> +       return true;
>>>> +}
>>>> +
>>>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
>>>> +{
>>>> +       struct drm_device *dev = pci_get_drvdata(pdev);
>>>> +       struct amdgpu_device *adev = drm_to_adev(dev);
>>>> +       int r;
>>>> +
>>>> +       if (!adev->pci_state)
>>>> +               return false;
>>>> +
>>>> +       r = pci_load_saved_state(pdev, adev->pci_state);
>>>> +
>>>> +       if (!r) {
>>>> +               pci_restore_state(pdev);
>>>> +       } else {
>>>> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
>>>> +               return false;
>>>> +       }
>>>> +}
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 4bbcc70..7a6482a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>>>>                   if (amdgpu_is_atpx_hybrid()) {
>>>>                           pci_ignore_hotplug(pdev);
>>>>                   } else {
>>>> -                       pci_save_state(pdev);
>>>> +                       amdgpu_device_cache_pci_state(pdev);
>>>>                           pci_disable_device(pdev);
>>>>                           pci_ignore_hotplug(pdev);
>>>>                           pci_set_power_state(pdev, PCI_D3cold);
>>>> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>>>>                           pci_set_master(pdev);
>>>>                   } else {
>>>>                           pci_set_power_state(pdev, PCI_D0);
>>>> -                       pci_restore_state(pdev);
>>>> +                       amdgpu_device_load_pci_state(pdev);
>>>>                           ret = pci_enable_device(pdev);
>>>>                           if (ret)
>>>>                                   return ret;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> index 4d14023..0ec6603 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>>>           /* disable BM */
>>>>           pci_clear_master(adev->pdev);
>>>>
>>>> -       pci_save_state(adev->pdev);
>>>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>>>
>>>>           if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
>>>>                   dev_info(adev->dev, "GPU smu mode1 reset\n");
>>>> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
>>>>
>>>>           if (ret)
>>>>                   dev_err(adev->dev, "GPU mode1 reset failed\n");
>>>> -       pci_restore_state(adev->pdev);
>>>> +       amdgpu_device_load_pci_state(adev->pdev);
>>>>
>>>>           /* wait for asic to come out of reset */
>>>>           for (i = 0; i < adev->usec_timeout; i++) {
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> index 2f93c47..ddd55e3 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
>>>>           /* disable BM */
>>>>           pci_clear_master(adev->pdev);
>>>>
>>>> -       pci_save_state(adev->pdev);
>>>> +       amdgpu_device_cache_pci_state(adev->pdev);
>>>>
>>>>           ret = psp_gpu_reset(adev);
>>>>           if (ret)
>>>>                   dev_err(adev->dev, "GPU mode1 reset failed\n");
>>>>
>>>> -       pci_restore_state(adev->pdev);
>>>> +       amdgpu_device_load_pci_state(adev->pdev);
>>>>
>>>>           /* wait for asic to come out of reset */
>>>>           for (i = 0; i < adev->usec_timeout; i++) {
>>>> --
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CAndrey.Grodzovsky%40amd.com%7C3d5f35b27c9c4055150908d84b5c81cb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637342207973407357&amp;sdata=qA1xTpuY1JtOTPf7m1WhpH%2BFqVv8GA5CDMu%2BJa3Ds4o%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.
  2020-08-28 14:21         ` Andrey Grodzovsky
@ 2020-08-28 14:32           ` Alex Deucher
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Deucher @ 2020-08-28 14:32 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: Deucher, Alexander, Nirmoy, amd-gfx list

On Fri, Aug 28, 2020 at 10:21 AM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 8/28/20 10:13 AM, Alex Deucher wrote:
> > On Thu, Aug 27, 2020 at 10:54 AM Andrey Grodzovsky
> > <Andrey.Grodzovsky@amd.com> wrote:
> >>
> >> On 8/26/20 11:20 AM, Alex Deucher wrote:
> >>> On Wed, Aug 26, 2020 at 10:46 AM Andrey Grodzovsky
> >>> <andrey.grodzovsky@amd.com> wrote:
> >>>> DPC recovery after prev. DPC recovery or after prev. MODE1 reset fails
> >>>> unles you save the cashe the saved PCI confspace to load it after
> >>>> each new reset.
> >>>> Also use same cached state for other use case of restoring PCI confspace
> >>>> such as GPU mode1 or VGA switheroo.
> >>>>
> >>> We don't want to keep the saved state around in the pci core
> >>> otherwise, the pci core will assume we are managing the saved state
> >>> for suspend and resume.  I think we want logic like this:
> >>>
> >>> At driver load time:
> >>> pci_save_state(pdev);
> >>> adev->pci_state = pci_store_saved_state(pdev);
> >>> pci_restore_state(adev->pdev);
> >>>
> >>> then in the case of dpc, do:
> >>> pci_load_saved_state(pdev, adev->pci_state);
> >>>
> >>> For all the other cases, just leave the code as is.
> >>
> >> Actually, as we already discussed - caching the PCI confspace only once on boot
> >> and not doing it again after each subsequent
> >> controlled or spontaneous reset runs the risk of loading back outdated confspace
> >> settings. I am not sure if and when but, is it indeed
> >> possible we make changes to PCI confspace registers during runtime and so the
> >> cached state from boot might be outdated
> >> to load back ?
> > As Christain noted we may change config space at load time if we
> > resize the BAR.  So we should probably save config space at the end of
> > the driver load init sequence.  This saved state is just a backup in
> > case something kills pci config space without saving it.  I think it's
> > fine to update the cached copy whenever we call pci_save_state(), but
> > what we don't want to do is call pci_save_state() at init time without
> > calling pci_restore_state().  pci_save_state() caches a copy of the
> > save state in the pdev structure.  The pci core uses the presence of
> > this cached state to make decisions about whether the driver or the
> > core should handle a bunch of stuff at suspend/resume.  If the cached
> > state is present, it assumes the driver will be handling all of the
> > pci related state management for power management.  I'd rather keep
> > that in the core.  I think the patch is good, but this part needs to
> > be reworked:
>
>
> In this case I believe the patch should stay as is with the only change at
> adding pci_restore_state(pdev) in amdgpu_device_cache_pci_state after the call
> to pci_store_saved_state - right ?

Correct.

Alex


>
> Andrey
>
>
> >
> > @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >          if (r)
> >                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
> >
> > -       if (pci_save_state(pdev))
> > -               DRM_ERROR("Failed to save PCI state!!\n");
> > +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> > +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> > +               DRM_WARN("Failed to cache PCI state!");
> >
> > Something like this:
> >
> > @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >          if (r)
> >                  dev_err(adev->dev, "amdgpu_pmu_init failed\n");
> >
> > -       if (pci_save_state(pdev))
> > -               DRM_ERROR("Failed to save PCI state!!\n");
> > +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> > +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> > +               DRM_WARN("Failed to cache PCI state!");
> > +       if (!amdgpu_device_load_pci_state(adev->pdev))
> > +               DRM_WARN("Failed to restore PCI state!");
> >
> > This way we have a cached copy in the driver but not in the pci core.
> >
> >
> > Alex
> >
> >
> >> Andrey
> >>
> >>
> >>> Alex
> >>>
> >>>
> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++++++++++++++++++++++++++---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +-
> >>>>    drivers/gpu/drm/amd/amdgpu/nv.c            |  4 +-
> >>>>    drivers/gpu/drm/amd/amdgpu/soc15.c         |  4 +-
> >>>>    5 files changed, 66 insertions(+), 12 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> index 3489622..42ee208 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> @@ -992,7 +992,9 @@ struct amdgpu_device {
> >>>>           atomic_t                        throttling_logging_enabled;
> >>>>           struct ratelimit_state          throttling_logging_rs;
> >>>>           uint32_t                        ras_features;
> >>>> +
> >>> Unrelated whitespace changes.
> >>>
> >>>>           bool                            in_dpc;
> >>>> +       struct pci_saved_state          *pci_state;
> >>>>    };
> >>>>
> >>>>    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> >>>> @@ -1272,6 +1274,10 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> >>>>    pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> >>>>    void amdgpu_pci_resume(struct pci_dev *pdev);
> >>>>
> >>>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
> >>>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
> >>>> +
> >>>> +
> >>>>
> >>>>    #include "amdgpu_object.h"
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> index d9e3994..2c088df 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> @@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >>>>                   dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>>>
> >>>>                   pci_set_power_state(dev->pdev, PCI_D0);
> >>>> -               pci_restore_state(dev->pdev);
> >>>> +               amdgpu_device_load_pci_state(dev->pdev);
> >>>>                   r = pci_enable_device(dev->pdev);
> >>>>                   if (r)
> >>>>                           DRM_WARN("pci_enable_device failed (%d)\n", r);
> >>>> @@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> >>>>                   drm_kms_helper_poll_disable(dev);
> >>>>                   dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
> >>>>                   amdgpu_device_suspend(dev, true);
> >>>> -               pci_save_state(dev->pdev);
> >>>> +               amdgpu_device_cache_pci_state(dev->pdev);
> >>>>                   /* Shut down the device */
> >>>>                   pci_disable_device(dev->pdev);
> >>>>                   pci_set_power_state(dev->pdev, PCI_D3cold);
> >>>> @@ -3401,8 +3401,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >>>>           if (r)
> >>>>                   dev_err(adev->dev, "amdgpu_pmu_init failed\n");
> >>>>
> >>>> -       if (pci_save_state(pdev))
> >>>> -               DRM_ERROR("Failed to save PCI state!!\n");
> >>>> +       /* Have stored pci confspace at hand for restore in sudden PCI error */
> >>>> +       if (!amdgpu_device_cache_pci_state(adev->pdev))
> >>>> +               DRM_WARN("Failed to cache PCI state!");
> >>>>
> >>>>           return 0;
> >>>>
> >>>> @@ -3430,6 +3431,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
> >>>>           flush_delayed_work(&adev->delayed_init_work);
> >>>>           adev->shutdown = true;
> >>>>
> >>>> +       kfree(adev->pci_state);
> >>>> +
> >>>>           /* make sure IB test finished before entering exclusive mode
> >>>>            * to avoid preemption on IB test
> >>>>            * */
> >>>> @@ -4855,7 +4858,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >>>>           /* wait for asic to come out of reset */
> >>>>           msleep(500);
> >>>>
> >>>> -       pci_restore_state(pdev);
> >>>> +       amdgpu_device_load_pci_state(pdev);
> >>>>
> >>>>           /* confirm  ASIC came out of reset */
> >>>>           for (i = 0; i < adev->usec_timeout; i++) {
> >>>> @@ -4934,8 +4937,10 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> >>>>
> >>>>    out:
> >>>>
> >>>> -       if (!r)
> >>>> +       if (!r) {
> >>>> +               amdgpu_device_cache_pci_state(adev->pdev);
> >>>>                   DRM_INFO("PCIe error recovery succeeded\n");
> >>>> +       }
> >>>>           else {
> >>>>                   DRM_ERROR("PCIe error recovery failed, err:%d", r);
> >>>>                   amdgpu_device_unlock_adev(adev);
> >>>> @@ -4974,3 +4979,46 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
> >>>>
> >>>>           amdgpu_device_unlock_adev(adev);
> >>>>    }
> >>>> +
> >>>> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> >>>> +{
> >>>> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >>>> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >>>> +       int r;
> >>>> +
> >>>> +       r = pci_save_state(pdev);
> >>>> +       if (!r) {
> >>>> +               kfree(adev->pci_state);
> >>>> +
> >>>> +               adev->pci_state = pci_store_saved_state(pdev);
> >>>> +
> >>>> +               if (!adev->pci_state) {
> >>>> +                       DRM_ERROR("Failed to store PCI saved state");
> >>>> +                       return false;
> >>>> +               }
> >>>> +       } else {
> >>>> +               DRM_WARN("Failed to save PCI state, err:%d\n", r);
> >>>> +               return false;
> >>>> +       }
> >>>> +
> >>>> +       return true;
> >>>> +}
> >>>> +
> >>>> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> >>>> +{
> >>>> +       struct drm_device *dev = pci_get_drvdata(pdev);
> >>>> +       struct amdgpu_device *adev = drm_to_adev(dev);
> >>>> +       int r;
> >>>> +
> >>>> +       if (!adev->pci_state)
> >>>> +               return false;
> >>>> +
> >>>> +       r = pci_load_saved_state(pdev, adev->pci_state);
> >>>> +
> >>>> +       if (!r) {
> >>>> +               pci_restore_state(pdev);
> >>>> +       } else {
> >>>> +               DRM_WARN("Failed to load PCI state, err:%d\n", r);
> >>>> +               return false;
> >>>> +       }
> >>>> +}
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> index 4bbcc70..7a6482a 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> @@ -1320,7 +1320,7 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
> >>>>                   if (amdgpu_is_atpx_hybrid()) {
> >>>>                           pci_ignore_hotplug(pdev);
> >>>>                   } else {
> >>>> -                       pci_save_state(pdev);
> >>>> +                       amdgpu_device_cache_pci_state(pdev);
> >>>>                           pci_disable_device(pdev);
> >>>>                           pci_ignore_hotplug(pdev);
> >>>>                           pci_set_power_state(pdev, PCI_D3cold);
> >>>> @@ -1353,7 +1353,7 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
> >>>>                           pci_set_master(pdev);
> >>>>                   } else {
> >>>>                           pci_set_power_state(pdev, PCI_D0);
> >>>> -                       pci_restore_state(pdev);
> >>>> +                       amdgpu_device_load_pci_state(pdev);
> >>>>                           ret = pci_enable_device(pdev);
> >>>>                           if (ret)
> >>>>                                   return ret;
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> >>>> index 4d14023..0ec6603 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> >>>> @@ -311,7 +311,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
> >>>>           /* disable BM */
> >>>>           pci_clear_master(adev->pdev);
> >>>>
> >>>> -       pci_save_state(adev->pdev);
> >>>> +       amdgpu_device_cache_pci_state(adev->pdev);
> >>>>
> >>>>           if (amdgpu_dpm_is_mode1_reset_supported(adev)) {
> >>>>                   dev_info(adev->dev, "GPU smu mode1 reset\n");
> >>>> @@ -323,7 +323,7 @@ static int nv_asic_mode1_reset(struct amdgpu_device *adev)
> >>>>
> >>>>           if (ret)
> >>>>                   dev_err(adev->dev, "GPU mode1 reset failed\n");
> >>>> -       pci_restore_state(adev->pdev);
> >>>> +       amdgpu_device_load_pci_state(adev->pdev);
> >>>>
> >>>>           /* wait for asic to come out of reset */
> >>>>           for (i = 0; i < adev->usec_timeout; i++) {
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>>> index 2f93c47..ddd55e3 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>>> @@ -484,13 +484,13 @@ static int soc15_asic_mode1_reset(struct amdgpu_device *adev)
> >>>>           /* disable BM */
> >>>>           pci_clear_master(adev->pdev);
> >>>>
> >>>> -       pci_save_state(adev->pdev);
> >>>> +       amdgpu_device_cache_pci_state(adev->pdev);
> >>>>
> >>>>           ret = psp_gpu_reset(adev);
> >>>>           if (ret)
> >>>>                   dev_err(adev->dev, "GPU mode1 reset failed\n");
> >>>>
> >>>> -       pci_restore_state(adev->pdev);
> >>>> +       amdgpu_device_load_pci_state(adev->pdev);
> >>>>
> >>>>           /* wait for asic to come out of reset */
> >>>>           for (i = 0; i < adev->usec_timeout; i++) {
> >>>> --
> >>>> 2.7.4
> >>>>
> >>>> _______________________________________________
> >>>> amd-gfx mailing list
> >>>> amd-gfx@lists.freedesktop.org
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CAndrey.Grodzovsky%40amd.com%7C3d5f35b27c9c4055150908d84b5c81cb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637342207973407357&amp;sdata=qA1xTpuY1JtOTPf7m1WhpH%2BFqVv8GA5CDMu%2BJa3Ds4o%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-26 14:46 ` [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
  2020-08-26 15:09   ` Alex Deucher
  2020-08-26 16:53   ` Nirmoy
@ 2020-08-29  2:03   ` Luben Tuikov
  2020-08-31 14:58     ` Andrey Grodzovsky
  2 siblings, 1 reply; 29+ messages in thread
From: Luben Tuikov @ 2020-08-29  2:03 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: alexander.deucher, nirmodas

On 2020-08-26 10:46, Andrey Grodzovsky wrote:
> At this point the ASIC is already post reset by the HW/PSP
> so the HW not in proper state to be configured for suspension,
> some bloks might be even gated and so best is to avoid touching it.

"blocks"

> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 37 ++++++++++++++++++++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 +++++++++------
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>  8 files changed, 64 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3399242..3489622 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,7 @@ struct amdgpu_device {
>  	atomic_t			throttling_logging_enabled;
>  	struct ratelimit_state		throttling_logging_rs;
>  	uint32_t			ras_features;
> +	bool                            in_dpc;
>  };

This is very cryptic and if we do that, in 10 years,
the code will be full of cryptic variables and comments.

Better to add a comment

	bool				in_dpc; /* PCI Downstream Port Containment */

>  
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 84f8d14..e1bbefd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>  {
>  	uint32_t ret;
>  
> +	if (adev->in_dpc)
> +		return 0;
> +
>  	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>  		return amdgpu_kiq_rreg(adev, reg);
>  
> @@ -350,8 +353,10 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   *
>   * Returns the 8 bit value from the offset specified.
>   */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> -{
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +
> +	if (adev->in_dpc)
> +		return 0;
>  
>  	if (offset < adev->rmmio_size)
>  		return (readb(adev->rmmio + offset));
> @@ -373,8 +378,10 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>   *
>   * Writes the value specified to the offset specified.
>   */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> -{
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +
> +	if (adev->in_dpc)
> +		return;
>  
>  	if (offset < adev->rmmio_size)
>  		writeb(value, adev->rmmio + offset);
> @@ -384,6 +391,8 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>  
>  void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>  {
> +	if (adev->in_dpc)
> +		return;
>  
>  	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>  
> @@ -412,6 +421,8 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>  		    uint32_t acc_flags)
>  {
> +	if (adev->in_dpc)
> +		return;
>  
>  	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>  		return amdgpu_kiq_wreg(adev, reg, v);
> @@ -427,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>  		    uint32_t acc_flags)
>  {
> +	if (adev->in_dpc)
> +		return;
> +
>  	if (amdgpu_sriov_fullaccess(adev) &&
>  		adev->gfx.rlc.funcs &&
>  		adev->gfx.rlc.funcs->is_rlcg_access_range) {
> @@ -448,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
>   */
>  u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>  {
> +	if (adev->in_dpc)
> +		return 0;
> +
>  	if ((reg * 4) < adev->rio_mem_size)
>  		return ioread32(adev->rio_mem + (reg * 4));
>  	else {
> @@ -467,6 +484,8 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>   */
>  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>  {
> +	if (adev->in_dpc)
> +		return;
>  
>  	if ((reg * 4) < adev->rio_mem_size)
>  		iowrite32(v, adev->rio_mem + (reg * 4));
> @@ -487,6 +506,8 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>   */
>  u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>  {
> +	if (adev->in_dpc)
> +		return 0;
>  
>  	if (index < adev->doorbell.num_doorbells) {
>  		return readl(adev->doorbell.ptr + index);
> @@ -508,6 +529,8 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>  {
> +	if (adev->in_dpc)
> +		return;
>  
>  	if (index < adev->doorbell.num_doorbells) {
>  		writel(v, adev->doorbell.ptr + index);
> @@ -527,6 +550,8 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>   */
>  u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>  {
> +	if (adev->in_dpc)
> +		return 0;
>  
>  	if (index < adev->doorbell.num_doorbells) {
>  		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
> @@ -548,6 +573,8 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>  {
> +	if (adev->in_dpc)
> +			return;
>  
>  	if (index < adev->doorbell.num_doorbells) {
>  		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
> @@ -4790,7 +4817,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  
>  	pci_restore_state(pdev);
>  
> +	adev->in_dpc = true;
>  	r = amdgpu_device_ip_suspend(adev);
> +	adev->in_dpc = false;
>  	if (r)
>  		goto out;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index d698142..50c42c9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>  	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>  	struct amdgpu_ring *ring = &kiq->ring;
>  
> +	if (adev->in_dpc)
> +		return 0;
> +
>  	BUG_ON(!ring->funcs->emit_rreg);
>  
>  	spin_lock_irqsave(&kiq->ring_lock, flags);
> @@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>  
>  	BUG_ON(!ring->funcs->emit_wreg);
>  
> +	if (adev->in_dpc)
> +		return;
> +

Don't know how I feel about such "architecture",
where we introduce a flag-type variable and check it
in a slew of places. If we did this a few times over,
we'd have a few flags which we check almost everywhere,
and it would make the code difficult to analyze.
For instance, which flag to check first? Or, how
are flags related to each other.

There has to be a better way, where we check
that flag only once before proceeding. A more
unified approach.

>  	spin_lock_irqsave(&kiq->ring_lock, flags);
>  	amdgpu_ring_alloc(ring, 32);
>  	amdgpu_ring_emit_wreg(ring, reg, v);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index d6c38e2..555ef59 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
>  	int i;
>  	struct amdgpu_device *adev = psp->adev;
>  
> +	if (psp->adev->in_dpc)
> +		return 0;
> +
>  	for (i = 0; i < adev->usec_timeout; i++) {
>  		val = RREG32(reg_index);
>  		if (check_changed) {
> @@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>  	bool ras_intr = false;
>  	bool skip_unsupport = false;
>  
> +	if (psp->adev->in_dpc)
> +		return 0;
> +
>  	mutex_lock(&psp->mutex);
>  
>  	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
> index 4cfc786..613dac1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/atom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/atom.c
> @@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
>  					DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
>  						  ATOM_CMD_TIMEOUT_SEC);
>  					ctx->abort = true;
> +					dump_stack();
>  				}
>  			} else {
>  				/* jiffies wrap around we will just wait a little longer */
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 2db195e..86f268e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
>  
>  	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>  	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> +
> +	if (!adev->in_dpc) {
>  #ifndef BRING_UP_DEBUG
> -	if (amdgpu_async_gfx_ring) {
> -		r = gfx_v10_0_kiq_disable_kgq(adev);
> -		if (r)
> -			DRM_ERROR("KGQ disable failed\n");
> -	}
> +		if (amdgpu_async_gfx_ring) {
> +			r = gfx_v10_0_kiq_disable_kgq(adev);
> +			if (r)
> +				DRM_ERROR("KGQ disable failed\n");
> +		}
>  #endif
> -	if (amdgpu_gfx_disable_kcq(adev))
> -		DRM_ERROR("KCQ disable failed\n");
> +		if (amdgpu_gfx_disable_kcq(adev))
> +			DRM_ERROR("KCQ disable failed\n");
> +	}
> +
>  	if (amdgpu_sriov_vf(adev)) {
>  		gfx_v10_0_cp_gfx_enable(adev, false);
>  		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 8462b30..306461d 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
>  {
>  	int ret = 0;
>  
> +
>  	if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
>  		return -EOPNOTSUPP;
>  
> +
> +

Unnecessary white space changes.

Regards,
Luben

>  	switch (block_type) {
>  	/*
>  	 * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index a58ea08..02cf55e 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>  	struct amdgpu_device *adev = smu->adev;
>  	int ret = 0, index = 0;
>  
> +	if (smu->adev->in_dpc)
> +		return 0;
> +
>  	index = smu_cmn_to_asic_specific_index(smu,
>  					       CMN2ASIC_MAPPING_MSG,
>  					       msg);
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-26 14:46 ` [PATCH 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
  2020-08-26 15:08   ` Alex Deucher
  2020-08-27  1:23   ` Li, Dennis
@ 2020-08-29  2:07   ` Luben Tuikov
  2020-08-31 14:30     ` Andrey Grodzovsky
  2 siblings, 1 reply; 29+ messages in thread
From: Luben Tuikov @ 2020-08-29  2:07 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: alexander.deucher, nirmodas

On 2020-08-26 10:46, Andrey Grodzovsky wrote:
> Add DPC handlers with basic recovery functionality.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 181 ++++++++++++++++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>  3 files changed, 196 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 49ea9fa..3399242 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -49,6 +49,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/hashtable.h>
>  #include <linux/dma-fence.h>
> +#include <linux/pci.h>
> +#include <linux/aer.h>
>  
>  #include <drm/ttm/ttm_bo_api.h>
>  #include <drm/ttm/ttm_bo_driver.h>
> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>  void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>  
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
> +					   pci_channel_state_t state);
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
> +void amdgpu_pci_resume(struct pci_dev *pdev);
> +
> +
>  #include "amdgpu_object.h"
>  
>  /* used by df_v3_6.c and amdgpu_pmu.c */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a948ed..84f8d14 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -350,7 +350,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>   *
>   * Returns the 8 bit value from the offset specified.
>   */
> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
> +{
> +
>  	if (offset < adev->rmmio_size)
>  		return (readb(adev->rmmio + offset));
>  	BUG();
> @@ -371,7 +373,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>   *
>   * Writes the value specified to the offset specified.
>   */
> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
> +{
> +
>  	if (offset < adev->rmmio_size)
>  		writeb(value, adev->rmmio + offset);
>  	else
> @@ -380,6 +384,7 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>  
>  void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>  {
> +
>  	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>  
>  	if ((reg * 4) < adev->rmmio_size)
> @@ -407,6 +412,7 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>  		    uint32_t acc_flags)
>  {
> +
>  	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>  		return amdgpu_kiq_wreg(adev, reg, v);
>  
> @@ -461,6 +467,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>   */
>  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>  {
> +
>  	if ((reg * 4) < adev->rio_mem_size)
>  		iowrite32(v, adev->rio_mem + (reg * 4));
>  	else {
> @@ -480,6 +487,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>   */
>  u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>  {
> +
>  	if (index < adev->doorbell.num_doorbells) {
>  		return readl(adev->doorbell.ptr + index);
>  	} else {
> @@ -500,6 +508,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>  {
> +
>  	if (index < adev->doorbell.num_doorbells) {
>  		writel(v, adev->doorbell.ptr + index);
>  	} else {
> @@ -518,6 +527,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>   */
>  u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>  {
> +
>  	if (index < adev->doorbell.num_doorbells) {
>  		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
>  	} else {
> @@ -538,6 +548,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>   */
>  void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>  {
> +
>  	if (index < adev->doorbell.num_doorbells) {
>  		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>  	} else {
> @@ -2989,6 +3000,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>  	NULL
>  };
>  
> +
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3207,6 +3219,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  		}
>  	}
>  
> +	pci_enable_pcie_error_reporting(adev->ddev.pdev);
> +
> +
>  	/* Post card if necessary */
>  	if (amdgpu_device_need_post(adev)) {
>  		if (!adev->bios) {
> @@ -3359,6 +3374,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  	if (r)
>  		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>  
> +	if (pci_save_state(pdev))
> +		DRM_ERROR("Failed to save PCI state!!\n");
> +
>  	return 0;
>  
>  failed:
> @@ -4701,3 +4719,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>  
>  	return 0;
>  }
> +
> +/**
> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
> + * @pdev: PCI device struct
> + * @state: PCI channel state
> + *
> + * Description: Called when a PCI error is detected.
> + *
> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
> + */
> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;
> +	case pci_channel_io_frozen: {
> +		/* Fatal error, prepare for slot reset */
> +
> +		amdgpu_device_lock_adev(adev);
> +		return PCI_ERS_RESULT_NEED_RESET;
> +	}
> +	case pci_channel_io_perm_failure:
> +		/* Permanent error, prepare for device removal */
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +	return PCI_ERS_RESULT_NEED_RESET;
> +}

Perhaps an empty line before the "return".

> +
> +/**
> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
> + * @pdev: pointer to PCI device
> + */
> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
> +{
> +
> +	DRM_INFO("PCI error: mmio enabled callback!!\n");
> +
> +	/* TODO - dump whatever for debugging purposes */
> +
> +	/* This called only if amdgpu_pci_error_detected returns
> +	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
> +	 * works, no need to reset slot.
> +	 */
> +
> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
> + * @pdev: PCI device struct
> + *
> + * Description: This routine is called by the pci error recovery
> + * code after the PCI slot has been reset, just before we
> + * should resume normal operations.
> + */
> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int r;
> +	bool vram_lost;
> +
> +	DRM_INFO("PCI error: slot reset callback!!\n");
> +
> +	pci_restore_state(pdev);
> +
> +	r = amdgpu_device_ip_suspend(adev);
> +	if (r)
> +		goto out;
> +
> +
> +	/* post card */
> +	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_ip_resume_phase1(adev);
> +	if (r)
> +		goto out;
> +
> +	vram_lost = amdgpu_device_check_vram_lost(adev);
> +	if (vram_lost) {
> +		DRM_INFO("VRAM is lost due to GPU reset!\n");
> +		amdgpu_inc_vram_lost(adev);
> +	}
> +
> +	r = amdgpu_gtt_mgr_recover(
> +		&adev->mman.bdev.man[TTM_PL_TT]);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_fw_loading(adev);
> +	if (r)
> +		return r;
> +
> +	r = amdgpu_device_ip_resume_phase2(adev);
> +	if (r)
> +		goto out;
> +
> +	if (vram_lost)
> +		amdgpu_device_fill_reset_magic(adev);
> +
> +	/*
> +	 * Add this ASIC as tracked as reset was already
> +	 * complete successfully.
> +	 */
> +	amdgpu_register_gpu_instance(adev);
> +
> +	r = amdgpu_device_ip_late_init(adev);
> +	if (r)
> +		goto out;
> +
> +	amdgpu_fbdev_set_suspend(adev, 0);
> +
> +	/* must succeed. */
> +	amdgpu_ras_resume(adev);
> +
> +
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
> +		goto out;
> +
> +	r = amdgpu_device_recover_vram(adev);
> +
> +out:
> +
> +	if (!r)
> +		DRM_INFO("PCIe error recovery succeeded\n");
> +	else {
> +		DRM_ERROR("PCIe error recovery failed, err:%d", r);
> +		amdgpu_device_unlock_adev(adev);
> +	}

Add braces around the "if ()" if you're going to add them
around the "else".

It's a good idea to run patches through checkpatch.pl--as
a general guidance--sometimes it finds good things.

Regards,
Luben

> +
> +	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/**
> + * amdgpu_pci_resume() - resume normal ops after PCI reset
> + * @pdev: pointer to PCI device
> + *
> + * Called when the error recovery driver tells us that its
> + * OK to resume normal operation. Use completion to allow
> + * halted scsi ops to resume.
> + */
> +void amdgpu_pci_resume(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +
> +	amdgpu_device_unlock_adev(adev);
> +
> +	DRM_INFO("PCI error: resume callback!!\n");
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d984c6a..4bbcc70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -31,7 +31,6 @@
>  #include <drm/drm_pciids.h>
>  #include <linux/console.h>
>  #include <linux/module.h>
> -#include <linux/pci.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/vga_switcheroo.h>
>  #include <drm/drm_probe_helper.h>
> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>  	.patchlevel = KMS_DRIVER_PATCHLEVEL,
>  };
>  
> +static struct pci_error_handlers amdgpu_pci_err_handler = {
> +	.error_detected	= amdgpu_pci_error_detected,
> +	.mmio_enabled	= amdgpu_pci_mmio_enabled,
> +	.slot_reset	= amdgpu_pci_slot_reset,
> +	.resume		= amdgpu_pci_resume,
> +};
> +
>  static struct pci_driver amdgpu_kms_pci_driver = {
>  	.name = DRIVER_NAME,
>  	.id_table = pciidlist,
> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>  	.remove = amdgpu_pci_remove,
>  	.shutdown = amdgpu_pci_shutdown,
>  	.driver.pm = &amdgpu_pm_ops,
> +	.err_handler = &amdgpu_pci_err_handler,
>  };
>  
>  static int __init amdgpu_init(void)
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-26 14:46 ` [PATCH 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
@ 2020-08-29  2:16   ` Luben Tuikov
  2020-08-31 15:11     ` Andrey Grodzovsky
  0 siblings, 1 reply; 29+ messages in thread
From: Luben Tuikov @ 2020-08-29  2:16 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: alexander.deucher, nirmodas

On 2020-08-26 10:46, Andrey Grodzovsky wrote:
> DPC recovery involves ASIC reset just as normal GPU recovery so block
> SW GPU shcedulers and wait on all concurent GPU resets.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
>  1 file changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e1bbefd..0a2130b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4747,6 +4747,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>  	return 0;
>  }
>  
> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> +{
> +	int i;
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		cancel_delayed_work_sync(&ring->sched.work_tdr);
> +	}
> +}
> +
>  /**
>   * amdgpu_pci_error_detected - Called when a PCI error is detected.
>   * @pdev: PCI device struct
> @@ -4760,16 +4774,38 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int i;
>  
>  	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>  
>  	switch (state) {
>  	case pci_channel_io_normal:
>  		return PCI_ERS_RESULT_CAN_RECOVER;
> -	case pci_channel_io_frozen: {
> -		/* Fatal error, prepare for slot reset */
> +	case pci_channel_io_frozen: { /* Fatal error, prepare for slot reset */
> +
> +		/*
> +		 * Cancel and wait for all TDRs in progress if failing to
> +		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
> +		 *
> +		 * Locking adev->reset_sem will perevent any external access
> +		 * to GPU during PCI error recovery
> +		 */
> +		while (!amdgpu_device_lock_adev(adev, NULL))
> +			amdgpu_cancel_all_tdr(adev);

Is it possible for this to hang forever? It eventually
resolves to __cancel_work_timer() in kernel/workqueue.c
which provides no timeout.

If you cannot guarantee the TDRs to return back, then
you'd need a timeout here.

Regards,
Luben

> +
> +		/*
> +		 * Block any work scheduling as we do for regualr GPU reset
> +		 * for the duration of the recoveryq
> +		 */
> +		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +			struct amdgpu_ring *ring = adev->rings[i];
> +
> +			if (!ring || !ring->sched.thread)
> +				continue;
> +
> +			drm_sched_stop(&ring->sched, NULL);
> +		}
>  
> -		amdgpu_device_lock_adev(adev);
>  		return PCI_ERS_RESULT_NEED_RESET;
>  	}
>  	case pci_channel_io_perm_failure:
> @@ -4902,8 +4938,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
> +	int i;
>  
> -	amdgpu_device_unlock_adev(adev);
>  
>  	DRM_INFO("PCI error: resume callback!!\n");
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +
> +		drm_sched_resubmit_jobs(&ring->sched);
> +		drm_sched_start(&ring->sched, true);
> +	}
> +
> +	amdgpu_device_unlock_adev(adev);
>  }
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] drm/amdgpu: Implement DPC recovery
  2020-08-29  2:07   ` Luben Tuikov
@ 2020-08-31 14:30     ` Andrey Grodzovsky
  0 siblings, 0 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 14:30 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: alexander.deucher, nirmodas


On 8/28/20 10:07 PM, Luben Tuikov wrote:
> On 2020-08-26 10:46, Andrey Grodzovsky wrote:
>> Add DPC handlers with basic recovery functionality.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 181 ++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   9 +-
>>   3 files changed, 196 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 49ea9fa..3399242 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -49,6 +49,8 @@
>>   #include <linux/rbtree.h>
>>   #include <linux/hashtable.h>
>>   #include <linux/dma-fence.h>
>> +#include <linux/pci.h>
>> +#include <linux/aer.h>
>>   
>>   #include <drm/ttm/ttm_bo_api.h>
>>   #include <drm/ttm/ttm_bo_driver.h>
>> @@ -1263,6 +1265,13 @@ static inline int amdgpu_dm_display_resume(struct amdgpu_device *adev) { return
>>   void amdgpu_register_gpu_instance(struct amdgpu_device *adev);
>>   void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev);
>>   
>> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev,
>> +					   pci_channel_state_t state);
>> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev);
>> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
>> +void amdgpu_pci_resume(struct pci_dev *pdev);
>> +
>> +
>>   #include "amdgpu_object.h"
>>   
>>   /* used by df_v3_6.c and amdgpu_pmu.c */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5a948ed..84f8d14 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -350,7 +350,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>>    *
>>    * Returns the 8 bit value from the offset specified.
>>    */
>> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>> +{
>> +
>>   	if (offset < adev->rmmio_size)
>>   		return (readb(adev->rmmio + offset));
>>   	BUG();
>> @@ -371,7 +373,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>>    *
>>    * Writes the value specified to the offset specified.
>>    */
>> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
>> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>> +{
>> +
>>   	if (offset < adev->rmmio_size)
>>   		writeb(value, adev->rmmio + offset);
>>   	else
>> @@ -380,6 +384,7 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>>   
>>   void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>>   {
>> +
>>   	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>>   
>>   	if ((reg * 4) < adev->rmmio_size)
>> @@ -407,6 +412,7 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>>   void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>>   		    uint32_t acc_flags)
>>   {
>> +
>>   	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>>   		return amdgpu_kiq_wreg(adev, reg, v);
>>   
>> @@ -461,6 +467,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>>    */
>>   void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>>   {
>> +
>>   	if ((reg * 4) < adev->rio_mem_size)
>>   		iowrite32(v, adev->rio_mem + (reg * 4));
>>   	else {
>> @@ -480,6 +487,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>>    */
>>   u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>>   {
>> +
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		return readl(adev->doorbell.ptr + index);
>>   	} else {
>> @@ -500,6 +508,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>>    */
>>   void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>>   {
>> +
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		writel(v, adev->doorbell.ptr + index);
>>   	} else {
>> @@ -518,6 +527,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>>    */
>>   u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>>   {
>> +
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
>>   	} else {
>> @@ -538,6 +548,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>>    */
>>   void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>>   {
>> +
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>>   	} else {
>> @@ -2989,6 +3000,7 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>>   	NULL
>>   };
>>   
>> +
>>   /**
>>    * amdgpu_device_init - initialize the driver
>>    *
>> @@ -3207,6 +3219,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>   		}
>>   	}
>>   
>> +	pci_enable_pcie_error_reporting(adev->ddev.pdev);
>> +
>> +
>>   	/* Post card if necessary */
>>   	if (amdgpu_device_need_post(adev)) {
>>   		if (!adev->bios) {
>> @@ -3359,6 +3374,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>   	if (r)
>>   		dev_err(adev->dev, "amdgpu_pmu_init failed\n");
>>   
>> +	if (pci_save_state(pdev))
>> +		DRM_ERROR("Failed to save PCI state!!\n");
>> +
>>   	return 0;
>>   
>>   failed:
>> @@ -4701,3 +4719,162 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>>   
>>   	return 0;
>>   }
>> +
>> +/**
>> + * amdgpu_pci_error_detected - Called when a PCI error is detected.
>> + * @pdev: PCI device struct
>> + * @state: PCI channel state
>> + *
>> + * Description: Called when a PCI error is detected.
>> + *
>> + * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT.
>> + */
>> +pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
>> +{
>> +	struct drm_device *dev = pci_get_drvdata(pdev);
>> +	struct amdgpu_device *adev = drm_to_adev(dev);
>> +
>> +	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>> +
>> +	switch (state) {
>> +	case pci_channel_io_normal:
>> +		return PCI_ERS_RESULT_CAN_RECOVER;
>> +	case pci_channel_io_frozen: {
>> +		/* Fatal error, prepare for slot reset */
>> +
>> +		amdgpu_device_lock_adev(adev);
>> +		return PCI_ERS_RESULT_NEED_RESET;
>> +	}
>> +	case pci_channel_io_perm_failure:
>> +		/* Permanent error, prepare for device removal */
>> +		return PCI_ERS_RESULT_DISCONNECT;
>> +	}
>> +	return PCI_ERS_RESULT_NEED_RESET;
>> +}
> Perhaps an empty line before the "return".
>
>> +
>> +/**
>> + * amdgpu_pci_mmio_enabled - Enable MMIO and dump debug registers
>> + * @pdev: pointer to PCI device
>> + */
>> +pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev *pdev)
>> +{
>> +
>> +	DRM_INFO("PCI error: mmio enabled callback!!\n");
>> +
>> +	/* TODO - dump whatever for debugging purposes */
>> +
>> +	/* This called only if amdgpu_pci_error_detected returns
>> +	 * PCI_ERS_RESULT_CAN_RECOVER. Read/write to the device still
>> +	 * works, no need to reset slot.
>> +	 */
>> +
>> +	return PCI_ERS_RESULT_RECOVERED;
>> +}
>> +
>> +/**
>> + * amdgpu_pci_slot_reset - Called when PCI slot has been reset.
>> + * @pdev: PCI device struct
>> + *
>> + * Description: This routine is called by the pci error recovery
>> + * code after the PCI slot has been reset, just before we
>> + * should resume normal operations.
>> + */
>> +pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>> +{
>> +	struct drm_device *dev = pci_get_drvdata(pdev);
>> +	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int r;
>> +	bool vram_lost;
>> +
>> +	DRM_INFO("PCI error: slot reset callback!!\n");
>> +
>> +	pci_restore_state(pdev);
>> +
>> +	r = amdgpu_device_ip_suspend(adev);
>> +	if (r)
>> +		goto out;
>> +
>> +
>> +	/* post card */
>> +	r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
>> +	if (r)
>> +		goto out;
>> +
>> +	r = amdgpu_device_ip_resume_phase1(adev);
>> +	if (r)
>> +		goto out;
>> +
>> +	vram_lost = amdgpu_device_check_vram_lost(adev);
>> +	if (vram_lost) {
>> +		DRM_INFO("VRAM is lost due to GPU reset!\n");
>> +		amdgpu_inc_vram_lost(adev);
>> +	}
>> +
>> +	r = amdgpu_gtt_mgr_recover(
>> +		&adev->mman.bdev.man[TTM_PL_TT]);
>> +	if (r)
>> +		goto out;
>> +
>> +	r = amdgpu_device_fw_loading(adev);
>> +	if (r)
>> +		return r;
>> +
>> +	r = amdgpu_device_ip_resume_phase2(adev);
>> +	if (r)
>> +		goto out;
>> +
>> +	if (vram_lost)
>> +		amdgpu_device_fill_reset_magic(adev);
>> +
>> +	/*
>> +	 * Add this ASIC as tracked as reset was already
>> +	 * complete successfully.
>> +	 */
>> +	amdgpu_register_gpu_instance(adev);
>> +
>> +	r = amdgpu_device_ip_late_init(adev);
>> +	if (r)
>> +		goto out;
>> +
>> +	amdgpu_fbdev_set_suspend(adev, 0);
>> +
>> +	/* must succeed. */
>> +	amdgpu_ras_resume(adev);
>> +
>> +
>> +	amdgpu_irq_gpu_reset_resume_helper(adev);
>> +	r = amdgpu_ib_ring_tests(adev);
>> +	if (r)
>> +		goto out;
>> +
>> +	r = amdgpu_device_recover_vram(adev);
>> +
>> +out:
>> +
>> +	if (!r)
>> +		DRM_INFO("PCIe error recovery succeeded\n");
>> +	else {
>> +		DRM_ERROR("PCIe error recovery failed, err:%d", r);
>> +		amdgpu_device_unlock_adev(adev);
>> +	}
> Add braces around the "if ()" if you're going to add them
> around the "else".
>
> It's a good idea to run patches through checkpatch.pl--as
> a general guidance--sometimes it finds good things.
>
> Regards,
> Luben


I did run it, I might have missed this of course.

Andrey


>
>> +
>> +	return r ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>> +}
>> +
>> +/**
>> + * amdgpu_pci_resume() - resume normal ops after PCI reset
>> + * @pdev: pointer to PCI device
>> + *
>> + * Called when the error recovery driver tells us that its
>> + * OK to resume normal operation. Use completion to allow
>> + * halted scsi ops to resume.
>> + */
>> +void amdgpu_pci_resume(struct pci_dev *pdev)
>> +{
>> +	struct drm_device *dev = pci_get_drvdata(pdev);
>> +	struct amdgpu_device *adev = drm_to_adev(dev);
>> +
>> +	amdgpu_device_unlock_adev(adev);
>> +
>> +	DRM_INFO("PCI error: resume callback!!\n");
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index d984c6a..4bbcc70 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -31,7 +31,6 @@
>>   #include <drm/drm_pciids.h>
>>   #include <linux/console.h>
>>   #include <linux/module.h>
>> -#include <linux/pci.h>
>>   #include <linux/pm_runtime.h>
>>   #include <linux/vga_switcheroo.h>
>>   #include <drm/drm_probe_helper.h>
>> @@ -1534,6 +1533,13 @@ static struct drm_driver kms_driver = {
>>   	.patchlevel = KMS_DRIVER_PATCHLEVEL,
>>   };
>>   
>> +static struct pci_error_handlers amdgpu_pci_err_handler = {
>> +	.error_detected	= amdgpu_pci_error_detected,
>> +	.mmio_enabled	= amdgpu_pci_mmio_enabled,
>> +	.slot_reset	= amdgpu_pci_slot_reset,
>> +	.resume		= amdgpu_pci_resume,
>> +};
>> +
>>   static struct pci_driver amdgpu_kms_pci_driver = {
>>   	.name = DRIVER_NAME,
>>   	.id_table = pciidlist,
>> @@ -1541,6 +1547,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>>   	.remove = amdgpu_pci_remove,
>>   	.shutdown = amdgpu_pci_shutdown,
>>   	.driver.pm = &amdgpu_pm_ops,
>> +	.err_handler = &amdgpu_pci_err_handler,
>>   };
>>   
>>   static int __init amdgpu_init(void)
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state
  2020-08-29  2:03   ` Luben Tuikov
@ 2020-08-31 14:58     ` Andrey Grodzovsky
  0 siblings, 0 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 14:58 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: alexander.deucher, nirmodas


On 8/28/20 10:03 PM, Luben Tuikov wrote:
> On 2020-08-26 10:46, Andrey Grodzovsky wrote:
>> At this point the ASIC is already post reset by the HW/PSP
>> so the HW not in proper state to be configured for suspension,
>> some bloks might be even gated and so best is to avoid touching it.
> "blocks"
>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 37 ++++++++++++++++++++++++++----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>>   drivers/gpu/drm/amd/amdgpu/atom.c          |  1 +
>>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 +++++++++------
>>   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  3 +++
>>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>>   8 files changed, 64 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3399242..3489622 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -992,6 +992,7 @@ struct amdgpu_device {
>>   	atomic_t			throttling_logging_enabled;
>>   	struct ratelimit_state		throttling_logging_rs;
>>   	uint32_t			ras_features;
>> +	bool                            in_dpc;
>>   };
> This is very cryptic and if we do that, in 10 years,
> the code will be full of cryptic variables and comments.
>
> Better to add a comment
>
> 	bool				in_dpc; /* PCI Downstream Port Containment */


Per Nirmoy's comment I already modified the naming to a descriptive one in V2 
patchset

>
>>   
>>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 84f8d14..e1bbefd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>>   {
>>   	uint32_t ret;
>>   
>> +	if (adev->in_dpc)
>> +		return 0;
>> +
>>   	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>>   		return amdgpu_kiq_rreg(adev, reg);
>>   
>> @@ -350,8 +353,10 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>>    *
>>    * Returns the 8 bit value from the offset specified.
>>    */
>> -uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>> -{
>> +uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
>> +
>> +	if (adev->in_dpc)
>> +		return 0;
>>   
>>   	if (offset < adev->rmmio_size)
>>   		return (readb(adev->rmmio + offset));
>> @@ -373,8 +378,10 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
>>    *
>>    * Writes the value specified to the offset specified.
>>    */
>> -void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>> -{
>> +void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value) {
>> +
>> +	if (adev->in_dpc)
>> +		return;
>>   
>>   	if (offset < adev->rmmio_size)
>>   		writeb(value, adev->rmmio + offset);
>> @@ -384,6 +391,8 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
>>   
>>   void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
>>   {
>> +	if (adev->in_dpc)
>> +		return;
>>   
>>   	trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
>>   
>> @@ -412,6 +421,8 @@ void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg,
>>   void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>>   		    uint32_t acc_flags)
>>   {
>> +	if (adev->in_dpc)
>> +		return;
>>   
>>   	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>>   		return amdgpu_kiq_wreg(adev, reg, v);
>> @@ -427,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>>   void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
>>   		    uint32_t acc_flags)
>>   {
>> +	if (adev->in_dpc)
>> +		return;
>> +
>>   	if (amdgpu_sriov_fullaccess(adev) &&
>>   		adev->gfx.rlc.funcs &&
>>   		adev->gfx.rlc.funcs->is_rlcg_access_range) {
>> @@ -448,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, uint32_t
>>    */
>>   u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>>   {
>> +	if (adev->in_dpc)
>> +		return 0;
>> +
>>   	if ((reg * 4) < adev->rio_mem_size)
>>   		return ioread32(adev->rio_mem + (reg * 4));
>>   	else {
>> @@ -467,6 +484,8 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
>>    */
>>   void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>>   {
>> +	if (adev->in_dpc)
>> +		return;
>>   
>>   	if ((reg * 4) < adev->rio_mem_size)
>>   		iowrite32(v, adev->rio_mem + (reg * 4));
>> @@ -487,6 +506,8 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
>>    */
>>   u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>>   {
>> +	if (adev->in_dpc)
>> +		return 0;
>>   
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		return readl(adev->doorbell.ptr + index);
>> @@ -508,6 +529,8 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>>    */
>>   void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>>   {
>> +	if (adev->in_dpc)
>> +		return;
>>   
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		writel(v, adev->doorbell.ptr + index);
>> @@ -527,6 +550,8 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>>    */
>>   u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>>   {
>> +	if (adev->in_dpc)
>> +		return 0;
>>   
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		return atomic64_read((atomic64_t *)(adev->doorbell.ptr + index));
>> @@ -548,6 +573,8 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>>    */
>>   void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>>   {
>> +	if (adev->in_dpc)
>> +			return;
>>   
>>   	if (index < adev->doorbell.num_doorbells) {
>>   		atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>> @@ -4790,7 +4817,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>   
>>   	pci_restore_state(pdev);
>>   
>> +	adev->in_dpc = true;
>>   	r = amdgpu_device_ip_suspend(adev);
>> +	adev->in_dpc = false;
>>   	if (r)
>>   		goto out;
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index d698142..50c42c9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -693,6 +693,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>>   	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>>   	struct amdgpu_ring *ring = &kiq->ring;
>>   
>> +	if (adev->in_dpc)
>> +		return 0;
>> +
>>   	BUG_ON(!ring->funcs->emit_rreg);
>>   
>>   	spin_lock_irqsave(&kiq->ring_lock, flags);
>> @@ -757,6 +760,9 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>   
>>   	BUG_ON(!ring->funcs->emit_wreg);
>>   
>> +	if (adev->in_dpc)
>> +		return;
>> +
> Don't know how I feel about such "architecture",
> where we introduce a flag-type variable and check it
> in a slew of places. If we did this a few times over,
> we'd have a few flags which we check almost everywhere,
> and it would make the code difficult to analyze.
> For instance, which flag to check first? Or, how
> are flags related to each other.
>
> There has to be a better way, where we check
> that flag only once before proceeding. A more
> unified approach.

I fail to see any other approach besides this one currently, there are a few 
different functions for
r/w register access and unless there would be a single register access generic 
function where I could
place this check (which I don't think is a good idea) I don't see how to do it 
in once place. On top of
that, this variable is being checked in wait for HW response use cases such as 
psp_wait_for where you
expect specific value to be returned from the driver/psp interfacing registers 
which you should simply
avoid expecting in this case. All in all I don't see how u can make this a 
generic check in once place that solves
all the use cases.

Andrey


>
>>   	spin_lock_irqsave(&kiq->ring_lock, flags);
>>   	amdgpu_ring_alloc(ring, 32);
>>   	amdgpu_ring_emit_wreg(ring, reg, v);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> index d6c38e2..555ef59 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> @@ -219,6 +219,9 @@ int psp_wait_for(struct psp_context *psp, uint32_t reg_index,
>>   	int i;
>>   	struct amdgpu_device *adev = psp->adev;
>>   
>> +	if (psp->adev->in_dpc)
>> +		return 0;
>> +
>>   	for (i = 0; i < adev->usec_timeout; i++) {
>>   		val = RREG32(reg_index);
>>   		if (check_changed) {
>> @@ -245,6 +248,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>   	bool ras_intr = false;
>>   	bool skip_unsupport = false;
>>   
>> +	if (psp->adev->in_dpc)
>> +		return 0;
>> +
>>   	mutex_lock(&psp->mutex);
>>   
>>   	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
>> index 4cfc786..613dac1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/atom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/atom.c
>> @@ -750,6 +750,7 @@ static void atom_op_jump(atom_exec_context *ctx, int *ptr, int arg)
>>   					DRM_ERROR("atombios stuck in loop for more than %dsecs aborting\n",
>>   						  ATOM_CMD_TIMEOUT_SEC);
>>   					ctx->abort = true;
>> +					dump_stack();
>>   				}
>>   			} else {
>>   				/* jiffies wrap around we will just wait a little longer */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> index 2db195e..86f268e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> @@ -6980,15 +6980,19 @@ static int gfx_v10_0_hw_fini(void *handle)
>>   
>>   	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>>   	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
>> +
>> +	if (!adev->in_dpc) {
>>   #ifndef BRING_UP_DEBUG
>> -	if (amdgpu_async_gfx_ring) {
>> -		r = gfx_v10_0_kiq_disable_kgq(adev);
>> -		if (r)
>> -			DRM_ERROR("KGQ disable failed\n");
>> -	}
>> +		if (amdgpu_async_gfx_ring) {
>> +			r = gfx_v10_0_kiq_disable_kgq(adev);
>> +			if (r)
>> +				DRM_ERROR("KGQ disable failed\n");
>> +		}
>>   #endif
>> -	if (amdgpu_gfx_disable_kcq(adev))
>> -		DRM_ERROR("KCQ disable failed\n");
>> +		if (amdgpu_gfx_disable_kcq(adev))
>> +			DRM_ERROR("KCQ disable failed\n");
>> +	}
>> +
>>   	if (amdgpu_sriov_vf(adev)) {
>>   		gfx_v10_0_cp_gfx_enable(adev, false);
>>   		/* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
>> index 8462b30..306461d 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
>> @@ -224,9 +224,12 @@ int smu_dpm_set_power_gate(struct smu_context *smu, uint32_t block_type,
>>   {
>>   	int ret = 0;
>>   
>> +
>>   	if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled)
>>   		return -EOPNOTSUPP;
>>   
>> +
>> +
> Unnecessary white space changes.
>
> Regards,
> Luben
>
>>   	switch (block_type) {
>>   	/*
>>   	 * Some legacy code of amdgpu_vcn.c and vcn_v2*.c still uses
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index a58ea08..02cf55e 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -112,6 +112,9 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>   	struct amdgpu_device *adev = smu->adev;
>>   	int ret = 0, index = 0;
>>   
>> +	if (smu->adev->in_dpc)
>> +		return 0;
>> +
>>   	index = smu_cmn_to_asic_specific_index(smu,
>>   					       CMN2ASIC_MAPPING_MSG,
>>   					       msg);
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery
  2020-08-29  2:16   ` Luben Tuikov
@ 2020-08-31 15:11     ` Andrey Grodzovsky
  0 siblings, 0 replies; 29+ messages in thread
From: Andrey Grodzovsky @ 2020-08-31 15:11 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: alexander.deucher, nirmodas


On 8/28/20 10:16 PM, Luben Tuikov wrote:
> On 2020-08-26 10:46, Andrey Grodzovsky wrote:
>> DPC recovery involves ASIC reset just as normal GPU recovery so block
>> SW GPU shcedulers and wait on all concurent GPU resets.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
>>   1 file changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index e1bbefd..0a2130b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4747,6 +4747,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>>   	return 0;
>>   }
>>   
>> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +		struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +		if (!ring || !ring->sched.thread)
>> +			continue;
>> +
>> +		cancel_delayed_work_sync(&ring->sched.work_tdr);
>> +	}
>> +}
>> +
>>   /**
>>    * amdgpu_pci_error_detected - Called when a PCI error is detected.
>>    * @pdev: PCI device struct
>> @@ -4760,16 +4774,38 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int i;
>>   
>>   	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>>   
>>   	switch (state) {
>>   	case pci_channel_io_normal:
>>   		return PCI_ERS_RESULT_CAN_RECOVER;
>> -	case pci_channel_io_frozen: {
>> -		/* Fatal error, prepare for slot reset */
>> +	case pci_channel_io_frozen: { /* Fatal error, prepare for slot reset */
>> +
>> +		/*
>> +		 * Cancel and wait for all TDRs in progress if failing to
>> +		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
>> +		 *
>> +		 * Locking adev->reset_sem will perevent any external access
>> +		 * to GPU during PCI error recovery
>> +		 */
>> +		while (!amdgpu_device_lock_adev(adev, NULL))
>> +			amdgpu_cancel_all_tdr(adev);
> Is it possible for this to hang forever? It eventually
> resolves to __cancel_work_timer() in kernel/workqueue.c
> which provides no timeout.
>
> If you cannot guarantee the TDRs to return back, then
> you'd need a timeout here.
>
> Regards,
> Luben


Unless there is a bug in the  reset code (such as deadlocks which then should be 
fixed)
it should progress to completion and so I don't see an issue here.

Andrey


>
>> +
>> +		/*
>> +		 * Block any work scheduling as we do for regualr GPU reset
>> +		 * for the duration of the recoveryq
>> +		 */
>> +		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +			struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +			if (!ring || !ring->sched.thread)
>> +				continue;
>> +
>> +			drm_sched_stop(&ring->sched, NULL);
>> +		}
>>   
>> -		amdgpu_device_lock_adev(adev);
>>   		return PCI_ERS_RESULT_NEED_RESET;
>>   	}
>>   	case pci_channel_io_perm_failure:
>> @@ -4902,8 +4938,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int i;
>>   
>> -	amdgpu_device_unlock_adev(adev);
>>   
>>   	DRM_INFO("PCI error: resume callback!!\n");
>> +
>> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +		struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +		if (!ring || !ring->sched.thread)
>> +			continue;
>> +
>> +
>> +		drm_sched_resubmit_jobs(&ring->sched);
>> +		drm_sched_start(&ring->sched, true);
>> +	}
>> +
>> +	amdgpu_device_unlock_adev(adev);
>>   }
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-08-31 15:11 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26 14:46 [PATCH 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
2020-08-26 14:46 ` [PATCH 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
2020-08-26 15:08   ` Alex Deucher
2020-08-27  1:23   ` Li, Dennis
2020-08-27 13:38     ` Andrey Grodzovsky
2020-08-29  2:07   ` Luben Tuikov
2020-08-31 14:30     ` Andrey Grodzovsky
2020-08-26 14:46 ` [PATCH 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
2020-08-26 15:09   ` Alex Deucher
2020-08-26 16:53   ` Nirmoy
2020-08-29  2:03   ` Luben Tuikov
2020-08-31 14:58     ` Andrey Grodzovsky
2020-08-26 14:46 ` [PATCH 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
2020-08-29  2:16   ` Luben Tuikov
2020-08-31 15:11     ` Andrey Grodzovsky
2020-08-26 14:46 ` [PATCH 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
2020-08-26 15:10   ` Alex Deucher
2020-08-26 14:46 ` [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure Andrey Grodzovsky
2020-08-26 15:20   ` Alex Deucher
2020-08-26 15:31     ` Andrey Grodzovsky
2020-08-26 15:39       ` Alex Deucher
2020-08-27 14:54     ` Andrey Grodzovsky
2020-08-28  0:00       ` Grodzovsky, Andrey
2020-08-28  6:56         ` Christian König
2020-08-28 14:13       ` Alex Deucher
2020-08-28 14:21         ` Andrey Grodzovsky
2020-08-28 14:32           ` Alex Deucher
2020-08-26 14:46 ` [PATCH 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
2020-08-26 14:46 ` [PATCH 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.