All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: <amd-gfx@lists.freedesktop.org>
Cc: alexander.deucher@amd.com,
	Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	nirmodas@amd.com, christian.koenig@amd.com, Dennis.Li@amd.com
Subject: [PATCH v2 0/7] Implement PCI Error Recovery on Navi12
Date: Fri, 28 Aug 2020 12:05:36 -0400	[thread overview]
Message-ID: <1598630743-21155-1-git-send-email-andrey.grodzovsky@amd.com> (raw)

Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, 
such as parity errors on the data and address buses,  A typical action taken is to disconnect 
the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered, 
so that the affected PCI device(s) are reset and put back into working condition. 
In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC) 
driver which will intercept the PCIe error, remove (isolate) the faulting device and reset the PCI 
link of the device after which it will call into PCIe recovery code of the PCI core. 
This code will call hooks which are implemented in this patchset where the error is 
first reported at which point we block the GPU scheduler, next DPC resets the 
PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
start executing mode1 reset of the ASIC, next step is slot reset hook is called 
at which point we wait for ASIC reset to complete, restore PCI config space and run 
HW suspend/resume sequence to resinit the ASIC. 
Last hook called is resume normal operation at which point we will restart the GPU scheduler.

Andrey Grodzovsky (7):
  drm/amdgpu: Implement DPC recovery
  drm/amdgpu: Avoid accessing HW when suspending SW state
  drm/amdgpu: Block all job scheduling activity during DPC recovery
  drm/amdgpu: Fix SMU error failure
  drm/amdgpu: Fix consecutive DPC recovery failures.
  drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  drm/amdgpu: Disable DPC for XGMI for now.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  16 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 296 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
 drivers/gpu/drm/amd/amdgpu/atom.c          |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 +-
 drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |   3 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
 11 files changed, 348 insertions(+), 22 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

             reply	other threads:[~2020-08-28 16:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-28 16:05 Andrey Grodzovsky [this message]
2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
2020-08-28 19:23   ` Alex Deucher
2020-08-28 19:24     ` Alex Deucher
2020-08-31 14:26     ` Andrey Grodzovsky
2020-08-31 14:30       ` Alex Deucher
2020-08-28 19:25   ` Alex Deucher
2020-08-31 12:44   ` Christian König
2020-08-28 16:05 ` [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
2020-08-28 19:26   ` Alex Deucher
2020-08-31 20:19     ` Luben Tuikov
2020-08-28 16:05 ` [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
2020-08-28 19:28   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
2020-08-28 19:29   ` Alex Deucher
2020-08-28 20:28     ` Andrey Grodzovsky
2020-08-28 16:05 ` [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
2020-08-28 19:19   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
2020-08-28 19:30   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
2020-08-28 19:30   ` Alex Deucher
2020-08-28 19:31     ` Alex Deucher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1598630743-21155-1-git-send-email-andrey.grodzovsky@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=nirmodas@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.