All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
@ 2022-05-17 19:20 Andrey Grodzovsky
  2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to actively stop any pending reset
works while another in progress. 

v2: Switch from generic list as was in v1[1] to eplicit 
stopping of each reset request from each reset source
per each request submitter. 

[1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/

Andrey Grodzovsky (7):
  drm/amdgpu: Cache result of last reset at reset domain level.
  drm/amdgpu: Switch to delayed work from work_struct.
  drm/admgpu: Serialize RAS recovery work directly into reset domain
    queue.
  drm/amdgpu: Add delayed work for GPU reset from debugfs
  drm/amdgpu: Add delayed work for GPU reset from kfd.
  drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
    amdgpu_device_gpu_recover
  drm/amdgpu: Stop any pending reset if another in progress.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
 14 files changed, 87 insertions(+), 54 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-05-19 13:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
2022-05-18  6:02   ` Christian König
2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
2022-05-18  6:03   ` Christian König
2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
2022-05-17 20:56   ` Felix Kuehling
2022-05-18  6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
2022-05-18 14:24   ` Andrey Grodzovsky
2022-05-19  7:58     ` Christian König
2022-05-19 13:41       ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.