All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] BAD GPU retirement policy by total bad pages
@ 2020-07-22  3:14 Guchun Chen
  2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Guchun Chen @ 2020-07-22  3:14 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements
  Cc: Guchun Chen

The series is to enable the feature of GPU RMA(Return Merchandise
Authorization) which is trigged when bad pages detected by RAS ECC
exceed the threshold value.

When the saved bad pages written to eeprom reach the threshold,
one ras recovery will be issued immediately and the recovery will
fail to tell user that the GPU is BAD and needs to be retired for
further check.

During bootup, similar BAD GPU check is conducted as well when
eeprom get initialized, and it will break boot up for user's
awareness.

User could set bad_page_threshold=0 when probing amdgpu driver to
disable this feature to bring up GPU, and reset eeprom later.

Guchun Chen (5):
  drm/amdgpu: add bad page count threshold in module parameter
  drm/amdgpu: validate bad page threshold in ras
  drm/amdgpu: conduct bad gpu check during bootup/reset
  drm/amdgpu: restore ras flags when user resets eeprom
  drm/amdgpu: calculate actual size instead of hardcode size

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 21 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       | 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 70 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       | 19 +++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 98 ++++++++++++++++++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  8 +-
 7 files changed, 211 insertions(+), 17 deletions(-)

-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-07-23 13:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
2020-07-23  0:31   ` Li, Dennis
2020-07-23  3:47     ` Chen, Guchun
2020-07-23 13:10       ` Christian König
2020-07-23 13:39         ` Alex Deucher
2020-07-22  3:14 ` [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras Guchun Chen
2020-07-22  7:51   ` Yang, Stanley
2020-07-23  3:40     ` Chen, Guchun
2020-07-22  3:14 ` [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset Guchun Chen
2020-07-23  2:51   ` Zhou1, Tao
2020-07-23  3:38     ` Chen, Guchun
2020-07-23  4:03       ` Zhou1, Tao
2020-07-22  3:14 ` [PATCH 4/5] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
2020-07-22  3:14 ` [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
2020-07-22 14:26   ` Andrey Grodzovsky
2020-07-22 14:29     ` Chen, Guchun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.