All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] BAD GPU retirement policy by total bad pages
@ 2020-07-28  7:49 Guchun Chen
  2020-07-28  7:49 ` [PATCH 01/12] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
                   ` (11 more replies)
  0 siblings, 12 replies; 18+ messages in thread
From: Guchun Chen @ 2020-07-28  7:49 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	andrey.grodzovsky, Tao.Zhou1, John.Clements, lijo.lazar,
	christian.koenig, stanley.yang
  Cc: Guchun Chen

The series is to enable/disable bad page feature and apply different
bad page reservation strategy by different bad page threshold
configurations.

When the saved bad pages written to eeprom reach the threshold,
one ras recovery will be issued immediately and the recovery will
fail to tell user that the GPU is BAD and needs to be retired for
further check or setting one valid bigger threshold value in next
driver's probe to skip corresponding check.

During bootup, similar bad page threshold check is conducted as
well when eeprom get initialized, and it will possibly break boot
up for user's awareness.

When user sets bad_page_threshold=0 once probing driver, bad page
retirement feature is completely disabled, and driver has no chance to
process bad page information record and write it to eeprom.

Guchun Chen (12):
  drm/amdgpu: add bad page count threshold in module parameter
  drm/amdgpu: validate bad page threshold in ras
  drm/amdgpu: add bad gpu tag definition
  drm/amdgpu: break driver init process when it's bad GPU
  drm/amdgpu: skip bad page reservation once issuing from eeprom write
  drm/amdgpu: schedule ras recovery when reaching bad page threshold
  drm/amdgpu: break GPU recovery once it's in bad state
  drm/amdgpu: restore ras flags when user resets eeprom
  drm/amdgpu: define one macro for RAS's sysfs/debugfs name
  drm/amdgpu: decouple sysfs creating of bad page node
  drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold =
    0
  drm/amdgpu: reset eeprom once specifying one bigger threshold

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  32 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  11 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 186 ++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  19 +-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 102 +++++++++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c       |   5 +-
 8 files changed, 312 insertions(+), 53 deletions(-)

-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread
* [PATCH 00/12] BAD GPU retirement policy by total bad pages
@ 2020-07-29  2:56 Guchun Chen
  2020-07-29  2:56 ` [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Guchun Chen @ 2020-07-29  2:56 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	andrey.grodzovsky, Tao.Zhou1, John.Clements, lijo.lazar,
	christian.koenig
  Cc: Guchun Chen

The series is to enable/disable bad page feature and apply different
bad page reservation strategy by different bad page threshold
configurations.

When the saved bad pages written to eeprom reach the threshold,
one ras recovery will be issued immediately and the recovery will
fail to tell user that the GPU is BAD and needs to be retired for
further check or setting one valid bigger threshold value in next
driver's probe to skip corresponding check.

During bootup, similar bad page threshold check is conducted as
well when eeprom get initialized, and it will possibly break boot
up for user's awareness.

When user sets bad_page_threshold=0 once probing driver, bad page
retirement feature is completely disabled, and driver has no chance to
process bad page information record and write it to eeprom.

Guchun Chen (12):
  drm/amdgpu: add bad page count threshold in module parameter
  drm/amdgpu: validate bad page threshold in ras
  drm/amdgpu: add bad gpu tag definition
  drm/amdgpu: break driver init process when it's bad GPU
  drm/amdgpu: skip bad page reservation once issuing from eeprom write
  drm/amdgpu: schedule ras recovery when reaching bad page threshold
  drm/amdgpu: break GPU recovery once it's in bad state
  drm/amdgpu: restore ras flags when user resets eeprom
  drm/amdgpu: add one definition for RAS's sysfs/debugfs name
  drm/amdgpu: decouple sysfs creating of bad page node
  drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold =
    0
  drm/amdgpu: update eeprom once specifying one bigger threshold

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  32 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  11 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 186 ++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  19 +-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 121 +++++++++++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c       |   5 +-
 8 files changed, 331 insertions(+), 53 deletions(-)

-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-07-29  2:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-28  7:49 [PATCH 00/12] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-28  7:49 ` [PATCH 01/12] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
2020-07-28  7:49 ` [PATCH 02/12] drm/amdgpu: validate bad page threshold in ras Guchun Chen
2020-07-28  7:49 ` [PATCH 03/12] drm/amdgpu: add bad gpu tag definition Guchun Chen
2020-07-28  7:49 ` [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen
2020-07-28  9:43   ` Li, Dennis
2020-07-28 14:11     ` Chen, Guchun
2020-07-28  7:49 ` [PATCH 05/12] drm/amdgpu: skip bad page reservation once issuing from eeprom write Guchun Chen
2020-07-28  7:49 ` [PATCH 06/12] drm/amdgpu: schedule ras recovery when reaching bad page threshold Guchun Chen
2020-07-28  7:49 ` [PATCH 07/12] drm/amdgpu: break GPU recovery once it's in bad state Guchun Chen
2020-07-28  7:49 ` [PATCH 08/12] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
2020-07-28  7:49 ` [PATCH 09/12] drm/amdgpu: define one macro for RAS's sysfs/debugfs name Guchun Chen
2020-07-28  7:55   ` Christian König
2020-07-28  8:00     ` Chen, Guchun
2020-07-28  7:49 ` [PATCH 10/12] drm/amdgpu: decouple sysfs creating of bad page node Guchun Chen
2020-07-28  7:49 ` [PATCH 11/12] drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold = 0 Guchun Chen
2020-07-28  7:49 ` [PATCH 12/12] drm/amdgpu: reset eeprom once specifying one bigger threshold Guchun Chen
2020-07-29  2:56 [PATCH 00/12] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-29  2:56 ` [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.