[PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold
@ 2021-10-19 17:50 Kent Russell
  2021-10-19 17:50 ` [PATCH 2/4] drm/amdgpu: Clarify error when hitting bad page threshold Kent Russell
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Kent Russell @ 2021-10-19 17:50 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

Currently dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 98732518543e..8270aad23a06 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 		if (res)
 			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
 				  res);
+
+		/* threshold = -1 is automatic, threshold = 0 means that page
+		 * retirement is disabled.
+		 */
+		if (amdgpu_bad_page_threshold > 0 &&
+		    control->ras_num_recs >= 0 &&
+		    control->ras_num_recs >= (amdgpu_bad_page_threshold * 9 / 10))
+			DRM_WARN("RAS records:%u approaching threshold:%d",
+					control->ras_num_recs,
+					amdgpu_bad_page_threshold);
 	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
 		   amdgpu_bad_page_threshold != 0) {
 		res = __verify_ras_table_checksum(control);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] drm/amdgpu: Clarify error when hitting bad page threshold
  2021-10-19 17:50 [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Kent Russell
@ 2021-10-19 17:50 ` Kent Russell
  2021-10-19 18:47   ` Luben Tuikov
  2021-10-19 17:50 ` [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring " Kent Russell
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Kent Russell @ 2021-10-19 17:50 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

Change the error message when the bad_page_threshold is reached,
explicitly stating that the GPU will not be initialized.

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 8270aad23a06..7bb506a0ebd6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1111,7 +1111,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 			*exceed_err_limit = true;
 			dev_err(adev->dev,
 				"RAS records:%d exceed threshold:%d, "
-				"maybe retire this GPU?",
+				"GPU will not be initialized. Replace this GPU or increase the threshold",
 				control->ras_num_recs, ras->bad_page_cnt_threshold);
 		}
 	} else {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold
  2021-10-19 17:50 [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Kent Russell
  2021-10-19 17:50 ` [PATCH 2/4] drm/amdgpu: Clarify error when hitting bad page threshold Kent Russell
@ 2021-10-19 17:50 ` Kent Russell
  2021-10-19 18:13   ` Felix Kuehling
  2021-10-20 10:55   ` Christian König
  2021-10-19 17:50 ` [PATCH 4/4] drm/amdgpu: Implement ignore_bad_page_threshold parameter Kent Russell
  2021-10-19 18:08 ` [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Felix Kuehling
  3 siblings, 2 replies; 13+ messages in thread
From: Kent Russell @ 2021-10-19 17:50 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).
Add an override called ignore_bad_page_threshold that can be set to true
to still initialize the GPU, even when the bad page threshold has been
reached.

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..3e9a7b072888 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
 int amdgpu_ras_enable = -1;
 uint amdgpu_ras_mask = 0xffffffff;
 int amdgpu_bad_page_threshold = -1;
+bool amdgpu_ignore_bad_page_threshold;
 struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
 	.timeout_fatal_disable = false,
 	.period = 0x0, /* default to 0x0 (timeout disable) */
@@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
 MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
+/**
+ * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
+ * the threshold value of faulty pages detected by RAS ECC. Once the
+ * threshold is hit, the GPU will not be initialized. Use this parameter
+ * to ignore the bad page threshold so that information gathering can
+ * still be performed. This also allows for booting the GPU to clear
+ * the RAS EEPROM table.
+ */
+
+MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false = respect bad page threshold (default value)");
+module_param_named(ignore_bad_page_threshold, amdgpu_ignore_bad_page_threshold, bool, 0644);
+
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
 module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] drm/amdgpu: Implement ignore_bad_page_threshold parameter
  2021-10-19 17:50 [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Kent Russell
  2021-10-19 17:50 ` [PATCH 2/4] drm/amdgpu: Clarify error when hitting bad page threshold Kent Russell
  2021-10-19 17:50 ` [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring " Kent Russell
@ 2021-10-19 17:50 ` Kent Russell
  2021-10-19 18:08 ` [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Felix Kuehling
  3 siblings, 0 replies; 13+ messages in thread
From: Kent Russell @ 2021-10-19 17:50 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

If the ignore_bad_page_threshold kernel parameter is set to true,
continue to post the GPU. Print an warning to dmesg that this action has
been done, and that page retirement will obviously not work for said GPU

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 7bb506a0ebd6..63a0548a05bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1108,11 +1108,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 			res = amdgpu_ras_eeprom_correct_header_tag(control,
 								   RAS_TABLE_HDR_VAL);
 		} else {
-			*exceed_err_limit = true;
-			dev_err(adev->dev,
-				"RAS records:%d exceed threshold:%d, "
-				"GPU will not be initialized. Replace this GPU or increase the threshold",
+			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
 				control->ras_num_recs, ras->bad_page_cnt_threshold);
+			if (amdgpu_ignore_bad_page_threshold) {
+				dev_warn(adev->dev, "GPU will be initialized due to ignore_bad_page_threshold.");
+				dev_warn(adev->dev, "Page retirement will not work for this GPU in this state.");
+				res = 0;
+			} else {
+				*exceed_err_limit = true;
+				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
+			}
 		}
 	} else {
 		DRM_INFO("Creating a new EEPROM table");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold
  2021-10-19 17:50 [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Kent Russell
                   ` (2 preceding siblings ...)
  2021-10-19 17:50 ` [PATCH 4/4] drm/amdgpu: Implement ignore_bad_page_threshold parameter Kent Russell
@ 2021-10-19 18:08 ` Felix Kuehling
  2021-10-19 18:22   ` Russell, Kent
  3 siblings, 1 reply; 13+ messages in thread
From: Felix Kuehling @ 2021-10-19 18:08 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Luben Tuikov, Mukul Joshi

Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell:
> Currently dmesg doesn't warn when the number of bad pages approaches the
> threshold for page retirement. WARN when the number of bad pages
> is at 90% or greater for easier checks and planning, instead of waiting
> until the GPU is full of bad pages
>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 98732518543e..8270aad23a06 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1077,6 +1077,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>  		if (res)
>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>  				  res);
> +
> +		/* threshold = -1 is automatic, threshold = 0 means that page
> +		 * retirement is disabled.
> +		 */
> +		if (amdgpu_bad_page_threshold > 0 &&
> +		    control->ras_num_recs >= 0 &&
> +		    control->ras_num_recs >= (amdgpu_bad_page_threshold * 9 / 10))
> +			DRM_WARN("RAS records:%u approaching threshold:%d",
> +					control->ras_num_recs,
> +					amdgpu_bad_page_threshold);

This won't work for the default setting amdgpu_bad_page_threshold=-1.
For this case, you'd have to take the threshold from
ras->bad_page_cnt_threshold.

Regards,
   Felix


>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>  		   amdgpu_bad_page_threshold != 0) {
>  		res = __verify_ras_table_checksum(control);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold
  2021-10-19 17:50 ` [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring " Kent Russell
@ 2021-10-19 18:13   ` Felix Kuehling
  2021-10-19 18:23     ` Russell, Kent
  2021-10-20 10:55   ` Christian König
  1 sibling, 1 reply; 13+ messages in thread
From: Felix Kuehling @ 2021-10-19 18:13 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Luben Tuikov, Mukul Joshi


Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell:
> When a GPU hits the bad_page_threshold, it will not be initialized by
> the amdgpu driver. This means that the table cannot be cleared, nor can
> information gathering be performed (getting serial number, BDF, etc).
> Add an override called ignore_bad_page_threshold that can be set to true
> to still initialize the GPU, even when the bad page threshold has been
> reached.
Do you really need a new parameter for this? Wouldn't it be enough to
set bad_page_threshold to the VRAM size? You could use a new special
value (e.g. bad_page_threshold=-2) for that.

Regards,
  Felix


>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index d58e37fd01f4..b85b67a88a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>  extern int amdgpu_ras_enable;
>  extern uint amdgpu_ras_mask;
>  extern int amdgpu_bad_page_threshold;
> +extern bool amdgpu_ignore_bad_page_threshold;
>  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>  extern int amdgpu_async_gfx_ring;
>  extern int amdgpu_mcbp;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 96bd63aeeddd..3e9a7b072888 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
>  int amdgpu_ras_enable = -1;
>  uint amdgpu_ras_mask = 0xffffffff;
>  int amdgpu_bad_page_threshold = -1;
> +bool amdgpu_ignore_bad_page_threshold;
>  struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
>  	.timeout_fatal_disable = false,
>  	.period = 0x0, /* default to 0x0 (timeout disable) */
> @@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
>  MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
>  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
>  
> +/**
> + * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
> + * the threshold value of faulty pages detected by RAS ECC. Once the
> + * threshold is hit, the GPU will not be initialized. Use this parameter
> + * to ignore the bad page threshold so that information gathering can
> + * still be performed. This also allows for booting the GPU to clear
> + * the RAS EEPROM table.
> + */
> +
> +MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false = respect bad page threshold (default value)");
> +module_param_named(ignore_bad_page_threshold, amdgpu_ignore_bad_page_threshold, bool, 0644);
> +
>  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
>  module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);
>  

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold
  2021-10-19 18:08 ` [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Felix Kuehling
@ 2021-10-19 18:22   ` Russell, Kent
  2021-10-19 18:42     ` Luben Tuikov
  0 siblings, 1 reply; 13+ messages in thread
From: Russell, Kent @ 2021-10-19 18:22 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx; +Cc: Tuikov, Luben, Joshi, Mukul

[AMD Official Use Only]



> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Tuesday, October 19, 2021 2:09 PM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
> Subject: Re: [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold
> 
> Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell:
> > Currently dmesg doesn't warn when the number of bad pages approaches the
> > threshold for page retirement. WARN when the number of bad pages
> > is at 90% or greater for easier checks and planning, instead of waiting
> > until the GPU is full of bad pages
> >
> > Cc: Luben Tuikov <luben.tuikov@amd.com>
> > Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 98732518543e..8270aad23a06 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1077,6 +1077,16 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >  		if (res)
> >  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
> >  				  res);
> > +
> > +		/* threshold = -1 is automatic, threshold = 0 means that page
> > +		 * retirement is disabled.
> > +		 */
> > +		if (amdgpu_bad_page_threshold > 0 &&
> > +		    control->ras_num_recs >= 0 &&
> > +		    control->ras_num_recs >= (amdgpu_bad_page_threshold * 9 / 10))
> > +			DRM_WARN("RAS records:%u approaching threshold:%d",
> > +					control->ras_num_recs,
> > +					amdgpu_bad_page_threshold);
> 
> This won't work for the default setting amdgpu_bad_page_threshold=-1.
> For this case, you'd have to take the threshold from
> ras->bad_page_cnt_threshold.

Yep, completely missed that. Thanks, I'll fix that up.

 Kent
> 
> Regards,
>    Felix
> 
> 
> >  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
> >  		   amdgpu_bad_page_threshold != 0) {
> >  		res = __verify_ras_table_checksum(control);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold
  2021-10-19 18:13   ` Felix Kuehling
@ 2021-10-19 18:23     ` Russell, Kent
  0 siblings, 0 replies; 13+ messages in thread
From: Russell, Kent @ 2021-10-19 18:23 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx; +Cc: Tuikov, Luben, Joshi, Mukul

[AMD Official Use Only]



> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Tuesday, October 19, 2021 2:13 PM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
> Subject: Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page
> threshold
> 
> 
> Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell:
> > When a GPU hits the bad_page_threshold, it will not be initialized by
> > the amdgpu driver. This means that the table cannot be cleared, nor can
> > information gathering be performed (getting serial number, BDF, etc).
> > Add an override called ignore_bad_page_threshold that can be set to true
> > to still initialize the GPU, even when the bad page threshold has been
> > reached.
> Do you really need a new parameter for this? Wouldn't it be enough to
> set bad_page_threshold to the VRAM size? You could use a new special
> value (e.g. bad_page_threshold=-2) for that.

Ah interesting. That could definitely work here. I hadn't thought about co-opting another variable. We already check -1, so why not -2? Great insight. Thanks!

 Kent

> 
> Regards,
>   Felix
> 
> 
> >
> > Cc: Luben Tuikov <luben.tuikov@amd.com>
> > Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
> >  2 files changed, 14 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index d58e37fd01f4..b85b67a88a3d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >  extern int amdgpu_ras_enable;
> >  extern uint amdgpu_ras_mask;
> >  extern int amdgpu_bad_page_threshold;
> > +extern bool amdgpu_ignore_bad_page_threshold;
> >  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >  extern int amdgpu_async_gfx_ring;
> >  extern int amdgpu_mcbp;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 96bd63aeeddd..3e9a7b072888 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
> >  int amdgpu_ras_enable = -1;
> >  uint amdgpu_ras_mask = 0xffffffff;
> >  int amdgpu_bad_page_threshold = -1;
> > +bool amdgpu_ignore_bad_page_threshold;
> >  struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
> >  	.timeout_fatal_disable = false,
> >  	.period = 0x0, /* default to 0x0 (timeout disable) */
> > @@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method,
> int, 0444);
> >  MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> value), 0 = disable bad page retirement)");
> >  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
> >
> > +/**
> > + * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
> > + * the threshold value of faulty pages detected by RAS ECC. Once the
> > + * threshold is hit, the GPU will not be initialized. Use this parameter
> > + * to ignore the bad page threshold so that information gathering can
> > + * still be performed. This also allows for booting the GPU to clear
> > + * the RAS EEPROM table.
> > + */
> > +
> > +MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false =
> respect bad page threshold (default value)");
> > +module_param_named(ignore_bad_page_threshold,
> amdgpu_ignore_bad_page_threshold, bool, 0644);
> > +
> >  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup
> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> >  module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold
  2021-10-19 18:22   ` Russell, Kent
@ 2021-10-19 18:42     ` Luben Tuikov
  0 siblings, 0 replies; 13+ messages in thread
From: Luben Tuikov @ 2021-10-19 18:42 UTC (permalink / raw)
  To: Russell, Kent, Kuehling, Felix, amd-gfx; +Cc: Joshi, Mukul

On 2021-10-19 14:22, Russell, Kent wrote:
> [AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>> Sent: Tuesday, October 19, 2021 2:09 PM
>> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
>> Subject: Re: [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold
>>
>> Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell:
>>> Currently dmesg doesn't warn when the number of bad pages approaches the
>>> threshold for page retirement. WARN when the number of bad pages
>>> is at 90% or greater for easier checks and planning, instead of waiting
>>> until the GPU is full of bad pages
>>>
>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> index 98732518543e..8270aad23a06 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> @@ -1077,6 +1077,16 @@ int amdgpu_ras_eeprom_init(struct
>> amdgpu_ras_eeprom_control *control,
>>>  		if (res)
>>>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>>  				  res);
>>> +
>>> +		/* threshold = -1 is automatic, threshold = 0 means that page
>>> +		 * retirement is disabled.
>>> +		 */
>>> +		if (amdgpu_bad_page_threshold > 0 &&
>>> +		    control->ras_num_recs >= 0 &&
>>> +		    control->ras_num_recs >= (amdgpu_bad_page_threshold * 9 / 10))
>>> +			DRM_WARN("RAS records:%u approaching threshold:%d",
>>> +					control->ras_num_recs,
>>> +					amdgpu_bad_page_threshold);
>> This won't work for the default setting amdgpu_bad_page_threshold=-1.
>> For this case, you'd have to take the threshold from
>> ras->bad_page_cnt_threshold.
> Yep, completely missed that. Thanks, I'll fix that up.

Please also fix the round off, third conditional:

a >= b * 9/10   <==>   10*a >= 9*b

Then, you can also drop the second line, since from the first:

b > 0  ==>   10*a >= 9*b > 0   ==>  10a > 0  ==>  a > 0.

Which shows that,

b > 0 && 10*a >= 9*b
               is true iff a and b are both greater than 0, so you don't need the middle line of the check.

Also in your message, say something like:

DRM_WARN("RAS records:%u approaching a 90% threshold:%d",
             control->ras_num_recs,
             amdgpu_bad_page_threshold);

Regards,
Luben

>
>  Kent
>> Regards,
>>    Felix
>>
>>
>>>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>>>  		   amdgpu_bad_page_threshold != 0) {
>>>  		res = __verify_ras_table_checksum(control);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] drm/amdgpu: Clarify error when hitting bad page threshold
  2021-10-19 17:50 ` [PATCH 2/4] drm/amdgpu: Clarify error when hitting bad page threshold Kent Russell
@ 2021-10-19 18:47   ` Luben Tuikov
  0 siblings, 0 replies; 13+ messages in thread
From: Luben Tuikov @ 2021-10-19 18:47 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Mukul Joshi

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben

On 2021-10-19 13:50, Kent Russell wrote:
> Change the error message when the bad_page_threshold is reached,
> explicitly stating that the GPU will not be initialized.
>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 8270aad23a06..7bb506a0ebd6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1111,7 +1111,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>  			*exceed_err_limit = true;
>  			dev_err(adev->dev,
>  				"RAS records:%d exceed threshold:%d, "
> -				"maybe retire this GPU?",
> +				"GPU will not be initialized. Replace this GPU or increase the threshold",
>  				control->ras_num_recs, ras->bad_page_cnt_threshold);
>  		}
>  	} else {


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold
  2021-10-19 17:50 ` [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring " Kent Russell
  2021-10-19 18:13   ` Felix Kuehling
@ 2021-10-20 10:55   ` Christian König
  2021-10-20 14:56     ` Russell, Kent
  1 sibling, 1 reply; 13+ messages in thread
From: Christian König @ 2021-10-20 10:55 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Luben Tuikov, Mukul Joshi

Am 19.10.21 um 19:50 schrieb Kent Russell:
> When a GPU hits the bad_page_threshold, it will not be initialized by
> the amdgpu driver. This means that the table cannot be cleared, nor can
> information gathering be performed (getting serial number, BDF, etc).
> Add an override called ignore_bad_page_threshold that can be set to true
> to still initialize the GPU, even when the bad page threshold has been
> reached.

I would rather question the practice of this bad pages threshold.

As far as I know the hardware works perfectly fine even when we have 
more bad badles then expected, we should just warn really loudly about it.

Christian.

>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
>   2 files changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index d58e37fd01f4..b85b67a88a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>   extern int amdgpu_ras_enable;
>   extern uint amdgpu_ras_mask;
>   extern int amdgpu_bad_page_threshold;
> +extern bool amdgpu_ignore_bad_page_threshold;
>   extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>   extern int amdgpu_async_gfx_ring;
>   extern int amdgpu_mcbp;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 96bd63aeeddd..3e9a7b072888 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
>   int amdgpu_ras_enable = -1;
>   uint amdgpu_ras_mask = 0xffffffff;
>   int amdgpu_bad_page_threshold = -1;
> +bool amdgpu_ignore_bad_page_threshold;
>   struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
>   	.timeout_fatal_disable = false,
>   	.period = 0x0, /* default to 0x0 (timeout disable) */
> @@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
>   MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
>   module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
>   
> +/**
> + * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
> + * the threshold value of faulty pages detected by RAS ECC. Once the
> + * threshold is hit, the GPU will not be initialized. Use this parameter
> + * to ignore the bad page threshold so that information gathering can
> + * still be performed. This also allows for booting the GPU to clear
> + * the RAS EEPROM table.
> + */
> +
> +MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false = respect bad page threshold (default value)");
> +module_param_named(ignore_bad_page_threshold, amdgpu_ignore_bad_page_threshold, bool, 0644);
> +
>   MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
>   module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold
  2021-10-20 10:55   ` Christian König
@ 2021-10-20 14:56     ` Russell, Kent
  2021-10-20 15:02       ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Russell, Kent @ 2021-10-20 14:56 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Tuikov, Luben, Joshi, Mukul

[AMD Official Use Only]

I can see both sides of the argument. Having a configurable threshold means that you can determine what sort of "HW reliability" that you want. The default value is likely not going to get hit by the average user. And users that DO hit that threshold can determine if they want to ignore it, increase it, or replace the hardware, through the kernel parameter. 

But having that option means it's configurable based on what that customer wants. If they believe that all data is precious, setting the threshold to something like 1 bad page means that they won't ever run on a chip that ever had a bad page, thus ensuring data integrity. It seems ludicrous to me to have a value so low, but I am sure that someone out there would want to remove a GPU as soon as it has one bad page retired. And some people couldn't care any less, so they can set it to disabled or ignored or whatever they wish.

As it stands, we have at least two customers who are focused on having the threshold automatically remove the GPUs from use, to ensure data integrity. They just want warnings to know that it's getting bad (my 90% threshold patch), so that they can plan for HW replacement accordingly. 

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Wednesday, October 20, 2021 6:55 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
> Subject: Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page
> threshold
> 
> Am 19.10.21 um 19:50 schrieb Kent Russell:
> > When a GPU hits the bad_page_threshold, it will not be initialized by
> > the amdgpu driver. This means that the table cannot be cleared, nor can
> > information gathering be performed (getting serial number, BDF, etc).
> > Add an override called ignore_bad_page_threshold that can be set to true
> > to still initialize the GPU, even when the bad page threshold has been
> > reached.
> 
> I would rather question the practice of this bad pages threshold.
> 
> As far as I know the hardware works perfectly fine even when we have
> more bad badles then expected, we should just warn really loudly about it.
> 
> Christian.
> 
> >
> > Cc: Luben Tuikov <luben.tuikov@amd.com>
> > Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
> >   2 files changed, 14 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index d58e37fd01f4..b85b67a88a3d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >   extern int amdgpu_ras_enable;
> >   extern uint amdgpu_ras_mask;
> >   extern int amdgpu_bad_page_threshold;
> > +extern bool amdgpu_ignore_bad_page_threshold;
> >   extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >   extern int amdgpu_async_gfx_ring;
> >   extern int amdgpu_mcbp;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 96bd63aeeddd..3e9a7b072888 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
> >   int amdgpu_ras_enable = -1;
> >   uint amdgpu_ras_mask = 0xffffffff;
> >   int amdgpu_bad_page_threshold = -1;
> > +bool amdgpu_ignore_bad_page_threshold;
> >   struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
> >   	.timeout_fatal_disable = false,
> >   	.period = 0x0, /* default to 0x0 (timeout disable) */
> > @@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method,
> int, 0444);
> >   MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> value), 0 = disable bad page retirement)");
> >   module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
> >
> > +/**
> > + * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
> > + * the threshold value of faulty pages detected by RAS ECC. Once the
> > + * threshold is hit, the GPU will not be initialized. Use this parameter
> > + * to ignore the bad page threshold so that information gathering can
> > + * still be performed. This also allows for booting the GPU to clear
> > + * the RAS EEPROM table.
> > + */
> > +
> > +MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false =
> respect bad page threshold (default value)");
> > +module_param_named(ignore_bad_page_threshold,
> amdgpu_ignore_bad_page_threshold, bool, 0644);
> > +
> >   MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup
> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> >   module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold
  2021-10-20 14:56     ` Russell, Kent
@ 2021-10-20 15:02       ` Christian König
  0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2021-10-20 15:02 UTC (permalink / raw)
  To: Russell, Kent, amd-gfx; +Cc: Tuikov, Luben, Joshi, Mukul

> As it stands, we have at least two customers who are focused on having the threshold automatically remove the GPUs from use, to ensure data integrity. They just want warnings to know that it's getting bad (my 90% threshold patch), so that they can plan for HW replacement accordingly.

We could handle that outside of the kernel driver. E.g. in userspace for 
example.

Not loading the driver at all results in numerous problems and I think 
not able to retrieve or reset the bad pages counter is just the tip of 
the iceberg here.

To be honest I'm pretty sure that rejecting the driver to load if it 
would work is a no-go and that design is really questionable.

Christian.

Am 20.10.21 um 16:56 schrieb Russell, Kent:
> [AMD Official Use Only]
>
> I can see both sides of the argument. Having a configurable threshold means that you can determine what sort of "HW reliability" that you want. The default value is likely not going to get hit by the average user. And users that DO hit that threshold can determine if they want to ignore it, increase it, or replace the hardware, through the kernel parameter.
>
> But having that option means it's configurable based on what that customer wants. If they believe that all data is precious, setting the threshold to something like 1 bad page means that they won't ever run on a chip that ever had a bad page, thus ensuring data integrity. It seems ludicrous to me to have a value so low, but I am sure that someone out there would want to remove a GPU as soon as it has one bad page retired. And some people couldn't care any less, so they can set it to disabled or ignored or whatever they wish.
>
> As it stands, we have at least two customers who are focused on having the threshold automatically remove the GPUs from use, to ensure data integrity. They just want warnings to know that it's getting bad (my 90% threshold patch), so that they can plan for HW replacement accordingly.
>
>   Kent
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Wednesday, October 20, 2021 6:55 AM
>> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
>> Subject: Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page
>> threshold
>>
>> Am 19.10.21 um 19:50 schrieb Kent Russell:
>>> When a GPU hits the bad_page_threshold, it will not be initialized by
>>> the amdgpu driver. This means that the table cannot be cleared, nor can
>>> information gathering be performed (getting serial number, BDF, etc).
>>> Add an override called ignore_bad_page_threshold that can be set to true
>>> to still initialize the GPU, even when the bad page threshold has been
>>> reached.
>> I would rather question the practice of this bad pages threshold.
>>
>> As far as I know the hardware works perfectly fine even when we have
>> more bad badles then expected, we should just warn really loudly about it.
>>
>> Christian.
>>
>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
>>>    2 files changed, 14 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index d58e37fd01f4..b85b67a88a3d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>>>    extern int amdgpu_ras_enable;
>>>    extern uint amdgpu_ras_mask;
>>>    extern int amdgpu_bad_page_threshold;
>>> +extern bool amdgpu_ignore_bad_page_threshold;
>>>    extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>>>    extern int amdgpu_async_gfx_ring;
>>>    extern int amdgpu_mcbp;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 96bd63aeeddd..3e9a7b072888 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
>>>    int amdgpu_ras_enable = -1;
>>>    uint amdgpu_ras_mask = 0xffffffff;
>>>    int amdgpu_bad_page_threshold = -1;
>>> +bool amdgpu_ignore_bad_page_threshold;
>>>    struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
>>>    	.timeout_fatal_disable = false,
>>>    	.period = 0x0, /* default to 0x0 (timeout disable) */
>>> @@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method,
>> int, 0444);
>>>    MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
>> value), 0 = disable bad page retirement)");
>>>    module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
>>>
>>> +/**
>>> + * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
>>> + * the threshold value of faulty pages detected by RAS ECC. Once the
>>> + * threshold is hit, the GPU will not be initialized. Use this parameter
>>> + * to ignore the bad page threshold so that information gathering can
>>> + * still be performed. This also allows for booting the GPU to clear
>>> + * the RAS EEPROM table.
>>> + */
>>> +
>>> +MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false =
>> respect bad page threshold (default value)");
>>> +module_param_named(ignore_bad_page_threshold,
>> amdgpu_ignore_bad_page_threshold, bool, 0644);
>>> +
>>>    MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup
>> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
>>>    module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);
>>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-10-20 15:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 17:50 [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Kent Russell
2021-10-19 17:50 ` [PATCH 2/4] drm/amdgpu: Clarify error when hitting bad page threshold Kent Russell
2021-10-19 18:47   ` Luben Tuikov
2021-10-19 17:50 ` [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring " Kent Russell
2021-10-19 18:13   ` Felix Kuehling
2021-10-19 18:23     ` Russell, Kent
2021-10-20 10:55   ` Christian König
2021-10-20 14:56     ` Russell, Kent
2021-10-20 15:02       ` Christian König
2021-10-19 17:50 ` [PATCH 4/4] drm/amdgpu: Implement ignore_bad_page_threshold parameter Kent Russell
2021-10-19 18:08 ` [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold Felix Kuehling
2021-10-19 18:22   ` Russell, Kent
2021-10-19 18:42     ` Luben Tuikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.