[PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
@ 2021-10-20 16:35 Kent Russell
  2021-10-20 16:35 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Kent Russell @ 2021-10-20 16:35 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

Currently dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..1ede0f0d6f55 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 	control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);
 
 	if (hdr->header == RAS_TABLE_HDR_VAL) {
+		int threshold = 0;
 		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
 				 control->ras_num_recs);
 		res = __verify_ras_table_checksum(control);
 		if (res)
 			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
 				  res);
+
+		/* threshold = 0 means that page retirement is disabled, while
+		 * threshold = -1 means default behaviour
+		 */
+		if (amdgpu_bad_page_threshold == -1)
+			threshold = ras->bad_page_cnt_threshold;
+		else if (amdgpu_bad_page_threshold > 0)
+			threshold = amdgpu_bad_page_threshold;
+
+		/* Since multiplcation is transitive, a = 9b/10 is the same
+		 * as 10a = 9b. Use this for our 90% limit to avoid rounding
+		 */
+		if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9)))
+			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+					control->ras_num_recs,
+					threshold);
 	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
 		   amdgpu_bad_page_threshold != 0) {
 		res = __verify_ras_table_checksum(control);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-20 16:35 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
@ 2021-10-20 16:35 ` Kent Russell
  2021-10-20 16:35 ` [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case Kent Russell
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Kent Russell @ 2021-10-20 16:35 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).
Add an override by using amdgpu_bad_page_threshold = -2 which will still
initialize the GPU, even when the bad page threshold has been reached.

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..eee3cf874e7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
  * result in the GPU entering bad status when the number of total
  * faulty pages by ECC exceeds the threshold value.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
  2021-10-20 16:35 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
  2021-10-20 16:35 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
@ 2021-10-20 16:35 ` Kent Russell
  2021-10-20 21:54   ` Felix Kuehling
  2021-10-21  5:24   ` Lazar, Lijo
  2021-10-20 21:47 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
  2021-10-20 21:50 ` Felix Kuehling
  3 siblings, 2 replies; 14+ messages in thread
From: Kent Russell @ 2021-10-20 16:35 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

If the bad_page_threshold kernel parameter is set to -2,
continue to post the GPU. Print a warning to dmesg that this action has
been done, and that page retirement will obviously not work for said GPU

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 1ede0f0d6f55..31852330c1db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 			res = amdgpu_ras_eeprom_correct_header_tag(control,
 								   RAS_TABLE_HDR_VAL);
 		} else {
-			*exceed_err_limit = true;
-			dev_err(adev->dev,
-				"RAS records:%d exceed threshold:%d, "
-				"GPU will not be initialized. Replace this GPU or increase the threshold",
+			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
 				control->ras_num_recs, ras->bad_page_cnt_threshold);
+			if (amdgpu_bad_page_threshold == -2) {
+				dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
+				dev_warn(adev->dev, "Page retirement will not work for this GPU in this state.");
+				res = 0;
+			} else {
+				*exceed_err_limit = true;
+				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
+			}
 		}
 	} else {
 		DRM_INFO("Creating a new EEPROM table");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-20 16:35 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
  2021-10-20 16:35 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
  2021-10-20 16:35 ` [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case Kent Russell
@ 2021-10-20 21:47 ` Luben Tuikov
  2021-10-21 14:04   ` Russell, Kent
  2021-10-20 21:50 ` Felix Kuehling
  3 siblings, 1 reply; 14+ messages in thread
From: Luben Tuikov @ 2021-10-20 21:47 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Mukul Joshi

[-- Attachment #1: Type: text/html, Size: 10133 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-20 16:35 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
                   ` (2 preceding siblings ...)
  2021-10-20 21:47 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
@ 2021-10-20 21:50 ` Felix Kuehling
  2021-10-20 22:09   ` Luben Tuikov
  2021-10-20 22:31   ` Felix Kuehling
  3 siblings, 2 replies; 14+ messages in thread
From: Felix Kuehling @ 2021-10-20 21:50 UTC (permalink / raw)
  To: amd-gfx, Russell, Kent; +Cc: Luben Tuikov, Mukul Joshi

On 2021-10-20 12:35 p.m., Kent Russell wrote:
> Currently dmesg doesn't warn when the number of bad pages approaches the
> threshold for page retirement. WARN when the number of bad pages
> is at 90% or greater for easier checks and planning, instead of waiting
> until the GPU is full of bad pages
>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index f4c05ff4b26c..1ede0f0d6f55 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>   	control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);
>   
>   	if (hdr->header == RAS_TABLE_HDR_VAL) {
> +		int threshold = 0;

ras->bad_page_cnt_threshold is uint32_t. I'd recommend using the same 
type. Also add an empty line after the declaration to avoid a checkpatch 
warning.


>   		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
>   				 control->ras_num_recs);
>   		res = __verify_ras_table_checksum(control);
>   		if (res)
>   			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>   				  res);
> +
> +		/* threshold = 0 means that page retirement is disabled, while
> +		 * threshold = -1 means default behaviour
> +		 */
> +		if (amdgpu_bad_page_threshold == -1)
> +			threshold = ras->bad_page_cnt_threshold;
> +		else if (amdgpu_bad_page_threshold > 0)
> +			threshold = amdgpu_bad_page_threshold;
> +
> +		/* Since multiplcation is transitive, a = 9b/10 is the same
> +		 * as 10a = 9b. Use this for our 90% limit to avoid rounding
> +		 */
> +		if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9)))

Not sure how big these values can get, but you may need to cast to 
(uint64_t) before the multiplications to avoid overflows. Alternatively 
you could use (control->ras_num_recs / 9 >= threshold / 10). It'll 
round, but never overflow.


> +			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",

Nitpick: I'd add space after the two colons for readability. The 
threshold should use %u if you make it uint32_t. This can never be negative.

Regards,
   Felix


> +					control->ras_num_recs,
> +					threshold);
>   	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>   		   amdgpu_bad_page_threshold != 0) {
>   		res = __verify_ras_table_checksum(control);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
  2021-10-20 16:35 ` [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case Kent Russell
@ 2021-10-20 21:54   ` Felix Kuehling
  2021-10-20 22:01     ` Luben Tuikov
  2021-10-21  5:24   ` Lazar, Lijo
  1 sibling, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2021-10-20 21:54 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Luben Tuikov, Mukul Joshi

On 2021-10-20 12:35 p.m., Kent Russell wrote:
> If the bad_page_threshold kernel parameter is set to -2,
> continue to post the GPU. Print a warning to dmesg that this action has
> been done, and that page retirement will obviously not work for said GPU

I'd squash patch 2 and 3. The squashed patch is

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>


>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 1ede0f0d6f55..31852330c1db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>   			res = amdgpu_ras_eeprom_correct_header_tag(control,
>   								   RAS_TABLE_HDR_VAL);
>   		} else {
> -			*exceed_err_limit = true;
> -			dev_err(adev->dev,
> -				"RAS records:%d exceed threshold:%d, "
> -				"GPU will not be initialized. Replace this GPU or increase the threshold",
> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>   				control->ras_num_recs, ras->bad_page_cnt_threshold);
> +			if (amdgpu_bad_page_threshold == -2) {
> +				dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
> +				dev_warn(adev->dev, "Page retirement will not work for this GPU in this state.");
> +				res = 0;
> +			} else {
> +				*exceed_err_limit = true;
> +				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
> +			}
>   		}
>   	} else {
>   		DRM_INFO("Creating a new EEPROM table");

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
  2021-10-20 21:54   ` Felix Kuehling
@ 2021-10-20 22:01     ` Luben Tuikov
  2021-10-21 13:57       ` Russell, Kent
  0 siblings, 1 reply; 14+ messages in thread
From: Luben Tuikov @ 2021-10-20 22:01 UTC (permalink / raw)
  To: Felix Kuehling, Kent Russell, amd-gfx; +Cc: Mukul Joshi

On 2021-10-20 17:54, Felix Kuehling wrote:
> On 2021-10-20 12:35 p.m., Kent Russell wrote:
>> If the bad_page_threshold kernel parameter is set to -2,
>> continue to post the GPU. Print a warning to dmesg that this action has
>> been done, and that page retirement will obviously not work for said GPU
> I'd squash patch 2 and 3. The squashed patch is
>
> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>

I was just thinking the same thing. Keep the title and text of patch 2 and add the description of 3 to 2. With that done:

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben

>
>
>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
>>   1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index 1ede0f0d6f55..31852330c1db 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>>   			res = amdgpu_ras_eeprom_correct_header_tag(control,
>>   								   RAS_TABLE_HDR_VAL);
>>   		} else {
>> -			*exceed_err_limit = true;
>> -			dev_err(adev->dev,
>> -				"RAS records:%d exceed threshold:%d, "
>> -				"GPU will not be initialized. Replace this GPU or increase the threshold",
>> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>>   				control->ras_num_recs, ras->bad_page_cnt_threshold);
>> +			if (amdgpu_bad_page_threshold == -2) {
>> +				dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
>> +				dev_warn(adev->dev, "Page retirement will not work for this GPU in this state.");
>> +				res = 0;
>> +			} else {
>> +				*exceed_err_limit = true;
>> +				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
>> +			}
>>   		}
>>   	} else {
>>   		DRM_INFO("Creating a new EEPROM table");


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-20 21:50 ` Felix Kuehling
@ 2021-10-20 22:09   ` Luben Tuikov
  2021-10-20 22:31   ` Felix Kuehling
  1 sibling, 0 replies; 14+ messages in thread
From: Luben Tuikov @ 2021-10-20 22:09 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx, Russell, Kent; +Cc: Mukul Joshi

On 2021-10-20 17:50, Felix Kuehling wrote:
> On 2021-10-20 12:35 p.m., Kent Russell wrote:
>> Currently dmesg doesn't warn when the number of bad pages approaches the
>> threshold for page retirement. WARN when the number of bad pages
>> is at 90% or greater for easier checks and planning, instead of waiting
>> until the GPU is full of bad pages
>>
>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index f4c05ff4b26c..1ede0f0d6f55 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>>   	control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);
>>   
>>   	if (hdr->header == RAS_TABLE_HDR_VAL) {
>> +		int threshold = 0;
> ras->bad_page_cnt_threshold is uint32_t. I'd recommend using the same 
> type. Also add an empty line after the declaration to avoid a checkpatch 
> warning.
>
>
>>   		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
>>   				 control->ras_num_recs);
>>   		res = __verify_ras_table_checksum(control);
>>   		if (res)
>>   			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>   				  res);
>> +
>> +		/* threshold = 0 means that page retirement is disabled, while
>> +		 * threshold = -1 means default behaviour
>> +		 */
>> +		if (amdgpu_bad_page_threshold == -1)
>> +			threshold = ras->bad_page_cnt_threshold;
>> +		else if (amdgpu_bad_page_threshold > 0)
>> +			threshold = amdgpu_bad_page_threshold;
>> +
>> +		/* Since multiplcation is transitive, a = 9b/10 is the same
>> +		 * as 10a = 9b. Use this for our 90% limit to avoid rounding
>> +		 */
>> +		if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9)))
> Not sure how big these values can get, but you may need to cast to 
> (uint64_t) before the multiplications to avoid overflows. Alternatively 
> you could use (control->ras_num_recs / 9 >= threshold / 10). It'll 
> round, but never overflow.

I sincerely hope that by the time those values overflow multiplication by 10, AI has taken over the planet. :-)
Avoiding rounding is preferable, since we deal with integers, and for small pages... could we get 0s after division? (if the page limit is 1 :-) )
(I think squashing numbers less than 10 to 0 is a bad idea, so as not to get false positives here on small numbers.)

Even for a 32-bit word size: can we be within 3 bits of 2^32? A value of 2^29? That's a lot of pages! Actually the number of pages of VRAM is *a lot smaller* than the number of pages we fit in a typical EEPROM we put with our parts. I think we're safe.

>
>
>> +			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
> Nitpick: I'd add space after the two colons for readability. The 
> threshold should use %u if you make it uint32_t. This can never be negative.

Aha, so that's exactly what I like "word:%q word:%q word:%q", so when I read it in the log, my eyes scan over it, on the spaces in between the word blocks. I prefer no spaces after the colon, so as to make scanning blocks (one has to squint to see it).

Regards,
Luben


>
> Regards,
>    Felix
>
>
>> +					control->ras_num_recs,
>> +					threshold);
>>   	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>>   		   amdgpu_bad_page_threshold != 0) {
>>   		res = __verify_ras_table_checksum(control);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-20 21:50 ` Felix Kuehling
  2021-10-20 22:09   ` Luben Tuikov
@ 2021-10-20 22:31   ` Felix Kuehling
  1 sibling, 0 replies; 14+ messages in thread
From: Felix Kuehling @ 2021-10-20 22:31 UTC (permalink / raw)
  To: amd-gfx, Russell, Kent; +Cc: Luben Tuikov, Mukul Joshi

On 2021-10-20 5:50 p.m., Felix Kuehling wrote:
> On 2021-10-20 12:35 p.m., Kent Russell wrote:
>> Currently dmesg doesn't warn when the number of bad pages approaches the
>> threshold for page retirement. WARN when the number of bad pages
>> is at 90% or greater for easier checks and planning, instead of waiting
>> until the GPU is full of bad pages
>>
>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index f4c05ff4b26c..1ede0f0d6f55 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct 
>> amdgpu_ras_eeprom_control *control,
>>       control->ras_fri = RAS_OFFSET_TO_INDEX(control, 
>> hdr->first_rec_offset);
>>         if (hdr->header == RAS_TABLE_HDR_VAL) {
>> +        int threshold = 0;
>
> ras->bad_page_cnt_threshold is uint32_t. I'd recommend using the same 
> type. Also add an empty line after the declaration to avoid a 
> checkpatch warning.
>
>
>>           DRM_DEBUG_DRIVER("Found existing EEPROM table with %d 
>> records",
>>                    control->ras_num_recs);
>>           res = __verify_ras_table_checksum(control);
>>           if (res)
>>               DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>                     res);
>> +
>> +        /* threshold = 0 means that page retirement is disabled, while
>> +         * threshold = -1 means default behaviour
>> +         */
>> +        if (amdgpu_bad_page_threshold == -1)
>> +            threshold = ras->bad_page_cnt_threshold;
>> +        else if (amdgpu_bad_page_threshold > 0)
>> +            threshold = amdgpu_bad_page_threshold;
>> +
>> +        /* Since multiplcation is transitive, a = 9b/10 is the same
>> +         * as 10a = 9b. Use this for our 90% limit to avoid rounding
>> +         */
>> +        if (threshold > 0 && ((control->ras_num_recs * 10) >= 
>> (threshold * 9)))
>
> Not sure how big these values can get, but you may need to cast to 
> (uint64_t) before the multiplications to avoid overflows. 
> Alternatively you could use (control->ras_num_recs / 9 >= threshold / 
> 10). It'll round, but never overflow.

Ignore this comment. If you follow Luben's recommendation to check 
ras->bad_page_cnt_threshold instead of the raw module parameter, the 
limit is small enough that multiplication will never overflow.

Regards,
   Felix


>
>
>> +            DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
>
> Nitpick: I'd add space after the two colons for readability. The 
> threshold should use %u if you make it uint32_t. This can never be 
> negative.
>
> Regards,
>   Felix
>
>
>> + control->ras_num_recs,
>> +                    threshold);
>>       } else if (hdr->header == RAS_TABLE_HDR_BAD &&
>>              amdgpu_bad_page_threshold != 0) {
>>           res = __verify_ras_table_checksum(control);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
  2021-10-20 16:35 ` [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case Kent Russell
  2021-10-20 21:54   ` Felix Kuehling
@ 2021-10-21  5:24   ` Lazar, Lijo
  2021-10-21 13:56     ` Russell, Kent
  1 sibling, 1 reply; 14+ messages in thread
From: Lazar, Lijo @ 2021-10-21  5:24 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Luben Tuikov, Mukul Joshi



On 10/20/2021 10:05 PM, Kent Russell wrote:
> If the bad_page_threshold kernel parameter is set to -2,
> continue to post the GPU. Print a warning to dmesg that this action has
> been done, and that page retirement will obviously not work for said GPU
> 
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 1ede0f0d6f55..31852330c1db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>   			res = amdgpu_ras_eeprom_correct_header_tag(control,
>   								   RAS_TABLE_HDR_VAL);
>   		} else {
> -			*exceed_err_limit = true;
> -			dev_err(adev->dev,
> -				"RAS records:%d exceed threshold:%d, "
> -				"GPU will not be initialized. Replace this GPU or increase the threshold",
> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>   				control->ras_num_recs, ras->bad_page_cnt_threshold);
> +			if (amdgpu_bad_page_threshold == -2) {
> +				dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
> +				dev_warn(adev->dev, "Page retirement will not work for this GPU in this state.");

Now, this looks as good as booting with 0 = disable bad page retirement. 
I thought page retirement will work as long as EEPROM has space, but it 
won't bother about the threshold. If the intent is to ignore bad page 
retirement, then 0 is good enough and -2 is not required.

Also, when user passes threshold=-2, what is the threshold being 
compared against to say *exceed_err_limit = true?

Thanks,
Lijo

> +				res = 0;
> +			} else {
> +				*exceed_err_limit = true;
> +				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
> +			}
>   		}
>   	} else {
>   		DRM_INFO("Creating a new EEPROM table");
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
  2021-10-21  5:24   ` Lazar, Lijo
@ 2021-10-21 13:56     ` Russell, Kent
  2021-10-22 11:26       ` Lazar, Lijo
  0 siblings, 1 reply; 14+ messages in thread
From: Russell, Kent @ 2021-10-21 13:56 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx; +Cc: Tuikov, Luben, Joshi, Mukul

[AMD Official Use Only]



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Thursday, October 21, 2021 1:25 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
> 
> 
> 
> On 10/20/2021 10:05 PM, Kent Russell wrote:
> > If the bad_page_threshold kernel parameter is set to -2,
> > continue to post the GPU. Print a warning to dmesg that this action has
> > been done, and that page retirement will obviously not work for said GPU
> >
> > Cc: Luben Tuikov <luben.tuikov@amd.com>
> > Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
> >   1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 1ede0f0d6f55..31852330c1db 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >   			res = amdgpu_ras_eeprom_correct_header_tag(control,
> >   								   RAS_TABLE_HDR_VAL);
> >   		} else {
> > -			*exceed_err_limit = true;
> > -			dev_err(adev->dev,
> > -				"RAS records:%d exceed threshold:%d, "
> > -				"GPU will not be initialized. Replace this GPU or increase the
> threshold",
> > +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >   				control->ras_num_recs, ras->bad_page_cnt_threshold);
> > +			if (amdgpu_bad_page_threshold == -2) {
> > +				dev_warn(adev->dev, "GPU will be initialized due to
> bad_page_threshold = -2.");
> > +				dev_warn(adev->dev, "Page retirement will not work for
> this GPU in this state.");
> 
> Now, this looks as good as booting with 0 = disable bad page retirement.
> I thought page retirement will work as long as EEPROM has space, but it
> won't bother about the threshold. If the intent is to ignore bad page
> retirement, then 0 is good enough and -2 is not required.
> 
> Also, when user passes threshold=-2, what is the threshold being
> compared against to say *exceed_err_limit = true?

My thought on having the -2 option is that we'll still enable page retirement, we just won't shut the GPU down when it hits the threshold. The bad pages will still be retired as we hit them, it will just never disable the GPU. The comment about retirement not working is definitely incorrect now (leftover from previous local patches), so I'll remove that. In this case, I don't think we'd ever exceed the error limit. exceed_err_limit is only really used when we are disabling the GPU, so we wouldn't want to set that to true. Otherwise we wouldn't be loading the bad pages in gpu_recovery_init, and we'll still return 0 from gpu_recovery_init.

 Kent
> 
> Thanks,
> Lijo
> 
> > +				res = 0;
> > +			} else {
> > +				*exceed_err_limit = true;
> > +				dev_err(adev->dev, "GPU will not be initialized. Replace this
> GPU or increase the threshold.");
> > +			}
> >   		}
> >   	} else {
> >   		DRM_INFO("Creating a new EEPROM table");
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
  2021-10-20 22:01     ` Luben Tuikov
@ 2021-10-21 13:57       ` Russell, Kent
  0 siblings, 0 replies; 14+ messages in thread
From: Russell, Kent @ 2021-10-21 13:57 UTC (permalink / raw)
  To: Tuikov, Luben, Kuehling, Felix, amd-gfx; +Cc: Joshi, Mukul

[AMD Official Use Only]



> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Wednesday, October 20, 2021 6:01 PM
> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Russell, Kent <Kent.Russell@amd.com>;
> amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
> 
> On 2021-10-20 17:54, Felix Kuehling wrote:
> > On 2021-10-20 12:35 p.m., Kent Russell wrote:
> >> If the bad_page_threshold kernel parameter is set to -2,
> >> continue to post the GPU. Print a warning to dmesg that this action has
> >> been done, and that page retirement will obviously not work for said GPU
> > I'd squash patch 2 and 3. The squashed patch is
> >
> > Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> 
> I was just thinking the same thing. Keep the title and text of patch 2 and add the description
> of 3 to 2. With that done:
> 
> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Sounds good, thanks. I was on the fence about combining them from when I had the separate kernel param, and it was easier to squash it at review time than to separate it. I'll still need to work on patch #1 but thanks for the reviews here!

 Kent

> 
> Regards,
> Luben
> 
> >
> >
> >> Cc: Luben Tuikov <luben.tuikov@amd.com>
> >> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> >> Signed-off-by: Kent Russell <kent.russell@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
> >>   1 file changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> index 1ede0f0d6f55..31852330c1db 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >>   			res = amdgpu_ras_eeprom_correct_header_tag(control,
> >>   								   RAS_TABLE_HDR_VAL);
> >>   		} else {
> >> -			*exceed_err_limit = true;
> >> -			dev_err(adev->dev,
> >> -				"RAS records:%d exceed threshold:%d, "
> >> -				"GPU will not be initialized. Replace this GPU or increase the
> threshold",
> >> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >>   				control->ras_num_recs, ras->bad_page_cnt_threshold);
> >> +			if (amdgpu_bad_page_threshold == -2) {
> >> +				dev_warn(adev->dev, "GPU will be initialized due to
> bad_page_threshold = -2.");
> >> +				dev_warn(adev->dev, "Page retirement will not work for
> this GPU in this state.");
> >> +				res = 0;
> >> +			} else {
> >> +				*exceed_err_limit = true;
> >> +				dev_err(adev->dev, "GPU will not be initialized. Replace this
> GPU or increase the threshold.");
> >> +			}
> >>   		}
> >>   	} else {
> >>   		DRM_INFO("Creating a new EEPROM table");

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-20 21:47 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
@ 2021-10-21 14:04   ` Russell, Kent
  0 siblings, 0 replies; 14+ messages in thread
From: Russell, Kent @ 2021-10-21 14:04 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx; +Cc: Joshi, Mukul

[-- Attachment #1: Type: text/plain, Size: 6973 bytes --]

[AMD Official Use Only]

My editor won't let me reply in-line without making it look like garbage.

Thanks for the insight, Luben! They're all useful points, especially the consolidation and relying on the threshold_validation which has already occurred before we get to this point (which I should've checked).

And I overdid the transitive multiplication explanation, so I wouldn't have to answer questions about it later. But your concise comment below pretty much covers things and shouldn't cause any unnecessary inquiries.

Kent

From: Tuikov, Luben <Luben.Tuikov@amd.com>
Sent: Wednesday, October 20, 2021 5:48 PM
To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Joshi, Mukul <Mukul.Joshi@amd.com>
Subject: Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

On 2021-10-20 12:35, Kent Russell wrote:

Currently dmesg doesn't warn when the number of bad pages approaches the

"Currently" is redundant in this sentence as it is already in present simple tense.

Fair point



threshold for page retirement. WARN when the number of bad pages

is at 90% or greater for easier checks and planning, instead of waiting

until the GPU is full of bad pages

Missing full-stop (period) above.







Cc: Luben Tuikov <luben.tuikov@amd.com><mailto:luben.tuikov@amd.com>

Cc: Mukul Joshi <Mukul.Joshi@amd.com><mailto:Mukul.Joshi@amd.com>

Signed-off-by: Kent Russell <kent.russell@amd.com><mailto:kent.russell@amd.com>

---

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++

 1 file changed, 17 insertions(+)



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

index f4c05ff4b26c..1ede0f0d6f55 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

@@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,

        control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);



        if (hdr->header == RAS_TABLE_HDR_VAL) {

+               int threshold = 0;

                DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",

                                control->ras_num_recs);

                res = __verify_ras_table_checksum(control);

                if (res)

                        DRM_ERROR("RAS table incorrect checksum or error:%d\n",

                                 res);

+

+               /* threshold = 0 means that page retirement is disabled, while

+                * threshold = -1 means default behaviour

+                */

+               if (amdgpu_bad_page_threshold == -1)

+                       threshold = ras->bad_page_cnt_threshold;

+               else if (amdgpu_bad_page_threshold > 0)

+                       threshold = amdgpu_bad_page_threshold;

I believe we don't need this calculation here as it's already been done for us in amdgpu_ras_validate_threshold(), in amdgpu_ras.c.


I believe you want to use "ras->bad_page_cnt_threshold" here instead. For instance of this, see a bit further down in this very function this check including the comment, in italics:

    } else if (hdr->header == RAS_TABLE_HDR_BAD &&
           amdgpu_bad_page_threshold != 0) {
        res = __verify_ras_table_checksum(control);
        if (res)
            DRM_ERROR("RAS Table incorrect checksum or error:%d\n",
                  res);
        if (ras->bad_page_cnt_threshold > control->ras_num_recs) {
            /* This means that, the threshold was increased since
             * the last time the system was booted, and now,
             * ras->bad_page_cnt_threshold - control->num_recs > 0,
             * so that at least one more record can be saved,
             * before the page count threshold is reached.
             */

And on the "else", a bit further down, again in italics:

        } else {
            *exceed_err_limit = true;
            dev_err(adev->dev,
                "RAS records:%d exceed threshold:%d, "
                "maybe retire this GPU?",
                control->ras_num_recs, ras->bad_page_cnt_threshold);
        }


See how it says "records exceed threshold"--well, with this patch you want to say "records exceed 90% of threshold." :-) So these are the quantities we gauge each other to.

Clarification on this below.





+

+               /* Since multiplcation is transitive, a = 9b/10 is the same

+                * as 10a = 9b. Use this for our 90% limit to avoid rounding

+                */

I really like the format of the comment. But I feel that the comment itself isn't necessary... at least the way it is written ("9b" may mean "9 bits" or "9 binary". I'd avoid getting into arithmetic theory, and remove the comment completely. Anything else (explaining the math) really distracts from the real purpose of what we're doing. (After all, this is C, not a class on arithmetic--they who can, will figure it out.)

Perhaps something like:

/* Warn if we get past 90% of the threshold.
 */



+               if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9)))

Right, so here perhaps we want to do this instead:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 98732518543e53..2aab62fa488eba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
                if (res)
                        DRM_ERROR("RAS table incorrect checksum or error:%d\n",
                                  res);
+               /* Warn if we get past 90% of the threshold.
+                */
+               if (10 * control->ras_num_recs >= 9 * ras->bad_page_cnt_threshold)
+                       DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+                                control->ras_num_recs,
+                                ras->bad_page_cnt_threshold);
        } else if (hdr->header == RAS_TABLE_HDR_BAD &&
                   amdgpu_bad_page_threshold != 0) {
                res = __verify_ras_table_checksum(control);

Also note that up until this point of the boot process, we don't qualify the boot by amdgpu_bad_page_threshold and I feel that this check in this embedded if-conditional shouldn't do that as well.

Regards,
Luben





+                       DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",

+                               control->ras_num_recs,

+                               threshold);

        } else if (hdr->header == RAS_TABLE_HDR_BAD &&

                   amdgpu_bad_page_threshold != 0) {

                res = __verify_ras_table_checksum(control);


[-- Attachment #2: Type: text/html, Size: 56950 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
  2021-10-21 13:56     ` Russell, Kent
@ 2021-10-22 11:26       ` Lazar, Lijo
  0 siblings, 0 replies; 14+ messages in thread
From: Lazar, Lijo @ 2021-10-22 11:26 UTC (permalink / raw)
  To: Russell, Kent, amd-gfx; +Cc: Tuikov, Luben, Joshi, Mukul



On 10/21/2021 7:26 PM, Russell, Kent wrote:
> [AMD Official Use Only]
> 
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>> Sent: Thursday, October 21, 2021 1:25 AM
>> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
>> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
>>
>>
>>
>> On 10/20/2021 10:05 PM, Kent Russell wrote:
>>> If the bad_page_threshold kernel parameter is set to -2,
>>> continue to post the GPU. Print a warning to dmesg that this action has
>>> been done, and that page retirement will obviously not work for said GPU
>>>
>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
>>>    1 file changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> index 1ede0f0d6f55..31852330c1db 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
>> amdgpu_ras_eeprom_control *control,
>>>    			res = amdgpu_ras_eeprom_correct_header_tag(control,
>>>    								   RAS_TABLE_HDR_VAL);
>>>    		} else {
>>> -			*exceed_err_limit = true;
>>> -			dev_err(adev->dev,
>>> -				"RAS records:%d exceed threshold:%d, "
>>> -				"GPU will not be initialized. Replace this GPU or increase the
>> threshold",
>>> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>>>    				control->ras_num_recs, ras->bad_page_cnt_threshold);
>>> +			if (amdgpu_bad_page_threshold == -2) {
>>> +				dev_warn(adev->dev, "GPU will be initialized due to
>> bad_page_threshold = -2.");
>>> +				dev_warn(adev->dev, "Page retirement will not work for
>> this GPU in this state.");
>>
>> Now, this looks as good as booting with 0 = disable bad page retirement.
>> I thought page retirement will work as long as EEPROM has space, but it
>> won't bother about the threshold. If the intent is to ignore bad page
>> retirement, then 0 is good enough and -2 is not required.
>>
>> Also, when user passes threshold=-2, what is the threshold being
>> compared against to say *exceed_err_limit = true?
> 
> My thought on having the -2 option is that we'll still enable page retirement, we just won't shut the GPU down when it hits the threshold. The bad pages will still be retired as we hit them, it will just never disable the GPU. The comment about retirement not working is definitely incorrect now (leftover from previous local patches), so I'll remove that. In this case, I don't think we'd ever exceed the error limit. exceed_err_limit is only really used when we are disabling the GPU, so we wouldn't want to set that to true. Otherwise we wouldn't be loading the bad pages in gpu_recovery_init, and we'll still return 0 from gpu_recovery_init.
> 

Probably, this is too late as this patch is superseded. Replying to this 
to keep the context.

I realized that the question went in the wrong direction because of 
*exceed_err_limit = true. Saw it in the earlier line but didn't realize 
that it was removed.

Anyway, main question I wanted to ask was -
	
	"When user passes threshold=-2, what is the threshold being compared 
against?"

Looked it up today and saw this piece of code.
	        if (amdgpu_bad_page_threshold < 0) {
                 u64 val = adev->gmc.mc_vram_size;

                 do_div(val, RAS_BAD_PAGE_COVER);
                 con->bad_page_cnt_threshold = min(lower_32_bits(val),
                                                   max_count);


For a user who has been specifying say bad_page_threshold=100 in the 
command line and unable to boot, now this is doing something else.

The comparison in next boot is done against a different threshold. In 
fact, it won't even come to this comparison logic in the next boot as 
the threshold is raised by default when this parameter is specified.

The only case this parameter takes effect is when a user is unable to 
boot because bad page retirements have exceeded the "default threshold". 
In other cases it just resets the comparison limit. Hope that is the 
intention.

Thanks,
Lijo


>   Kent
>>
>> Thanks,
>> Lijo
>>
>>> +				res = 0;
>>> +			} else {
>>> +				*exceed_err_limit = true;
>>> +				dev_err(adev->dev, "GPU will not be initialized. Replace this
>> GPU or increase the threshold.");
>>> +			}
>>>    		}
>>>    	} else {
>>>    		DRM_INFO("Creating a new EEPROM table");
>>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-10-22 11:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-20 16:35 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
2021-10-20 16:35 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
2021-10-20 16:35 ` [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case Kent Russell
2021-10-20 21:54   ` Felix Kuehling
2021-10-20 22:01     ` Luben Tuikov
2021-10-21 13:57       ` Russell, Kent
2021-10-21  5:24   ` Lazar, Lijo
2021-10-21 13:56     ` Russell, Kent
2021-10-22 11:26       ` Lazar, Lijo
2021-10-20 21:47 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
2021-10-21 14:04   ` Russell, Kent
2021-10-20 21:50 ` Felix Kuehling
2021-10-20 22:09   ` Luben Tuikov
2021-10-20 22:31   ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.