[PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
@ 2021-10-21 17:26 Kent Russell
  2021-10-21 17:26 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Kent Russell @ 2021-10-21 17:26 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..8309eea09df3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,13 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 		if (res)
 			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
 				  res);
+
+		/* Warn if we are at 90% of the threshold or above
+		 */
+		if (10 * control->ras_num_recs >= ras->bad_page_cnt_threshold * 9)
+			dev_warn(adev->dev, "RAS records:%u exceeds 90%% of threshold:%d",
+					control->ras_num_recs,
+					ras->bad_page_cnt_threshold);
 	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
 		   amdgpu_bad_page_threshold != 0) {
 		res = __verify_ras_table_checksum(control);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-21 17:26 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
@ 2021-10-21 17:26 ` Kent Russell
  2021-10-21 17:26 ` [PATCH 3/3] drm/amdgpu: Make EEPROM messages dev_ instead of DRM_ Kent Russell
  2021-10-21 17:40 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
  2 siblings, 0 replies; 5+ messages in thread
From: Kent Russell @ 2021-10-21 17:26 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi, Felix Kuehling

When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).

If the bad_page_threshold kernel parameter is set to -2,
continue to initialize the GPU, while printing a warning to dmesg that
this action has been done

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h            |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 ++++++++----
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..eee3cf874e7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
  * result in the GPU entering bad status when the number of total
  * faulty pages by ECC exceeds the threshold value.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 8309eea09df3..0428a1d3d22a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1105,11 +1105,15 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 			res = amdgpu_ras_eeprom_correct_header_tag(control,
 								   RAS_TABLE_HDR_VAL);
 		} else {
-			*exceed_err_limit = true;
-			dev_err(adev->dev,
-				"RAS records:%d exceed threshold:%d, "
-				"GPU will not be initialized. Replace this GPU or increase the threshold",
+			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
 				control->ras_num_recs, ras->bad_page_cnt_threshold);
+			if (amdgpu_bad_page_threshold == -2) {
+				dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
+				res = 0;
+			} else {
+				*exceed_err_limit = true;
+				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
+			}
 		}
 	} else {
 		DRM_INFO("Creating a new EEPROM table");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/3] drm/amdgpu: Make EEPROM messages dev_ instead of DRM_
  2021-10-21 17:26 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
  2021-10-21 17:26 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
@ 2021-10-21 17:26 ` Kent Russell
  2021-10-21 17:40 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
  2 siblings, 0 replies; 5+ messages in thread
From: Kent Russell @ 2021-10-21 17:26 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell

Since the EEPROM is specific to the device for each of these messages,
use the dev_* macro instead of DRM_* to make it easier to identify the
GPU that correlates to the EEPROM messages.

Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 40 +++++++++----------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 0428a1d3d22a..3792a69b876f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -201,9 +201,9 @@ static int __write_table_header(struct amdgpu_ras_eeprom_control *control)
 	up_read(&adev->reset_sem);
 
 	if (res < 0) {
-		DRM_ERROR("Failed to write EEPROM table header:%d", res);
+		dev_err(adev->dev, "Failed to write EEPROM table header:%d", res);
 	} else if (res < RAS_TABLE_HEADER_SIZE) {
-		DRM_ERROR("Short write:%d out of %d\n",
+		dev_err(adev->dev, "Short write:%d out of %d\n",
 			  res, RAS_TABLE_HEADER_SIZE);
 		res = -EIO;
 	} else {
@@ -395,12 +395,12 @@ static int __amdgpu_ras_eeprom_write(struct amdgpu_ras_eeprom_control *control,
 				  buf, buf_size);
 	up_read(&adev->reset_sem);
 	if (res < 0) {
-		DRM_ERROR("Writing %d EEPROM table records error:%d",
+		dev_err(adev->dev, "Writing %d EEPROM table records error:%d",
 			  num, res);
 	} else if (res < buf_size) {
 		/* Short write, return error.
 		 */
-		DRM_ERROR("Wrote %d records out of %d",
+		dev_err(adev->dev, "Wrote %d records out of %d",
 			  res / RAS_TABLE_RECORD_SIZE, num);
 		res = -EIO;
 	} else {
@@ -541,7 +541,7 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control)
 	buf_size = control->ras_num_recs * RAS_TABLE_RECORD_SIZE;
 	buf = kcalloc(control->ras_num_recs, RAS_TABLE_RECORD_SIZE, GFP_KERNEL);
 	if (!buf) {
-		DRM_ERROR("allocating memory for table of size %d bytes failed\n",
+		dev_err(adev->dev, "allocating memory for table of size %d bytes failed\n",
 			  control->tbl_hdr.tbl_size);
 		res = -ENOMEM;
 		goto Out;
@@ -554,11 +554,11 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control)
 				 buf, buf_size);
 	up_read(&adev->reset_sem);
 	if (res < 0) {
-		DRM_ERROR("EEPROM failed reading records:%d\n",
+		dev_err(adev->dev, "EEPROM failed reading records:%d\n",
 			  res);
 		goto Out;
 	} else if (res < buf_size) {
-		DRM_ERROR("EEPROM read %d out of %d bytes\n",
+		dev_err(adev->dev, "EEPROM read %d out of %d bytes\n",
 			  res, buf_size);
 		res = -EIO;
 		goto Out;
@@ -604,10 +604,10 @@ int amdgpu_ras_eeprom_append(struct amdgpu_ras_eeprom_control *control,
 		return 0;
 
 	if (num == 0) {
-		DRM_ERROR("will not append 0 records\n");
+		dev_err(adev->dev, "will not append 0 records\n");
 		return -EINVAL;
 	} else if (num > control->ras_max_record_count) {
-		DRM_ERROR("cannot append %d records than the size of table %d\n",
+		dev_err(adev->dev, "cannot append %d records than the size of table %d\n",
 			  num, control->ras_max_record_count);
 		return -EINVAL;
 	}
@@ -650,12 +650,12 @@ static int __amdgpu_ras_eeprom_read(struct amdgpu_ras_eeprom_control *control,
 				 buf, buf_size);
 	up_read(&adev->reset_sem);
 	if (res < 0) {
-		DRM_ERROR("Reading %d EEPROM table records error:%d",
+		dev_err(adev->dev, "Reading %d EEPROM table records error:%d",
 			  num, res);
 	} else if (res < buf_size) {
 		/* Short read, return error.
 		 */
-		DRM_ERROR("Read %d records out of %d",
+		dev_err(adev->dev, "Read %d records out of %d",
 			  res / RAS_TABLE_RECORD_SIZE, num);
 		res = -EIO;
 	} else {
@@ -689,10 +689,10 @@ int amdgpu_ras_eeprom_read(struct amdgpu_ras_eeprom_control *control,
 		return 0;
 
 	if (num == 0) {
-		DRM_ERROR("will not read 0 records\n");
+		dev_err(adev->dev, "will not read 0 records\n");
 		return -EINVAL;
 	} else if (num > control->ras_num_recs) {
-		DRM_ERROR("too many records to read:%d available:%d\n",
+		dev_err(adev->dev, "too many records to read:%d available:%d\n",
 			  num, control->ras_num_recs);
 		return -EINVAL;
 	}
@@ -1005,7 +1005,7 @@ static int __verify_ras_table_checksum(struct amdgpu_ras_eeprom_control *control
 		control->ras_num_recs * RAS_TABLE_RECORD_SIZE;
 	buf = kzalloc(buf_size, GFP_KERNEL);
 	if (!buf) {
-		DRM_ERROR("Out of memory checking RAS table checksum.\n");
+		dev_err(adev->dev, "Out of memory checking RAS table checksum.\n");
 		return -ENOMEM;
 	}
 
@@ -1014,7 +1014,7 @@ static int __verify_ras_table_checksum(struct amdgpu_ras_eeprom_control *control
 				 control->ras_header_offset,
 				 buf, buf_size);
 	if (res < buf_size) {
-		DRM_ERROR("Partial read for checksum, res:%d\n", res);
+		dev_err(adev->dev, "Partial read for checksum, res:%d\n", res);
 		/* On partial reads, return -EIO.
 		 */
 		if (res >= 0)
@@ -1061,7 +1061,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 				 control->i2c_address + control->ras_header_offset,
 				 buf, RAS_TABLE_HEADER_SIZE);
 	if (res < RAS_TABLE_HEADER_SIZE) {
-		DRM_ERROR("Failed to read EEPROM table header, res:%d", res);
+		dev_err(adev->dev, "Failed to read EEPROM table header, res:%d", res);
 		return res >= 0 ? -EIO : res;
 	}
 
@@ -1071,11 +1071,11 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 	control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);
 
 	if (hdr->header == RAS_TABLE_HDR_VAL) {
-		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
+		dev_dbg(adev->dev, "Found existing EEPROM table with %d records",
 				 control->ras_num_recs);
 		res = __verify_ras_table_checksum(control);
 		if (res)
-			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
+			dev_err(adev->dev, "RAS table incorrect checksum or error:%d\n",
 				  res);
 
 		/* Warn if we are at 90% of the threshold or above
@@ -1088,7 +1088,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 		   amdgpu_bad_page_threshold != 0) {
 		res = __verify_ras_table_checksum(control);
 		if (res)
-			DRM_ERROR("RAS Table incorrect checksum or error:%d\n",
+			dev_err(adev->dev, "RAS Table incorrect checksum or error:%d\n",
 				  res);
 		if (ras->bad_page_cnt_threshold > control->ras_num_recs) {
 			/* This means that, the threshold was increased since
@@ -1116,7 +1116,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 			}
 		}
 	} else {
-		DRM_INFO("Creating a new EEPROM table");
+		dev_info(adev->dev, "Creating a new EEPROM table");
 
 		res = amdgpu_ras_eeprom_reset_table(control);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 17:26 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
  2021-10-21 17:26 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
  2021-10-21 17:26 ` [PATCH 3/3] drm/amdgpu: Make EEPROM messages dev_ instead of DRM_ Kent Russell
@ 2021-10-21 17:40 ` Luben Tuikov
  2 siblings, 0 replies; 5+ messages in thread
From: Luben Tuikov @ 2021-10-21 17:40 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Mukul Joshi, Tuikov, Luben

On 2021-10-21 13:26, Kent Russell wrote:
> dmesg doesn't warn when the number of bad pages approaches the
> threshold for page retirement. WARN when the number of bad pages
> is at 90% or greater for easier checks and planning, instead of waiting
> until the GPU is full of bad pages.
>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index f4c05ff4b26c..8309eea09df3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1077,6 +1077,13 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>  		if (res)
>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>  				  res);
> +
> +		/* Warn if we are at 90% of the threshold or above
> +		 */
> +		if (10 * control->ras_num_recs >= ras->bad_page_cnt_threshold * 9)

Change this to " >= 9 * ras->bad_page_cnt_threshold ". With that fixed, this patch is:

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben

> +			dev_warn(adev->dev, "RAS records:%u exceeds 90%% of threshold:%d",
> +					control->ras_num_recs,
> +					ras->bad_page_cnt_threshold);
>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>  		   amdgpu_bad_page_threshold != 0) {
>  		res = __verify_ras_table_checksum(control);


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-20 16:35 Kent Russell
@ 2021-10-20 16:35 ` Kent Russell
  0 siblings, 0 replies; 5+ messages in thread
From: Kent Russell @ 2021-10-20 16:35 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).
Add an override by using amdgpu_bad_page_threshold = -2 which will still
initialize the GPU, even when the bad page threshold has been reached.

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..eee3cf874e7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
  * result in the GPU entering bad status when the number of total
  * faulty pages by ECC exceeds the threshold value.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-10-21 17:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-21 17:26 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
2021-10-21 17:26 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
2021-10-21 17:26 ` [PATCH 3/3] drm/amdgpu: Make EEPROM messages dev_ instead of DRM_ Kent Russell
2021-10-21 17:40 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
  -- strict thread matches above, loose matches on Subject: below --
2021-10-20 16:35 Kent Russell
2021-10-20 16:35 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.