* [PATCH 1/2] drm/amdgpu: change default behavior of bad_page_threshold parameter
@ 2023-02-22 2:51 Tao Zhou
2023-02-22 2:51 ` [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err Tao Zhou
0 siblings, 1 reply; 6+ messages in thread
From: Tao Zhou @ 2023-02-22 2:51 UTC (permalink / raw)
To: amd-gfx, hawking.zhang, stanley.yang, yipeng.chai, candice.li,
lijo.lazar
Cc: Tao Zhou
Ignore ras umc bad page threshold by default, GPU initialization won't
be stopped in this mode.
v2: refine the description of bad_page_threshold.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 ++++---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 ++--
3 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6c2fe50b528e..8a375394db0c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -921,7 +921,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
* result in the GPU entering bad status when the number of total
* faulty pages by ECC exceeds the threshold value.
*/
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = ignore threshold (default value), 0 = disable bad page retirement, -2 = driver sets threshold)");
module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 5c02c6c9f773..63dfcc98152d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2196,11 +2196,12 @@ static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
/*
* Justification of value bad_page_cnt_threshold in ras structure
*
- * Generally, -1 <= amdgpu_bad_page_threshold <= max record length
- * in eeprom, and introduce two scenarios accordingly.
+ * Generally, 0 <= amdgpu_bad_page_threshold <= max record length
+ * in eeprom or amdgpu_bad_page_threshold == -2, introduce two
+ * scenarios accordingly.
*
* Bad page retirement enablement:
- * - If amdgpu_bad_page_threshold = -1,
+ * - If amdgpu_bad_page_threshold = -2,
* bad_page_cnt_threshold = typical value by formula.
*
* - When the value from user is 0 < amdgpu_bad_page_threshold <
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 2d9f3f4cd79e..9d370465b08d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1191,8 +1191,8 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
} else {
dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
control->ras_num_recs, ras->bad_page_cnt_threshold);
- if (amdgpu_bad_page_threshold == -2) {
- dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
+ if (amdgpu_bad_page_threshold == -1) {
+ dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -1.");
res = 0;
} else {
*exceed_err_limit = true;
--
2.35.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
2023-02-22 2:51 [PATCH 1/2] drm/amdgpu: change default behavior of bad_page_threshold parameter Tao Zhou
@ 2023-02-22 2:51 ` Tao Zhou
2023-02-22 3:16 ` Yang, Stanley
0 siblings, 1 reply; 6+ messages in thread
From: Tao Zhou @ 2023-02-22 2:51 UTC (permalink / raw)
To: amd-gfx, hawking.zhang, stanley.yang, yipeng.chai, candice.li,
lijo.lazar
Cc: Tao Zhou
bad_page_threshold controls page retirement behavior and it should be
also checked.
v2: simplify the condition of bad page handling path.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 9d370465b08d..2e08fce87521 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -417,7 +417,8 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)
{
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
- if (!__is_ras_eeprom_supported(adev))
+ if (!__is_ras_eeprom_supported(adev) ||
+ !amdgpu_bad_page_threshold)
return false;
/* skip check eeprom table for VEGA20 Gaming */
@@ -428,10 +429,18 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)
return false;
if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
- dev_warn(adev->dev, "This GPU is in BAD status.");
- dev_warn(adev->dev, "Please retire it or set a larger "
- "threshold value when reloading driver.\n");
- return true;
+ if (amdgpu_bad_page_threshold == -1) {
+ dev_warn(adev->dev, "RAS records:%d exceed threshold:%d",
+ con->eeprom_control.ras_num_recs, con->bad_page_cnt_threshold);
+ dev_warn(adev->dev,
+ "But GPU can be operated due to bad_page_threshold = -1.\n");
+ return false;
+ } else {
+ dev_warn(adev->dev, "This GPU is in BAD status.");
+ dev_warn(adev->dev, "Please retire it or set a larger "
+ "threshold value when reloading driver.\n");
+ return true;
+ }
}
return false;
--
2.35.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* RE: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
2023-02-22 2:51 ` [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err Tao Zhou
@ 2023-02-22 3:16 ` Yang, Stanley
0 siblings, 0 replies; 6+ messages in thread
From: Yang, Stanley @ 2023-02-22 3:16 UTC (permalink / raw)
To: Zhou1, Tao, amd-gfx, Zhang, Hawking, Chai, Thomas, Li, Candice,
Lazar, Lijo
[AMD Official Use Only - General]
The series is Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Regards,
Stanley
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@amd.com>
> Sent: Wednesday, February 22, 2023 10:52 AM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>; Lazar,
> Lijo <Lijo.Lazar@amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
>
> bad_page_threshold controls page retirement behavior and it should be also
> checked.
>
> v2: simplify the condition of bad page handling path.
>
> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> ---
> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 19 ++++++++++++++-
> ----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9d370465b08d..2e08fce87521 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -417,7 +417,8 @@ bool
> amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev) {
> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>
> - if (!__is_ras_eeprom_supported(adev))
> + if (!__is_ras_eeprom_supported(adev) ||
> + !amdgpu_bad_page_threshold)
> return false;
>
> /* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> +429,18 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> amdgpu_device *adev)
> return false;
>
> if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> - dev_warn(adev->dev, "This GPU is in BAD status.");
> - dev_warn(adev->dev, "Please retire it or set a larger "
> - "threshold value when reloading driver.\n");
> - return true;
> + if (amdgpu_bad_page_threshold == -1) {
> + dev_warn(adev->dev, "RAS records:%d exceed
> threshold:%d",
> + con->eeprom_control.ras_num_recs, con-
> >bad_page_cnt_threshold);
> + dev_warn(adev->dev,
> + "But GPU can be operated due to
> bad_page_threshold = -1.\n");
> + return false;
> + } else {
> + dev_warn(adev->dev, "This GPU is in BAD status.");
> + dev_warn(adev->dev, "Please retire it or set a larger
> "
> + "threshold value when reloading driver.\n");
> + return true;
> + }
> }
>
> return false;
> --
> 2.35.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
2023-02-21 9:33 ` Yang, Stanley
@ 2023-02-21 10:06 ` Zhou1, Tao
0 siblings, 0 replies; 6+ messages in thread
From: Zhou1, Tao @ 2023-02-21 10:06 UTC (permalink / raw)
To: Yang, Stanley, amd-gfx, Zhang, Hawking, Chai, Thomas, Li, Candice
[AMD Official Use Only - General]
> -----Original Message-----
> From: Yang, Stanley <Stanley.Yang@amd.com>
> Sent: Tuesday, February 21, 2023 5:34 PM
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Zhang,
> Hawking <Hawking.Zhang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>;
> Li, Candice <Candice.Li@amd.com>
> Subject: RE: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
>
> [AMD Official Use Only - General]
>
>
>
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1@amd.com>
> > Sent: Tuesday, February 21, 2023 4:29 PM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> > Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> > Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> > ras_eeprom_check_err
> >
> > bad_page_threshold controls page retirement behavior and it should be
> > also checked.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> > ---
> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 20 ++++++++++++++-
> > ----
> > 1 file changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 9d370465b08d..c88123896fe8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -417,7 +417,8 @@ bool
> > amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev) {
> > struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >
> > - if (!__is_ras_eeprom_supported(adev))
> > + if (!__is_ras_eeprom_supported(adev) ||
> > + !amdgpu_bad_page_threshold)
> > return false;
> >
> > /* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> > +429,19 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> > amdgpu_device *adev)
> > return false;
> >
> > if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> > - dev_warn(adev->dev, "This GPU is in BAD status.");
> > - dev_warn(adev->dev, "Please retire it or set a larger "
> > - "threshold value when reloading driver.\n");
> > - return true;
> > + if (amdgpu_bad_page_threshold == -1) {
> > + dev_warn(adev->dev, "RAS records:%d exceed
> > threshold:%d",
> > + con->eeprom_control.ras_num_recs, con-
> > >bad_page_cnt_threshold);
> > + dev_warn(adev->dev,
> > + "But GPU can be operated due to
> > bad_page_threshold = -1.\n");
> > + return false;
> > + } else if (amdgpu_bad_page_threshold > 0 ||
> > + amdgpu_bad_page_threshold == -2) {
>
> Stanley: it can't guarantee use to set amdgpu_bad_page_threshold value as
> expected for example -3, how about set this if condition as below
[Tao] Since "<= -2" and "> 0" can be treated as same thing here, will update the condition to "else".
The "-2" isn't retired, it indicates threshold number is calculated by driver.
> else if (amdgpu_bad_page_threshold) {
> ...
> }
> And in patch#1 the value -2 isn't need anymore.
>
> Regards,
> Stanley
> > + dev_warn(adev->dev, "This GPU is in BAD status.");
> > + dev_warn(adev->dev, "Please retire it or set a larger
> > "
> > + "threshold value when reloading driver.\n");
> > + return true;
> > + }
> > }
> >
> > return false;
> > --
> > 2.35.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
2023-02-21 8:29 ` [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err Tao Zhou
@ 2023-02-21 9:33 ` Yang, Stanley
2023-02-21 10:06 ` Zhou1, Tao
0 siblings, 1 reply; 6+ messages in thread
From: Yang, Stanley @ 2023-02-21 9:33 UTC (permalink / raw)
To: Zhou1, Tao, amd-gfx, Zhang, Hawking, Chai, Thomas, Li, Candice
[AMD Official Use Only - General]
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@amd.com>
> Sent: Tuesday, February 21, 2023 4:29 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
>
> bad_page_threshold controls page retirement behavior and it should be also
> checked.
>
> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> ---
> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 20 ++++++++++++++-
> ----
> 1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9d370465b08d..c88123896fe8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -417,7 +417,8 @@ bool
> amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev) {
> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>
> - if (!__is_ras_eeprom_supported(adev))
> + if (!__is_ras_eeprom_supported(adev) ||
> + !amdgpu_bad_page_threshold)
> return false;
>
> /* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> +429,19 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> amdgpu_device *adev)
> return false;
>
> if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> - dev_warn(adev->dev, "This GPU is in BAD status.");
> - dev_warn(adev->dev, "Please retire it or set a larger "
> - "threshold value when reloading driver.\n");
> - return true;
> + if (amdgpu_bad_page_threshold == -1) {
> + dev_warn(adev->dev, "RAS records:%d exceed
> threshold:%d",
> + con->eeprom_control.ras_num_recs, con-
> >bad_page_cnt_threshold);
> + dev_warn(adev->dev,
> + "But GPU can be operated due to
> bad_page_threshold = -1.\n");
> + return false;
> + } else if (amdgpu_bad_page_threshold > 0 ||
> + amdgpu_bad_page_threshold == -2) {
Stanley: it can't guarantee use to set amdgpu_bad_page_threshold value as expected for example -3, how about set this if condition as below
else if (amdgpu_bad_page_threshold) {
...
}
And in patch#1 the value -2 isn't need anymore.
Regards,
Stanley
> + dev_warn(adev->dev, "This GPU is in BAD status.");
> + dev_warn(adev->dev, "Please retire it or set a larger
> "
> + "threshold value when reloading driver.\n");
> + return true;
> + }
> }
>
> return false;
> --
> 2.35.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
2023-02-21 8:29 [PATCH 1/2] drm/amdgpu: change default behavior of bad_page_threshold parameter Tao Zhou
@ 2023-02-21 8:29 ` Tao Zhou
2023-02-21 9:33 ` Yang, Stanley
0 siblings, 1 reply; 6+ messages in thread
From: Tao Zhou @ 2023-02-21 8:29 UTC (permalink / raw)
To: amd-gfx, hawking.zhang, stanley.yang, yipeng.chai, candice.li; +Cc: Tao Zhou
bad_page_threshold controls page retirement behavior and it should be
also checked.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 20 ++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 9d370465b08d..c88123896fe8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -417,7 +417,8 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)
{
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
- if (!__is_ras_eeprom_supported(adev))
+ if (!__is_ras_eeprom_supported(adev) ||
+ !amdgpu_bad_page_threshold)
return false;
/* skip check eeprom table for VEGA20 Gaming */
@@ -428,10 +429,19 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)
return false;
if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
- dev_warn(adev->dev, "This GPU is in BAD status.");
- dev_warn(adev->dev, "Please retire it or set a larger "
- "threshold value when reloading driver.\n");
- return true;
+ if (amdgpu_bad_page_threshold == -1) {
+ dev_warn(adev->dev, "RAS records:%d exceed threshold:%d",
+ con->eeprom_control.ras_num_recs, con->bad_page_cnt_threshold);
+ dev_warn(adev->dev,
+ "But GPU can be operated due to bad_page_threshold = -1.\n");
+ return false;
+ } else if (amdgpu_bad_page_threshold > 0 ||
+ amdgpu_bad_page_threshold == -2) {
+ dev_warn(adev->dev, "This GPU is in BAD status.");
+ dev_warn(adev->dev, "Please retire it or set a larger "
+ "threshold value when reloading driver.\n");
+ return true;
+ }
}
return false;
--
2.35.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-02-22 3:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-22 2:51 [PATCH 1/2] drm/amdgpu: change default behavior of bad_page_threshold parameter Tao Zhou
2023-02-22 2:51 ` [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err Tao Zhou
2023-02-22 3:16 ` Yang, Stanley
-- strict thread matches above, loose matches on Subject: below --
2023-02-21 8:29 [PATCH 1/2] drm/amdgpu: change default behavior of bad_page_threshold parameter Tao Zhou
2023-02-21 8:29 ` [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err Tao Zhou
2023-02-21 9:33 ` Yang, Stanley
2023-02-21 10:06 ` Zhou1, Tao
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.