[PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
@ 2023-02-10  8:45 Tao Zhou
  2023-02-10 15:01 ` Zhang, Hawking
  2023-02-13 12:37 ` Lazar, Lijo
  0 siblings, 2 replies; 13+ messages in thread
From: Tao Zhou @ 2023-02-10  8:45 UTC (permalink / raw)
  To: amd-gfx, hawking.zhang, stanley.yang, yipeng.chai, candice.li; +Cc: Tao Zhou

If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e85c4689ce2c..eafe01a24349 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data *data;
-	int ret = 0;
+	int ret = 0, old_cnt;
 	uint32_t i;
 
 	if (!con || !con->eh_data || !bps || pages <= 0)
@@ -2060,6 +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 	if (!data)
 		goto out;
 
+	old_cnt = data->count;
+
 	for (i = 0; i < pages; i++) {
 		if (amdgpu_ras_check_bad_page_unlock(con,
 			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
@@ -2079,6 +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 		data->count++;
 		data->space_left--;
 	}
+
+	/* all pages have been reserved before, no new bad page */
+	if (old_cnt == data->count)
+		ret = -EEXIST;
+
 out:
 	mutex_unlock(&con->recovery_lock);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 1c7fcb4f2380..772c431e4065 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
 
 		if ((amdgpu_bad_page_threshold != 0) &&
 			err_data->err_addr_cnt) {
-			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
+			ret = amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
 						err_data->err_addr_cnt);
+			/* if no new bad page is found, no need to increase ue count */
+			if (ret == -EEXIST)
+				err_data->ue_count = 0;
+
 			amdgpu_ras_save_bad_pages(adev);
 
 			amdgpu_dpm_send_hbm_bad_pages_num(adev, con->eeprom_control.ras_num_recs);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-10  8:45 [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page Tao Zhou
@ 2023-02-10 15:01 ` Zhang, Hawking
  2023-02-13  8:25   ` Zhou1, Tao
  2023-02-13 12:37 ` Lazar, Lijo
  1 sibling, 1 reply; 13+ messages in thread
From: Zhang, Hawking @ 2023-02-10 15:01 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx, Yang, Stanley, Chai, Thomas, Li, Candice

[AMD Official Use Only - General]

+                       /* if no new bad page is found, no need to increase ue count */
+                       if (ret == -EEXIST)
+                               err_data->ue_count = 0;

Returning EEXIST in such case is not reasonable. Might consider return a bool for amdgpu_ras_add_bad_pages: true means it does add some new bad page; false means it doesn't change anything.

Regards,
Hawking

-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1@amd.com>
Sent: Friday, February 10, 2023 16:45
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

If a UMC bad page is reserved but not freed by an application, the application may trigger uncorrectable error repeatly by accessing the page.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e85c4689ce2c..eafe01a24349 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,  {
        struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
        struct ras_err_handler_data *data;
-       int ret = 0;
+       int ret = 0, old_cnt;
        uint32_t i;

        if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
        if (!data)
                goto out;

+       old_cnt = data->count;
+
        for (i = 0; i < pages; i++) {
                if (amdgpu_ras_check_bad_page_unlock(con,
                        bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6 +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
                data->count++;
                data->space_left--;
        }
+
+       /* all pages have been reserved before, no new bad page */
+       if (old_cnt == data->count)
+               ret = -EEXIST;
+
 out:
        mutex_unlock(&con->recovery_lock);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 1c7fcb4f2380..772c431e4065 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,

                if ((amdgpu_bad_page_threshold != 0) &&
                        err_data->err_addr_cnt) {
-                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
+                       ret = amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
                                                err_data->err_addr_cnt);
+                       /* if no new bad page is found, no need to increase ue count */
+                       if (ret == -EEXIST)
+                               err_data->ue_count = 0;
+
                        amdgpu_ras_save_bad_pages(adev);

                        amdgpu_dpm_send_hbm_bad_pages_num(adev, con->eeprom_control.ras_num_recs);
--
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-10 15:01 ` Zhang, Hawking
@ 2023-02-13  8:25   ` Zhou1, Tao
  2023-02-14  2:42     ` Yang, Stanley
  0 siblings, 1 reply; 13+ messages in thread
From: Zhou1, Tao @ 2023-02-13  8:25 UTC (permalink / raw)
  To: Zhang, Hawking, amd-gfx, Yang, Stanley, Chai, Thomas, Li, Candice

[AMD Official Use Only - General]

> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang@amd.com>
> Sent: Friday, February 10, 2023 11:02 PM
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Yang,
> Stanley <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li,
> Candice <Candice.Li@amd.com>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
>
> [AMD Official Use Only - General]
>
> +                       /* if no new bad page is found, no need to increase ue count */
> +                       if (ret == -EEXIST)
> +                               err_data->ue_count = 0;
>
> Returning EEXIST in such case is not reasonable. Might consider return a bool for
> amdgpu_ras_add_bad_pages: true means it does add some new bad page; false
> means it doesn't change anything.
>
> Regards,
> Hawking

[Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and amdgpu_ras_recovery_init also need to check the return value. I'd like to keep the type of return value unchanged.
How about -EINVAL?

>
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@amd.com>
> Sent: Friday, February 10, 2023 16:45
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad
> page
>
> If a UMC bad page is reserved but not freed by an application, the application
> may trigger uncorrectable error repeatly by accessing the page.
>
> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
>  2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e85c4689ce2c..eafe01a24349 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,  {
>         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>         struct ras_err_handler_data *data;
> -       int ret = 0;
> +       int ret = 0, old_cnt;
>         uint32_t i;
>
>         if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 +2060,8 @@
> int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>         if (!data)
>                 goto out;
>
> +       old_cnt = data->count;
> +
>         for (i = 0; i < pages; i++) {
>                 if (amdgpu_ras_check_bad_page_unlock(con,
>                         bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6
> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>                 data->count++;
>                 data->space_left--;
>         }
> +
> +       /* all pages have been reserved before, no new bad page */
> +       if (old_cnt == data->count)
> +               ret = -EEXIST;
> +
>  out:
>         mutex_unlock(&con->recovery_lock);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1c7fcb4f2380..772c431e4065 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
>
>                 if ((amdgpu_bad_page_threshold != 0) &&
>                         err_data->err_addr_cnt) {
> -                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> +                       ret = amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>                                                 err_data->err_addr_cnt);
> +                       /* if no new bad page is found, no need to increase ue count */
> +                       if (ret == -EEXIST)
> +                               err_data->ue_count = 0;
> +
>                         amdgpu_ras_save_bad_pages(adev);
>
>                         amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> >eeprom_control.ras_num_recs);
> --
> 2.35.1
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-10  8:45 [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page Tao Zhou
  2023-02-10 15:01 ` Zhang, Hawking
@ 2023-02-13 12:37 ` Lazar, Lijo
  2023-02-14  2:26   ` Zhou1, Tao
  1 sibling, 1 reply; 13+ messages in thread
From: Lazar, Lijo @ 2023-02-13 12:37 UTC (permalink / raw)
  To: Tao Zhou, amd-gfx, hawking.zhang, stanley.yang, yipeng.chai, candice.li



On 2/10/2023 2:15 PM, Tao Zhou wrote:
> If a UMC bad page is reserved but not freed by an application, the
> application may trigger uncorrectable error repeatly by accessing the page.
> 

There is amdgpu_ras_check_bad_page which checks if address is already 
part of an existing bad page. Can't that be used?

Thanks,
Lijo

> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
>   2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e85c4689ce2c..eafe01a24349 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>   {
>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   	struct ras_err_handler_data *data;
> -	int ret = 0;
> +	int ret = 0, old_cnt;
>   	uint32_t i;
>   
>   	if (!con || !con->eh_data || !bps || pages <= 0)
> @@ -2060,6 +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>   	if (!data)
>   		goto out;
>   
> +	old_cnt = data->count;
> +
>   	for (i = 0; i < pages; i++) {
>   		if (amdgpu_ras_check_bad_page_unlock(con,
>   			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
> @@ -2079,6 +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>   		data->count++;
>   		data->space_left--;
>   	}
> +
> +	/* all pages have been reserved before, no new bad page */
> +	if (old_cnt == data->count)
> +		ret = -EEXIST;
> +
>   out:
>   	mutex_unlock(&con->recovery_lock);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1c7fcb4f2380..772c431e4065 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
>   
>   		if ((amdgpu_bad_page_threshold != 0) &&
>   			err_data->err_addr_cnt) {
> -			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> +			ret = amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>   						err_data->err_addr_cnt);
> +			/* if no new bad page is found, no need to increase ue count */
> +			if (ret == -EEXIST)
> +				err_data->ue_count = 0;
> +
>   			amdgpu_ras_save_bad_pages(adev);
>   
>   			amdgpu_dpm_send_hbm_bad_pages_num(adev, con->eeprom_control.ras_num_recs);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-13 12:37 ` Lazar, Lijo
@ 2023-02-14  2:26   ` Zhou1, Tao
  2023-02-14  4:55     ` Lazar, Lijo
  0 siblings, 1 reply; 13+ messages in thread
From: Zhou1, Tao @ 2023-02-14  2:26 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx, Zhang, Hawking, Yang, Stanley, Chai,
	Thomas, Li, Candice



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Monday, February 13, 2023 8:38 PM
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Zhang,
> Hawking <Hawking.Zhang@amd.com>; Yang, Stanley
> <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li, Candice
> <Candice.Li@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
> 
> 
> 
> On 2/10/2023 2:15 PM, Tao Zhou wrote:
> > If a UMC bad page is reserved but not freed by an application, the
> > application may trigger uncorrectable error repeatly by accessing the page.
> >
> 
> There is amdgpu_ras_check_bad_page which checks if address is already part of
> an existing bad page. Can't that be used?
> 
> Thanks,
> Lijo

[Tao] amdgpu_ras_check_bad_page is already called in amdgpu_ras_add_bad_pages, this patch just makes use of the result of amdgpu_ras_check_bad_page.

> 
> > Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> >   2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e85c4689ce2c..eafe01a24349 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
> >   {
> >   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >   	struct ras_err_handler_data *data;
> > -	int ret = 0;
> > +	int ret = 0, old_cnt;
> >   	uint32_t i;
> >
> >   	if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 +2060,8
> > @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >   	if (!data)
> >   		goto out;
> >
> > +	old_cnt = data->count;
> > +
> >   	for (i = 0; i < pages; i++) {
> >   		if (amdgpu_ras_check_bad_page_unlock(con,
> >   			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
> @@ -2079,6
> > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >   		data->count++;
> >   		data->space_left--;
> >   	}
> > +
> > +	/* all pages have been reserved before, no new bad page */
> > +	if (old_cnt == data->count)
> > +		ret = -EEXIST;
> > +
> >   out:
> >   	mutex_unlock(&con->recovery_lock);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1c7fcb4f2380..772c431e4065 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct
> > amdgpu_device *adev,
> >
> >   		if ((amdgpu_bad_page_threshold != 0) &&
> >   			err_data->err_addr_cnt) {
> > -			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > +			ret = amdgpu_ras_add_bad_pages(adev, err_data-
> >err_addr,
> >   						err_data->err_addr_cnt);
> > +			/* if no new bad page is found, no need to increase ue
> count */
> > +			if (ret == -EEXIST)
> > +				err_data->ue_count = 0;
> > +
> >   			amdgpu_ras_save_bad_pages(adev);
> >
> >   			amdgpu_dpm_send_hbm_bad_pages_num(adev,
> > con->eeprom_control.ras_num_recs);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-13  8:25   ` Zhou1, Tao
@ 2023-02-14  2:42     ` Yang, Stanley
  2023-02-14 10:03       ` Zhang, Hawking
  0 siblings, 1 reply; 13+ messages in thread
From: Yang, Stanley @ 2023-02-14  2:42 UTC (permalink / raw)
  To: Zhou1, Tao, Zhang, Hawking, amd-gfx, Chai, Thomas, Li, Candice

[AMD Official Use Only - General]



> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@amd.com>
> Sent: Monday, February 13, 2023 4:25 PM
> To: Zhang, Hawking <Hawking.Zhang@amd.com>; amd-
> gfx@lists.freedesktop.org; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new bad page
> 
> [AMD Official Use Only - General]
> 
> > -----Original Message-----
> > From: Zhang, Hawking <Hawking.Zhang@amd.com>
> > Sent: Friday, February 10, 2023 11:02 PM
> > To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
> > Yang, Stanley <Stanley.Yang@amd.com>; Chai, Thomas
> > <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> > new bad page
> >
> > [AMD Official Use Only - General]
> >
> > +                       /* if no new bad page is found, no need to increase ue count
> */
> > +                       if (ret == -EEXIST)
> > +                               err_data->ue_count = 0;
> >
> > Returning EEXIST in such case is not reasonable. Might consider return
> > a bool for
> > amdgpu_ras_add_bad_pages: true means it does add some new bad page;
> > false means it doesn't change anything.
> >
> > Regards,
> > Hawking
> 
> [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and
> amdgpu_ras_recovery_init also need to check the return value. I'd like to
> keep the type of return value unchanged.
> How about -EINVAL?

Stanley: How about return -EALREADY?

Regards,
Stanley
> 
> >
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1@amd.com>
> > Sent: Friday, February 10, 2023 16:45
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>;
> Chai,
> > Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new
> > bad page
> >
> > If a UMC bad page is reserved but not freed by an application, the
> > application may trigger uncorrectable error repeatly by accessing the page.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e85c4689ce2c..eafe01a24349 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> > amdgpu_device *adev,  {
> >         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >         struct ras_err_handler_data *data;
> > -       int ret = 0;
> > +       int ret = 0, old_cnt;
> >         uint32_t i;
> >
> >         if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> >         if (!data)
> >                 goto out;
> >
> > +       old_cnt = data->count;
> > +
> >         for (i = 0; i < pages; i++) {
> >                 if (amdgpu_ras_check_bad_page_unlock(con,
> >                         bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
> > @@ -2079,6
> > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> >                 data->count++;
> >                 data->space_left--;
> >         }
> > +
> > +       /* all pages have been reserved before, no new bad page */
> > +       if (old_cnt == data->count)
> > +               ret = -EEXIST;
> > +
> >  out:
> >         mutex_unlock(&con->recovery_lock);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1c7fcb4f2380..772c431e4065 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -145,8 +145,12 @@ static int
> amdgpu_umc_do_page_retirement(struct
> > amdgpu_device *adev,
> >
> >                 if ((amdgpu_bad_page_threshold != 0) &&
> >                         err_data->err_addr_cnt) {
> > -                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > +                       ret = amdgpu_ras_add_bad_pages(adev,
> > + err_data->err_addr,
> >
> > err_data->err_addr_cnt);
> > +                       /* if no new bad page is found, no need to increase ue count
> */
> > +                       if (ret == -EEXIST)
> > +                               err_data->ue_count = 0;
> > +
> >                         amdgpu_ras_save_bad_pages(adev);
> >
> >                         amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > >eeprom_control.ras_num_recs);
> > --
> > 2.35.1
> >
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-14  2:26   ` Zhou1, Tao
@ 2023-02-14  4:55     ` Lazar, Lijo
  2023-02-14  6:12       ` Zhou1, Tao
  0 siblings, 1 reply; 13+ messages in thread
From: Lazar, Lijo @ 2023-02-14  4:55 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx, Zhang, Hawking, Yang, Stanley, Chai, Thomas,
	Li, Candice



On 2/14/2023 7:56 AM, Zhou1, Tao wrote:
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>> Sent: Monday, February 13, 2023 8:38 PM
>> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Zhang,
>> Hawking <Hawking.Zhang@amd.com>; Yang, Stanley
>> <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li, Candice
>> <Candice.Li@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
>> bad page
>>
>>
>>
>> On 2/10/2023 2:15 PM, Tao Zhou wrote:
>>> If a UMC bad page is reserved but not freed by an application, the
>>> application may trigger uncorrectable error repeatly by accessing the page.
>>>
>>
>> There is amdgpu_ras_check_bad_page which checks if address is already part of
>> an existing bad page. Can't that be used?
>>
>> Thanks,
>> Lijo
> 
> [Tao] amdgpu_ras_check_bad_page is already called in amdgpu_ras_add_bad_pages, this patch just makes use of the result of amdgpu_ras_check_bad_page.
> 

In the patch, below two are called after error count is set to 0.
	amdgpu_ras_save_bad_pages
	amdgpu_dpm_send_hbm_bad_pages_num

Instead of that, just check if it's an existing badpage which is 
repeatedly accessed and proceed directly to the next step (reset if 
specified)

	if (amdgpu_ras_check_bad_page(adev, address))
		set error count to 0;
		goto reset_logic;

Thanks,
Lijo

>>
>>> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
>>>    2 files changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index e85c4689ce2c..eafe01a24349 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
>> amdgpu_device *adev,
>>>    {
>>>    	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>    	struct ras_err_handler_data *data;
>>> -	int ret = 0;
>>> +	int ret = 0, old_cnt;
>>>    	uint32_t i;
>>>
>>>    	if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 +2060,8
>>> @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>    	if (!data)
>>>    		goto out;
>>>
>>> +	old_cnt = data->count;
>>> +
>>>    	for (i = 0; i < pages; i++) {
>>>    		if (amdgpu_ras_check_bad_page_unlock(con,
>>>    			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
>> @@ -2079,6
>>> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>    		data->count++;
>>>    		data->space_left--;
>>>    	}
>>> +
>>> +	/* all pages have been reserved before, no new bad page */
>>> +	if (old_cnt == data->count)
>>> +		ret = -EEXIST;
>>> +
>>>    out:
>>>    	mutex_unlock(&con->recovery_lock);
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> index 1c7fcb4f2380..772c431e4065 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> @@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct
>>> amdgpu_device *adev,
>>>
>>>    		if ((amdgpu_bad_page_threshold != 0) &&
>>>    			err_data->err_addr_cnt) {
>>> -			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>>> +			ret = amdgpu_ras_add_bad_pages(adev, err_data-
>>> err_addr,
>>>    						err_data->err_addr_cnt);
>>> +			/* if no new bad page is found, no need to increase ue
>> count */
>>> +			if (ret == -EEXIST)
>>> +				err_data->ue_count = 0;
>>> +
>>>    			amdgpu_ras_save_bad_pages(adev);
>>>
>>>    			amdgpu_dpm_send_hbm_bad_pages_num(adev,
>>> con->eeprom_control.ras_num_recs);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-14  4:55     ` Lazar, Lijo
@ 2023-02-14  6:12       ` Zhou1, Tao
  2023-02-14  6:55         ` Lazar, Lijo
  0 siblings, 1 reply; 13+ messages in thread
From: Zhou1, Tao @ 2023-02-14  6:12 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx, Zhang, Hawking, Yang, Stanley, Chai,
	Thomas, Li, Candice



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Tuesday, February 14, 2023 12:55 PM
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Zhang,
> Hawking <Hawking.Zhang@amd.com>; Yang, Stanley
> <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li, Candice
> <Candice.Li@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
> 
> 
> 
> On 2/14/2023 7:56 AM, Zhou1, Tao wrote:
> >
> >
> >> -----Original Message-----
> >> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> >> Sent: Monday, February 13, 2023 8:38 PM
> >> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
> >> Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley
> >> <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li,
> >> Candice <Candice.Li@amd.com>
> >> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
> >> no new bad page
> >>
> >>
> >>
> >> On 2/10/2023 2:15 PM, Tao Zhou wrote:
> >>> If a UMC bad page is reserved but not freed by an application, the
> >>> application may trigger uncorrectable error repeatly by accessing the page.
> >>>
> >>
> >> There is amdgpu_ras_check_bad_page which checks if address is already
> >> part of an existing bad page. Can't that be used?
> >>
> >> Thanks,
> >> Lijo
> >
> > [Tao] amdgpu_ras_check_bad_page is already called in
> amdgpu_ras_add_bad_pages, this patch just makes use of the result of
> amdgpu_ras_check_bad_page.
> >
> 
> In the patch, below two are called after error count is set to 0.
> 	amdgpu_ras_save_bad_pages
> 	amdgpu_dpm_send_hbm_bad_pages_num
> 
> Instead of that, just check if it's an existing badpage which is repeatedly
> accessed and proceed directly to the next step (reset if
> specified)
> 
> 	if (amdgpu_ras_check_bad_page(adev, address))
> 		set error count to 0;
> 		goto reset_logic;
> 
> Thanks,
> Lijo

[Tao] 1. amdgpu_ras_check_bad_page checks only one page, but one ue can generate 16 or more pages.
2. if no new bad page is found, amdgpu_ras_save_bad_pages will do nothing and ras_num_recs won't increase.
3. gpu reset logic should follow the old way even ue count is set to 0.

This patch only set ue count to 0 if there is no new bad page, but other logic has no change.

> 
> >>
> >>> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> >>>    2 files changed, 13 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >>> index e85c4689ce2c..eafe01a24349 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >>> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> >> amdgpu_device *adev,
> >>>    {
> >>>    	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >>>    	struct ras_err_handler_data *data;
> >>> -	int ret = 0;
> >>> +	int ret = 0, old_cnt;
> >>>    	uint32_t i;
> >>>
> >>>    	if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> >>> +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >>>    	if (!data)
> >>>    		goto out;
> >>>
> >>> +	old_cnt = data->count;
> >>> +
> >>>    	for (i = 0; i < pages; i++) {
> >>>    		if (amdgpu_ras_check_bad_page_unlock(con,
> >>>    			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
> >> @@ -2079,6
> >>> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> >>>    		data->count++;
> >>>    		data->space_left--;
> >>>    	}
> >>> +
> >>> +	/* all pages have been reserved before, no new bad page */
> >>> +	if (old_cnt == data->count)
> >>> +		ret = -EEXIST;
> >>> +
> >>>    out:
> >>>    	mutex_unlock(&con->recovery_lock);
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> >>> index 1c7fcb4f2380..772c431e4065 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> >>> @@ -145,8 +145,12 @@ static int
> amdgpu_umc_do_page_retirement(struct
> >>> amdgpu_device *adev,
> >>>
> >>>    		if ((amdgpu_bad_page_threshold != 0) &&
> >>>    			err_data->err_addr_cnt) {
> >>> -			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> >>> +			ret = amdgpu_ras_add_bad_pages(adev, err_data-
> >>> err_addr,
> >>>    						err_data->err_addr_cnt);
> >>> +			/* if no new bad page is found, no need to increase ue
> >> count */
> >>> +			if (ret == -EEXIST)
> >>> +				err_data->ue_count = 0;
> >>> +
> >>>    			amdgpu_ras_save_bad_pages(adev);
> >>>
> >>>    			amdgpu_dpm_send_hbm_bad_pages_num(adev,
> >>> con->eeprom_control.ras_num_recs);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-14  6:12       ` Zhou1, Tao
@ 2023-02-14  6:55         ` Lazar, Lijo
  0 siblings, 0 replies; 13+ messages in thread
From: Lazar, Lijo @ 2023-02-14  6:55 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx, Zhang, Hawking, Yang, Stanley, Chai, Thomas,
	Li, Candice



On 2/14/2023 11:42 AM, Zhou1, Tao wrote:
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>> Sent: Tuesday, February 14, 2023 12:55 PM
>> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Zhang,
>> Hawking <Hawking.Zhang@amd.com>; Yang, Stanley
>> <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li, Candice
>> <Candice.Li@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
>> bad page
>>
>>
>>
>> On 2/14/2023 7:56 AM, Zhou1, Tao wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>> Sent: Monday, February 13, 2023 8:38 PM
>>>> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
>>>> Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley
>>>> <Stanley.Yang@amd.com>; Chai, Thomas <YiPeng.Chai@amd.com>; Li,
>>>> Candice <Candice.Li@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
>>>> no new bad page
>>>>
>>>>
>>>>
>>>> On 2/10/2023 2:15 PM, Tao Zhou wrote:
>>>>> If a UMC bad page is reserved but not freed by an application, the
>>>>> application may trigger uncorrectable error repeatly by accessing the page.
>>>>>
>>>>
>>>> There is amdgpu_ras_check_bad_page which checks if address is already
>>>> part of an existing bad page. Can't that be used?
>>>>
>>>> Thanks,
>>>> Lijo
>>>
>>> [Tao] amdgpu_ras_check_bad_page is already called in
>> amdgpu_ras_add_bad_pages, this patch just makes use of the result of
>> amdgpu_ras_check_bad_page.
>>>
>>
>> In the patch, below two are called after error count is set to 0.
>> 	amdgpu_ras_save_bad_pages
>> 	amdgpu_dpm_send_hbm_bad_pages_num
>>
>> Instead of that, just check if it's an existing badpage which is repeatedly
>> accessed and proceed directly to the next step (reset if
>> specified)
>>
>> 	if (amdgpu_ras_check_bad_page(adev, address))
>> 		set error count to 0;
>> 		goto reset_logic;
>>
>> Thanks,
>> Lijo
> 
> [Tao] 1. amdgpu_ras_check_bad_page checks only one page, but one ue can generate 16 or more pages.

In this case, it makes sense to do as per the patch. Thanks for the 
explanation.

Thanks,
Lijo

> 2. if no new bad page is found, amdgpu_ras_save_bad_pages will do nothing and ras_num_recs won't increase.
> 3. gpu reset logic should follow the old way even ue count is set to 0.
> 
> This patch only set ue count to 0 if there is no new bad page, but other logic has no change.
> 
>>
>>>>
>>>>> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
>>>>>     2 files changed, 13 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> index e85c4689ce2c..eafe01a24349 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
>>>> amdgpu_device *adev,
>>>>>     {
>>>>>     	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>>>     	struct ras_err_handler_data *data;
>>>>> -	int ret = 0;
>>>>> +	int ret = 0, old_cnt;
>>>>>     	uint32_t i;
>>>>>
>>>>>     	if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
>>>>> +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>>>     	if (!data)
>>>>>     		goto out;
>>>>>
>>>>> +	old_cnt = data->count;
>>>>> +
>>>>>     	for (i = 0; i < pages; i++) {
>>>>>     		if (amdgpu_ras_check_bad_page_unlock(con,
>>>>>     			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
>>>> @@ -2079,6
>>>>> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
>> *adev,
>>>>>     		data->count++;
>>>>>     		data->space_left--;
>>>>>     	}
>>>>> +
>>>>> +	/* all pages have been reserved before, no new bad page */
>>>>> +	if (old_cnt == data->count)
>>>>> +		ret = -EEXIST;
>>>>> +
>>>>>     out:
>>>>>     	mutex_unlock(&con->recovery_lock);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> index 1c7fcb4f2380..772c431e4065 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> @@ -145,8 +145,12 @@ static int
>> amdgpu_umc_do_page_retirement(struct
>>>>> amdgpu_device *adev,
>>>>>
>>>>>     		if ((amdgpu_bad_page_threshold != 0) &&
>>>>>     			err_data->err_addr_cnt) {
>>>>> -			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>>>>> +			ret = amdgpu_ras_add_bad_pages(adev, err_data-
>>>>> err_addr,
>>>>>     						err_data->err_addr_cnt);
>>>>> +			/* if no new bad page is found, no need to increase ue
>>>> count */
>>>>> +			if (ret == -EEXIST)
>>>>> +				err_data->ue_count = 0;
>>>>> +
>>>>>     			amdgpu_ras_save_bad_pages(adev);
>>>>>
>>>>>     			amdgpu_dpm_send_hbm_bad_pages_num(adev,
>>>>> con->eeprom_control.ras_num_recs);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-14  2:42     ` Yang, Stanley
@ 2023-02-14 10:03       ` Zhang, Hawking
  2023-02-14 10:27         ` Zhou1, Tao
  0 siblings, 1 reply; 13+ messages in thread
From: Zhang, Hawking @ 2023-02-14 10:03 UTC (permalink / raw)
  To: Yang, Stanley, Zhou1, Tao, amd-gfx, Chai, Thomas, Li, Candice

[AMD Official Use Only - General]

-EINVAL looks better than EEXIST, but it still doesn't fit the case? Alternatively, Can we compare the count before and after amdgpu_ras_add_bad_pages to decide whether reset ue_count to 0 or not. That could be straightforward than struggling for returning which error code in this specific case.

Regards,
Hawking

-----Original Message-----
From: Yang, Stanley <Stanley.Yang@amd.com>
Sent: Tuesday, February 14, 2023 10:42
To: Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; amd-gfx@lists.freedesktop.org; Chai, Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

[AMD Official Use Only - General]



> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@amd.com>
> Sent: Monday, February 13, 2023 4:25 PM
> To: Zhang, Hawking <Hawking.Zhang@amd.com>; amd-
> gfx@lists.freedesktop.org; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new bad page
>
> [AMD Official Use Only - General]
>
> > -----Original Message-----
> > From: Zhang, Hawking <Hawking.Zhang@amd.com>
> > Sent: Friday, February 10, 2023 11:02 PM
> > To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
> > Yang, Stanley <Stanley.Yang@amd.com>; Chai, Thomas
> > <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
> > no new bad page
> >
> > [AMD Official Use Only - General]
> >
> > +                       /* if no new bad page is found, no need to
> > + increase ue count
> */
> > +                       if (ret == -EEXIST)
> > +                               err_data->ue_count = 0;
> >
> > Returning EEXIST in such case is not reasonable. Might consider
> > return a bool for
> > amdgpu_ras_add_bad_pages: true means it does add some new bad page;
> > false means it doesn't change anything.
> >
> > Regards,
> > Hawking
>
> [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and
> amdgpu_ras_recovery_init also need to check the return value. I'd like
> to keep the type of return value unchanged.
> How about -EINVAL?

Stanley: How about return -EALREADY?

Regards,
Stanley
>
> >
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1@amd.com>
> > Sent: Friday, February 10, 2023 16:45
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>;
> Chai,
> > Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new
> > bad page
> >
> > If a UMC bad page is reserved but not freed by an application, the
> > application may trigger uncorrectable error repeatly by accessing the page.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e85c4689ce2c..eafe01a24349 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> > amdgpu_device *adev,  {
> >         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >         struct ras_err_handler_data *data;
> > -       int ret = 0;
> > +       int ret = 0, old_cnt;
> >         uint32_t i;
> >
> >         if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> >         if (!data)
> >                 goto out;
> >
> > +       old_cnt = data->count;
> > +
> >         for (i = 0; i < pages; i++) {
> >                 if (amdgpu_ras_check_bad_page_unlock(con,
> >                         bps[i].retired_page <<
> > AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6
> > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> >                 data->count++;
> >                 data->space_left--;
> >         }
> > +
> > +       /* all pages have been reserved before, no new bad page */
> > +       if (old_cnt == data->count)
> > +               ret = -EEXIST;
> > +
> >  out:
> >         mutex_unlock(&con->recovery_lock);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1c7fcb4f2380..772c431e4065 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -145,8 +145,12 @@ static int
> amdgpu_umc_do_page_retirement(struct
> > amdgpu_device *adev,
> >
> >                 if ((amdgpu_bad_page_threshold != 0) &&
> >                         err_data->err_addr_cnt) {
> > -                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > +                       ret = amdgpu_ras_add_bad_pages(adev,
> > + err_data->err_addr,
> >
> > err_data->err_addr_cnt);
> > +                       /* if no new bad page is found, no need to
> > + increase ue count
> */
> > +                       if (ret == -EEXIST)
> > +                               err_data->ue_count = 0;
> > +
> >                         amdgpu_ras_save_bad_pages(adev);
> >
> >                         amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > >eeprom_control.ras_num_recs);
> > --
> > 2.35.1
> >
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-14 10:03       ` Zhang, Hawking
@ 2023-02-14 10:27         ` Zhou1, Tao
  0 siblings, 0 replies; 13+ messages in thread
From: Zhou1, Tao @ 2023-02-14 10:27 UTC (permalink / raw)
  To: Zhang, Hawking, Yang, Stanley, amd-gfx, Chai, Thomas, Li, Candice

[AMD Official Use Only - General]

OK, I'll add a new function to do the check.

Tao

> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang@amd.com>
> Sent: Tuesday, February 14, 2023 6:03 PM
> To: Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao
> <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Chai, Thomas
> <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
>
> [AMD Official Use Only - General]
>
> -EINVAL looks better than EEXIST, but it still doesn't fit the case? Alternatively,
> Can we compare the count before and after amdgpu_ras_add_bad_pages to
> decide whether reset ue_count to 0 or not. That could be straightforward than
> struggling for returning which error code in this specific case.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Yang, Stanley <Stanley.Yang@amd.com>
> Sent: Tuesday, February 14, 2023 10:42
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, Hawking
> <Hawking.Zhang@amd.com>; amd-gfx@lists.freedesktop.org; Chai, Thomas
> <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
>
> [AMD Official Use Only - General]
>
>
>
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1@amd.com>
> > Sent: Monday, February 13, 2023 4:25 PM
> > To: Zhang, Hawking <Hawking.Zhang@amd.com>; amd-
> > gfx@lists.freedesktop.org; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> > Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> > new bad page
> >
> > [AMD Official Use Only - General]
> >
> > > -----Original Message-----
> > > From: Zhang, Hawking <Hawking.Zhang@amd.com>
> > > Sent: Friday, February 10, 2023 11:02 PM
> > > To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
> > > Yang, Stanley <Stanley.Yang@amd.com>; Chai, Thomas
> > > <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
> > > no new bad page
> > >
> > > [AMD Official Use Only - General]
> > >
> > > +                       /* if no new bad page is found, no need to
> > > + increase ue count
> > */
> > > +                       if (ret == -EEXIST)
> > > +                               err_data->ue_count = 0;
> > >
> > > Returning EEXIST in such case is not reasonable. Might consider
> > > return a bool for
> > > amdgpu_ras_add_bad_pages: true means it does add some new bad page;
> > > false means it doesn't change anything.
> > >
> > > Regards,
> > > Hawking
> >
> > [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and
> > amdgpu_ras_recovery_init also need to check the return value. I'd like
> > to keep the type of return value unchanged.
> > How about -EINVAL?
>
> Stanley: How about return -EALREADY?
>
> Regards,
> Stanley
> >
> > >
> > > -----Original Message-----
> > > From: Zhou1, Tao <Tao.Zhou1@amd.com>
> > > Sent: Friday, February 10, 2023 16:45
> > > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > > <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>;
> > Chai,
> > > Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>
> > > Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> > > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> > new
> > > bad page
> > >
> > > If a UMC bad page is reserved but not freed by an application, the
> > > application may trigger uncorrectable error repeatly by accessing the page.
> > >
> > > Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > index e85c4689ce2c..eafe01a24349 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> > > amdgpu_device *adev,  {
> > >         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> > >         struct ras_err_handler_data *data;
> > > -       int ret = 0;
> > > +       int ret = 0, old_cnt;
> > >         uint32_t i;
> > >
> > >         if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> > > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> > *adev,
> > >         if (!data)
> > >                 goto out;
> > >
> > > +       old_cnt = data->count;
> > > +
> > >         for (i = 0; i < pages; i++) {
> > >                 if (amdgpu_ras_check_bad_page_unlock(con,
> > >                         bps[i].retired_page <<
> > > AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6
> > > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> > *adev,
> > >                 data->count++;
> > >                 data->space_left--;
> > >         }
> > > +
> > > +       /* all pages have been reserved before, no new bad page */
> > > +       if (old_cnt == data->count)
> > > +               ret = -EEXIST;
> > > +
> > >  out:
> > >         mutex_unlock(&con->recovery_lock);
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > index 1c7fcb4f2380..772c431e4065 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > @@ -145,8 +145,12 @@ static int
> > amdgpu_umc_do_page_retirement(struct
> > > amdgpu_device *adev,
> > >
> > >                 if ((amdgpu_bad_page_threshold != 0) &&
> > >                         err_data->err_addr_cnt) {
> > > -                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > > +                       ret = amdgpu_ras_add_bad_pages(adev,
> > > + err_data->err_addr,
> > >
> > > err_data->err_addr_cnt);
> > > +                       /* if no new bad page is found, no need to
> > > + increase ue count
> > */
> > > +                       if (ret == -EEXIST)
> > > +                               err_data->ue_count = 0;
> > > +
> > >                         amdgpu_ras_save_bad_pages(adev);
> > >
> > >                         amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > > >eeprom_control.ras_num_recs);
> > > --
> > > 2.35.1
> > >
> >


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
  2023-02-16  7:57 Tao Zhou
@ 2023-02-16 12:20 ` Yang, Stanley
  0 siblings, 0 replies; 13+ messages in thread
From: Yang, Stanley @ 2023-02-16 12:20 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx, Zhang, Hawking, Chai, Thomas, Li, Candice,
	Lazar, Lijo

[AMD Official Use Only - General]



> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@amd.com>
> Sent: Thursday, February 16, 2023 3:58 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Chai,
> Thomas <YiPeng.Chai@amd.com>; Li, Candice <Candice.Li@amd.com>; Lazar,
> Lijo <Lijo.Lazar@amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1@amd.com>
> Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
> 
> If a UMC bad page is reserved but not freed by an application, the application
> may trigger uncorrectable error repeatly by accessing the page.
> 
> v2: add specific function to do the check.
> 
> Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 24
> ++++++++++++++++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c |  4 ++++
>  3 files changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6e543558386d..5214034e1b16 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2115,6 +2115,30 @@ int amdgpu_ras_save_bad_pages(struct
> amdgpu_device *adev)
>  	return 0;
>  }
> 
> +/* Return false if all pages have been reserved before, no new bad page
> + * is found, otherwise return true.
> + * Note: the function should be called between
> amdgpu_ras_add_bad_pages
> + * and amdgpu_ras_save_bad_pages.
> + */
> +bool amdgpu_ras_new_bad_page_is_added(struct amdgpu_device *adev)
> {
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +	struct ras_err_handler_data *data;
> +	struct amdgpu_ras_eeprom_control *control;
> +	int save_count;
> +
> +	if (!con || !con->eh_data)
> +		return false;
> +
> +	mutex_lock(&con->recovery_lock);
> +	control = &con->eeprom_control;
> +	data = con->eh_data;
> +	save_count = data->count - control->ras_num_recs;
> +	mutex_unlock(&con->recovery_lock);
> +
> +	return (save_count ? true : false);
> +}
> +
>  /*
>   * read error record array in eeprom and reserve enough space for
>   * storing new bad pages
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index f2ad999993f6..606b75c36848 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -549,6 +549,8 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
> 
>  int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> 
> +bool amdgpu_ras_new_bad_page_is_added(struct amdgpu_device *adev);
> +
>  static inline enum ta_ras_block
>  amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) {
>  	switch (block) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1c7fcb4f2380..1146e65c22be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -147,6 +147,10 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
>  			err_data->err_addr_cnt) {
>  			amdgpu_ras_add_bad_pages(adev, err_data-
> >err_addr,
>  						err_data->err_addr_cnt);
> +			/* if no new bad page is found, no need to increase
> ue count */
> +			if (!amdgpu_ras_new_bad_page_is_added(adev))
> +				err_data->ue_count = 0;

[Stanley]: There is a scenario, a UMC bad page is reserved but not freed by an application, the application accesses the above reserved page and it also
accesses a new bad page, driver read 2 ue count but save one new bad page, the err_data->ue_count should be set to 1.

> +
>  			amdgpu_ras_save_bad_pages(adev);
> 
>  			amdgpu_dpm_send_hbm_bad_pages_num(adev,
> con->eeprom_control.ras_num_recs);
> --
> 2.35.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
@ 2023-02-16  7:57 Tao Zhou
  2023-02-16 12:20 ` Yang, Stanley
  0 siblings, 1 reply; 13+ messages in thread
From: Tao Zhou @ 2023-02-16  7:57 UTC (permalink / raw)
  To: amd-gfx, hawking.zhang, stanley.yang, yipeng.chai, candice.li,
	lijo.lazar
  Cc: Tao Zhou

If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.

v2: add specific function to do the check.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 24 ++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c |  4 ++++
 3 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6e543558386d..5214034e1b16 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2115,6 +2115,30 @@ int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
 	return 0;
 }
 
+/* Return false if all pages have been reserved before, no new bad page
+ * is found, otherwise return true.
+ * Note: the function should be called between amdgpu_ras_add_bad_pages
+ * and amdgpu_ras_save_bad_pages.
+ */
+bool amdgpu_ras_new_bad_page_is_added(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data *data;
+	struct amdgpu_ras_eeprom_control *control;
+	int save_count;
+
+	if (!con || !con->eh_data)
+		return false;
+
+	mutex_lock(&con->recovery_lock);
+	control = &con->eeprom_control;
+	data = con->eh_data;
+	save_count = data->count - control->ras_num_recs;
+	mutex_unlock(&con->recovery_lock);
+
+	return (save_count ? true : false);
+}
+
 /*
  * read error record array in eeprom and reserve enough space for
  * storing new bad pages
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index f2ad999993f6..606b75c36848 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -549,6 +549,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 
 int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
 
+bool amdgpu_ras_new_bad_page_is_added(struct amdgpu_device *adev);
+
 static inline enum ta_ras_block
 amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) {
 	switch (block) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 1c7fcb4f2380..1146e65c22be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -147,6 +147,10 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
 			err_data->err_addr_cnt) {
 			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
 						err_data->err_addr_cnt);
+			/* if no new bad page is found, no need to increase ue count */
+			if (!amdgpu_ras_new_bad_page_is_added(adev))
+				err_data->ue_count = 0;
+
 			amdgpu_ras_save_bad_pages(adev);
 
 			amdgpu_dpm_send_hbm_bad_pages_num(adev, con->eeprom_control.ras_num_recs);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-02-16 12:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-10  8:45 [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page Tao Zhou
2023-02-10 15:01 ` Zhang, Hawking
2023-02-13  8:25   ` Zhou1, Tao
2023-02-14  2:42     ` Yang, Stanley
2023-02-14 10:03       ` Zhang, Hawking
2023-02-14 10:27         ` Zhou1, Tao
2023-02-13 12:37 ` Lazar, Lijo
2023-02-14  2:26   ` Zhou1, Tao
2023-02-14  4:55     ` Lazar, Lijo
2023-02-14  6:12       ` Zhou1, Tao
2023-02-14  6:55         ` Lazar, Lijo
2023-02-16  7:57 Tao Zhou
2023-02-16 12:20 ` Yang, Stanley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.