[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad

amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
@ 2019-11-13 22:45 Andrey Grodzovsky
  2019-11-13 22:45 ` Andrey Grodzovsky
       [not found] ` <1573685125-2335-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 2 replies; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-13 22:45 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Andrey Grodzovsky,
	guchun.chen-5C7GfCeVMHo, Hawking.Zhang-5C7GfCeVMHo

We want to be be able to call them independently from one another
so that before GPU reset just amdgpu_ras_save_bad_pages is called.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4044834..600a86d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
  * write error record array to eeprom, the function should be
  * protected by recovery_lock
  */
-static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
+int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data *data;
@@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)
 	struct ras_err_handler_data *data;
 	uint64_t bp;
 	struct amdgpu_bo *bo = NULL;
-	int i, ret = 0;
+	int i;
 
 	if (!con || !con->eh_data)
 		return 0;
@@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)
 		data->last_reserved = i + 1;
 		bo = NULL;
 	}
-
-	/* continue to save bad pages to eeprom even reesrve_vram fails */
-	ret = amdgpu_ras_save_bad_pages(adev);
 out:
 	mutex_unlock(&con->recovery_lock);
-	return ret;
+	return 0;
 }
 
 /* called when driver unload */
@@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 		ret = amdgpu_ras_load_bad_pages(adev);
 		if (ret)
 			goto free;
-		ret = amdgpu_ras_reserve_bad_pages(adev);
-		if (ret)
-			goto release;
+		amdgpu_ras_reserve_bad_pages(adev);
 	}
 
 	return 0;
 
-release:
 	amdgpu_ras_release_bad_pages(adev);
 free:
 	kfree((*data)->bps);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index f80fd34..12b0797 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -492,6 +492,8 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 		struct eeprom_table_record *bps, int pages);
 
+int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
+
 int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
 
 static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
@@ -503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
 	 * i2c may be unstable in gpu reset
 	 */
 	if (in_task())
-		amdgpu_ras_reserve_bad_pages(adev);
+		amdgpu_ras_save_bad_pages(adev);
 
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
 		schedule_work(&ras->recovery_work);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
  2019-11-13 22:45 [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages Andrey Grodzovsky
@ 2019-11-13 22:45 ` Andrey Grodzovsky
       [not found] ` <1573685125-2335-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 0 replies; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-13 22:45 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexdeucher, Andrey Grodzovsky, guchun.chen, Hawking.Zhang

We want to be be able to call them independently from one another
so that before GPU reset just amdgpu_ras_save_bad_pages is called.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4044834..600a86d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
  * write error record array to eeprom, the function should be
  * protected by recovery_lock
  */
-static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
+int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data *data;
@@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)
 	struct ras_err_handler_data *data;
 	uint64_t bp;
 	struct amdgpu_bo *bo = NULL;
-	int i, ret = 0;
+	int i;
 
 	if (!con || !con->eh_data)
 		return 0;
@@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)
 		data->last_reserved = i + 1;
 		bo = NULL;
 	}
-
-	/* continue to save bad pages to eeprom even reesrve_vram fails */
-	ret = amdgpu_ras_save_bad_pages(adev);
 out:
 	mutex_unlock(&con->recovery_lock);
-	return ret;
+	return 0;
 }
 
 /* called when driver unload */
@@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 		ret = amdgpu_ras_load_bad_pages(adev);
 		if (ret)
 			goto free;
-		ret = amdgpu_ras_reserve_bad_pages(adev);
-		if (ret)
-			goto release;
+		amdgpu_ras_reserve_bad_pages(adev);
 	}
 
 	return 0;
 
-release:
 	amdgpu_ras_release_bad_pages(adev);
 free:
 	kfree((*data)->bps);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index f80fd34..12b0797 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -492,6 +492,8 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 		struct eeprom_table_record *bps, int pages);
 
+int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
+
 int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
 
 static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
@@ -503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
 	 * i2c may be unstable in gpu reset
 	 */
 	if (in_task())
-		amdgpu_ras_reserve_bad_pages(adev);
+		amdgpu_ras_save_bad_pages(adev);
 
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
 		schedule_work(&ras->recovery_work);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] drm/amdgpu: Reserve bad pages after ASIC was reset.
       [not found] ` <1573685125-2335-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
@ 2019-11-13 22:45   ` Andrey Grodzovsky
  2019-11-13 22:45     ` Andrey Grodzovsky
  2019-11-14  3:09   ` [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages Zhou1, Tao
  1 sibling, 1 reply; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-13 22:45 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Andrey Grodzovsky,
	guchun.chen-5C7GfCeVMHo, Hawking.Zhang-5C7GfCeVMHo

Problem:
There is no point to reserve bad pages before GPU reset as
it was done untill now since we need to do it after ASIC was
reset as we lose all reservation during ASIC reset.

Fix:
Call amdgpu_ras_reserve_bad_pages after ASIC is reset and not before.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f2feff4..eaac2b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3804,6 +3804,10 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 						break;
 				}
 			}
+
+			/* Mark vram pages with errors as bad after ASIC was reset */
+			list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head)
+			       amdgpu_ras_reserve_bad_pages(tmp_adev);
 		}
 	}
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] drm/amdgpu: Reserve bad pages after ASIC was reset.
  2019-11-13 22:45   ` [PATCH 2/2] drm/amdgpu: Reserve bad pages after ASIC was reset Andrey Grodzovsky
@ 2019-11-13 22:45     ` Andrey Grodzovsky
  0 siblings, 0 replies; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-13 22:45 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexdeucher, Andrey Grodzovsky, guchun.chen, Hawking.Zhang

Problem:
There is no point to reserve bad pages before GPU reset as
it was done untill now since we need to do it after ASIC was
reset as we lose all reservation during ASIC reset.

Fix:
Call amdgpu_ras_reserve_bad_pages after ASIC is reset and not before.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f2feff4..eaac2b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3804,6 +3804,10 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 						break;
 				}
 			}
+
+			/* Mark vram pages with errors as bad after ASIC was reset */
+			list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head)
+			       amdgpu_ras_reserve_bad_pages(tmp_adev);
 		}
 	}
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
       [not found] ` <1573685125-2335-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2019-11-13 22:45   ` [PATCH 2/2] drm/amdgpu: Reserve bad pages after ASIC was reset Andrey Grodzovsky
@ 2019-11-14  3:09   ` Zhou1, Tao
  2019-11-14  3:09     ` Zhou1, Tao
       [not found]     ` <MN2PR12MB3054761DA7F4EDEC9975F8AAB0710-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 2 replies; 14+ messages in thread
From: Zhou1, Tao @ 2019-11-14  3:09 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Grodzovsky, Andrey, Chen,
	Guchun, Zhang, Hawking

Two questions:

1. "we lose all reservation during ASIC reset"
Are you sure of it? I remember the content of vram may be lost after reset but the allocated status could be reserved.

2. You change the bad page handle flow from:

detect bad page -> reserve vram for bad page -> save bad page info to eeprom -> gpu reset

to:

detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time)

Is that right?  Saving bad page (this time) info to eeprom is delayed to the next time that bad page is detected? But the time of detecting bad page is random.
I think the bad page info would be lost without saving to eeprom if power off occurs.

detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time) -> poweroff/system reset (and bad page info (this time) is lost)

Tao

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Andrey Grodzovsky
> Sent: 2019年11月14日 6:45
> To: amd-gfx@lists.freedesktop.org
> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
> Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> 
> We want to be be able to call them independently from one another so that
> before GPU reset just amdgpu_ras_save_bad_pages is called.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 4044834..600a86d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
>   * write error record array to eeprom, the function should be
>   * protected by recovery_lock
>   */
> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  	struct ras_err_handler_data *data;
> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
> amdgpu_device *adev)
>  	struct ras_err_handler_data *data;
>  	uint64_t bp;
>  	struct amdgpu_bo *bo = NULL;
> -	int i, ret = 0;
> +	int i;
> 
>  	if (!con || !con->eh_data)
>  		return 0;
> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
> amdgpu_device *adev)
>  		data->last_reserved = i + 1;
>  		bo = NULL;
>  	}
> -
> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
> -	ret = amdgpu_ras_save_bad_pages(adev);
>  out:
>  	mutex_unlock(&con->recovery_lock);
> -	return ret;
> +	return 0;
>  }
> 
>  /* called when driver unload */
> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>  		ret = amdgpu_ras_load_bad_pages(adev);
>  		if (ret)
>  			goto free;
> -		ret = amdgpu_ras_reserve_bad_pages(adev);
> -		if (ret)
> -			goto release;
> +		amdgpu_ras_reserve_bad_pages(adev);
>  	}
> 
>  	return 0;
> 
> -release:
>  	amdgpu_ras_release_bad_pages(adev);
>  free:
>  	kfree((*data)->bps);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index f80fd34..12b0797 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -492,6 +492,8 @@ unsigned long
> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>  		struct eeprom_table_record *bps, int pages);
> 
> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> +
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
> 
>  static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, @@ -
> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
> amdgpu_device *adev,
>  	 * i2c may be unstable in gpu reset
>  	 */
>  	if (in_task())
> -		amdgpu_ras_reserve_bad_pages(adev);
> +		amdgpu_ras_save_bad_pages(adev);
> 
>  	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>  		schedule_work(&ras->recovery_work);
> --
> 2.7.4
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
  2019-11-14  3:09   ` [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages Zhou1, Tao
@ 2019-11-14  3:09     ` Zhou1, Tao
       [not found]     ` <MN2PR12MB3054761DA7F4EDEC9975F8AAB0710-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 0 replies; 14+ messages in thread
From: Zhou1, Tao @ 2019-11-14  3:09 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx
  Cc: alexdeucher, Grodzovsky, Andrey, Chen, Guchun, Zhang, Hawking

Two questions:

1. "we lose all reservation during ASIC reset"
Are you sure of it? I remember the content of vram may be lost after reset but the allocated status could be reserved.

2. You change the bad page handle flow from:

detect bad page -> reserve vram for bad page -> save bad page info to eeprom -> gpu reset

to:

detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time)

Is that right?  Saving bad page (this time) info to eeprom is delayed to the next time that bad page is detected? But the time of detecting bad page is random.
I think the bad page info would be lost without saving to eeprom if power off occurs.

detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time) -> poweroff/system reset (and bad page info (this time) is lost)

Tao

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Andrey Grodzovsky
> Sent: 2019年11月14日 6:45
> To: amd-gfx@lists.freedesktop.org
> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
> Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> 
> We want to be be able to call them independently from one another so that
> before GPU reset just amdgpu_ras_save_bad_pages is called.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 4044834..600a86d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
>   * write error record array to eeprom, the function should be
>   * protected by recovery_lock
>   */
> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  	struct ras_err_handler_data *data;
> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
> amdgpu_device *adev)
>  	struct ras_err_handler_data *data;
>  	uint64_t bp;
>  	struct amdgpu_bo *bo = NULL;
> -	int i, ret = 0;
> +	int i;
> 
>  	if (!con || !con->eh_data)
>  		return 0;
> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
> amdgpu_device *adev)
>  		data->last_reserved = i + 1;
>  		bo = NULL;
>  	}
> -
> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
> -	ret = amdgpu_ras_save_bad_pages(adev);
>  out:
>  	mutex_unlock(&con->recovery_lock);
> -	return ret;
> +	return 0;
>  }
> 
>  /* called when driver unload */
> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>  		ret = amdgpu_ras_load_bad_pages(adev);
>  		if (ret)
>  			goto free;
> -		ret = amdgpu_ras_reserve_bad_pages(adev);
> -		if (ret)
> -			goto release;
> +		amdgpu_ras_reserve_bad_pages(adev);
>  	}
> 
>  	return 0;
> 
> -release:
>  	amdgpu_ras_release_bad_pages(adev);
>  free:
>  	kfree((*data)->bps);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index f80fd34..12b0797 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -492,6 +492,8 @@ unsigned long
> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>  		struct eeprom_table_record *bps, int pages);
> 
> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> +
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
> 
>  static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, @@ -
> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
> amdgpu_device *adev,
>  	 * i2c may be unstable in gpu reset
>  	 */
>  	if (in_task())
> -		amdgpu_ras_reserve_bad_pages(adev);
> +		amdgpu_ras_save_bad_pages(adev);
> 
>  	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>  		schedule_work(&ras->recovery_work);
> --
> 2.7.4
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
       [not found]     ` <MN2PR12MB3054761DA7F4EDEC9975F8AAB0710-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2019-11-14 16:41       ` Andrey Grodzovsky
  2019-11-14 16:41         ` Andrey Grodzovsky
       [not found]         ` <63578cfb-83c2-6887-b7da-7715ef6ed3a9-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 2 replies; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-14 16:41 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Koenig, Christian
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Chen, Guchun, Zhang, Hawking


On 11/13/19 10:09 PM, Zhou1, Tao wrote:
> Two questions:
>
> 1. "we lose all reservation during ASIC reset"
> Are you sure of it? I remember the content of vram may be lost after reset but the allocated status could be reserved.

Yea, now that I am thinking of it I think i might have confused it with 
BO content recovery in amdgpu_device_recover_vram for shadow buffers 
which are page tables only but just for VRAM reservation  status this 
might be irrelevant... Christian - can you confirm Tao is correct on this ?

>
> 2. You change the bad page handle flow from:
>
> detect bad page -> reserve vram for bad page -> save bad page info to eeprom -> gpu reset
>
> to:
>
> detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time)

Even though if I am wrong on the first point this is irrelevant but 
still - Why saving bad page is from last time ? See 
https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
- the save count is the latest so as the content of 
data->bps[control->num_recs] up to data->bps[control->num_recs + 
save_count] as those are updated in 
amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is called 
right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in the 
interrupt sequence

Andrey


>
> Is that right?  Saving bad page (this time) info to eeprom is delayed to the next time that bad page is detected? But the time of detecting bad page is random.
> I think the bad page info would be lost without saving to eeprom if power off occurs.
>
> detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time) -> poweroff/system reset (and bad page info (this time) is lost)
>
> Tao
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Andrey Grodzovsky
>> Sent: 2019年11月14日 6:45
>> To: amd-gfx@lists.freedesktop.org
>> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
>> Zhang, Hawking <Hawking.Zhang@amd.com>
>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>
>> We want to be be able to call them independently from one another so that
>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 4044834..600a86d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>> amdgpu_device *adev,
>>    * write error record array to eeprom, the function should be
>>    * protected by recovery_lock
>>    */
>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>   {
>>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>   	struct ras_err_handler_data *data;
>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>> amdgpu_device *adev)
>>   	struct ras_err_handler_data *data;
>>   	uint64_t bp;
>>   	struct amdgpu_bo *bo = NULL;
>> -	int i, ret = 0;
>> +	int i;
>>
>>   	if (!con || !con->eh_data)
>>   		return 0;
>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>> amdgpu_device *adev)
>>   		data->last_reserved = i + 1;
>>   		bo = NULL;
>>   	}
>> -
>> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
>> -	ret = amdgpu_ras_save_bad_pages(adev);
>>   out:
>>   	mutex_unlock(&con->recovery_lock);
>> -	return ret;
>> +	return 0;
>>   }
>>
>>   /* called when driver unload */
>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>> amdgpu_device *adev)
>>   		ret = amdgpu_ras_load_bad_pages(adev);
>>   		if (ret)
>>   			goto free;
>> -		ret = amdgpu_ras_reserve_bad_pages(adev);
>> -		if (ret)
>> -			goto release;
>> +		amdgpu_ras_reserve_bad_pages(adev);
>>   	}
>>
>>   	return 0;
>>
>> -release:
>>   	amdgpu_ras_release_bad_pages(adev);
>>   free:
>>   	kfree((*data)->bps);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index f80fd34..12b0797 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -492,6 +492,8 @@ unsigned long
>> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>   		struct eeprom_table_record *bps, int pages);
>>
>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>> +
>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>
>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, @@ -
>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>> amdgpu_device *adev,
>>   	 * i2c may be unstable in gpu reset
>>   	 */
>>   	if (in_task())
>> -		amdgpu_ras_reserve_bad_pages(adev);
>> +		amdgpu_ras_save_bad_pages(adev);
>>
>>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>   		schedule_work(&ras->recovery_work);
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
  2019-11-14 16:41       ` Andrey Grodzovsky
@ 2019-11-14 16:41         ` Andrey Grodzovsky
       [not found]         ` <63578cfb-83c2-6887-b7da-7715ef6ed3a9-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 0 replies; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-14 16:41 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx, Koenig, Christian
  Cc: alexdeucher, Chen, Guchun, Zhang, Hawking


On 11/13/19 10:09 PM, Zhou1, Tao wrote:
> Two questions:
>
> 1. "we lose all reservation during ASIC reset"
> Are you sure of it? I remember the content of vram may be lost after reset but the allocated status could be reserved.

Yea, now that I am thinking of it I think i might have confused it with 
BO content recovery in amdgpu_device_recover_vram for shadow buffers 
which are page tables only but just for VRAM reservation  status this 
might be irrelevant... Christian - can you confirm Tao is correct on this ?

>
> 2. You change the bad page handle flow from:
>
> detect bad page -> reserve vram for bad page -> save bad page info to eeprom -> gpu reset
>
> to:
>
> detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time)

Even though if I am wrong on the first point this is irrelevant but 
still - Why saving bad page is from last time ? See 
https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
- the save count is the latest so as the content of 
data->bps[control->num_recs] up to data->bps[control->num_recs + 
save_count] as those are updated in 
amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is called 
right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in the 
interrupt sequence

Andrey


>
> Is that right?  Saving bad page (this time) info to eeprom is delayed to the next time that bad page is detected? But the time of detecting bad page is random.
> I think the bad page info would be lost without saving to eeprom if power off occurs.
>
> detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time) -> poweroff/system reset (and bad page info (this time) is lost)
>
> Tao
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Andrey Grodzovsky
>> Sent: 2019年11月14日 6:45
>> To: amd-gfx@lists.freedesktop.org
>> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
>> Zhang, Hawking <Hawking.Zhang@amd.com>
>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>
>> We want to be be able to call them independently from one another so that
>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 4044834..600a86d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>> amdgpu_device *adev,
>>    * write error record array to eeprom, the function should be
>>    * protected by recovery_lock
>>    */
>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>   {
>>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>   	struct ras_err_handler_data *data;
>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>> amdgpu_device *adev)
>>   	struct ras_err_handler_data *data;
>>   	uint64_t bp;
>>   	struct amdgpu_bo *bo = NULL;
>> -	int i, ret = 0;
>> +	int i;
>>
>>   	if (!con || !con->eh_data)
>>   		return 0;
>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>> amdgpu_device *adev)
>>   		data->last_reserved = i + 1;
>>   		bo = NULL;
>>   	}
>> -
>> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
>> -	ret = amdgpu_ras_save_bad_pages(adev);
>>   out:
>>   	mutex_unlock(&con->recovery_lock);
>> -	return ret;
>> +	return 0;
>>   }
>>
>>   /* called when driver unload */
>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>> amdgpu_device *adev)
>>   		ret = amdgpu_ras_load_bad_pages(adev);
>>   		if (ret)
>>   			goto free;
>> -		ret = amdgpu_ras_reserve_bad_pages(adev);
>> -		if (ret)
>> -			goto release;
>> +		amdgpu_ras_reserve_bad_pages(adev);
>>   	}
>>
>>   	return 0;
>>
>> -release:
>>   	amdgpu_ras_release_bad_pages(adev);
>>   free:
>>   	kfree((*data)->bps);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index f80fd34..12b0797 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -492,6 +492,8 @@ unsigned long
>> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>   		struct eeprom_table_record *bps, int pages);
>>
>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>> +
>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>
>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, @@ -
>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>> amdgpu_device *adev,
>>   	 * i2c may be unstable in gpu reset
>>   	 */
>>   	if (in_task())
>> -		amdgpu_ras_reserve_bad_pages(adev);
>> +		amdgpu_ras_save_bad_pages(adev);
>>
>>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>   		schedule_work(&ras->recovery_work);
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
       [not found]         ` <63578cfb-83c2-6887-b7da-7715ef6ed3a9-5C7GfCeVMHo@public.gmane.org>
@ 2019-11-15  2:42           ` Zhou1, Tao
  2019-11-15  2:42             ` Zhou1, Tao
  2019-11-18 14:05           ` Andrey Grodzovsky
  1 sibling, 1 reply; 14+ messages in thread
From: Zhou1, Tao @ 2019-11-15  2:42 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Koenig, Christian
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Chen, Guchun, Zhang, Hawking

Sorry, I confused reserve_bad_pages with add_bad_pages, you're right.

If vram allocated info could be reserved after gpu reset, it seems your patch is unnecessary.
If there is a risk that the info is lost, I think [data->0, data->count) pages instead of only [data->last_reserved, data->count) pages should be reserved.

Tao

> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: 2019年11月15日 0:42
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
> Koenig, Christian <Christian.Koenig@amd.com>
> Cc: alexdeucher@gmail.com; Chen, Guchun <Guchun.Chen@amd.com>;
> Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu/ras: Extract
> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> 
> 
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
> > Two questions:
> >
> > 1. "we lose all reservation during ASIC reset"
> > Are you sure of it? I remember the content of vram may be lost after reset
> but the allocated status could be reserved.
> 
> Yea, now that I am thinking of it I think i might have confused it with BO
> content recovery in amdgpu_device_recover_vram for shadow buffers which
> are page tables only but just for VRAM reservation  status this might be
> irrelevant... Christian - can you confirm Tao is correct on this ?
> 
> >
> > 2. You change the bad page handle flow from:
> >
> > detect bad page -> reserve vram for bad page -> save bad page info to
> > eeprom -> gpu reset
> >
> > to:
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time)
> 
> Even though if I am wrong on the first point this is irrelevant but still - Why
> saving bad page is from last time ? See
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdg
> pu/amdgpu_ras.c?h=amd-staging-drm-next#n1436
> - the save count is the latest so as the content of
> data->bps[control->num_recs] up to data->bps[control->num_recs +
> save_count] as those are updated in
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages
> in the interrupt sequence
> 
> Andrey
> 
> 
> >
> > Is that right?  Saving bad page (this time) info to eeprom is delayed to the
> next time that bad page is detected? But the time of detecting bad page is
> random.
> > I think the bad page info would be lost without saving to eeprom if power
> off occurs.
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time) ->
> > poweroff/system reset (and bad page info (this time) is lost)
> >
> > Tao
> >
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> >> Andrey Grodzovsky
> >> Sent: 2019年11月14日 6:45
> >> To: amd-gfx@lists.freedesktop.org
> >> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
> >> <Andrey.Grodzovsky@amd.com>; Chen, Guchun
> <Guchun.Chen@amd.com>;
> >> Zhang, Hawking <Hawking.Zhang@amd.com>
> >> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
> >> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> >>
> >> We want to be be able to call them independently from one another so
> >> that before GPU reset just amdgpu_ras_save_bad_pages is called.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
> >>   2 files changed, 7 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> index 4044834..600a86d 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
> >> amdgpu_device *adev,
> >>    * write error record array to eeprom, the function should be
> >>    * protected by recovery_lock
> >>    */
> >> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >>   {
> >>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >>   	struct ras_err_handler_data *data; @@ -1520,7 +1520,7 @@ int
> >> amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >>   	struct ras_err_handler_data *data;
> >>   	uint64_t bp;
> >>   	struct amdgpu_bo *bo = NULL;
> >> -	int i, ret = 0;
> >> +	int i;
> >>
> >>   	if (!con || !con->eh_data)
> >>   		return 0;
> >> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >>   		data->last_reserved = i + 1;
> >>   		bo = NULL;
> >>   	}
> >> -
> >> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
> >> -	ret = amdgpu_ras_save_bad_pages(adev);
> >>   out:
> >>   	mutex_unlock(&con->recovery_lock);
> >> -	return ret;
> >> +	return 0;
> >>   }
> >>
> >>   /* called when driver unload */
> >> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
> >> amdgpu_device *adev)
> >>   		ret = amdgpu_ras_load_bad_pages(adev);
> >>   		if (ret)
> >>   			goto free;
> >> -		ret = amdgpu_ras_reserve_bad_pages(adev);
> >> -		if (ret)
> >> -			goto release;
> >> +		amdgpu_ras_reserve_bad_pages(adev);
> >>   	}
> >>
> >>   	return 0;
> >>
> >> -release:
> >>   	amdgpu_ras_release_bad_pages(adev);
> >>   free:
> >>   	kfree((*data)->bps);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> index f80fd34..12b0797 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> @@ -492,6 +492,8 @@ unsigned long
> >> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
> >> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >>   		struct eeprom_table_record *bps, int pages);
> >>
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> >> +
> >>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
> >>
> >>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
> >> @@ -
> >> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
> >> amdgpu_device *adev,
> >>   	 * i2c may be unstable in gpu reset
> >>   	 */
> >>   	if (in_task())
> >> -		amdgpu_ras_reserve_bad_pages(adev);
> >> +		amdgpu_ras_save_bad_pages(adev);
> >>
> >>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> >>   		schedule_work(&ras->recovery_work);
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
  2019-11-15  2:42           ` Zhou1, Tao
@ 2019-11-15  2:42             ` Zhou1, Tao
  0 siblings, 0 replies; 14+ messages in thread
From: Zhou1, Tao @ 2019-11-15  2:42 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx, Koenig, Christian
  Cc: alexdeucher, Chen, Guchun, Zhang, Hawking

Sorry, I confused reserve_bad_pages with add_bad_pages, you're right.

If vram allocated info could be reserved after gpu reset, it seems your patch is unnecessary.
If there is a risk that the info is lost, I think [data->0, data->count) pages instead of only [data->last_reserved, data->count) pages should be reserved.

Tao

> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: 2019年11月15日 0:42
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
> Koenig, Christian <Christian.Koenig@amd.com>
> Cc: alexdeucher@gmail.com; Chen, Guchun <Guchun.Chen@amd.com>;
> Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu/ras: Extract
> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> 
> 
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
> > Two questions:
> >
> > 1. "we lose all reservation during ASIC reset"
> > Are you sure of it? I remember the content of vram may be lost after reset
> but the allocated status could be reserved.
> 
> Yea, now that I am thinking of it I think i might have confused it with BO
> content recovery in amdgpu_device_recover_vram for shadow buffers which
> are page tables only but just for VRAM reservation  status this might be
> irrelevant... Christian - can you confirm Tao is correct on this ?
> 
> >
> > 2. You change the bad page handle flow from:
> >
> > detect bad page -> reserve vram for bad page -> save bad page info to
> > eeprom -> gpu reset
> >
> > to:
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time)
> 
> Even though if I am wrong on the first point this is irrelevant but still - Why
> saving bad page is from last time ? See
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdg
> pu/amdgpu_ras.c?h=amd-staging-drm-next#n1436
> - the save count is the latest so as the content of
> data->bps[control->num_recs] up to data->bps[control->num_recs +
> save_count] as those are updated in
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages
> in the interrupt sequence
> 
> Andrey
> 
> 
> >
> > Is that right?  Saving bad page (this time) info to eeprom is delayed to the
> next time that bad page is detected? But the time of detecting bad page is
> random.
> > I think the bad page info would be lost without saving to eeprom if power
> off occurs.
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time) ->
> > poweroff/system reset (and bad page info (this time) is lost)
> >
> > Tao
> >
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> >> Andrey Grodzovsky
> >> Sent: 2019年11月14日 6:45
> >> To: amd-gfx@lists.freedesktop.org
> >> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
> >> <Andrey.Grodzovsky@amd.com>; Chen, Guchun
> <Guchun.Chen@amd.com>;
> >> Zhang, Hawking <Hawking.Zhang@amd.com>
> >> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
> >> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> >>
> >> We want to be be able to call them independently from one another so
> >> that before GPU reset just amdgpu_ras_save_bad_pages is called.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
> >>   2 files changed, 7 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> index 4044834..600a86d 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
> >> amdgpu_device *adev,
> >>    * write error record array to eeprom, the function should be
> >>    * protected by recovery_lock
> >>    */
> >> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >>   {
> >>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >>   	struct ras_err_handler_data *data; @@ -1520,7 +1520,7 @@ int
> >> amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >>   	struct ras_err_handler_data *data;
> >>   	uint64_t bp;
> >>   	struct amdgpu_bo *bo = NULL;
> >> -	int i, ret = 0;
> >> +	int i;
> >>
> >>   	if (!con || !con->eh_data)
> >>   		return 0;
> >> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >>   		data->last_reserved = i + 1;
> >>   		bo = NULL;
> >>   	}
> >> -
> >> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
> >> -	ret = amdgpu_ras_save_bad_pages(adev);
> >>   out:
> >>   	mutex_unlock(&con->recovery_lock);
> >> -	return ret;
> >> +	return 0;
> >>   }
> >>
> >>   /* called when driver unload */
> >> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
> >> amdgpu_device *adev)
> >>   		ret = amdgpu_ras_load_bad_pages(adev);
> >>   		if (ret)
> >>   			goto free;
> >> -		ret = amdgpu_ras_reserve_bad_pages(adev);
> >> -		if (ret)
> >> -			goto release;
> >> +		amdgpu_ras_reserve_bad_pages(adev);
> >>   	}
> >>
> >>   	return 0;
> >>
> >> -release:
> >>   	amdgpu_ras_release_bad_pages(adev);
> >>   free:
> >>   	kfree((*data)->bps);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> index f80fd34..12b0797 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> @@ -492,6 +492,8 @@ unsigned long
> >> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
> >> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >>   		struct eeprom_table_record *bps, int pages);
> >>
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> >> +
> >>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
> >>
> >>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
> >> @@ -
> >> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
> >> amdgpu_device *adev,
> >>   	 * i2c may be unstable in gpu reset
> >>   	 */
> >>   	if (in_task())
> >> -		amdgpu_ras_reserve_bad_pages(adev);
> >> +		amdgpu_ras_save_bad_pages(adev);
> >>
> >>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> >>   		schedule_work(&ras->recovery_work);
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
       [not found]         ` <63578cfb-83c2-6887-b7da-7715ef6ed3a9-5C7GfCeVMHo@public.gmane.org>
  2019-11-15  2:42           ` Zhou1, Tao
@ 2019-11-18 14:05           ` Andrey Grodzovsky
  2019-11-18 14:05             ` Andrey Grodzovsky
       [not found]             ` <9ac78bf3-664b-ba1a-8579-bb802e3860e4-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 2 replies; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-18 14:05 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Koenig, Christian
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Chen, Guchun, Zhang, Hawking

Christian - ping.

Andrey

On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>> Two questions:
>>
>> 1. "we lose all reservation during ASIC reset"
>> Are you sure of it? I remember the content of vram may be lost after 
>> reset but the allocated status could be reserved.
>
> Yea, now that I am thinking of it I think i might have confused it 
> with BO content recovery in amdgpu_device_recover_vram for shadow 
> buffers which are page tables only but just for VRAM reservation 
> status this might be irrelevant... Christian - can you confirm Tao is 
> correct on this ?
>
>>
>> 2. You change the bad page handle flow from:
>>
>> detect bad page -> reserve vram for bad page -> save bad page info to 
>> eeprom -> gpu reset
>>
>> to:
>>
>> detect bad page (this time) -> save bad page (last time) info to 
>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>
> Even though if I am wrong on the first point this is irrelevant but 
> still - Why saving bad page is from last time ? See 
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
> - the save count is the latest so as the content of 
> data->bps[control->num_recs] up to data->bps[control->num_recs + 
> save_count] as those are updated in 
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is 
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in 
> the interrupt sequence
>
> Andrey
>
>
>>
>> Is that right?  Saving bad page (this time) info to eeprom is delayed 
>> to the next time that bad page is detected? But the time of detecting 
>> bad page is random.
>> I think the bad page info would be lost without saving to eeprom if 
>> power off occurs.
>>
>> detect bad page (this time) -> save bad page (last time) info to 
>> eeprom -> gpu reset -> reserve vram for bad page (this time) -> 
>> poweroff/system reset (and bad page info (this time) is lost)
>>
>> Tao
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>> Andrey Grodzovsky
>>> Sent: 2019年11月14日 6:45
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
>>> Zhang, Hawking <Hawking.Zhang@amd.com>
>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>
>>> We want to be be able to call them independently from one another so 
>>> that
>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index 4044834..600a86d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>> amdgpu_device *adev,
>>>    * write error record array to eeprom, the function should be
>>>    * protected by recovery_lock
>>>    */
>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>   {
>>>       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>       struct ras_err_handler_data *data;
>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>>       struct ras_err_handler_data *data;
>>>       uint64_t bp;
>>>       struct amdgpu_bo *bo = NULL;
>>> -    int i, ret = 0;
>>> +    int i;
>>>
>>>       if (!con || !con->eh_data)
>>>           return 0;
>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>>           data->last_reserved = i + 1;
>>>           bo = NULL;
>>>       }
>>> -
>>> -    /* continue to save bad pages to eeprom even reesrve_vram fails */
>>> -    ret = amdgpu_ras_save_bad_pages(adev);
>>>   out:
>>>       mutex_unlock(&con->recovery_lock);
>>> -    return ret;
>>> +    return 0;
>>>   }
>>>
>>>   /* called when driver unload */
>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>> amdgpu_device *adev)
>>>           ret = amdgpu_ras_load_bad_pages(adev);
>>>           if (ret)
>>>               goto free;
>>> -        ret = amdgpu_ras_reserve_bad_pages(adev);
>>> -        if (ret)
>>> -            goto release;
>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>       }
>>>
>>>       return 0;
>>>
>>> -release:
>>>       amdgpu_ras_release_bad_pages(adev);
>>>   free:
>>>       kfree((*data)->bps);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index f80fd34..12b0797 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -492,6 +492,8 @@ unsigned long
>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>           struct eeprom_table_record *bps, int pages);
>>>
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>> +
>>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>
>>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, 
>>> @@ -
>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>> amdgpu_device *adev,
>>>        * i2c may be unstable in gpu reset
>>>        */
>>>       if (in_task())
>>> -        amdgpu_ras_reserve_bad_pages(adev);
>>> +        amdgpu_ras_save_bad_pages(adev);
>>>
>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>           schedule_work(&ras->recovery_work);
>>> -- 
>>> 2.7.4
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
  2019-11-18 14:05           ` Andrey Grodzovsky
@ 2019-11-18 14:05             ` Andrey Grodzovsky
       [not found]             ` <9ac78bf3-664b-ba1a-8579-bb802e3860e4-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 0 replies; 14+ messages in thread
From: Andrey Grodzovsky @ 2019-11-18 14:05 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx, Koenig, Christian
  Cc: alexdeucher, Chen, Guchun, Zhang, Hawking

Christian - ping.

Andrey

On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>> Two questions:
>>
>> 1. "we lose all reservation during ASIC reset"
>> Are you sure of it? I remember the content of vram may be lost after 
>> reset but the allocated status could be reserved.
>
> Yea, now that I am thinking of it I think i might have confused it 
> with BO content recovery in amdgpu_device_recover_vram for shadow 
> buffers which are page tables only but just for VRAM reservation 
> status this might be irrelevant... Christian - can you confirm Tao is 
> correct on this ?
>
>>
>> 2. You change the bad page handle flow from:
>>
>> detect bad page -> reserve vram for bad page -> save bad page info to 
>> eeprom -> gpu reset
>>
>> to:
>>
>> detect bad page (this time) -> save bad page (last time) info to 
>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>
> Even though if I am wrong on the first point this is irrelevant but 
> still - Why saving bad page is from last time ? See 
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
> - the save count is the latest so as the content of 
> data->bps[control->num_recs] up to data->bps[control->num_recs + 
> save_count] as those are updated in 
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is 
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in 
> the interrupt sequence
>
> Andrey
>
>
>>
>> Is that right?  Saving bad page (this time) info to eeprom is delayed 
>> to the next time that bad page is detected? But the time of detecting 
>> bad page is random.
>> I think the bad page info would be lost without saving to eeprom if 
>> power off occurs.
>>
>> detect bad page (this time) -> save bad page (last time) info to 
>> eeprom -> gpu reset -> reserve vram for bad page (this time) -> 
>> poweroff/system reset (and bad page info (this time) is lost)
>>
>> Tao
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>> Andrey Grodzovsky
>>> Sent: 2019年11月14日 6:45
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
>>> Zhang, Hawking <Hawking.Zhang@amd.com>
>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>
>>> We want to be be able to call them independently from one another so 
>>> that
>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index 4044834..600a86d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>> amdgpu_device *adev,
>>>    * write error record array to eeprom, the function should be
>>>    * protected by recovery_lock
>>>    */
>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>   {
>>>       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>       struct ras_err_handler_data *data;
>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>>       struct ras_err_handler_data *data;
>>>       uint64_t bp;
>>>       struct amdgpu_bo *bo = NULL;
>>> -    int i, ret = 0;
>>> +    int i;
>>>
>>>       if (!con || !con->eh_data)
>>>           return 0;
>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>>           data->last_reserved = i + 1;
>>>           bo = NULL;
>>>       }
>>> -
>>> -    /* continue to save bad pages to eeprom even reesrve_vram fails */
>>> -    ret = amdgpu_ras_save_bad_pages(adev);
>>>   out:
>>>       mutex_unlock(&con->recovery_lock);
>>> -    return ret;
>>> +    return 0;
>>>   }
>>>
>>>   /* called when driver unload */
>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>> amdgpu_device *adev)
>>>           ret = amdgpu_ras_load_bad_pages(adev);
>>>           if (ret)
>>>               goto free;
>>> -        ret = amdgpu_ras_reserve_bad_pages(adev);
>>> -        if (ret)
>>> -            goto release;
>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>       }
>>>
>>>       return 0;
>>>
>>> -release:
>>>       amdgpu_ras_release_bad_pages(adev);
>>>   free:
>>>       kfree((*data)->bps);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index f80fd34..12b0797 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -492,6 +492,8 @@ unsigned long
>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>           struct eeprom_table_record *bps, int pages);
>>>
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>> +
>>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>
>>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, 
>>> @@ -
>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>> amdgpu_device *adev,
>>>        * i2c may be unstable in gpu reset
>>>        */
>>>       if (in_task())
>>> -        amdgpu_ras_reserve_bad_pages(adev);
>>> +        amdgpu_ras_save_bad_pages(adev);
>>>
>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>           schedule_work(&ras->recovery_work);
>>> -- 
>>> 2.7.4
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
       [not found]             ` <9ac78bf3-664b-ba1a-8579-bb802e3860e4-5C7GfCeVMHo@public.gmane.org>
@ 2019-11-18 14:17               ` Christian König
  2019-11-18 14:17                 ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2019-11-18 14:17 UTC (permalink / raw)
  To: Andrey Grodzovsky, Zhou1, Tao, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Chen, Guchun, Zhang, Hawking

Am 18.11.19 um 15:05 schrieb Andrey Grodzovsky:
> Christian - ping.
>
> Andrey
>
> On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>>
>> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>>> Two questions:
>>>
>>> 1. "we lose all reservation during ASIC reset"
>>> Are you sure of it? I remember the content of vram may be lost after 
>>> reset but the allocated status could be reserved.
>>
>> Yea, now that I am thinking of it I think i might have confused it 
>> with BO content recovery in amdgpu_device_recover_vram for shadow 
>> buffers which are page tables only but just for VRAM reservation 
>> status this might be irrelevant... Christian - can you confirm Tao is 
>> correct on this ?

Yeah, that is correct. The BO structure stays, just the content is lost.

You guys should maybe consider moving all that stuff into 
amdgpu_vram_mgr.c. It is far easier to handle there because you don't 
need to care about memory which is currently allocated.

Regards,
Christian.

>>
>>>
>>> 2. You change the bad page handle flow from:
>>>
>>> detect bad page -> reserve vram for bad page -> save bad page info 
>>> to eeprom -> gpu reset
>>>
>>> to:
>>>
>>> detect bad page (this time) -> save bad page (last time) info to 
>>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>>
>> Even though if I am wrong on the first point this is irrelevant but 
>> still - Why saving bad page is from last time ? See 
>> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
>> - the save count is the latest so as the content of 
>> data->bps[control->num_recs] up to data->bps[control->num_recs + 
>> save_count] as those are updated in 
>> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is 
>> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages 
>> in the interrupt sequence
>>
>> Andrey
>>
>>
>>>
>>> Is that right?  Saving bad page (this time) info to eeprom is 
>>> delayed to the next time that bad page is detected? But the time of 
>>> detecting bad page is random.
>>> I think the bad page info would be lost without saving to eeprom if 
>>> power off occurs.
>>>
>>> detect bad page (this time) -> save bad page (last time) info to 
>>> eeprom -> gpu reset -> reserve vram for bad page (this time) -> 
>>> poweroff/system reset (and bad page info (this time) is lost)
>>>
>>> Tao
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>>> Andrey Grodzovsky
>>>> Sent: 2019年11月14日 6:45
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
>>>> Zhang, Hawking <Hawking.Zhang@amd.com>
>>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>>
>>>> We want to be be able to call them independently from one another 
>>>> so that
>>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> index 4044834..600a86d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>>> amdgpu_device *adev,
>>>>    * write error record array to eeprom, the function should be
>>>>    * protected by recovery_lock
>>>>    */
>>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>>   {
>>>>       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>>       struct ras_err_handler_data *data;
>>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>>       struct ras_err_handler_data *data;
>>>>       uint64_t bp;
>>>>       struct amdgpu_bo *bo = NULL;
>>>> -    int i, ret = 0;
>>>> +    int i;
>>>>
>>>>       if (!con || !con->eh_data)
>>>>           return 0;
>>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>>           data->last_reserved = i + 1;
>>>>           bo = NULL;
>>>>       }
>>>> -
>>>> -    /* continue to save bad pages to eeprom even reesrve_vram 
>>>> fails */
>>>> -    ret = amdgpu_ras_save_bad_pages(adev);
>>>>   out:
>>>>       mutex_unlock(&con->recovery_lock);
>>>> -    return ret;
>>>> +    return 0;
>>>>   }
>>>>
>>>>   /* called when driver unload */
>>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>>> amdgpu_device *adev)
>>>>           ret = amdgpu_ras_load_bad_pages(adev);
>>>>           if (ret)
>>>>               goto free;
>>>> -        ret = amdgpu_ras_reserve_bad_pages(adev);
>>>> -        if (ret)
>>>> -            goto release;
>>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>>       }
>>>>
>>>>       return 0;
>>>>
>>>> -release:
>>>>       amdgpu_ras_release_bad_pages(adev);
>>>>   free:
>>>>       kfree((*data)->bps);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> index f80fd34..12b0797 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> @@ -492,6 +492,8 @@ unsigned long
>>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev, int
>>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>>           struct eeprom_table_record *bps, int pages);
>>>>
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>>> +
>>>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>>
>>>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device 
>>>> *adev, @@ -
>>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>> amdgpu_device *adev,
>>>>        * i2c may be unstable in gpu reset
>>>>        */
>>>>       if (in_task())
>>>> -        amdgpu_ras_reserve_bad_pages(adev);
>>>> +        amdgpu_ras_save_bad_pages(adev);
>>>>
>>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>           schedule_work(&ras->recovery_work);
>>>> -- 
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
  2019-11-18 14:17               ` Christian König
@ 2019-11-18 14:17                 ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2019-11-18 14:17 UTC (permalink / raw)
  To: Andrey Grodzovsky, Zhou1, Tao, amd-gfx
  Cc: alexdeucher, Chen, Guchun, Zhang, Hawking

Am 18.11.19 um 15:05 schrieb Andrey Grodzovsky:
> Christian - ping.
>
> Andrey
>
> On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>>
>> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>>> Two questions:
>>>
>>> 1. "we lose all reservation during ASIC reset"
>>> Are you sure of it? I remember the content of vram may be lost after 
>>> reset but the allocated status could be reserved.
>>
>> Yea, now that I am thinking of it I think i might have confused it 
>> with BO content recovery in amdgpu_device_recover_vram for shadow 
>> buffers which are page tables only but just for VRAM reservation 
>> status this might be irrelevant... Christian - can you confirm Tao is 
>> correct on this ?

Yeah, that is correct. The BO structure stays, just the content is lost.

You guys should maybe consider moving all that stuff into 
amdgpu_vram_mgr.c. It is far easier to handle there because you don't 
need to care about memory which is currently allocated.

Regards,
Christian.

>>
>>>
>>> 2. You change the bad page handle flow from:
>>>
>>> detect bad page -> reserve vram for bad page -> save bad page info 
>>> to eeprom -> gpu reset
>>>
>>> to:
>>>
>>> detect bad page (this time) -> save bad page (last time) info to 
>>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>>
>> Even though if I am wrong on the first point this is irrelevant but 
>> still - Why saving bad page is from last time ? See 
>> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
>> - the save count is the latest so as the content of 
>> data->bps[control->num_recs] up to data->bps[control->num_recs + 
>> save_count] as those are updated in 
>> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is 
>> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages 
>> in the interrupt sequence
>>
>> Andrey
>>
>>
>>>
>>> Is that right?  Saving bad page (this time) info to eeprom is 
>>> delayed to the next time that bad page is detected? But the time of 
>>> detecting bad page is random.
>>> I think the bad page info would be lost without saving to eeprom if 
>>> power off occurs.
>>>
>>> detect bad page (this time) -> save bad page (last time) info to 
>>> eeprom -> gpu reset -> reserve vram for bad page (this time) -> 
>>> poweroff/system reset (and bad page info (this time) is lost)
>>>
>>> Tao
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>>> Andrey Grodzovsky
>>>> Sent: 2019年11月14日 6:45
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Cc: alexdeucher@gmail.com; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
>>>> Zhang, Hawking <Hawking.Zhang@amd.com>
>>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>>
>>>> We want to be be able to call them independently from one another 
>>>> so that
>>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> index 4044834..600a86d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>>> amdgpu_device *adev,
>>>>    * write error record array to eeprom, the function should be
>>>>    * protected by recovery_lock
>>>>    */
>>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>>   {
>>>>       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>>       struct ras_err_handler_data *data;
>>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>>       struct ras_err_handler_data *data;
>>>>       uint64_t bp;
>>>>       struct amdgpu_bo *bo = NULL;
>>>> -    int i, ret = 0;
>>>> +    int i;
>>>>
>>>>       if (!con || !con->eh_data)
>>>>           return 0;
>>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>>           data->last_reserved = i + 1;
>>>>           bo = NULL;
>>>>       }
>>>> -
>>>> -    /* continue to save bad pages to eeprom even reesrve_vram 
>>>> fails */
>>>> -    ret = amdgpu_ras_save_bad_pages(adev);
>>>>   out:
>>>>       mutex_unlock(&con->recovery_lock);
>>>> -    return ret;
>>>> +    return 0;
>>>>   }
>>>>
>>>>   /* called when driver unload */
>>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>>> amdgpu_device *adev)
>>>>           ret = amdgpu_ras_load_bad_pages(adev);
>>>>           if (ret)
>>>>               goto free;
>>>> -        ret = amdgpu_ras_reserve_bad_pages(adev);
>>>> -        if (ret)
>>>> -            goto release;
>>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>>       }
>>>>
>>>>       return 0;
>>>>
>>>> -release:
>>>>       amdgpu_ras_release_bad_pages(adev);
>>>>   free:
>>>>       kfree((*data)->bps);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> index f80fd34..12b0797 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> @@ -492,6 +492,8 @@ unsigned long
>>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev, int
>>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>>           struct eeprom_table_record *bps, int pages);
>>>>
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>>> +
>>>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>>
>>>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device 
>>>> *adev, @@ -
>>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>> amdgpu_device *adev,
>>>>        * i2c may be unstable in gpu reset
>>>>        */
>>>>       if (in_task())
>>>> -        amdgpu_ras_reserve_bad_pages(adev);
>>>> +        amdgpu_ras_save_bad_pages(adev);
>>>>
>>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>           schedule_work(&ras->recovery_work);
>>>> -- 
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-11-18 14:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 22:45 [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages Andrey Grodzovsky
2019-11-13 22:45 ` Andrey Grodzovsky
     [not found] ` <1573685125-2335-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2019-11-13 22:45   ` [PATCH 2/2] drm/amdgpu: Reserve bad pages after ASIC was reset Andrey Grodzovsky
2019-11-13 22:45     ` Andrey Grodzovsky
2019-11-14  3:09   ` [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages Zhou1, Tao
2019-11-14  3:09     ` Zhou1, Tao
     [not found]     ` <MN2PR12MB3054761DA7F4EDEC9975F8AAB0710-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-14 16:41       ` Andrey Grodzovsky
2019-11-14 16:41         ` Andrey Grodzovsky
     [not found]         ` <63578cfb-83c2-6887-b7da-7715ef6ed3a9-5C7GfCeVMHo@public.gmane.org>
2019-11-15  2:42           ` Zhou1, Tao
2019-11-15  2:42             ` Zhou1, Tao
2019-11-18 14:05           ` Andrey Grodzovsky
2019-11-18 14:05             ` Andrey Grodzovsky
     [not found]             ` <9ac78bf3-664b-ba1a-8579-bb802e3860e4-5C7GfCeVMHo@public.gmane.org>
2019-11-18 14:17               ` Christian König
2019-11-18 14:17                 ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).