All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
@ 2019-09-10 19:41 Andrey Grodzovsky
       [not found] ` <1568144487-27802-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Andrey Grodzovsky @ 2019-09-10 19:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Alexander.Deucher-5C7GfCeVMHo, Andrey Grodzovsky,
	Tao.Zhou1-5C7GfCeVMHo, Guchun.Chen-5C7GfCeVMHo

Problem:
amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
because writing to EEPROM during ASIC reset was unstable.
But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
directly from ISR context and so locking is not allowed. Also it's
irrelevant for this partilcular interrupt as this is generic RAS
interrupt and not memory errors specific.

Fix:
Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 012034d..dd5da3c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
 	/* save bad page to eeprom before gpu reset,
 	 * i2c may be unstable in gpu reset
 	 */
-	amdgpu_ras_reserve_bad_pages(adev);
+	if (in_task())
+		amdgpu_ras_reserve_bad_pages(adev);
+
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
 		schedule_work(&ras->recovery_work);
 	return 0;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found] ` <1568144487-27802-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-11  3:08   ` Zhou1, Tao
       [not found]     ` <MN2PR12MB3054A0B4D399377417213B76B0B10-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2019-09-11  6:54   ` Chen, Guchun
  1 sibling, 1 reply; 11+ messages in thread
From: Zhou1, Tao @ 2019-09-11  3:08 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander, Grodzovsky, Andrey, Chen, Guchun

amdgpu_ras_reserve_bad_pages is only used by umc block, so another approach is to move it into amdgpu_umc_process_ras_data_cb.
Anyway, either way is OK and the patch is:

Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

> -----Original Message-----
> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Sent: 2019年9月11日 3:41
> To: amd-gfx@lists.freedesktop.org
> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
> <Tao.Zhou1@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com>
> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> 
> Problem:
> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
> because writing to EEPROM during ASIC reset was unstable.
> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
> directly from ISR context and so locking is not allowed. Also it's irrelevant for
> this partilcular interrupt as this is generic RAS interrupt and not memory
> errors specific.
> 
> Fix:
> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 012034d..dd5da3c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
> amdgpu_device *adev,
>  	/* save bad page to eeprom before gpu reset,
>  	 * i2c may be unstable in gpu reset
>  	 */
> -	amdgpu_ras_reserve_bad_pages(adev);
> +	if (in_task())
> +		amdgpu_ras_reserve_bad_pages(adev);
> +
>  	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>  		schedule_work(&ras->recovery_work);
>  	return 0;
> --
> 2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found] ` <1568144487-27802-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2019-09-11  3:08   ` Zhou1, Tao
@ 2019-09-11  6:54   ` Chen, Guchun
  1 sibling, 0 replies; 11+ messages in thread
From: Chen, Guchun @ 2019-09-11  6:54 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander, Grodzovsky, Andrey, Zhou1, Tao

Also it's irrelevant for this partilcular interrupt as this is generic RAS interrupt and not memory errors specific.
[Guchun]One typo, it should be "particular", not " partilcular". With that fixed, the patch is: Reviewed-by: Guchun Chen <guchun.chen@amd.com>


-----Original Message-----
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> 
Sent: Wednesday, September 11, 2019 3:41 AM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.

Problem:
amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu because writing to EEPROM during ASIC reset was unstable.
But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called directly from ISR context and so locking is not allowed. Also it's irrelevant for this partilcular interrupt as this is generic RAS interrupt and not memory errors specific.
[Guchun]One typo, it should be "particular", not " partilcular". With that fixed, the patch is: Reviewed-by: Guchun Chen <guchun.chen@amd.com>

Fix:
Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 012034d..dd5da3c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
 	/* save bad page to eeprom before gpu reset,
 	 * i2c may be unstable in gpu reset
 	 */
-	amdgpu_ras_reserve_bad_pages(adev);
+	if (in_task())
+		amdgpu_ras_reserve_bad_pages(adev);
+
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
 		schedule_work(&ras->recovery_work);
 	return 0;
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]     ` <MN2PR12MB3054A0B4D399377417213B76B0B10-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2019-09-11 14:19       ` Grodzovsky, Andrey
       [not found]         ` <d35cc3f6-ff46-175e-3a92-5f7948f97bef-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Grodzovsky, Andrey @ 2019-09-11 14:19 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander, Chen, Guchun

I like this much more, I will relocate to amdgpu_umc_process_ras_data_cb 
an push.

Andrey

On 9/10/19 11:08 PM, Zhou1, Tao wrote:
> amdgpu_ras_reserve_bad_pages is only used by umc block, so another approach is to move it into amdgpu_umc_process_ras_data_cb.
> Anyway, either way is OK and the patch is:
>
> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
>
>> -----Original Message-----
>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Sent: 2019年9月11日 3:41
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>> <Tao.Zhou1@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>
>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>
>> Problem:
>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>> because writing to EEPROM during ASIC reset was unstable.
>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>> directly from ISR context and so locking is not allowed. Also it's irrelevant for
>> this partilcular interrupt as this is generic RAS interrupt and not memory
>> errors specific.
>>
>> Fix:
>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 012034d..dd5da3c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>> amdgpu_device *adev,
>>   	/* save bad page to eeprom before gpu reset,
>>   	 * i2c may be unstable in gpu reset
>>   	 */
>> -	amdgpu_ras_reserve_bad_pages(adev);
>> +	if (in_task())
>> +		amdgpu_ras_reserve_bad_pages(adev);
>> +
>>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>   		schedule_work(&ras->recovery_work);
>>   	return 0;
>> --
>> 2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]         ` <d35cc3f6-ff46-175e-3a92-5f7948f97bef-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-11 14:41           ` Grodzovsky, Andrey
       [not found]             ` <603add77-1476-ebc8-69f9-2cf88a788a6b-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Grodzovsky, Andrey @ 2019-09-11 14:41 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander, Chen, Guchun

On second though this will break  what about reserving bad pages when 
resetting GPU for non RAS error reason such as manual reset ,S3 or ring 
timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the 
code as is.

Another possible issue in existing code - looks like no reservation will 
take place in those case even now as amdgpu_ras_reserve_bad_pages 
data->last_reserved will be equal to data->count , no ? Looks like for 
this case you need to add flag to FORCE reservation for all pages from  
0 to data->counnt.

Andrey

On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
> I like this much more, I will relocate to 
> amdgpu_umc_process_ras_data_cb an push.
>
> Andrey
>
> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another 
>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>> Anyway, either way is OK and the patch is:
>>
>> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
>>
>>> -----Original Message-----
>>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Sent: 2019年9月11日 3:41
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>>> <Tao.Zhou1@amd.com>; Deucher, Alexander
>>> <Alexander.Deucher@amd.com>; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>
>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>
>>> Problem:
>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>> because writing to EEPROM during ASIC reset was unstable.
>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>> directly from ISR context and so locking is not allowed. Also it's 
>>> irrelevant for
>>> this partilcular interrupt as this is generic RAS interrupt and not 
>>> memory
>>> errors specific.
>>>
>>> Fix:
>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index 012034d..dd5da3c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>> amdgpu_device *adev,
>>>       /* save bad page to eeprom before gpu reset,
>>>        * i2c may be unstable in gpu reset
>>>        */
>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>> +    if (in_task())
>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>> +
>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>           schedule_work(&ras->recovery_work);
>>>       return 0;
>>> -- 
>>> 2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]             ` <603add77-1476-ebc8-69f9-2cf88a788a6b-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-12  1:53               ` Chen, Guchun
       [not found]                 ` <SN6PR12MB2813F0DFFE8EC027AAF6D6DAF1B00-kxOKjb6HO/Hw8A9fYknAbAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Chen, Guchun @ 2019-09-12  1:53 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Zhou1, Tao, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander

Comment inline.

Regards,
Guchun

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Wednesday, September 11, 2019 10:41 PM
To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.

On second though this will break  what about reserving bad pages when resetting GPU for non RAS error reason such as manual reset ,S3 or ring timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the code as is.

Another possible issue in existing code - looks like no reservation will take place in those case even now as amdgpu_ras_reserve_bad_pages 
data->last_reserved will be equal to data->count , no ? Looks like for
this case you need to add flag to FORCE reservation for all pages from
0 to data->counnt.
[Guchun]Yes, last_reserved is not updated any more, unless we unload our driver. So it maybe always equal to data->count, then no new bad page will be reserved.
I see we have one eeprom reset by user, can we put this last_reserved clean operation to user in the same stack as well?

Andrey

On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
> I like this much more, I will relocate to 
> amdgpu_umc_process_ras_data_cb an push.
>
> Andrey
>
> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another 
>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>> Anyway, either way is OK and the patch is:
>>
>> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
>>
>>> -----Original Message-----
>>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Sent: 2019年9月11日 3:41
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao 
>>> <Tao.Zhou1@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; 
>>> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>
>>> Problem:
>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu 
>>> because writing to EEPROM during ASIC reset was unstable.
>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called 
>>> directly from ISR context and so locking is not allowed. Also it's 
>>> irrelevant for this partilcular interrupt as this is generic RAS 
>>> interrupt and not memory errors specific.
>>>
>>> Fix:
>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index 012034d..dd5da3c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct 
>>> amdgpu_device *adev,
>>>       /* save bad page to eeprom before gpu reset,
>>>        * i2c may be unstable in gpu reset
>>>        */
>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>> +    if (in_task())
>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>> +
>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>           schedule_work(&ras->recovery_work);
>>>       return 0;
>>> --
>>> 2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]                 ` <SN6PR12MB2813F0DFFE8EC027AAF6D6DAF1B00-kxOKjb6HO/Hw8A9fYknAbAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2019-09-12  2:32                   ` Grodzovsky, Andrey
       [not found]                     ` <MWHPR12MB14533B06E13B86E54520E991EAB00-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Grodzovsky, Andrey @ 2019-09-12  2:32 UTC (permalink / raw)
  To: Chen, Guchun, Zhou1, Tao, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander

That not what I meant. Let's say you handled one bad page interrupt and as a result have one bad page reserved. Now unrelated gfx ring timeout happens which triggers GPU reset and VRAM loss. When you come back from reset amdgpu_ras_reserve_bad_pages will be called but since last_reserved == data_count the bad page will not be reserved again, maybe we should just set data->last_reserved to 0 again if VRAM was lost during ASIC reset...

Andrey

________________________________________
From: Chen, Guchun <Guchun.Chen@amd.com>
Sent: 11 September 2019 21:53:03
To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander
Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.

Comment inline.

Regards,
Guchun

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Sent: Wednesday, September 11, 2019 10:41 PM
To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.

On second though this will break  what about reserving bad pages when resetting GPU for non RAS error reason such as manual reset ,S3 or ring timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the code as is.

Another possible issue in existing code - looks like no reservation will take place in those case even now as amdgpu_ras_reserve_bad_pages
data->last_reserved will be equal to data->count , no ? Looks like for
this case you need to add flag to FORCE reservation for all pages from
0 to data->counnt.
[Guchun]Yes, last_reserved is not updated any more, unless we unload our driver. So it maybe always equal to data->count, then no new bad page will be reserved.
I see we have one eeprom reset by user, can we put this last_reserved clean operation to user in the same stack as well?

Andrey

On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
> I like this much more, I will relocate to
> amdgpu_umc_process_ras_data_cb an push.
>
> Andrey
>
> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>> Anyway, either way is OK and the patch is:
>>
>> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
>>
>>> -----Original Message-----
>>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Sent: 2019年9月11日 3:41
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>>> <Tao.Zhou1@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>;
>>> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>
>>> Problem:
>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>> because writing to EEPROM during ASIC reset was unstable.
>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>> directly from ISR context and so locking is not allowed. Also it's
>>> irrelevant for this partilcular interrupt as this is generic RAS
>>> interrupt and not memory errors specific.
>>>
>>> Fix:
>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index 012034d..dd5da3c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>> amdgpu_device *adev,
>>>       /* save bad page to eeprom before gpu reset,
>>>        * i2c may be unstable in gpu reset
>>>        */
>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>> +    if (in_task())
>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>> +
>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>           schedule_work(&ras->recovery_work);
>>>       return 0;
>>> --
>>> 2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]                     ` <MWHPR12MB14533B06E13B86E54520E991EAB00-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2019-09-12 11:35                       ` Zhou1, Tao
       [not found]                         ` <MN2PR12MB3054CE8F6F6097847B188457B0B00-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Zhou1, Tao @ 2019-09-12 11:35 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Chen, Guchun,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander

Hi Andrey:

Are you sure of the VRAM content loss after gpu reset? I'm not very familiar with the detail of gpu reset and I'll do experiment to confirm the case you mentioned.

Regards,
Tao

> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: 2019年9月12日 10:32
> To: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
> <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> 
> That not what I meant. Let's say you handled one bad page interrupt and as
> a result have one bad page reserved. Now unrelated gfx ring timeout
> happens which triggers GPU reset and VRAM loss. When you come back from
> reset amdgpu_ras_reserve_bad_pages will be called but since last_reserved
> == data_count the bad page will not be reserved again, maybe we should just
> set data->last_reserved to 0 again if VRAM was lost during ASIC reset...
> 
> Andrey
> 
> ________________________________________
> From: Chen, Guchun <Guchun.Chen@amd.com>
> Sent: 11 September 2019 21:53:03
> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander
> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> 
> Comment inline.
> 
> Regards,
> Guchun
> 
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Wednesday, September 11, 2019 10:41 PM
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> 
> On second though this will break  what about reserving bad pages when
> resetting GPU for non RAS error reason such as manual reset ,S3 or ring
> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the
> code as is.
> 
> Another possible issue in existing code - looks like no reservation will take
> place in those case even now as amdgpu_ras_reserve_bad_pages
> data->last_reserved will be equal to data->count , no ? Looks like for
> this case you need to add flag to FORCE reservation for all pages from
> 0 to data->counnt.
> [Guchun]Yes, last_reserved is not updated any more, unless we unload our
> driver. So it maybe always equal to data->count, then no new bad page will
> be reserved.
> I see we have one eeprom reset by user, can we put this last_reserved clean
> operation to user in the same stack as well?
> 
> Andrey
> 
> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
> > I like this much more, I will relocate to
> > amdgpu_umc_process_ras_data_cb an push.
> >
> > Andrey
> >
> > On 9/10/19 11:08 PM, Zhou1, Tao wrote:
> >> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
> >> approach is to move it into amdgpu_umc_process_ras_data_cb.
> >> Anyway, either way is OK and the patch is:
> >>
> >> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
> >>
> >>> -----Original Message-----
> >>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>> Sent: 2019年9月11日 3:41
> >>> To: amd-gfx@lists.freedesktop.org
> >>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
> >>> <Tao.Zhou1@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>;
> >>> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> >>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> >>>
> >>> Problem:
> >>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
> >>> because writing to EEPROM during ASIC reset was unstable.
> >>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
> >>> directly from ISR context and so locking is not allowed. Also it's
> >>> irrelevant for this partilcular interrupt as this is generic RAS
> >>> interrupt and not memory errors specific.
> >>>
> >>> Fix:
> >>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
> >>>
> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
> >>>   1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> index 012034d..dd5da3c 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
> >>> amdgpu_device *adev,
> >>>       /* save bad page to eeprom before gpu reset,
> >>>        * i2c may be unstable in gpu reset
> >>>        */
> >>> -    amdgpu_ras_reserve_bad_pages(adev);
> >>> +    if (in_task())
> >>> +        amdgpu_ras_reserve_bad_pages(adev);
> >>> +
> >>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> >>>           schedule_work(&ras->recovery_work);
> >>>       return 0;
> >>> --
> >>> 2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]                         ` <MN2PR12MB3054CE8F6F6097847B188457B0B00-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2019-09-12 14:09                           ` Grodzovsky, Andrey
       [not found]                             ` <1caeca1e-40e7-9b59-37f9-47704903655f-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Grodzovsky, Andrey @ 2019-09-12 14:09 UTC (permalink / raw)
  To: Zhou1, Tao, Chen, Guchun, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander

I am not sure VRAM loss happens every time, but when it does I would 
assume you would have to reserve them again as the page tables content 
was lost. On the other hand I do remember we keep shadow system memory 
copies of all page tables so maybe that not an issue, so yes, just try 
to allocate the bad page after reset and if it's still reserved you will 
fail.

Andrey

On 9/12/19 7:35 AM, Zhou1, Tao wrote:
> Hi Andrey:
>
> Are you sure of the VRAM content loss after gpu reset? I'm not very familiar with the detail of gpu reset and I'll do experiment to confirm the case you mentioned.
>
> Regards,
> Tao
>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>> Sent: 2019年9月12日 10:32
>> To: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>> <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>
>> That not what I meant. Let's say you handled one bad page interrupt and as
>> a result have one bad page reserved. Now unrelated gfx ring timeout
>> happens which triggers GPU reset and VRAM loss. When you come back from
>> reset amdgpu_ras_reserve_bad_pages will be called but since last_reserved
>> == data_count the bad page will not be reserved again, maybe we should just
>> set data->last_reserved to 0 again if VRAM was lost during ASIC reset...
>>
>> Andrey
>>
>> ________________________________________
>> From: Chen, Guchun <Guchun.Chen@amd.com>
>> Sent: 11 September 2019 21:53:03
>> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander
>> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>
>> Comment inline.
>>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>> Sent: Wednesday, September 11, 2019 10:41 PM
>> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>
>> On second though this will break  what about reserving bad pages when
>> resetting GPU for non RAS error reason such as manual reset ,S3 or ring
>> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the
>> code as is.
>>
>> Another possible issue in existing code - looks like no reservation will take
>> place in those case even now as amdgpu_ras_reserve_bad_pages
>> data->last_reserved will be equal to data->count , no ? Looks like for
>> this case you need to add flag to FORCE reservation for all pages from
>> 0 to data->counnt.
>> [Guchun]Yes, last_reserved is not updated any more, unless we unload our
>> driver. So it maybe always equal to data->count, then no new bad page will
>> be reserved.
>> I see we have one eeprom reset by user, can we put this last_reserved clean
>> operation to user in the same stack as well?
>>
>> Andrey
>>
>> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
>>> I like this much more, I will relocate to
>>> amdgpu_umc_process_ras_data_cb an push.
>>>
>>> Andrey
>>>
>>> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>>>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
>>>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>>>> Anyway, either way is OK and the patch is:
>>>>
>>>> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
>>>>
>>>>> -----Original Message-----
>>>>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Sent: 2019年9月11日 3:41
>>>>> To: amd-gfx@lists.freedesktop.org
>>>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>>>>> <Tao.Zhou1@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>;
>>>>> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>>
>>>>> Problem:
>>>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>>>> because writing to EEPROM during ASIC reset was unstable.
>>>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>>>> directly from ISR context and so locking is not allowed. Also it's
>>>>> irrelevant for this partilcular interrupt as this is generic RAS
>>>>> interrupt and not memory errors specific.
>>>>>
>>>>> Fix:
>>>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>> index 012034d..dd5da3c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>>> amdgpu_device *adev,
>>>>>        /* save bad page to eeprom before gpu reset,
>>>>>         * i2c may be unstable in gpu reset
>>>>>         */
>>>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>>>> +    if (in_task())
>>>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>>> +
>>>>>        if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>>            schedule_work(&ras->recovery_work);
>>>>>        return 0;
>>>>> --
>>>>> 2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]                             ` <1caeca1e-40e7-9b59-37f9-47704903655f-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-12 15:15                               ` Christian König
       [not found]                                 ` <91382817-97b0-9ca5-24c6-e7880c4bdb55-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2019-09-12 15:15 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Zhou1, Tao, Chen, Guchun,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander

Well I still hope to avoid VRAM lost in most of the cases, but that is 
really not guaranteed.

What is the bad page and why do you need to reserve it?

Christian.

Am 12.09.19 um 16:09 schrieb Grodzovsky, Andrey:
> I am not sure VRAM loss happens every time, but when it does I would
> assume you would have to reserve them again as the page tables content
> was lost. On the other hand I do remember we keep shadow system memory
> copies of all page tables so maybe that not an issue, so yes, just try
> to allocate the bad page after reset and if it's still reserved you will
> fail.
>
> Andrey
>
> On 9/12/19 7:35 AM, Zhou1, Tao wrote:
>> Hi Andrey:
>>
>> Are you sure of the VRAM content loss after gpu reset? I'm not very familiar with the detail of gpu reset and I'll do experiment to confirm the case you mentioned.
>>
>> Regards,
>> Tao
>>
>>> -----Original Message-----
>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Sent: 2019年9月12日 10:32
>>> To: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>>> <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>
>>> That not what I meant. Let's say you handled one bad page interrupt and as
>>> a result have one bad page reserved. Now unrelated gfx ring timeout
>>> happens which triggers GPU reset and VRAM loss. When you come back from
>>> reset amdgpu_ras_reserve_bad_pages will be called but since last_reserved
>>> == data_count the bad page will not be reserved again, maybe we should just
>>> set data->last_reserved to 0 again if VRAM was lost during ASIC reset...
>>>
>>> Andrey
>>>
>>> ________________________________________
>>> From: Chen, Guchun <Guchun.Chen@amd.com>
>>> Sent: 11 September 2019 21:53:03
>>> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander
>>> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>
>>> Comment inline.
>>>
>>> Regards,
>>> Guchun
>>>
>>> -----Original Message-----
>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Sent: Wednesday, September 11, 2019 10:41 PM
>>> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Deucher, Alexander
>>> <Alexander.Deucher@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>
>>> On second though this will break  what about reserving bad pages when
>>> resetting GPU for non RAS error reason such as manual reset ,S3 or ring
>>> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the
>>> code as is.
>>>
>>> Another possible issue in existing code - looks like no reservation will take
>>> place in those case even now as amdgpu_ras_reserve_bad_pages
>>> data->last_reserved will be equal to data->count , no ? Looks like for
>>> this case you need to add flag to FORCE reservation for all pages from
>>> 0 to data->counnt.
>>> [Guchun]Yes, last_reserved is not updated any more, unless we unload our
>>> driver. So it maybe always equal to data->count, then no new bad page will
>>> be reserved.
>>> I see we have one eeprom reset by user, can we put this last_reserved clean
>>> operation to user in the same stack as well?
>>>
>>> Andrey
>>>
>>> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
>>>> I like this much more, I will relocate to
>>>> amdgpu_umc_process_ras_data_cb an push.
>>>>
>>>> Andrey
>>>>
>>>> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>>>>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
>>>>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>>>>> Anyway, either way is OK and the patch is:
>>>>>
>>>>> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Sent: 2019年9月11日 3:41
>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>>>>>> <Tao.Zhou1@amd.com>; Deucher, Alexander
>>> <Alexander.Deucher@amd.com>;
>>>>>> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>>>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>>>
>>>>>> Problem:
>>>>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>>>>> because writing to EEPROM during ASIC reset was unstable.
>>>>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>>>>> directly from ISR context and so locking is not allowed. Also it's
>>>>>> irrelevant for this partilcular interrupt as this is generic RAS
>>>>>> interrupt and not memory errors specific.
>>>>>>
>>>>>> Fix:
>>>>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>> index 012034d..dd5da3c 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>>>> amdgpu_device *adev,
>>>>>>         /* save bad page to eeprom before gpu reset,
>>>>>>          * i2c may be unstable in gpu reset
>>>>>>          */
>>>>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>>>>> +    if (in_task())
>>>>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>>>> +
>>>>>>         if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>>>             schedule_work(&ras->recovery_work);
>>>>>>         return 0;
>>>>>> --
>>>>>> 2.7.4
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
       [not found]                                 ` <91382817-97b0-9ca5-24c6-e7880c4bdb55-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2019-09-12 15:19                                   ` Grodzovsky, Andrey
  0 siblings, 0 replies; 11+ messages in thread
From: Grodzovsky, Andrey @ 2019-09-12 15:19 UTC (permalink / raw)
  To: Koenig, Christian, Zhou1, Tao, Chen, Guchun,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander

RAS error will be triggered if the HW identified a faulty physical page, 
the error comes through an interrupt where the data payload will have 
information that can be translated into the bad page address, we then as 
a recovery measure reset the ASIC, reserve this bad page so it cannot be 
used by anyone else and store it's address to EEPROM for future 
reservation after boot.

Andrey

On 9/12/19 11:15 AM, Christian König wrote:
> Well I still hope to avoid VRAM lost in most of the cases, but that is 
> really not guaranteed.
>
> What is the bad page and why do you need to reserve it?
>
> Christian.
>
> Am 12.09.19 um 16:09 schrieb Grodzovsky, Andrey:
>> I am not sure VRAM loss happens every time, but when it does I would
>> assume you would have to reserve them again as the page tables content
>> was lost. On the other hand I do remember we keep shadow system memory
>> copies of all page tables so maybe that not an issue, so yes, just try
>> to allocate the bad page after reset and if it's still reserved you will
>> fail.
>>
>> Andrey
>>
>> On 9/12/19 7:35 AM, Zhou1, Tao wrote:
>>> Hi Andrey:
>>>
>>> Are you sure of the VRAM content loss after gpu reset? I'm not very 
>>> familiar with the detail of gpu reset and I'll do experiment to 
>>> confirm the case you mentioned.
>>>
>>> Regards,
>>> Tao
>>>
>>>> -----Original Message-----
>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>>> Sent: 2019年9月12日 10:32
>>>> To: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>>>> <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> That not what I meant. Let's say you handled one bad page interrupt 
>>>> and as
>>>> a result have one bad page reserved. Now unrelated gfx ring timeout
>>>> happens which triggers GPU reset and VRAM loss. When you come back 
>>>> from
>>>> reset amdgpu_ras_reserve_bad_pages will be called but since 
>>>> last_reserved
>>>> == data_count the bad page will not be reserved again, maybe we 
>>>> should just
>>>> set data->last_reserved to 0 again if VRAM was lost during ASIC 
>>>> reset...
>>>>
>>>> Andrey
>>>>
>>>> ________________________________________
>>>> From: Chen, Guchun <Guchun.Chen@amd.com>
>>>> Sent: 11 September 2019 21:53:03
>>>> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx@lists.freedesktop.org
>>>> Cc: Deucher, Alexander
>>>> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> Comment inline.
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>>> Sent: Wednesday, September 11, 2019 10:41 PM
>>>> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Deucher, Alexander
>>>> <Alexander.Deucher@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> On second though this will break  what about reserving bad pages when
>>>> resetting GPU for non RAS error reason such as manual reset ,S3 or 
>>>> ring
>>>> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the
>>>> code as is.
>>>>
>>>> Another possible issue in existing code - looks like no reservation 
>>>> will take
>>>> place in those case even now as amdgpu_ras_reserve_bad_pages
>>>> data->last_reserved will be equal to data->count , no ? Looks like for
>>>> this case you need to add flag to FORCE reservation for all pages from
>>>> 0 to data->counnt.
>>>> [Guchun]Yes, last_reserved is not updated any more, unless we 
>>>> unload our
>>>> driver. So it maybe always equal to data->count, then no new bad 
>>>> page will
>>>> be reserved.
>>>> I see we have one eeprom reset by user, can we put this 
>>>> last_reserved clean
>>>> operation to user in the same stack as well?
>>>>
>>>> Andrey
>>>>
>>>> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
>>>>> I like this much more, I will relocate to
>>>>> amdgpu_umc_process_ras_data_cb an push.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>>>>>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
>>>>>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>>>>>> Anyway, either way is OK and the patch is:
>>>>>>
>>>>>> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Sent: 2019年9月11日 3:41
>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>> Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao
>>>>>>> <Tao.Zhou1@amd.com>; Deucher, Alexander
>>>> <Alexander.Deucher@amd.com>;
>>>>>>> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>>>>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>>>>
>>>>>>> Problem:
>>>>>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>>>>>> because writing to EEPROM during ASIC reset was unstable.
>>>>>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>>>>>> directly from ISR context and so locking is not allowed. Also it's
>>>>>>> irrelevant for this partilcular interrupt as this is generic RAS
>>>>>>> interrupt and not memory errors specific.
>>>>>>>
>>>>>>> Fix:
>>>>>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> index 012034d..dd5da3c 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>         /* save bad page to eeprom before gpu reset,
>>>>>>>          * i2c may be unstable in gpu reset
>>>>>>>          */
>>>>>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>>>>>> +    if (in_task())
>>>>>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>>>>> +
>>>>>>>         if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>>>>             schedule_work(&ras->recovery_work);
>>>>>>>         return 0;
>>>>>>> -- 
>>>>>>> 2.7.4
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-09-12 15:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10 19:41 [PATCH] drm/amdgpu: Fix mutex lock from atomic context Andrey Grodzovsky
     [not found] ` <1568144487-27802-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2019-09-11  3:08   ` Zhou1, Tao
     [not found]     ` <MN2PR12MB3054A0B4D399377417213B76B0B10-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-09-11 14:19       ` Grodzovsky, Andrey
     [not found]         ` <d35cc3f6-ff46-175e-3a92-5f7948f97bef-5C7GfCeVMHo@public.gmane.org>
2019-09-11 14:41           ` Grodzovsky, Andrey
     [not found]             ` <603add77-1476-ebc8-69f9-2cf88a788a6b-5C7GfCeVMHo@public.gmane.org>
2019-09-12  1:53               ` Chen, Guchun
     [not found]                 ` <SN6PR12MB2813F0DFFE8EC027AAF6D6DAF1B00-kxOKjb6HO/Hw8A9fYknAbAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-09-12  2:32                   ` Grodzovsky, Andrey
     [not found]                     ` <MWHPR12MB14533B06E13B86E54520E991EAB00-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-09-12 11:35                       ` Zhou1, Tao
     [not found]                         ` <MN2PR12MB3054CE8F6F6097847B188457B0B00-rweVpJHSKTqnT25eLM+iUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-09-12 14:09                           ` Grodzovsky, Andrey
     [not found]                             ` <1caeca1e-40e7-9b59-37f9-47704903655f-5C7GfCeVMHo@public.gmane.org>
2019-09-12 15:15                               ` Christian König
     [not found]                                 ` <91382817-97b0-9ca5-24c6-e7880c4bdb55-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-09-12 15:19                                   ` Grodzovsky, Andrey
2019-09-11  6:54   ` Chen, Guchun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.