[PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened
@ 2020-12-09  9:46 Liu ChengZhe
  2020-12-09 10:06 ` Christian König
  0 siblings, 1 reply; 5+ messages in thread
From: Liu ChengZhe @ 2020-12-09  9:46 UTC (permalink / raw)
  To: amd-gfx
  Cc: Jack Xiao, Feifei Xu, Kevin Wang, Liu ChengZhe, Tuikov Luben,
	Deucher Alexander, Xiaojie Yuan, Christian König,
	Hawking Zhang

If CS init return -ECANCELED, UMD will free and create new context.
Job in this new context could conitnue exexcuting. In the case of
BACO or mode 1, we can't allow this happpen. Because VRAM has lost
after whole gpu reset, the job can't guarantee to succeed.

Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 85e48c29a57c..2a98f58134ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
 	uint64_t *chunk_array;
 	unsigned size, num_ibs = 0;
 	uint32_t uf_offset = 0;
+	uint32_t vramlost_count = 0;
 	int i;
 	int ret;
 
@@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
 
 	/* skip guilty context job */
 	if (atomic_read(&p->ctx->guilty) == 1) {
-		ret = -ECANCELED;
+		vramlost_count = atomic_read(&p->adev->vram_lost_counter);
+		if (p->ctx->vram_lost_counter != vramlost_count)
+			ret = -EINVAL;
+		else
+			ret = -ECANCELED;
 		goto free_chunk;
 	}
 
@@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
 		goto free_all_kdata;
 
 	if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
-		ret = -ECANCELED;
+		ret = -EINVAL;
 		goto free_all_kdata;
 	}
 
-- 
2.25.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened
  2020-12-09  9:46 [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened Liu ChengZhe
@ 2020-12-09 10:06 ` Christian König
  2020-12-10  2:25   ` Liu, Cheng Zhe
  0 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2020-12-09 10:06 UTC (permalink / raw)
  To: Liu ChengZhe, amd-gfx
  Cc: Jack Xiao, Feifei Xu, Kevin Wang, Tuikov Luben,
	Deucher Alexander, Xiaojie Yuan, Hawking Zhang

Am 09.12.20 um 10:46 schrieb Liu ChengZhe:
> If CS init return -ECANCELED, UMD will free and create new context.
> Job in this new context could conitnue exexcuting. In the case of
> BACO or mode 1, we can't allow this happpen. Because VRAM has lost
> after whole gpu reset, the job can't guarantee to succeed.

NAK, this is intentional.

When ECANCELED is returned UMD should create new context after a GPU 
reset to get back into an usable state and continue to submit jobs.

Regards,
Christian.

>
> Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 85e48c29a57c..2a98f58134ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>   	uint64_t *chunk_array;
>   	unsigned size, num_ibs = 0;
>   	uint32_t uf_offset = 0;
> +	uint32_t vramlost_count = 0;
>   	int i;
>   	int ret;
>   
> @@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>   
>   	/* skip guilty context job */
>   	if (atomic_read(&p->ctx->guilty) == 1) {
> -		ret = -ECANCELED;
> +		vramlost_count = atomic_read(&p->adev->vram_lost_counter);
> +		if (p->ctx->vram_lost_counter != vramlost_count)
> +			ret = -EINVAL;
> +		else
> +			ret = -ECANCELED;
>   		goto free_chunk;
>   	}
>   
> @@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>   		goto free_all_kdata;
>   
>   	if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
> -		ret = -ECANCELED;
> +		ret = -EINVAL;
>   		goto free_all_kdata;
>   	}
>   

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened
  2020-12-09 10:06 ` Christian König
@ 2020-12-10  2:25   ` Liu, Cheng Zhe
  2020-12-10  9:28     ` Christian König
  0 siblings, 1 reply; 5+ messages in thread
From: Liu, Cheng Zhe @ 2020-12-10  2:25 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang,  Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Yuan, Xiaojie, Zhang, Hawking

[AMD Public Use]

Yeah, we discussed this issue again. We think it's better UMD make some changes instead of changing in KMD. If FLR happened, it's OK for UMD create new context and continue to submit jobs.
However, if BACO or mode 1 reset happens, of course UMD could also submit jobs, but these jobs can't use any resource create before the reset, including page table.
Because all the contents in VRAM has lost after BACO or mode 1 reset, which including APP's buffer.

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Wednesday, December 9, 2020 6:06 PM
To: Liu, Cheng Zhe <ChengZhe.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Yuan, Xiaojie <Xiaojie.Yuan@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

Am 09.12.20 um 10:46 schrieb Liu ChengZhe:
> If CS init return -ECANCELED, UMD will free and create new context.
> Job in this new context could conitnue exexcuting. In the case of BACO 
> or mode 1, we can't allow this happpen. Because VRAM has lost after 
> whole gpu reset, the job can't guarantee to succeed.

NAK, this is intentional.

When ECANCELED is returned UMD should create new context after a GPU reset to get back into an usable state and continue to submit jobs.

Regards,
Christian.

>
> Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 85e48c29a57c..2a98f58134ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>   	uint64_t *chunk_array;
>   	unsigned size, num_ibs = 0;
>   	uint32_t uf_offset = 0;
> +	uint32_t vramlost_count = 0;
>   	int i;
>   	int ret;
>   
> @@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct 
> amdgpu_cs_parser *p, union drm_amdgpu_cs
>   
>   	/* skip guilty context job */
>   	if (atomic_read(&p->ctx->guilty) == 1) {
> -		ret = -ECANCELED;
> +		vramlost_count = atomic_read(&p->adev->vram_lost_counter);
> +		if (p->ctx->vram_lost_counter != vramlost_count)
> +			ret = -EINVAL;
> +		else
> +			ret = -ECANCELED;
>   		goto free_chunk;
>   	}
>   
> @@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>   		goto free_all_kdata;
>   
>   	if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
> -		ret = -ECANCELED;
> +		ret = -EINVAL;
>   		goto free_all_kdata;
>   	}
>   
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened
  2020-12-10  2:25   ` Liu, Cheng Zhe
@ 2020-12-10  9:28     ` Christian König
  2020-12-10  9:40       ` Liu, Cheng Zhe
  0 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2020-12-10  9:28 UTC (permalink / raw)
  To: Liu, Cheng Zhe, Koenig, Christian, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang, Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Zhang, Hawking, Yuan, Xiaojie

Hi Cheng,

well with whom you discussed that? Such discussions should happen on the 
mailing list and at minimum include Alex and me.

Anyway as discussed before this is not acceptable. We already discussed 
this years ago and the behavior is fixed and not changeable.

Regards,
Christian.

Am 10.12.20 um 03:25 schrieb Liu, Cheng Zhe:
> [AMD Public Use]
>
> Yeah, we discussed this issue again. We think it's better UMD make some changes instead of changing in KMD. If FLR happened, it's OK for UMD create new context and continue to submit jobs.
> However, if BACO or mode 1 reset happens, of course UMD could also submit jobs, but these jobs can't use any resource create before the reset, including page table.
> Because all the contents in VRAM has lost after BACO or mode 1 reset, which including APP's buffer.
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Wednesday, December 9, 2020 6:06 PM
> To: Liu, Cheng Zhe <ChengZhe.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Yuan, Xiaojie <Xiaojie.Yuan@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened
>
> Am 09.12.20 um 10:46 schrieb Liu ChengZhe:
>> If CS init return -ECANCELED, UMD will free and create new context.
>> Job in this new context could conitnue exexcuting. In the case of BACO
>> or mode 1, we can't allow this happpen. Because VRAM has lost after
>> whole gpu reset, the job can't guarantee to succeed.
> NAK, this is intentional.
>
> When ECANCELED is returned UMD should create new context after a GPU reset to get back into an usable state and continue to submit jobs.
>
> Regards,
> Christian.
>
>> Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++++++--
>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 85e48c29a57c..2a98f58134ed 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>>    	uint64_t *chunk_array;
>>    	unsigned size, num_ibs = 0;
>>    	uint32_t uf_offset = 0;
>> +	uint32_t vramlost_count = 0;
>>    	int i;
>>    	int ret;
>>    
>> @@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct
>> amdgpu_cs_parser *p, union drm_amdgpu_cs
>>    
>>    	/* skip guilty context job */
>>    	if (atomic_read(&p->ctx->guilty) == 1) {
>> -		ret = -ECANCELED;
>> +		vramlost_count = atomic_read(&p->adev->vram_lost_counter);
>> +		if (p->ctx->vram_lost_counter != vramlost_count)
>> +			ret = -EINVAL;
>> +		else
>> +			ret = -ECANCELED;
>>    		goto free_chunk;
>>    	}
>>    
>> @@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>>    		goto free_all_kdata;
>>    
>>    	if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
>> -		ret = -ECANCELED;
>> +		ret = -EINVAL;
>>    		goto free_all_kdata;
>>    	}
>>    
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened
  2020-12-10  9:28     ` Christian König
@ 2020-12-10  9:40       ` Liu, Cheng Zhe
  0 siblings, 0 replies; 5+ messages in thread
From: Liu, Cheng Zhe @ 2020-12-10  9:40 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx, Liu, Monk, Chen, Horace
  Cc: Xiao, Jack, Xu, Feifei, Wang,  Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Zhang, Hawking, Yuan, Xiaojie

[AMD Public Use]

Hi Christian, 

I discussed with @Liu, Monk and @Chen, Horace, it hasn't decided yet.  I'll add you and Alex to the mailing list. 

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com> 
Sent: Thursday, December 10, 2020 5:29 PM
To: Liu, Cheng Zhe <ChengZhe.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Xiao, Jack <Jack.Xiao@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Yuan, Xiaojie <Xiaojie.Yuan@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

Hi Cheng,

well with whom you discussed that? Such discussions should happen on the mailing list and at minimum include Alex and me.

Anyway as discussed before this is not acceptable. We already discussed this years ago and the behavior is fixed and not changeable.

Regards,
Christian.

Am 10.12.20 um 03:25 schrieb Liu, Cheng Zhe:
> [AMD Public Use]
>
> Yeah, we discussed this issue again. We think it's better UMD make some changes instead of changing in KMD. If FLR happened, it's OK for UMD create new context and continue to submit jobs.
> However, if BACO or mode 1 reset happens, of course UMD could also submit jobs, but these jobs can't use any resource create before the reset, including page table.
> Because all the contents in VRAM has lost after BACO or mode 1 reset, which including APP's buffer.
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Wednesday, December 9, 2020 6:06 PM
> To: Liu, Cheng Zhe <ChengZhe.Liu@amd.com>; 
> amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, 
> Hawking <Hawking.Zhang@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, 
> Kevin(Yang) <Kevin1.Wang@amd.com>; Yuan, Xiaojie 
> <Xiaojie.Yuan@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset 
> happened
>
> Am 09.12.20 um 10:46 schrieb Liu ChengZhe:
>> If CS init return -ECANCELED, UMD will free and create new context.
>> Job in this new context could conitnue exexcuting. In the case of 
>> BACO or mode 1, we can't allow this happpen. Because VRAM has lost 
>> after whole gpu reset, the job can't guarantee to succeed.
> NAK, this is intentional.
>
> When ECANCELED is returned UMD should create new context after a GPU reset to get back into an usable state and continue to submit jobs.
>
> Regards,
> Christian.
>
>> Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++++++--
>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 85e48c29a57c..2a98f58134ed 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>>    	uint64_t *chunk_array;
>>    	unsigned size, num_ibs = 0;
>>    	uint32_t uf_offset = 0;
>> +	uint32_t vramlost_count = 0;
>>    	int i;
>>    	int ret;
>>    
>> @@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct 
>> amdgpu_cs_parser *p, union drm_amdgpu_cs
>>    
>>    	/* skip guilty context job */
>>    	if (atomic_read(&p->ctx->guilty) == 1) {
>> -		ret = -ECANCELED;
>> +		vramlost_count = atomic_read(&p->adev->vram_lost_counter);
>> +		if (p->ctx->vram_lost_counter != vramlost_count)
>> +			ret = -EINVAL;
>> +		else
>> +			ret = -ECANCELED;
>>    		goto free_chunk;
>>    	}
>>    
>> @@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>>    		goto free_all_kdata;
>>    
>>    	if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
>> -		ret = -ECANCELED;
>> +		ret = -EINVAL;
>>    		goto free_all_kdata;
>>    	}
>>    
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CCh
> engZhe.Liu%40amd.com%7C56e4519b79184cdab3e508d89cedf8ba%7C3dd8961fe488
> 4e608e11a82d994e183d%7C0%7C0%7C637431893767203117%7CUnknown%7CTWFpbGZs
> b3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D
> %7C3000&amp;sdata=sQIc%2FhxC%2BotX49yD7Vhw5FfxL96Nz4IPAkcXMUVZS6c%3D&a
> mp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-10  9:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-09  9:46 [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened Liu ChengZhe
2020-12-09 10:06 ` Christian König
2020-12-10  2:25   ` Liu, Cheng Zhe
2020-12-10  9:28     ` Christian König
2020-12-10  9:40       ` Liu, Cheng Zhe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.