All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
@ 2021-07-19  5:57 Jingwen Chen
  2021-07-19  8:24 ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Jingwen Chen @ 2021-07-19  5:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: horace.chen, Jingwen Chen, monk.liu

[Why]
Current vm_pte entities have NORMAL priority, in SRIOV multi-vf
use case, the vf flr happens first and then job time out is found.
There can be several jobs timeout during a very small time slice.
And if the innocent sdma job time out is found before the real bad
job, then the innocent sdma job will be set to guilty as it only
has NORMAL priority. This will lead to a page fault after
resubmitting job.

[How]
sdma should always have KERNEL priority. The kernel job will always
be resubmitted.

Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 358316d6a38c..f7526b67cc5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
 	INIT_LIST_HEAD(&vm->done);
 
 	/* create scheduler entities for page table updates */
-	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
+	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_KERNEL,
 				  adev->vm_manager.vm_pte_scheds,
 				  adev->vm_manager.vm_pte_num_scheds, NULL);
 	if (r)
 		return r;
 
-	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
+	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
 				  adev->vm_manager.vm_pte_scheds,
 				  adev->vm_manager.vm_pte_num_scheds, NULL);
 	if (r)
-- 
2.25.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-19  5:57 [PATCH] drm/amd/amdgpu: vm entities should have kernel priority Jingwen Chen
@ 2021-07-19  8:24 ` Christian König
  2021-07-19  9:40   ` Liu, Monk
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2021-07-19  8:24 UTC (permalink / raw)
  To: Jingwen Chen, amd-gfx; +Cc: horace.chen, monk.liu

Am 19.07.21 um 07:57 schrieb Jingwen Chen:
> [Why]
> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf
> use case, the vf flr happens first and then job time out is found.
> There can be several jobs timeout during a very small time slice.
> And if the innocent sdma job time out is found before the real bad
> job, then the innocent sdma job will be set to guilty as it only
> has NORMAL priority. This will lead to a page fault after
> resubmitting job.
>
> [How]
> sdma should always have KERNEL priority. The kernel job will always
> be resubmitted.

I'm not sure if that is a good idea. We intentionally didn't gave the 
page table updates kernel priority to avoid clashing with the move jobs.

Christian.

>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 358316d6a38c..f7526b67cc5d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   	INIT_LIST_HEAD(&vm->done);
>   
>   	/* create scheduler entities for page table updates */
> -	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
> +	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_KERNEL,
>   				  adev->vm_manager.vm_pte_scheds,
>   				  adev->vm_manager.vm_pte_num_scheds, NULL);
>   	if (r)
>   		return r;
>   
> -	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
> +	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>   				  adev->vm_manager.vm_pte_scheds,
>   				  adev->vm_manager.vm_pte_num_scheds, NULL);
>   	if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-19  8:24 ` Christian König
@ 2021-07-19  9:40   ` Liu, Monk
  2021-07-19  9:42     ` Liu, Monk
  0 siblings, 1 reply; 9+ messages in thread
From: Liu, Monk @ 2021-07-19  9:40 UTC (permalink / raw)
  To: Christian König, Chen, JingWen, amd-gfx; +Cc: Chen, Horace

[AMD Official Use Only]

If there is move jobs clashing there we probably need to fix the bugs of those move jobs

Previously I believe you also remember that we agreed to always trust kernel jobs especially paging jobs,

Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com> 
Sent: Monday, July 19, 2021 4:25 PM
To: Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

Am 19.07.21 um 07:57 schrieb Jingwen Chen:
> [Why]
> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use 
> case, the vf flr happens first and then job time out is found.
> There can be several jobs timeout during a very small time slice.
> And if the innocent sdma job time out is found before the real bad 
> job, then the innocent sdma job will be set to guilty as it only has 
> NORMAL priority. This will lead to a page fault after resubmitting 
> job.
>
> [How]
> sdma should always have KERNEL priority. The kernel job will always be 
> resubmitted.

I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.

Christian.

>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 358316d6a38c..f7526b67cc5d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   	INIT_LIST_HEAD(&vm->done);
>   
>   	/* create scheduler entities for page table updates */
> -	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
> +	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_KERNEL,
>   				  adev->vm_manager.vm_pte_scheds,
>   				  adev->vm_manager.vm_pte_num_scheds, NULL);
>   	if (r)
>   		return r;
>   
> -	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
> +	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>   				  adev->vm_manager.vm_pte_scheds,
>   				  adev->vm_manager.vm_pte_num_scheds, NULL);
>   	if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-19  9:40   ` Liu, Monk
@ 2021-07-19  9:42     ` Liu, Monk
  2021-07-19 11:10       ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Liu, Monk @ 2021-07-19  9:42 UTC (permalink / raw)
  To: Christian König, Chen, JingWen, amd-gfx; +Cc: Chen, Horace

[AMD Official Use Only]

Besides, I think our current KMD have three types of kernel sdma jobs:
1) adev->mman.entity, it is already a KERNEL priority entity
2) vm->immediate
3) vm->delay

Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Liu, Monk 
Sent: Monday, July 19, 2021 5:40 PM
To: 'Christian König' <ckoenig.leichtzumerken@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>
Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

[AMD Official Use Only]

If there is move jobs clashing there we probably need to fix the bugs of those move jobs

Previously I believe you also remember that we agreed to always trust kernel jobs especially paging jobs,

Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Monday, July 19, 2021 4:25 PM
To: Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

Am 19.07.21 um 07:57 schrieb Jingwen Chen:
> [Why]
> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use 
> case, the vf flr happens first and then job time out is found.
> There can be several jobs timeout during a very small time slice.
> And if the innocent sdma job time out is found before the real bad 
> job, then the innocent sdma job will be set to guilty as it only has 
> NORMAL priority. This will lead to a page fault after resubmitting 
> job.
>
> [How]
> sdma should always have KERNEL priority. The kernel job will always be 
> resubmitted.

I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.

Christian.

>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 358316d6a38c..f7526b67cc5d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   	INIT_LIST_HEAD(&vm->done);
>   
>   	/* create scheduler entities for page table updates */
> -	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
> +	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_KERNEL,
>   				  adev->vm_manager.vm_pte_scheds,
>   				  adev->vm_manager.vm_pte_num_scheds, NULL);
>   	if (r)
>   		return r;
>   
> -	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
> +	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>   				  adev->vm_manager.vm_pte_scheds,
>   				  adev->vm_manager.vm_pte_num_scheds, NULL);
>   	if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-19  9:42     ` Liu, Monk
@ 2021-07-19 11:10       ` Christian König
  2021-07-20  7:49         ` Chen, JingWen
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2021-07-19 11:10 UTC (permalink / raw)
  To: Liu, Monk, Chen, JingWen, amd-gfx; +Cc: Chen, Horace

Am 19.07.21 um 11:42 schrieb Liu, Monk:
> [AMD Official Use Only]
>
> Besides, I think our current KMD have three types of kernel sdma jobs:
> 1) adev->mman.entity, it is already a KERNEL priority entity
> 2) vm->immediate
> 3) vm->delay
>
> Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?

No, exactly that's the point. vm->immediate and vm->delayed are not for 
kernel paging jobs.

Those are used for userspace page table updates.

I agree that those should probably not considered guilty, but modifying 
the priority of them is not the right way of doing that.

Regards,
Christian.

>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Monday, July 19, 2021 5:40 PM
> To: 'Christian König' <ckoenig.leichtzumerken@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>
> Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>
> [AMD Official Use Only]
>
> If there is move jobs clashing there we probably need to fix the bugs of those move jobs
>
> Previously I believe you also remember that we agreed to always trust kernel jobs especially paging jobs,
>
> Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Monday, July 19, 2021 4:25 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>
> Am 19.07.21 um 07:57 schrieb Jingwen Chen:
>> [Why]
>> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use
>> case, the vf flr happens first and then job time out is found.
>> There can be several jobs timeout during a very small time slice.
>> And if the innocent sdma job time out is found before the real bad
>> job, then the innocent sdma job will be set to guilty as it only has
>> NORMAL priority. This will lead to a page fault after resubmitting
>> job.
>>
>> [How]
>> sdma should always have KERNEL priority. The kernel job will always be
>> resubmitted.
> I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.
>
> Christian.
>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 358316d6a38c..f7526b67cc5d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>    	INIT_LIST_HEAD(&vm->done);
>>    
>>    	/* create scheduler entities for page table updates */
>> -	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
>> +	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_KERNEL,
>>    				  adev->vm_manager.vm_pte_scheds,
>>    				  adev->vm_manager.vm_pte_num_scheds, NULL);
>>    	if (r)
>>    		return r;
>>    
>> -	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
>> +	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>>    				  adev->vm_manager.vm_pte_scheds,
>>    				  adev->vm_manager.vm_pte_num_scheds, NULL);
>>    	if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-19 11:10       ` Christian König
@ 2021-07-20  7:49         ` Chen, JingWen
  2021-07-20  8:29           ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, JingWen @ 2021-07-20  7:49 UTC (permalink / raw)
  To: Christian König, Liu, Monk, amd-gfx; +Cc: Chen, Horace

[AMD Official Use Only]

Hi Christian,

Even if this is a userspace mapping, it's still packaged by the kernel, so it's always assumed to be correct. In which case modifying the priority should have no side effects. May I know the detailed reason for your concern?

And if we eventually decide not to change the priority, do you have any suggestions about how to make sure the paging jobs are not considered guilty?

Best Regards,
JingWen Chen

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Monday, July 19, 2021 7:10 PM
To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

Am 19.07.21 um 11:42 schrieb Liu, Monk:
> [AMD Official Use Only]
>
> Besides, I think our current KMD have three types of kernel sdma jobs:
> 1) adev->mman.entity, it is already a KERNEL priority entity
> 2) vm->immediate
> 3) vm->delay
>
> Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?

No, exactly that's the point. vm->immediate and vm->delayed are not for kernel paging jobs.

Those are used for userspace page table updates.

I agree that those should probably not considered guilty, but modifying the priority of them is not the right way of doing that.

Regards,
Christian.

>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Monday, July 19, 2021 5:40 PM
> To: 'Christian König' <ckoenig.leichtzumerken@gmail.com>; Chen,
> JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>
> Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel
> priority
>
> [AMD Official Use Only]
>
> If there is move jobs clashing there we probably need to fix the bugs
> of those move jobs
>
> Previously I believe you also remember that we agreed to always trust
> kernel jobs especially paging jobs,
>
> Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Monday, July 19, 2021 4:25 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>;
> amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel
> priority
>
> Am 19.07.21 um 07:57 schrieb Jingwen Chen:
>> [Why]
>> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use
>> case, the vf flr happens first and then job time out is found.
>> There can be several jobs timeout during a very small time slice.
>> And if the innocent sdma job time out is found before the real bad
>> job, then the innocent sdma job will be set to guilty as it only has
>> NORMAL priority. This will lead to a page fault after resubmitting
>> job.
>>
>> [How]
>> sdma should always have KERNEL priority. The kernel job will always
>> be resubmitted.
> I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.
>
> Christian.
>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 358316d6a38c..f7526b67cc5d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>      INIT_LIST_HEAD(&vm->done);
>>
>>      /* create scheduler entities for page table updates */
>> -    r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
>> +    r = drm_sched_entity_init(&vm->immediate,
>> +DRM_SCHED_PRIORITY_KERNEL,
>>                                adev->vm_manager.vm_pte_scheds,
>>                                adev->vm_manager.vm_pte_num_scheds, NULL);
>>      if (r)
>>              return r;
>>
>> -    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
>> +    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>>                                adev->vm_manager.vm_pte_scheds,
>>                                adev->vm_manager.vm_pte_num_scheds, NULL);
>>      if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-20  7:49         ` Chen, JingWen
@ 2021-07-20  8:29           ` Christian König
  2021-07-20  9:14             ` Chen, JingWen
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2021-07-20  8:29 UTC (permalink / raw)
  To: Chen, JingWen, Liu, Monk, amd-gfx; +Cc: Chen, Horace

Hi JingWen,

you can look at the job->vm pointer to distinct an userspace submission 
from a kernel submission.

The priority is not really related to the submission type, we just 
happen to treat paging jobs with the highest priority since they are 
important for the system as a whole.

Regards,
Christian.

Am 20.07.21 um 09:49 schrieb Chen, JingWen:
> [AMD Official Use Only]
>
> Hi Christian,
>
> Even if this is a userspace mapping, it's still packaged by the kernel, so it's always assumed to be correct. In which case modifying the priority should have no side effects. May I know the detailed reason for your concern?
>
> And if we eventually decide not to change the priority, do you have any suggestions about how to make sure the paging jobs are not considered guilty?
>
> Best Regards,
> JingWen Chen
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Monday, July 19, 2021 7:10 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>
> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>
> Am 19.07.21 um 11:42 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>> Besides, I think our current KMD have three types of kernel sdma jobs:
>> 1) adev->mman.entity, it is already a KERNEL priority entity
>> 2) vm->immediate
>> 3) vm->delay
>>
>> Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?
> No, exactly that's the point. vm->immediate and vm->delayed are not for kernel paging jobs.
>
> Those are used for userspace page table updates.
>
> I agree that those should probably not considered guilty, but modifying the priority of them is not the right way of doing that.
>
> Regards,
> Christian.
>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Liu, Monk
>> Sent: Monday, July 19, 2021 5:40 PM
>> To: 'Christian König' <ckoenig.leichtzumerken@gmail.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Chen, Horace <Horace.Chen@amd.com>
>> Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel
>> priority
>>
>> [AMD Official Use Only]
>>
>> If there is move jobs clashing there we probably need to fix the bugs
>> of those move jobs
>>
>> Previously I believe you also remember that we agreed to always trust
>> kernel jobs especially paging jobs,
>>
>> Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?
>>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Monday, July 19, 2021 4:25 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
>> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel
>> priority
>>
>> Am 19.07.21 um 07:57 schrieb Jingwen Chen:
>>> [Why]
>>> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use
>>> case, the vf flr happens first and then job time out is found.
>>> There can be several jobs timeout during a very small time slice.
>>> And if the innocent sdma job time out is found before the real bad
>>> job, then the innocent sdma job will be set to guilty as it only has
>>> NORMAL priority. This will lead to a page fault after resubmitting
>>> job.
>>>
>>> [How]
>>> sdma should always have KERNEL priority. The kernel job will always
>>> be resubmitted.
>> I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.
>>
>> Christian.
>>
>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 358316d6a38c..f7526b67cc5d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>>       INIT_LIST_HEAD(&vm->done);
>>>
>>>       /* create scheduler entities for page table updates */
>>> -    r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
>>> +    r = drm_sched_entity_init(&vm->immediate,
>>> +DRM_SCHED_PRIORITY_KERNEL,
>>>                                 adev->vm_manager.vm_pte_scheds,
>>>                                 adev->vm_manager.vm_pte_num_scheds, NULL);
>>>       if (r)
>>>               return r;
>>>
>>> -    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
>>> +    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>>>                                 adev->vm_manager.vm_pte_scheds,
>>>                                 adev->vm_manager.vm_pte_num_scheds, NULL);
>>>       if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-20  8:29           ` Christian König
@ 2021-07-20  9:14             ` Chen, JingWen
  2021-07-20  9:28               ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, JingWen @ 2021-07-20  9:14 UTC (permalink / raw)
  To: Christian König, Liu, Monk, amd-gfx; +Cc: Chen, Horace

[AMD Official Use Only]

Hi Christian,

I agree that changing the priority is not the right way.
So to not consider paging job guilty, do you think it's OK to not increase_karma for it(if (job->vmid == 0))?

Best Regards,
JingWen Chen

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Tuesday, July 20, 2021 4:29 PM
To: Chen, JingWen <JingWen.Chen2@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

Hi JingWen,

you can look at the job->vm pointer to distinct an userspace submission
from a kernel submission.

The priority is not really related to the submission type, we just
happen to treat paging jobs with the highest priority since they are
important for the system as a whole.

Regards,
Christian.

Am 20.07.21 um 09:49 schrieb Chen, JingWen:
> [AMD Official Use Only]
>
> Hi Christian,
>
> Even if this is a userspace mapping, it's still packaged by the kernel, so it's always assumed to be correct. In which case modifying the priority should have no side effects. May I know the detailed reason for your concern?
>
> And if we eventually decide not to change the priority, do you have any suggestions about how to make sure the paging jobs are not considered guilty?
>
> Best Regards,
> JingWen Chen
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Monday, July 19, 2021 7:10 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>
> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>
> Am 19.07.21 um 11:42 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>> Besides, I think our current KMD have three types of kernel sdma jobs:
>> 1) adev->mman.entity, it is already a KERNEL priority entity
>> 2) vm->immediate
>> 3) vm->delay
>>
>> Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?
> No, exactly that's the point. vm->immediate and vm->delayed are not for kernel paging jobs.
>
> Those are used for userspace page table updates.
>
> I agree that those should probably not considered guilty, but modifying the priority of them is not the right way of doing that.
>
> Regards,
> Christian.
>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Liu, Monk
>> Sent: Monday, July 19, 2021 5:40 PM
>> To: 'Christian König' <ckoenig.leichtzumerken@gmail.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Chen, Horace <Horace.Chen@amd.com>
>> Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel
>> priority
>>
>> [AMD Official Use Only]
>>
>> If there is move jobs clashing there we probably need to fix the bugs
>> of those move jobs
>>
>> Previously I believe you also remember that we agreed to always trust
>> kernel jobs especially paging jobs,
>>
>> Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?
>>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Monday, July 19, 2021 4:25 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
>> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel
>> priority
>>
>> Am 19.07.21 um 07:57 schrieb Jingwen Chen:
>>> [Why]
>>> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use
>>> case, the vf flr happens first and then job time out is found.
>>> There can be several jobs timeout during a very small time slice.
>>> And if the innocent sdma job time out is found before the real bad
>>> job, then the innocent sdma job will be set to guilty as it only has
>>> NORMAL priority. This will lead to a page fault after resubmitting
>>> job.
>>>
>>> [How]
>>> sdma should always have KERNEL priority. The kernel job will always
>>> be resubmitted.
>> I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.
>>
>> Christian.
>>
>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 358316d6a38c..f7526b67cc5d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>>       INIT_LIST_HEAD(&vm->done);
>>>
>>>       /* create scheduler entities for page table updates */
>>> -    r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
>>> +    r = drm_sched_entity_init(&vm->immediate,
>>> +DRM_SCHED_PRIORITY_KERNEL,
>>>                                 adev->vm_manager.vm_pte_scheds,
>>>                                 adev->vm_manager.vm_pte_num_scheds, NULL);
>>>       if (r)
>>>               return r;
>>>
>>> -    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
>>> +    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>>>                                 adev->vm_manager.vm_pte_scheds,
>>>                                 adev->vm_manager.vm_pte_num_scheds, NULL);
>>>       if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
  2021-07-20  9:14             ` Chen, JingWen
@ 2021-07-20  9:28               ` Christian König
  0 siblings, 0 replies; 9+ messages in thread
From: Christian König @ 2021-07-20  9:28 UTC (permalink / raw)
  To: Chen, JingWen, Liu, Monk, amd-gfx; +Cc: Chen, Horace

I would check (job->vm == NULL) instead, but yes something like that 
should work.

Regards,
Christian.

Am 20.07.21 um 11:14 schrieb Chen, JingWen:
> [AMD Official Use Only]
>
> Hi Christian,
>
> I agree that changing the priority is not the right way.
> So to not consider paging job guilty, do you think it's OK to not increase_karma for it(if (job->vmid == 0))?
>
> Best Regards,
> JingWen Chen
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, July 20, 2021 4:29 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>
> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>
> Hi JingWen,
>
> you can look at the job->vm pointer to distinct an userspace submission
> from a kernel submission.
>
> The priority is not really related to the submission type, we just
> happen to treat paging jobs with the highest priority since they are
> important for the system as a whole.
>
> Regards,
> Christian.
>
> Am 20.07.21 um 09:49 schrieb Chen, JingWen:
>> [AMD Official Use Only]
>>
>> Hi Christian,
>>
>> Even if this is a userspace mapping, it's still packaged by the kernel, so it's always assumed to be correct. In which case modifying the priority should have no side effects. May I know the detailed reason for your concern?
>>
>> And if we eventually decide not to change the priority, do you have any suggestions about how to make sure the paging jobs are not considered guilty?
>>
>> Best Regards,
>> JingWen Chen
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Monday, July 19, 2021 7:10 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Chen, Horace <Horace.Chen@amd.com>
>> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>>
>> Am 19.07.21 um 11:42 schrieb Liu, Monk:
>>> [AMD Official Use Only]
>>>
>>> Besides, I think our current KMD have three types of kernel sdma jobs:
>>> 1) adev->mman.entity, it is already a KERNEL priority entity
>>> 2) vm->immediate
>>> 3) vm->delay
>>>
>>> Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?
>> No, exactly that's the point. vm->immediate and vm->delayed are not for kernel paging jobs.
>>
>> Those are used for userspace page table updates.
>>
>> I agree that those should probably not considered guilty, but modifying the priority of them is not the right way of doing that.
>>
>> Regards,
>> Christian.
>>
>>> Thanks
>>>
>>> ------------------------------------------
>>> Monk Liu | Cloud-GPU Core team
>>> ------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Liu, Monk
>>> Sent: Monday, July 19, 2021 5:40 PM
>>> To: 'Christian König' <ckoenig.leichtzumerken@gmail.com>; Chen,
>>> JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Chen, Horace <Horace.Chen@amd.com>
>>> Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel
>>> priority
>>>
>>> [AMD Official Use Only]
>>>
>>> If there is move jobs clashing there we probably need to fix the bugs
>>> of those move jobs
>>>
>>> Previously I believe you also remember that we agreed to always trust
>>> kernel jobs especially paging jobs,
>>>
>>> Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?
>>>
>>> Thanks
>>>
>>> ------------------------------------------
>>> Monk Liu | Cloud-GPU Core team
>>> ------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>> Sent: Monday, July 19, 2021 4:25 PM
>>> To: Chen, JingWen <JingWen.Chen2@amd.com>;
>>> amd-gfx@lists.freedesktop.org
>>> Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
>>> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel
>>> priority
>>>
>>> Am 19.07.21 um 07:57 schrieb Jingwen Chen:
>>>> [Why]
>>>> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use
>>>> case, the vf flr happens first and then job time out is found.
>>>> There can be several jobs timeout during a very small time slice.
>>>> And if the innocent sdma job time out is found before the real bad
>>>> job, then the innocent sdma job will be set to guilty as it only has
>>>> NORMAL priority. This will lead to a page fault after resubmitting
>>>> job.
>>>>
>>>> [How]
>>>> sdma should always have KERNEL priority. The kernel job will always
>>>> be resubmitted.
>>> I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.
>>>
>>> Christian.
>>>
>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>>>>      1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index 358316d6a38c..f7526b67cc5d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>>>        INIT_LIST_HEAD(&vm->done);
>>>>
>>>>        /* create scheduler entities for page table updates */
>>>> -    r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
>>>> +    r = drm_sched_entity_init(&vm->immediate,
>>>> +DRM_SCHED_PRIORITY_KERNEL,
>>>>                                  adev->vm_manager.vm_pte_scheds,
>>>>                                  adev->vm_manager.vm_pte_num_scheds, NULL);
>>>>        if (r)
>>>>                return r;
>>>>
>>>> -    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
>>>> +    r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>>>>                                  adev->vm_manager.vm_pte_scheds,
>>>>                                  adev->vm_manager.vm_pte_num_scheds, NULL);
>>>>        if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-07-20  9:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-19  5:57 [PATCH] drm/amd/amdgpu: vm entities should have kernel priority Jingwen Chen
2021-07-19  8:24 ` Christian König
2021-07-19  9:40   ` Liu, Monk
2021-07-19  9:42     ` Liu, Monk
2021-07-19 11:10       ` Christian König
2021-07-20  7:49         ` Chen, JingWen
2021-07-20  8:29           ` Christian König
2021-07-20  9:14             ` Chen, JingWen
2021-07-20  9:28               ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.