Re: [RFC PATCH 2/4] drm/scheduler: implement a function to modify sched list

From: Luben Tuikov <luben.tuikov@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Nirmoy Das" <nirmoy.aiemd@gmail.com>,
	amd-gfx@lists.freedesktop.org
Cc: alexander.deucher@amd.com, luben.tuikov@amd.com,
	Ray.Huang@amd.com, nirmoy.das@amd.com
Subject: Re: [RFC PATCH 2/4] drm/scheduler: implement a function to modify sched list
Date: Mon, 2 Mar 2020 15:47:28 -0500	[thread overview]
Message-ID: <069d5a87-9792-87e5-432a-6c6456beeed6@amd.com> (raw)
In-Reply-To: <d87c9952-1eb8-5295-a1bc-475f5e94fc78@amd.com>

On 2020-02-28 2:47 a.m., Christian König wrote:
> Am 28.02.20 um 06:08 schrieb Luben Tuikov:
>> On 2020-02-27 4:40 p.m., Nirmoy Das wrote:
>>> implement drm_sched_entity_modify_sched() which can modify existing
>>> sched_list with a different one. This is going to be helpful when
>>> userspace changes priority of a ctx/entity then driver can switch to
>>> corresponding hw shced list for that priority
>>>
>>> Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_entity.c | 24 ++++++++++++++++++++++++
>>>   include/drm/gpu_scheduler.h              |  4 ++++
>>>   2 files changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 63bccd201b97..711e9d504bcb 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -83,6 +83,30 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>>   }
>>>   EXPORT_SYMBOL(drm_sched_entity_init);
>>>   
>>> +/**
>>> + * drm_sched_entity_modify_sched - Modify sched of an entity
>>> + *
>>> + * @entity: scheduler entity to init
>>> + * @sched_list: the list of new drm scheds which will replace
>>> + *		existing entity->sched_list
>>> + * @num_sched_list: number of drm sched in sched_list
>>> + *
>>> + * Returns 0 on success or a negative error code on failure.
>>> + */
>>> +int drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>> +				  struct drm_gpu_scheduler **sched_list,
>>> +				  unsigned int num_sched_list)
>>> +{
>>> +	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])))
>>> +		return -EINVAL;
>> This seems unmaintainable. I'd write it in its natural form:
> 
> This is probably just copy & pasted from the init function and complete 
> overkill here.

It should be an easy rejection then. Statements like this make
the code unmaintainable. Regardless of whether it was copy-and-pasted
I wanted to emphasize the lack of simplification of what
was being done.

We need to put intention and sense into what we're doing, scrutinizing
every line we put into a patch. This is why I suggested
the simplification here:

> 
>>
>> int drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>> 				  struct drm_gpu_scheduler **sched_list,
>> 				  unsigned int num_sched_list)
>> {
>> 	if (entity && sched_list && (num_sched_list == 0 || sched_list[0] != NULL)) {
>> 		entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
>> 		entity->num_sched_list = num_sched_list;
>> 		return 0;
>> 	} else {
>> 		return -EINVAL;
>> 	}
>> }
> 
> Actually that's a rather bad idea. Error handling should always be in 

I actually don't think that it is a "rather bad idea". At all.
I actually think that it makes this leaf function more clear to
understand as the conditional would read like a sentence in prose.

> the form of:
> 
> if (check_error || missing_prerequisite)
>      return_or_goto_cleanup;

I don't think we should generalize across the board. We should be
more flexible in order to create clear and maintainable code.

> 
>> That's too heavy. Can we improve the architecture
>> so we don't have to check for this in leaf functions like this one?
>> We can just return a parameterization.
>>
>> Why would this be called with entity being NULL?
>> Or with sched_list being NULL? Or num_sched_list being not zero
>> yet sched_list[0] being NULL? Why not make sure that sched_list[0] is
>> never NULL and that num_sched_list is greater than 0 always?
>>
>> Does this make it to user space?
>> Why would the state of execution be one such that this is true/false
>> for the code to return -EINVAL?
>>  From patch 3/4 it seems that an error is printed inside amdgpu_ctx_priority_override()
>> and the result is not propagated up the stack.
>>
>> I think we should improve the code where here this condition above
>> is never true. Then we can use parameterization for those two
>> statements below:
>>
>>> +
>>> +	entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
>> So if we're here, we know from the top conditional that
>> either num_sched_list is 0 or that sched_list[0] not NULL
>> or both.
>>
>> So if num_sched_list is 0 or 1 we return NULL?
>> And if num_sched_list is 2 or greater we return sched_list
>> of which sched_list[0] could be NULL?
> 
> This is also copy&pasted from the init code and completely wrong here.

So even more reasons to reject this patch.

> 
> What we should do instead is just: WARN_ON(!num_sched_list || !sched_list);

Again, what does that *mean*? What does the check mean and what
does the num_sched_list == 0 or sched_list == NULL mean?
And how did we get into a situation like this where either or both
could be nil?

Wouldn't it be better to simplify or re-architecture this (we only recently
decided to hide physical rings from user-space), so that the code
is elegant (meaning no if-else) yet flexible and straightforward?

> 
> And to the checking for not keeping around the scheduler list in the 
> init function.
> 
>> Why not fix the architecture so that this is simply copied?
> 
> We had that and moved away from it because the scheduler list is 
> actually const and shouldn't be allocated with each entity (which we can 
> easily have thousands of).

I think that peppering the code with if-else conditionals
everywhere as these patch-series into the DRM scheduler have been,
would make the code unmaintainable in the long run.

Regards,
Luben

> 
> Regards,
> Christian.
> 
>>   (in which case
>> we can simply index-parameterize it and simply copy it.
>> Why are there so many checks everywhere?
>>
>>> +	entity->num_sched_list = num_sched_list;
>>> +
>> I mean, all we're doing in this function is initializing
>> entity->sched_list and entity->num_sched_list. Why does this
>> function have to be so complex and do so many checks?
>> Can we improve/fix the architecture instead?
>>
>> Regards,
>> Luben
>>
>>
>>> +	return 0;
>>> +}
>>> +EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>> +
>>>   /**
>>>    * drm_sched_entity_is_idle - Check if entity is idle
>>>    *
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 589be851f8a1..0c164a96d51b 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -297,6 +297,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched);
>>>   int drm_sched_job_init(struct drm_sched_job *job,
>>>   		       struct drm_sched_entity *entity,
>>>   		       void *owner);
>>> +int drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>> +				  struct drm_gpu_scheduler **sched_list,
>>> +                                  unsigned int num_sched_list);
>>> +
>>>   void drm_sched_job_cleanup(struct drm_sched_job *job);
>>>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>>>
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx