Re: [Freedreno] [PATCH 2/2] drm/scheduler: Add a start_timeout_notify() backend function

From: Sharat Masetty <smasetty@codeaurora.org>
To: "Koenig, Christian" <Christian.Koenig@amd.com>,
	"freedreno@lists.freedesktop.org"
	<freedreno@lists.freedesktop.org>
Cc: "linux-arm-msm@vger.kernel.org" <linux-arm-msm@vger.kernel.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>
Subject: Re: [Freedreno] [PATCH 2/2] drm/scheduler: Add a start_timeout_notify() backend function
Date: Mon, 5 Nov 2018 13:25:08 +0530	[thread overview]
Message-ID: <bbfa8eae-23f5-5484-5abb-bf55dee64e34@codeaurora.org> (raw)
In-Reply-To: <ad50a045-3048-bd6f-8d21-30adf54771a7@amd.com>

On 11/2/2018 7:07 PM, Koenig, Christian wrote:
> Am 02.11.18 um 14:25 schrieb Sharat Masetty:
>>
>>
>> On 11/2/2018 4:09 PM, Koenig, Christian wrote:
>>> Am 02.11.18 um 11:31 schrieb Sharat Masetty:
>>>> Add an optional backend function op which will let the scheduler
>>>> clients
>>>> know when the timeout got scheduled on the scheduler instance. This
>>>> will
>>>> help drivers with multiple schedulers(one per ring) measure time
>>>> spent on
>>>> the ring accurately, eventually helping with better timeout detection.
>>>>
>>>> Signed-off-by: Sharat Masetty <smasetty@codeaurora.org>
>>>
>>> Well, NAK. drm_sched_start_timeout() is called whenever the timer needs
>>> to run, but that doesn't mean that the timer is started (e.g. it can
>>> already be running).
>>>
>>> So the callback would be called multiple times and not reflect the
>>> actual job run time.
>>>
>>> Additional to that it can be racy, e.g. we can complete multiple jobs at
>>> a time before the timer is started again.
>>>
>>> If you want to accurately count how much time you spend on each job/ring
>>> you need to do this by measuring the time inside your driver instead.
>>>
>>> E.g. for amdgpu I would get the time first in amdgpu_job_run() and then
>>> again in amdgpu_job_free_cb() and calculate the difference.
>> Hi Christian,
>>
>> Thank you for the comments and apologies if this was confusing. All I
>> want to determine(more accurately) is that when the scheduler instance
>> timer of say 500 ms goes off, is if the ring(associated with the
>> scheduler instance) actually spent 500 ms on the hardware - and for
>> this I need to know in the driver space when the timer actually started.
>>
>> In msm hardware we have ring preemption support enabled and the kernel
>> driver triggers a preemption switch to a higher priority ring if there
>> is work available on that ring for the GPU to work on. So in the
>> presence of preemption it is possible that a lower priority ring did
>> not actually get to spend the full 500 ms and this is what I am trying
>> to catch with this callback.
>>
>> I am *not* trying to profile per job time consumption with this.
>>
>>> Well, NAK. drm_sched_start_timeout() is called whenever the timer needs
>>> to run, but that doesn't mean that the timer is started (e.g. it can
>>> already be running).
>>
>> Regarding the case where the timer may already be running - good
>> point, but it should be easy to address the scenario. I will check the
>> output
>> of schedule_delayed_work() and only call the callback(new proposed) if
>> the timer was really scheduled.
> 
> Yeah, that should work.
> 
>>
>> In summary, when this timedout_job() callback is called, it is assumed
>> that the job actually did time out from the POV of the scheduler, but
>> this will not hold true with preemption switching and that is what I
>> am trying to better address with this patch.
> 
> Mhm, so what you actually need is to suspend the timeout when the lower
> priority ring is preempted and resume it when it is started again? I
> wonder if that wouldn't be simpler.
> 
> We have support for ring preemption as well, but not implemented yet. So
> it would be nice to have something that works for everybody.
> 
> But on the other hand a callback to notify the driver that the timer
> started isn't so bad either.
Hi Christian,

Yes something like a suspend timeout would be simpler for the drivers, 
but I could not find anything which does this for the delayed work or 
even for the general timers. All I could find was cancel/delete.

In lieu of this, I chose this approach. If you like it this way(proposed 
patch), then I will address the review comments and re-spin... please 
let me know.

Sharat
> 
> Regards,
> Christian.
> 
>>
>> Sharat
>>>
>>> Regards,
>>> Christian.
>>>
>>>> ---
>>>> Here is an example of how I plan to use this new function callback.
>>>>
>>>> [1] https://patchwork.freedesktop.org/patch/254227/
>>>>
>>>>     drivers/gpu/drm/scheduler/sched_main.c | 7 ++++++-
>>>>     include/drm/gpu_scheduler.h            | 6 ++++++
>>>>     2 files changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index c993d10..afd461e 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -192,8 +192,13 @@ bool drm_sched_dependency_optimized(struct
>>>> dma_fence* fence,
>>>>     static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>>>     {
>>>>         if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>> -        !list_empty(&sched->ring_mirror_list))
>>>> +        !list_empty(&sched->ring_mirror_list)) {
>>>> +
>>>>             schedule_delayed_work(&sched->work_tdr, sched->timeout);
>>>> +
>>>> +        if (sched->ops->start_timeout_notify)
>>>> +            sched->ops->start_timeout_notify(sched);
>>>> +    }
>>>>     }
>>>>
>>>>     /* job_finish is called after hw fence signaled
>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>> index d87b268..faf28b4 100644
>>>> --- a/include/drm/gpu_scheduler.h
>>>> +++ b/include/drm/gpu_scheduler.h
>>>> @@ -239,6 +239,12 @@ struct drm_sched_backend_ops {
>>>>              * and it's time to clean it up.
>>>>          */
>>>>         void (*free_job)(struct drm_sched_job *sched_job);
>>>> +
>>>> +    /*
>>>> +     * (Optional) Called to let the driver know that a timeout
>>>> detection
>>>> +     * timer has been started.
>>>> +     */
>>>> +    void (*start_timeout_notify)(struct drm_gpu_scheduler *sched);
>>>>     };
>>>>
>>>>     /**
>>>> -- 
>>>> 1.9.1
>>>>
>>>
>>> _______________________________________________
>>> Freedreno mailing list
>>> Freedreno@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/freedreno
>>>
>>
> 
> _______________________________________________
> Freedreno mailing list
> Freedreno@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/freedreno
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel