Re: [Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost

From: "Belgaumkar, Vinay" <vinay.belgaumkar@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	John Harrison <john.c.harrison@intel.com>,
	<intel-gfx@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>
Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost
Date: Fri, 6 May 2022 09:21:27 -0700	[thread overview]
Message-ID: <f7e4b196-6ce9-ee61-4314-0baf8450bc45@intel.com> (raw)
In-Reply-To: <1d15f38e-c3d8-a521-4a15-50341dae5000@linux.intel.com>

On 5/6/2022 12:18 AM, Tvrtko Ursulin wrote:
>
> On 05/05/2022 19:36, John Harrison wrote:
>> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>>> in a timely fashion causing warnings and affecting CI.
>>>>
>>>> Is it the "Unable to force min freq" error? Do you have a link to 
>>>> the GitLab issue to add to commit message?
>>> We don't have a specific error for this one, but have seen similar 
>>> issues with other H2G which are blocking.
>>>>
>>>>> This is seen when waitboosting happens during a stress test.
>>>>> this patch updates the waitboost path to use a non-blocking
>>>>> H2G call instead, which returns as soon as the message is
>>>>> successfully transmitted.
>>>>
>>>> AFAIU with this approach, when CT channel is congested, you instead 
>>>> achieve silent dropping of the waitboost request, right?
>>> We are hoping it makes it, but just not waiting for it to complete.
>> We are not 'hoping it makes it'. We know for a fact that it will make 
>> it. We just don't know when. The issue is not about whether the 
>> waitboost request itself gets dropped/lost it is about the ack that 
>> comes back. The GuC will process the message and it will send an ack. 
>> It's just a question of whether the i915 driver has given up waiting 
>> for it yet. And if it has, then you get the initial 'timed out 
>> waiting for ack' followed by a later 'got unexpected ack' message.
>>
>> Whereas, if we make the call asynchronous, there is no ack. i915 
>> doesn't bother waiting and it won't get surprised later.
>>
>> Also, note that this is only an issue when GuC itself is backed up. 
>> Normally that requires the creation/destruction of large numbers of 
>> contexts in rapid succession (context management is about the slowest 
>> thing we do with GuC). Some of the IGTs and selftests do that with 
>> thousands of contexts all at once. Those are generally where we see 
>> this kind of problem. It would be highly unlikely (but not 
>> impossible) to hit it in real world usage.
>
> Goto ->
>
>> The general design philosophy of H2G messages is that asynchronous 
>> mode should be used for everything if at all possible. It is fire and 
>> forget and will all get processed in the order sent (same as batch 
>> buffer execution, really). Synchronous messages should only be used 
>> when an ack/status is absolutely required. E.g. start of day 
>> initialisation or things like TLB invalidation where we need to know 
>> that a cache has been cleared/flushed before updating memory from the 
>> CPU.
>>
>> John.
>>
>>
>>>>
>>>> It sounds like a potentially important feedback from the field to 
>>>> lose so easily. How about you added drm_notice to the worker when 
>>>> it fails?
>>>>
>>>> Or simply a "one line patch" to replace i915_probe_error (!?) with 
>>>> drm_notice and keep the blocking behavior. (I have no idea what is 
>>>> the typical time to drain the CT buffer, and so to decide whether 
>>>> waiting or dropping makes more sense for effectiveness of 
>>>> waitboosting.)
>>>>
>>>> Or since the congestion /should not/ happen in production, then the 
>>>> argument is why complicate with more code, in which case going with 
>>>> one line patch is an easy way forward?
>
> Here. Where I did hint I understood the "should not happen in 
> production angle".
>
> So statement is GuC is congested in processing requests, but the h2g 
> buffer is not congested so no chance intel_guc_send_nb() will fail 
> with no space in that buffer? Sounds a bit un-intuitive.
>
> Anyway, it sounds okay to me to use the non-blocking, but I would like 
> to see some logging if the unexpected does happen. Hence I was 
> suggesting the option of adding drm_notice logging if the send fails 
> from the worker. (Because I think other callers would already 
> propagate the error, like sysfs.)
>
>   err = slpc_force_min_freq(slpc, slpc->boost_freq);
>   if (!err)
>        slpc->num_boosts++;
>   else
>        drm_notice(... "Failed to send waitboost request (%d)", err);

Ok, makes sense. Will send out another rev with this change.

Thanks,

Vinay.

>
> Something like that.
>
> Regards,
>
> Tvrtko
>
>
>>> Even if we soften the blow here, the actual timeout error occurs in 
>>> the intel_guc_ct.c code, so we cannot hide that error anyways. 
>>> Making this call non-blocking will achieve both things.
>>>
>>> Thanks,
>>>
>>> Vinay.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 
>>>>> ++++++++++++++++-----
>>>>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> index 1db833da42df..c852f73cf521 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc 
>>>>> *slpc)
>>>>>       return data->header.global_state;
>>>>>   }
>>>>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, 
>>>>> u8 id, u32 value)
>>>>> +{
>>>>> +    u32 request[] = {
>>>>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>>> +        id,
>>>>> +        value,
>>>>> +    };
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>>> +
>>>>> +    return ret > 0 ? -EPROTO : ret;
>>>>> +}
>>>>> +
>>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, 
>>>>> u32 value)
>>>>> +{
>>>>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>>>>> +
>>>>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>>> +
>>>>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>>>>> +}
>>>>> +
>>>>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 
>>>>> id, u32 value)
>>>>>   {
>>>>>       u32 request[] = {
>>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>>>>> intel_guc_slpc *slpc, u32 freq)
>>>>>        */
>>>>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>> -        ret = slpc_set_param(slpc,
>>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> -                     freq);
>>>>> -        if (ret)
>>>>> -            i915_probe_error(i915, "Unable to force min freq to 
>>>>> %u: %d",
>>>>> -                     freq, ret);
>>>>> +        /* Non-blocking request will avoid stalls */
>>>>> +        ret = slpc_set_param_nb(slpc,
>>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> +                    freq);
>>>>>       }
>>>>>         return ret;
>>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct 
>>>>> *work)
>>>>>        */
>>>>>       mutex_lock(&slpc->lock);
>>>>>       if (atomic_read(&slpc->num_waiters)) {
>>>>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>>>>> -        slpc->num_boosts++;
>>>>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>>> +            slpc->num_boosts++;
>>>>>       }
>>>>>       mutex_unlock(&slpc->lock);
>>>>>   }
>>