Re: [PATCH v3 5/5] drm/amdgpu: switch workload context to/from compute

From: "Sharma, Shashank" <shashank.sharma@amd.com>
To: "Lazar, Lijo" <lijo.lazar@amd.com>, Alex Deucher <alexdeucher@gmail.com>
Cc: alexander.deucher@amd.com,
	"Felix Kuehling" <felix.kuehling@amd.com>,
	amaranath.somalapuram@amd.com,
	"Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v3 5/5] drm/amdgpu: switch workload context to/from compute
Date: Fri, 30 Sep 2022 12:09:36 +0200	[thread overview]
Message-ID: <3c69d3e3-d4a5-5119-7d48-8e60304b34b4@amd.com> (raw)
In-Reply-To: <abdda495-27ae-cf32-6d58-b44773797262@amd.com>

On 9/30/2022 11:54 AM, Lazar, Lijo wrote:
> 
> 
> On 9/30/2022 2:52 PM, Sharma, Shashank wrote:
>>
>>
>> On 9/30/2022 11:13 AM, Lazar, Lijo wrote:
>>>
>>>
>>> On 9/30/2022 2:07 PM, Sharma, Shashank wrote:
>>>>
>>>>
>>>> On 9/30/2022 7:08 AM, Lazar, Lijo wrote:
>>>>>
>>>>>
>>>>> On 9/30/2022 12:02 AM, Alex Deucher wrote:
>>>>>> On Thu, Sep 29, 2022 at 10:14 AM Lazar, Lijo <lijo.lazar@amd.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/29/2022 7:30 PM, Sharma, Shashank wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/29/2022 3:37 PM, Lazar, Lijo wrote:
>>>>>>>>> To be clear your understanding -
>>>>>>>>>
>>>>>>>>> Nothing is automatic in PMFW. PMFW picks a priority based on the
>>>>>>>>> actual mask sent by driver.
>>>>>>>>>
>>>>>>>>> Assuming lower bits corresponds to highest priority -
>>>>>>>>>
>>>>>>>>> If driver sends a mask with Bit3 and Bit 0 set, PMFW will chose
>>>>>>>>> profile that corresponds to Bit0. If driver sends a mask with Bit4
>>>>>>>>> Bit2 set and rest unset, PMFW will chose profile that 
>>>>>>>>> corresponds to
>>>>>>>>> Bit2. However if driver sends a mask only with a single bit 
>>>>>>>>> set, it
>>>>>>>>> chooses the profile regardless of whatever was the previous 
>>>>>>>>> profile. t
>>>>>>>>> doesn't check if the existing profile > newly requested one. 
>>>>>>>>> That is
>>>>>>>>> the behavior.
>>>>>>>>>
>>>>>>>>> So if a job send chooses a profile that corresponds to Bit0, 
>>>>>>>>> driver
>>>>>>>>> will send that. Next time if another job chooses a profile that
>>>>>>>>> corresponds to Bit1, PMFW will receive that as the new profile and
>>>>>>>>> switch to that. It trusts the driver to send the proper 
>>>>>>>>> workload mask.
>>>>>>>>>
>>>>>>>>> Hope that gives the picture.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, my understanding is also similar, referring to the core 
>>>>>>>> power
>>>>>>>> switch profile function here:
>>>>>>>> amd_powerplay.c::pp_dpm_switch_power_profile()
>>>>>>>> *snip code*
>>>>>>>> hwmgr->workload_mask |= (1 << hwmgr->workload_prority[type]);
>>>>>>>>           index = fls(hwmgr->workload_mask);
>>>>>>>>           index = index <= Workload_Policy_Max ? index - 1 : 0;
>>>>>>>>           workload = hwmgr->workload_setting[index];
>>>>>>>> *snip_code*
>>>>>>>> hwmgr->hwmgr_func->set_power_profile_mode(hwmgr, &workload, 0);
>>>>>>>>
>>>>>>>> Here I can see that the new workload mask is appended into the 
>>>>>>>> existing
>>>>>>>> workload mask (not overwritten). So if we keep sending new
>>>>>>>> workload_modes, they would be appended into the workload flags and
>>>>>>>> finally the PM will pick the most aggressive one of all these 
>>>>>>>> flags, as
>>>>>>>> per its policy.
>>>>>>>>
>>>>>>>
>>>>>>> Actually it's misleading -
>>>>>>>
>>>>>>> The path for sienna is -
>>>>>>> set_power_profile_mode -> sienna_cichlid_set_power_profile_mode
>>>>>>>
>>>>>>>
>>>>>>> This code here is a picking one based on lookup table.
>>>>>>>
>>>>>>>    workload_type = smu_cmn_to_asic_specific_index(smu,
>>>>>>>
>>>>>>> CMN2ASIC_MAPPING_WORKLOAD,
>>>>>>>
>>>>>>> smu->power_profile_mode);
>>>>>>>
>>>>>>> This is that lookup table -
>>>>>>>
>>>>>>> static struct cmn2asic_mapping
>>>>>>> sienna_cichlid_workload_map[PP_SMC_POWER_PROFILE_COUNT] = {
>>>>>>>           WORKLOAD_MAP(PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT,
>>>>>>> WORKLOAD_PPLIB_DEFAULT_BIT),
>>>>>>>           WORKLOAD_MAP(PP_SMC_POWER_PROFILE_FULLSCREEN3D,
>>>>>>> WORKLOAD_PPLIB_FULL_SCREEN_3D_BIT),
>>>>>>>           WORKLOAD_MAP(PP_SMC_POWER_PROFILE_POWERSAVING,
>>>>>>> WORKLOAD_PPLIB_POWER_SAVING_BIT),
>>>>>>>           WORKLOAD_MAP(PP_SMC_POWER_PROFILE_VIDEO,
>>>>>>> WORKLOAD_PPLIB_VIDEO_BIT),
>>>>>>>           WORKLOAD_MAP(PP_SMC_POWER_PROFILE_VR,
>>>>>>> WORKLOAD_PPLIB_VR_BIT),
>>>>>>>           WORKLOAD_MAP(PP_SMC_POWER_PROFILE_COMPUTE,
>>>>>>> WORKLOAD_PPLIB_COMPUTE_BIT),
>>>>>>>           WORKLOAD_MAP(PP_SMC_POWER_PROFILE_CUSTOM,
>>>>>>> WORKLOAD_PPLIB_CUSTOM_BIT),
>>>>>>> };
>>>>>>>
>>>>>>>
>>>>>>> And this is the place of interaction with PMFW. (1 << 
>>>>>>> workload_type) is
>>>>>>> the mask being sent.
>>>>>>>
>>>>>>>          smu_cmn_send_smc_msg_with_param(smu, 
>>>>>>> SMU_MSG_SetWorkloadMask,
>>>>>>>                                       1 << workload_type, NULL);
>>>>>>>
>>>>>>> In the end, driver implementation expects only one bit to be set.
>>>>>>
>>>>>> Shashank and I had a discussion about this today.  I think there 
>>>>>> are a
>>>>>> few thing we can do to handle this better:
>>>>>>
>>>>>> 1. Set a flag that if the user changes the default via sysfs that
>>>>>> overrides any runtime setting via an application since presumably 
>>>>>> that
>>>>>> is what the user wants and we won't change the hint at runtime.
>>>>>> 2. Drop the GET API.  There's no need for this, the hint is just a 
>>>>>> hint.
>>>>>
>>>>> Double checked again based on Felix's comments on API definition. 
>>>>> Driver decides the priority instead of FW. That way we can still 
>>>>> keep Get API.
>>>>>
>>>>>> 2. Have the driver arbitrate between the available workload profiles
>>>>>> based on the numeric value of the hint (e.g., default < 3D < video <
>>>>>> VR < compute) as the higher values are more aggressive in most cases.
>>>>>> If requests come in for 3D and compute at the same time, the driver
>>>>>> will select compute because it's value is highest.  Each hint type
>>>>>> would be ref counted so we'd know what state to be in every time 
>>>>>> we go
>>>>>> to set the state.  If all of the clients requesting compute go away,
>>>>>> and only 3D requestors remain, we can switch to 3D.  If all refcounts
>>>>>> go to 0, we go back to default.  This will not require any change to
>>>>>> the current workload API in the SMU code.
>>>>>
>>>>> Since PM layer decides priority, refcount can be kept at powerplay 
>>>>> and swsmu layer instead of any higher level API.
>>>>>
>>>>> User API may keep something like req_power_profile (for any 
>>>>> logging/debug purpose) for the job preference.
>>>>
>>>> No, I think there has been enough confusion around this 
>>>> implementation so far, we will implement this just as Alex/Felix 
>>>> suggested:
>>>> - No change will be done in pm/SMU layer.
>>>
>>> Well, a confusion doesn't justify bad implementation. You could just 
>>> keep the refcount in workload_setting.
>>
>> So far, none of us have any reason to believe its a bad 
>> implementation. Why is it so, again ?
>>
> 
> It's only about keeping track of requests at client layer.
> 

There is absolutely nothing bad or wrong with that, as a matter of fact, 
some driver designs prefer to keep it like this, and let the core API 
minimal and focused on core functionality. This is just about choice.

>>>
>>> Another API that uses power profile indirectly also will need to take 
>>> care of refcount and we don't need every other API to do that 
>>> separately without knowing what is the final outcome.
>>>
>>
>> And why ? The dpm_switch_power_profile API was introduced to be used 
>> by a higher level API, and if a consumer API wants to keep track of 
>> that, its their own call. This doesn't affect internal PM APIs. The 
>> whole idea is to manage the PM calls without any change in PM APIs.
>>
> 
> Just like per-job-switch-profile is a new usage, there could be other 
> new cases as well. Also, there are other APIs which indirectly 
> manipulates power profile other than sys.
> 
Understand that there was no reference counting for pm profile change so 
far, as it was probably written considering sysfs interface and never 
considered a multi-client environment.

Like workload context, If there are other current/future clients who 
could also use these APIs, it would be a very important reason to add 
this workload reference counting in central pm structure (rather than 
multiple scattered places), so that every new API/consumer can 
understand and use this, and consider a system-wide scenario of DPM 
power profile, instead of a narrow view of its own thread. The central 
counter will indicate that more than one consumers/clients can change 
the power profile, not only this thread.

- Shashank

> All I'm saying is keep the refcount at core layer so that regardless of 
> wherever it comes from, it keeps the preference.
> 
> So instead of this-
>                  smu->workload_mask &= ~(1 << smu->workload_prority[type]);
> 
> Have something like this -
> 
>      smu->workload[type].reqcount--;
>      if (!smu->workload[type].reqcount)
>                  smu->workload_mask &= ~(1 << 
> smu->workload[type].priority);
> 
> I guess, the count was not there because there was no usage of multiple 
> clients preferring the same profile at the same time. Now that there is 
> a case for this, fix it at where required rather than keeping a track of 
> it at client layer.
> 
> Thanks,
> Lijo
> 
>>
>> - Shashank
>>
>>> Thanks,
>>> Lijo
>>>
>>>> - The amdgpu_context_workload layer will keep the ref_counting and 
>>>> user_workload_hint management, and it will just call and consume the 
>>>> pm_switch_workload profile() like any other client.
>>>
>>>> - We will add a force flag for calls coming from sysfs() interface, 
>>>> and it will take the highest priority. No state machine will be 
>>>> managed for sysfs, and it will work as it is working today.
>>>>
>>>> - Shashank
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>> Now, when we have a single workload:
>>>>>>>> -> Job1: requests profile P1 via UAPI, ref count = 1
>>>>>>>> -> driver sends flags for p1
>>>>>>>> -> PM FW applies profile P1
>>>>>>>> -> Job executes in profile P1
>>>>>>>> -> Job goes to reset function, ref_count = 0,
>>>>>>>> -> Power profile resets
>>>>>>>>
>>>>>>>> Now, we have conflicts only when we see multiple workloads (Job1 
>>>>>>>> and Job 2)
>>>>>>>> -> Job1: requests profile P1 via UAPI, ref count = 1
>>>>>>>> -> driver sends flags for p1
>>>>>>>> -> PM FW applies profile P1
>>>>>>>> -> Job executes in profile P1
>>>>>>>> -> Job2: requests profile P2 via UAPI, refcount = 2
>>>>>>>> -> driver sends flags for (P1|P2)
>>>>>>>> -> PM FW picks the more aggressive of the two (Say P1, stays in P1)
>>>>>>>> -> Job1 goes to reset function, ref_count = 1, job1 does not 
>>>>>>>> reset power
>>>>>>>> profile
>>>>>>>> -> Job2 goes to reset function, ref_counter = 2, job 2 resets 
>>>>>>>> Power profile
>>>>>>>> -> Power profile resets to None
>>>>>>>>
>>>>>>>> So this state machine looks like if there is only 1 job, it will be
>>>>>>>> executed in desired mode. But if there are multiple, the most 
>>>>>>>> aggressive
>>>>>>>> profile will be picked, and every job will be executed in 
>>>>>>>> atleast the
>>>>>>>> requested power profile mode or higher.
>>>>>>>>
>>>>>>>> Do you find any problem so far ?
>>>>>>>>
>>>>>>>> - Shashank
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Lijo