Re: [PATCH v2 6/6] arm64: use activity monitors for frequency invariance

From: Lukasz Luba <lukasz.luba@arm.com>
To: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
	maz@kernel.org, suzuki.poulose@arm.com, sudeep.holla@arm.com,
	dietmar.eggemann@arm.com, peterz@infradead.org, mingo@redhat.com,
	ggherdovich@suse.cz, vincent.guittot@linaro.org,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 6/6] arm64: use activity monitors for frequency invariance
Date: Fri, 24 Jan 2020 15:17:48 +0000	[thread overview]
Message-ID: <d541c4ae-8419-0204-f399-7f0f0a18eb38@arm.com> (raw)
In-Reply-To: <20200124131134.GA15164@arm.com>

On 1/24/20 1:12 PM, Ionela Voinescu wrote:
> Hi Lukasz,
> 
> On Friday 24 Jan 2020 at 01:19:31 (+0000), Lukasz Luba wrote:
>>
>>
>> On 1/23/20 5:07 PM, Ionela Voinescu wrote:
>>> Hi Lukasz,
>>>
>>> Thank you for taking a look over the patches.
>>>
>>> On Thursday 23 Jan 2020 at 11:49:29 (+0000), Lukasz Luba wrote:
>>>> Hi Ionela,
>>>>
>>>> Please find my few comments below.
>>>>
>>>> On 12/18/19 6:26 PM, Ionela Voinescu wrote:
>>>>> The Frequency Invariance Engine (FIE) is providing a frequency
>>>>> scaling correction factor that helps achieve more accurate
>>>>> load-tracking.
>>>>>
>>>>> So far, for arm and arm64 platforms, this scale factor has been
>>>>> obtained based on the ratio between the current frequency and the
>>>>> maximum supported frequency recorded by the cpufreq policy. The
>>>>> setting of this scale factor is triggered from cpufreq drivers by
>>>>> calling arch_set_freq_scale. The current frequency used in computation
>>>>> is the frequency requested by a governor, but it may not be the
>>>>> frequency that was implemented by the platform.
>>>>>
>>>>> This correction factor can also be obtained using a core counter and a
>>>>> constant counter to get information on the performance (frequency based
>>>>> only) obtained in a period of time. This will more accurately reflect
>>>>> the actual current frequency of the CPU, compared with the alternative
>>>>> implementation that reflects the request of a performance level from
>>>>> the OS.
>>>>>
>>>>> Therefore, implement arch_scale_freq_tick to use activity monitors, if
>>>>> present, for the computation of the frequency scale factor.
>>>>>
>>>>> The use of AMU counters depends on:
>>>>>     - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
>>>>>     - CONFIG_CPU_FREQ - the current frequency obtained using counter
>>>>>       information is divided by the maximum frequency obtained from the
>>>>>       cpufreq policy.
>>>>>
>>>>> While it is possible to have a combination of CPUs in the system with
>>>>> and without support for activity monitors, the use of counters for
>>>>> frequency invariance is only enabled for a CPU, if all related CPUs
>>>>> (CPUs in the same frequency domain) support and have enabled the core
>>>>
>>>> This looks like an edge case scenario, for which we are designing the
>>>> whole machinery with workqueues. AFAIU we cannot run the code in
>>>> arch_set_freq_scale() and you want to be check all CPUs upfront.
>>>>
>>>
>>> Unfortunately, I don't believe it to be be an edge-case. Given that this
>>> is an optional feature, I do believe that people might skip on
>>> implementing it on some CPUs(LITTLEs) while keeping it for CPUs(bigs)
>>> where power and thermal mitigation is more probable to happen in firmware.
>>> This is the main reason to be conservative in the validation of CPUs and
>>> cpufreq policies.
>>>
>>> In regards to arch_set_freq_scale, I want to be able to tell, when that
>>> function is called, if I should return a scale factor based on cpufreq
>>> for the current policy. If activity monitors are useable for the CPUs in
>>> the full policy, than I'm bailing out and leave the AMU FIE machinery
>>> set the scale factor. Unfortunately this works at policy granularity.
>>>
>>> This could  be done in a nicer way by setting the scale factor per cpu
>>> and not for all CPUs in a policy in this arch_set_freq_scale function.
>>> But this would require some rewriting for the full frequency invariance
>>> support in drivers which we've talked about for a while but it was not
>>> the purpose of this patch set. But it would eliminate the policy
>>> verification I do with the second workqueue.
>>>
>>>> Maybe you can just wait till all CPUs boot and then set the proper
>>>> flags and finish initialization. Something like:
>>>> per_cpu(s8, amu_feat) /* form the patch 1/6 */
>>>> OR
>>>> per_cpu(u8, amu_scale_freq) /* from this patch */
>>>> with maybe some values:
>>>> 0 - not checked yet
>>>> 1 - checked and present
>>>> -1 - checked and not available
>>>> -2 - checked but in conflict with others in the freq domain
>>>> -3..-k - other odd configurations
>>>>
>>>> could potentially eliminate the need of workqueues.
>>>>
>>>> Then, if we could trigger this from i.e. late_initcall, the CPUs
>>>> should be online and you can validate them.
>>>>
>>>
>>> I did initially give such a state machine a try but it proved to be
>>> quite messy. A big reason for this is that the activity monitors unit
>>> has multiple counters that can be used for different purposes.
>>>
>>> The amu_feat per_cpu variable only flags that you have the AMU present
>>> for potential users (in this case FIE) to validate the counters they
>>> need for their respective usecase. For this reason I don't want to
>>> overload the meaning of amu_feat. For the same reason I'm not doing the
>>> validation of the counters in a generic way, but I'm tying it to the
>>> usecase for particular counters. For example, it would not matter if
>>> the instructions retired counter is not enabled from firmware for the
>>> usecase of FIE. For frequency invariance we only need the core and
>>> constant cycle counters and I'm making it the job of the user (arm64
>>> topology code) to do the checking.
>>>
>>> Secondly, for amu_scale_freq I could have added such a state machine,
>>> but I did not think it was useful. The only thing it would change is
>>> that I would not have to use the cpu_amu_fie variable in the data
>>> structure that gets passed to the work functions. The only way I would
>>> eliminate the second workqueue was if I did not do a check of all CPUs
>>> in a policy, as described above, and rewrite frequency invariance to
>>> work at CPU granularity and not policy granularity. This would eliminate
>>> the dependency on cpufreq policy all-together, so it would be worth
>>> doing if only for this reason alone :).
>>>
>>> But even in that case, it's probably not needed to have more than two
>>> states for amu_freq_scale.
>>>
>>> What do you think?
>>
>> I think currently we are the only users for this AMU and if there will
>> be another in the future, then we can start thinking about his proposed
>> changes. Let's cross that bridge when we come to it.
>>
>> Regarding the code, in the arch/arm64/cpufeature.c you can already
>> read the cycle registers. All the CPUs are going through that code
>> during start. If you use this fact in the late_initcall() all CPUs
>> should be checked and you can just ask for cpufreq policy, calculate the
>> max_freq ratio, set the per cpu config value to 'ready' state.
>>
>> Something like in the code below, it is on top of your patch set.
>>
>> ------------------------>8-------------------------------------
>>
>>
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index c639b3e052d7..837ea46d8867 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -1168,19 +1168,26 @@ static bool has_hw_dbm(const struct
>> arm64_cpu_capabilities *cap,
>>    * from the current cpu.
>>    *  - cpu_has_amu_feat()
>>    */
>> -static DEFINE_PER_CPU_READ_MOSTLY(u8, amu_feat);
>> -
>> -inline bool cpu_has_amu_feat(void)
>> -{
>> -	return !!this_cpu_read(amu_feat);
>> -}
>> +DECLARE_PER_CPU(u64, arch_const_cycles_prev);
>> +DECLARE_PER_CPU(u64, arch_core_cycles_prev);
>> +DECLARE_PER_CPU(u8, amu_scale_freq);
>>
>>   static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap)
>>   {
>> +	u64 core_cnt, const_cnt;
>> +
>>   	if (has_cpuid_feature(cap, SCOPE_LOCAL_CPU)) {
>>   		pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n",
>>   			smp_processor_id());
>> -		this_cpu_write(amu_feat, 1);
>> +		core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
>> +		const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
>> +
>> +		this_cpu_write(arch_core_cycles_prev, core_cnt);
>> +		this_cpu_write(arch_const_cycles_prev, const_cnt);
>> +
>> +		this_cpu_write(amu_scale_freq, 1);
>> +	} else {
>> +		this_cpu_write(amu_scale_freq, 2);
>>   	}
>>   }
> 
> 
> Yes, functionally this can be done here (it would need some extra checks
> on the initial values of core_cnt and const_cnt), but what I was saying
> in my previous comment is that I don't want to mix generic feature
> detection, which should happen here, with counter validation for
> frequency invariance. As you see, this would already bring here per-cpu
> variables for counters and amu_scale_freq flag, and I only see this
> getting more messy with the future use of more counters. I don't believe
> this code belongs here.
> 
> Looking a bit more over the code and checking against the new frequency
> invariance code for x86, there is a case of either doing this CPU
> validation in smp_prepare_cpus (separately for arm64 and x86) or calling
> an arch_init_freq_invariance() maybe in sched_init_smp to be defined with
> the proper frequency invariance counter initialisation code separately
> for x86 and arm64. I'll have to look more over the details to make sure
> this is feasible.

I have found that we could simply draw on from Mark's solution to
similar problem. In commit:

commit df857416a13734ed9356f6e4f0152d55e4fb748a
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Wed Jul 16 16:32:44 2014 +0100

     arm64: cpuinfo: record cpu system register values

     Several kernel subsystems need to know details about CPU system 
register
     values, sometimes for CPUs other than that they are executing on. 
Rather
     than hard-coding system register accesses and cross-calls for these
     cases, this patch adds logic to record various system register 
values at
     boot-time. This may be used for feature reporting, firmware bug
     detection, etc.

     Separate hooks are added for the boot and hotplug paths to enable
     one-time intialisation and cold/warm boot value mismatch detection in
     later patches.

     Signed-off-by: Mark Rutland <mark.rutland@arm.com>
     Reviewed-by: Will Deacon <will.deacon@arm.com>
     Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

He added cpuinfo_store_cpu() call in secondary_start_kernel()
[in arm64 smp.c]. Please check the file:
arch/arm64/kernel/cpuinfo.c

We can probably add our read-amu-regs-and-setup-invariance call
just below his cpuinfo_store_cpu.

Then the arm64 cpufeature.c would be clean, we will be called for
each cpu, late_initcal() will finish setup with edge case policy
check like in the init_amu_feature() code below.

> 
>>
>> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
>> index 61f8264afec9..95b34085ae64 100644
>> --- a/arch/arm64/kernel/topology.c
>> +++ b/arch/arm64/kernel/topology.c
>> @@ -144,8 +144,8 @@ static struct cpu_amu_work __percpu *works;
>>   static cpumask_var_t cpus_to_visit;
>>
>>   static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
>> -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
>> -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
>> +DEFINE_PER_CPU(u64, arch_const_cycles_prev);
>> +DEFINE_PER_CPU(u64, arch_core_cycles_prev);
>>   DECLARE_PER_CPU(u8, amu_scale_freq);
>>
>>   static void cpu_amu_fie_init_workfn(struct work_struct *work)
>> @@ -323,12 +323,64 @@ static int __init
>> register_fie_counters_cpufreq_notifier(void)
>>   }
>>   core_initcall(register_fie_counters_cpufreq_notifier);
>>
>> +static int __init init_amu_feature(void)
>> +{
>> +	struct cpufreq_policy *policy;
>> +	struct cpumask *checked_cpus;
>> +	int count, total;
>> +	int cpu, i;
>> +	s8 amu_config;
>> +	u64 ratio;
>> +
>> +	checked_cpus = kzalloc(cpumask_size(), GFP_KERNEL);
>> +	if (!checked_cpus)
>> +		return -ENOMEM;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		if (cpumask_test_cpu(cpu, checked_cpus))
>> +			continue;
>> +
>> +		policy = cpufreq_cpu_get(cpu);
>> +		if (!policy) {
>> +			pr_warn("No cpufreq policy found for CPU%d\n", cpu);
>> +			continue;
>> +		}
>> +
>> +		count = total = 0;
>> +
>> +		for_each_cpu(i, policy->related_cpus) {
>> +			amu_config = per_cpu(amu_scale_freq, i);
>> +			if (amu_config == 1)
>> +				count++;
>> +			total++;
>> +		}
>> +
>> +		amu_config = (total == count) ? 3 : 4;
>> +
>> +		ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
>> +		ratio = div64_u64(ratio, policy->cpuinfo.max_freq * 1000);
>> +
>> +		for_each_cpu(i, policy->related_cpus) {
>> +			per_cpu(arch_max_freq_scale, i) = (unsigned long)ratio;
>> +			per_cpu(amu_scale_freq, i) = amu_config;
>> +			cpumask_set_cpu(i, checked_cpus);
>> +		}
>> +
>> +		cpufreq_cpu_put(policy);
>> +	}
>> +
>> +	kfree(checked_cpus);
>> +
>> +	return 0;
>> +}
>> +late_initcall(init_amu_feature);
>> +
> 
> Yes, with the design I mentioned above, this CPU policy validation could
> move to a late_initcall and I could drop the workqueues and the extra
> data structure. Thanks for this!
> 
> Let me know what you think!
> 

One think is still open, the file drivers/base/arch_topology.c and
#ifdef in function arch_set_freq_scale().

Generally, if there is such need, it's better to put such stuff into the
header and make dual implementation not polluting generic code with:
#if defined(CONFIG_ARM64_XZY)
#endif
#if defined(CONFIG_POWERPC_ABC)
#endif
#if defined(CONFIG_x86_QAZ)
#endif
...

In our case we would need i.e. linux/topology.h because it includes
asm/topology.h, which might provide a needed symbol. At the end of
linux/topology.h we can have:

#ifndef arch_cpu_auto_scaling
static __always_inline
bool arch_cpu_auto_scaling(void) { return False; }
#endif

Then, when the symbol was missing and we got the default one,
it should be easily optimized by the compiler.

We could have a much cleaner function arch_set_freq_scale()
in drivers/base/ and all architecture will deal with specific
#ifdef CONFIG in their <asm/topology.h> implementations or
use default.

Example:
arch_set_freq_scale()
{
	unsigned long scale;
	int i;

	if (arch_cpu_auto_scaling(cpu))
		return;

	scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
	for_each_cpu(i, cpus)
		per_cpu(freq_scale, i) = scale;
}

Regards,
Lukasz