Re: [PATCH v2 6/6] arm64: use activity monitors for frequency invariance

From: Ionela Voinescu <ionela.voinescu@arm.com>
To: Lukasz Luba <lukasz.luba@arm.com>
Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
	maz@kernel.org, suzuki.poulose@arm.com, sudeep.holla@arm.com,
	dietmar.eggemann@arm.com, peterz@infradead.org, mingo@redhat.com,
	ggherdovich@suse.cz, vincent.guittot@linaro.org,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 6/6] arm64: use activity monitors for frequency invariance
Date: Fri, 24 Jan 2020 13:12:23 +0000	[thread overview]
Message-ID: <20200124131134.GA15164@arm.com> (raw)
In-Reply-To: <bddfdb83-c6cb-a809-8781-80e1117a2d6c@arm.com>

Hi Lukasz,

On Friday 24 Jan 2020 at 01:19:31 (+0000), Lukasz Luba wrote:
> 
> 
> On 1/23/20 5:07 PM, Ionela Voinescu wrote:
> > Hi Lukasz,
> > 
> > Thank you for taking a look over the patches.
> > 
> > On Thursday 23 Jan 2020 at 11:49:29 (+0000), Lukasz Luba wrote:
> > > Hi Ionela,
> > > 
> > > Please find my few comments below.
> > > 
> > > On 12/18/19 6:26 PM, Ionela Voinescu wrote:
> > > > The Frequency Invariance Engine (FIE) is providing a frequency
> > > > scaling correction factor that helps achieve more accurate
> > > > load-tracking.
> > > > 
> > > > So far, for arm and arm64 platforms, this scale factor has been
> > > > obtained based on the ratio between the current frequency and the
> > > > maximum supported frequency recorded by the cpufreq policy. The
> > > > setting of this scale factor is triggered from cpufreq drivers by
> > > > calling arch_set_freq_scale. The current frequency used in computation
> > > > is the frequency requested by a governor, but it may not be the
> > > > frequency that was implemented by the platform.
> > > > 
> > > > This correction factor can also be obtained using a core counter and a
> > > > constant counter to get information on the performance (frequency based
> > > > only) obtained in a period of time. This will more accurately reflect
> > > > the actual current frequency of the CPU, compared with the alternative
> > > > implementation that reflects the request of a performance level from
> > > > the OS.
> > > > 
> > > > Therefore, implement arch_scale_freq_tick to use activity monitors, if
> > > > present, for the computation of the frequency scale factor.
> > > > 
> > > > The use of AMU counters depends on:
> > > >    - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
> > > >    - CONFIG_CPU_FREQ - the current frequency obtained using counter
> > > >      information is divided by the maximum frequency obtained from the
> > > >      cpufreq policy.
> > > > 
> > > > While it is possible to have a combination of CPUs in the system with
> > > > and without support for activity monitors, the use of counters for
> > > > frequency invariance is only enabled for a CPU, if all related CPUs
> > > > (CPUs in the same frequency domain) support and have enabled the core
> > > 
> > > This looks like an edge case scenario, for which we are designing the
> > > whole machinery with workqueues. AFAIU we cannot run the code in
> > > arch_set_freq_scale() and you want to be check all CPUs upfront.
> > > 
> > 
> > Unfortunately, I don't believe it to be be an edge-case. Given that this
> > is an optional feature, I do believe that people might skip on
> > implementing it on some CPUs(LITTLEs) while keeping it for CPUs(bigs)
> > where power and thermal mitigation is more probable to happen in firmware.
> > This is the main reason to be conservative in the validation of CPUs and
> > cpufreq policies.
> > 
> > In regards to arch_set_freq_scale, I want to be able to tell, when that
> > function is called, if I should return a scale factor based on cpufreq
> > for the current policy. If activity monitors are useable for the CPUs in
> > the full policy, than I'm bailing out and leave the AMU FIE machinery
> > set the scale factor. Unfortunately this works at policy granularity.
> > 
> > This could  be done in a nicer way by setting the scale factor per cpu
> > and not for all CPUs in a policy in this arch_set_freq_scale function.
> > But this would require some rewriting for the full frequency invariance
> > support in drivers which we've talked about for a while but it was not
> > the purpose of this patch set. But it would eliminate the policy
> > verification I do with the second workqueue.
> > 
> > > Maybe you can just wait till all CPUs boot and then set the proper
> > > flags and finish initialization. Something like:
> > > per_cpu(s8, amu_feat) /* form the patch 1/6 */
> > > OR
> > > per_cpu(u8, amu_scale_freq) /* from this patch */
> > > with maybe some values:
> > > 0 - not checked yet
> > > 1 - checked and present
> > > -1 - checked and not available
> > > -2 - checked but in conflict with others in the freq domain
> > > -3..-k - other odd configurations
> > > 
> > > could potentially eliminate the need of workqueues.
> > > 
> > > Then, if we could trigger this from i.e. late_initcall, the CPUs
> > > should be online and you can validate them.
> > > 
> > 
> > I did initially give such a state machine a try but it proved to be
> > quite messy. A big reason for this is that the activity monitors unit
> > has multiple counters that can be used for different purposes.
> > 
> > The amu_feat per_cpu variable only flags that you have the AMU present
> > for potential users (in this case FIE) to validate the counters they
> > need for their respective usecase. For this reason I don't want to
> > overload the meaning of amu_feat. For the same reason I'm not doing the
> > validation of the counters in a generic way, but I'm tying it to the
> > usecase for particular counters. For example, it would not matter if
> > the instructions retired counter is not enabled from firmware for the
> > usecase of FIE. For frequency invariance we only need the core and
> > constant cycle counters and I'm making it the job of the user (arm64
> > topology code) to do the checking.
> > 
> > Secondly, for amu_scale_freq I could have added such a state machine,
> > but I did not think it was useful. The only thing it would change is
> > that I would not have to use the cpu_amu_fie variable in the data
> > structure that gets passed to the work functions. The only way I would
> > eliminate the second workqueue was if I did not do a check of all CPUs
> > in a policy, as described above, and rewrite frequency invariance to
> > work at CPU granularity and not policy granularity. This would eliminate
> > the dependency on cpufreq policy all-together, so it would be worth
> > doing if only for this reason alone :).
> > 
> > But even in that case, it's probably not needed to have more than two
> > states for amu_freq_scale.
> > 
> > What do you think?
> 
> I think currently we are the only users for this AMU and if there will
> be another in the future, then we can start thinking about his proposed
> changes. Let's cross that bridge when we come to it.
> 
> Regarding the code, in the arch/arm64/cpufeature.c you can already
> read the cycle registers. All the CPUs are going through that code
> during start. If you use this fact in the late_initcall() all CPUs
> should be checked and you can just ask for cpufreq policy, calculate the
> max_freq ratio, set the per cpu config value to 'ready' state.
> 
> Something like in the code below, it is on top of your patch set.
> 
> ------------------------>8-------------------------------------
> 
> 
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index c639b3e052d7..837ea46d8867 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1168,19 +1168,26 @@ static bool has_hw_dbm(const struct
> arm64_cpu_capabilities *cap,
>   * from the current cpu.
>   *  - cpu_has_amu_feat()
>   */
> -static DEFINE_PER_CPU_READ_MOSTLY(u8, amu_feat);
> -
> -inline bool cpu_has_amu_feat(void)
> -{
> -	return !!this_cpu_read(amu_feat);
> -}
> +DECLARE_PER_CPU(u64, arch_const_cycles_prev);
> +DECLARE_PER_CPU(u64, arch_core_cycles_prev);
> +DECLARE_PER_CPU(u8, amu_scale_freq);
> 
>  static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap)
>  {
> +	u64 core_cnt, const_cnt;
> +
>  	if (has_cpuid_feature(cap, SCOPE_LOCAL_CPU)) {
>  		pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n",
>  			smp_processor_id());
> -		this_cpu_write(amu_feat, 1);
> +		core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
> +		const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> +
> +		this_cpu_write(arch_core_cycles_prev, core_cnt);
> +		this_cpu_write(arch_const_cycles_prev, const_cnt);
> +
> +		this_cpu_write(amu_scale_freq, 1);
> +	} else {
> +		this_cpu_write(amu_scale_freq, 2);
>  	}
>  }

Yes, functionally this can be done here (it would need some extra checks
on the initial values of core_cnt and const_cnt), but what I was saying
in my previous comment is that I don't want to mix generic feature
detection, which should happen here, with counter validation for
frequency invariance. As you see, this would already bring here per-cpu
variables for counters and amu_scale_freq flag, and I only see this
getting more messy with the future use of more counters. I don't believe
this code belongs here.

Looking a bit more over the code and checking against the new frequency
invariance code for x86, there is a case of either doing this CPU
validation in smp_prepare_cpus (separately for arm64 and x86) or calling
an arch_init_freq_invariance() maybe in sched_init_smp to be defined with
the proper frequency invariance counter initialisation code separately
for x86 and arm64. I'll have to look more over the details to make sure
this is feasible.

> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 61f8264afec9..95b34085ae64 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -144,8 +144,8 @@ static struct cpu_amu_work __percpu *works;
>  static cpumask_var_t cpus_to_visit;
> 
>  static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
> -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> +DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> +DEFINE_PER_CPU(u64, arch_core_cycles_prev);
>  DECLARE_PER_CPU(u8, amu_scale_freq);
> 
>  static void cpu_amu_fie_init_workfn(struct work_struct *work)
> @@ -323,12 +323,64 @@ static int __init
> register_fie_counters_cpufreq_notifier(void)
>  }
>  core_initcall(register_fie_counters_cpufreq_notifier);
> 
> +static int __init init_amu_feature(void)
> +{
> +	struct cpufreq_policy *policy;
> +	struct cpumask *checked_cpus;
> +	int count, total;
> +	int cpu, i;
> +	s8 amu_config;
> +	u64 ratio;
> +
> +	checked_cpus = kzalloc(cpumask_size(), GFP_KERNEL);
> +	if (!checked_cpus)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		if (cpumask_test_cpu(cpu, checked_cpus))
> +			continue;
> +
> +		policy = cpufreq_cpu_get(cpu);
> +		if (!policy) {
> +			pr_warn("No cpufreq policy found for CPU%d\n", cpu);
> +			continue;
> +		}
> +
> +		count = total = 0;
> +
> +		for_each_cpu(i, policy->related_cpus) {
> +			amu_config = per_cpu(amu_scale_freq, i);
> +			if (amu_config == 1)
> +				count++;
> +			total++;
> +		}
> +
> +		amu_config = (total == count) ? 3 : 4;
> +
> +		ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
> +		ratio = div64_u64(ratio, policy->cpuinfo.max_freq * 1000);
> +
> +		for_each_cpu(i, policy->related_cpus) {
> +			per_cpu(arch_max_freq_scale, i) = (unsigned long)ratio;
> +			per_cpu(amu_scale_freq, i) = amu_config;
> +			cpumask_set_cpu(i, checked_cpus);
> +		}
> +
> +		cpufreq_cpu_put(policy);
> +	}
> +
> +	kfree(checked_cpus);
> +
> +	return 0;
> +}
> +late_initcall(init_amu_feature);
> +

Yes, with the design I mentioned above, this CPU policy validation could
move to a late_initcall and I could drop the workqueues and the extra
data structure. Thanks for this!

Let me know what you think!

Thank you,
Ionela.

>  void topology_scale_freq_tick(void)
>  {
>  	u64 prev_core_cnt, prev_const_cnt;
>  	u64 core_cnt, const_cnt, scale;
> 
> -	if (!this_cpu_read(amu_scale_freq))
> +	if (this_cpu_read(amu_scale_freq) != 3)
>  		return;
> 
>  	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> 
> 
> -------------------------8<------------------------------------
> 
> Regards,
> Lukasz
> 
> > 
> > Thank you,
> > Ionela.
> >