All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
@ 2022-04-15 19:19 Thomas Gleixner
  2022-04-15 19:19 ` [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() Thomas Gleixner
                   ` (13 more replies)
  0 siblings, 14 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

APERF/MPERF is utilized in two ways:

  1) Ad hoc readout of CPU frequency which requires IPIs

  2) Frequency scale calculation for frequency invariant scheduling which
     reads APERF/MPERF on every tick.

These are completely independent code parts. Eric observed long latencies
when reading /proc/cpuinfo which reads out CPU frequency via #1 and
proposed to replace the per CPU single IPI with a broadcast IPI.

While this makes the latency smaller, it is not necessary at all because #2
samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
which are excluded from IPI already.

It could be argued that not all APERF/MPERF capable systems have the
required BIOS information to enable frequency invariance support, but in
practice most of them do. So the APERF/MPERF sampling can be made
unconditional and just the frequency scale calculation for the scheduler
excluded.

The following series consolidates that.

Thanks,

	tglx
---
 arch/x86/include/asm/cpu.h       |    2 
 arch/x86/include/asm/topology.h  |   17 -
 arch/x86/kernel/acpi/cppc.c      |   28 --
 arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
 arch/x86/kernel/cpu/proc.c       |    2 
 arch/x86/kernel/smpboot.c        |  358 -----------------------------
 fs/proc/cpuinfo.c                |    6 
 include/linux/cpufreq.h          |    1 
 8 files changed, 405 insertions(+), 483 deletions(-)



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
@ 2022-04-15 19:19 ` Thomas Gleixner
  2022-04-19 15:34   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-15 19:19 ` [patch 02/10] x86/smp: Move APERF/MPERF code where it belongs Thomas Gleixner
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

aperfmperf_get_khz() already excludes idle CPUs from APERF/MPERF sampling
and that's a reasonable decision. There is no point in sending up to two
IPIs to an idle CPU just because someone reads a sysfs file.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -139,6 +139,9 @@ unsigned int arch_freq_get_on_cpu(int cp
 	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
 		return 0;
 
+	if (rcu_is_idle_cpu(cpu))
+		return 0;
+
 	if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
 		return per_cpu(samples.khz, cpu);
 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 02/10] x86/smp: Move APERF/MPERF code where it belongs
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
  2022-04-15 19:19 ` [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() Thomas Gleixner
@ 2022-04-15 19:19 ` Thomas Gleixner
  2022-04-19 15:40   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-15 19:19 ` [patch 03/10] x86/aperfmperf: Separate AP/BP frequency invariance init Thomas Gleixner
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

as this can share code with the preexisting APERF/MPERF code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |  366 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/smpboot.c        |  355 -------------------------------------
 2 files changed, 362 insertions(+), 359 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -6,15 +6,19 @@
  * Copyright (C) 2017 Intel Corp.
  * Author: Len Brown <len.brown@intel.com>
  */
-
+#include <linux/cpufreq.h>
 #include <linux/delay.h>
 #include <linux/ktime.h>
 #include <linux/math64.h>
 #include <linux/percpu.h>
-#include <linux/cpufreq.h>
-#include <linux/smp.h>
-#include <linux/sched/isolation.h>
 #include <linux/rcupdate.h>
+#include <linux/sched/isolation.h>
+#include <linux/sched/topology.h>
+#include <linux/smp.h>
+#include <linux/syscore_ops.h>
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
 
 #include "cpu.h"
 
@@ -152,3 +156,357 @@ unsigned int arch_freq_get_on_cpu(int cp
 
 	return per_cpu(samples.khz, cpu);
 }
+
+#if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency, corresponding to
+ * freq_curr / freq_max.
+ *
+ * Since the frequency freq_curr on x86 is controlled by micro-controller and
+ * our P-state setting is little more than a request/hint, we need to observe
+ * the effective frequency 'BusyMHz', i.e. the average frequency over a time
+ * interval after discarding idle time. This is given by:
+ *
+ *   BusyMHz = delta_APERF / delta_MPERF * freq_base
+ *
+ * where freq_base is the max non-turbo P-state.
+ *
+ * The freq_max term has to be set to a somewhat arbitrary value, because we
+ * can't know which turbo states will be available at a given point in time:
+ * it all depends on the thermal headroom of the entire package. We set it to
+ * the turbo level with 4 cores active.
+ *
+ * Benchmarks show that's a good compromise between the 1C turbo ratio
+ * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
+ * which would ignore the entire turbo range (a conspicuous part, making
+ * freq_curr/freq_max always maxed out).
+ *
+ * An exception to the heuristic above is the Atom uarch, where we choose the
+ * highest turbo level for freq_max since Atom's are generally oriented towards
+ * power efficiency.
+ *
+ * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
+ * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
+ */
+
+DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
+static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
+
+void arch_set_max_freq_ratio(bool turbo_disabled)
+{
+	arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
+					arch_turbo_freq_ratio;
+}
+EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
+
+static bool turbo_disabled(void)
+{
+	u64 misc_en;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+	if (err)
+		return false;
+
+	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+{
+	int err;
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 16) & 0x3F;     /* max P state */
+	*turbo_freq = *turbo_freq & 0x3F;           /* 1C turbo    */
+
+	return true;
+}
+
+#define X86_MATCH(model)					\
+	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,		\
+		INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
+
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+	X86_MATCH(XEON_PHI_KNL),
+	X86_MATCH(XEON_PHI_KNM),
+	{}
+};
+
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+	X86_MATCH(SKYLAKE_X),
+	{}
+};
+
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+	X86_MATCH(ATOM_GOLDMONT),
+	X86_MATCH(ATOM_GOLDMONT_D),
+	X86_MATCH(ATOM_GOLDMONT_PLUS),
+	{}
+};
+
+static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+				int num_delta_fratio)
+{
+	int fratio, delta_fratio, found;
+	int err, i;
+	u64 msr;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 8) & 0xFF;	    /* max P state */
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+	if (err)
+		return false;
+
+	fratio = (msr >> 8) & 0xFF;
+	i = 16;
+	found = 0;
+	do {
+		if (found >= num_delta_fratio) {
+			*turbo_freq = fratio;
+			return true;
+		}
+
+		delta_fratio = (msr >> (i + 5)) & 0x7;
+
+		if (delta_fratio) {
+			found += 1;
+			fratio -= delta_fratio;
+		}
+
+		i += 8;
+	} while (i < 64);
+
+	return true;
+}
+
+static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+{
+	u64 ratios, counts;
+	u32 group_size;
+	int err, i;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
+	if (err)
+		return false;
+
+	for (i = 0; i < 64; i += 8) {
+		group_size = (counts >> i) & 0xFF;
+		if (group_size >= size) {
+			*turbo_freq = (ratios >> i) & 0xFF;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+{
+	u64 msr;
+	int err;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 8) & 0xFF;    /* max P state */
+	*turbo_freq = (msr >> 24) & 0xFF;         /* 4C turbo    */
+
+	/* The CPU may have less than 4 cores */
+	if (!*turbo_freq)
+		*turbo_freq = msr & 0xFF;         /* 1C turbo    */
+
+	return true;
+}
+
+static bool intel_set_max_freq_ratio(void)
+{
+	u64 base_freq, turbo_freq;
+	u64 turbo_ratio;
+
+	if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
+		goto out;
+
+	if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
+	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+		goto out;
+
+	if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
+	    knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+		goto out;
+
+	if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
+	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
+		goto out;
+
+	if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
+		goto out;
+
+	return false;
+
+out:
+	/*
+	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
+	 * but then fill all MSR's with zeroes.
+	 * Some CPUs have turbo boost but don't declare any turbo ratio
+	 * in MSR_TURBO_RATIO_LIMIT.
+	 */
+	if (!base_freq || !turbo_freq) {
+		pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
+		return false;
+	}
+
+	turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
+	if (!turbo_ratio) {
+		pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
+		return false;
+	}
+
+	arch_turbo_freq_ratio = turbo_ratio;
+	arch_set_max_freq_ratio(turbo_disabled());
+
+	return true;
+}
+
+static void init_counter_refs(void)
+{
+	u64 aperf, mperf;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+}
+
+#ifdef CONFIG_PM_SLEEP
+static struct syscore_ops freq_invariance_syscore_ops = {
+	.resume = init_counter_refs,
+};
+
+static void register_freq_invariance_syscore_ops(void)
+{
+	/* Bail out if registered already. */
+	if (freq_invariance_syscore_ops.node.prev)
+		return;
+
+	register_syscore_ops(&freq_invariance_syscore_ops);
+}
+#else
+static inline void register_freq_invariance_syscore_ops(void) {}
+#endif
+
+void init_freq_invariance(bool secondary, bool cppc_ready)
+{
+	bool ret = false;
+
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return;
+
+	if (secondary) {
+		if (static_branch_likely(&arch_scale_freq_key)) {
+			init_counter_refs();
+		}
+		return;
+	}
+
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		ret = intel_set_max_freq_ratio();
+	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
+		if (!cppc_ready) {
+			return;
+		}
+		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
+	}
+
+	if (ret) {
+		init_counter_refs();
+		static_branch_enable(&arch_scale_freq_key);
+		register_freq_invariance_syscore_ops();
+		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
+	} else {
+		pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
+	}
+}
+
+static void disable_freq_invariance_workfn(struct work_struct *work)
+{
+	static_branch_disable(&arch_scale_freq_key);
+}
+
+static DECLARE_WORK(disable_freq_invariance_work,
+		    disable_freq_invariance_workfn);
+
+DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
+
+void arch_scale_freq_tick(void)
+{
+	u64 freq_scale;
+	u64 aperf, mperf;
+	u64 acnt, mcnt;
+
+	if (!arch_scale_freq_invariant())
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	acnt = aperf - this_cpu_read(arch_prev_aperf);
+	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+
+	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
+		goto error;
+
+	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
+		goto error;
+
+	freq_scale = div64_u64(acnt, mcnt);
+	if (!freq_scale)
+		goto error;
+
+	if (freq_scale > SCHED_CAPACITY_SCALE)
+		freq_scale = SCHED_CAPACITY_SCALE;
+
+	this_cpu_write(arch_freq_scale, freq_scale);
+	return;
+
+error:
+	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
+	schedule_work(&disable_freq_invariance_work);
+}
+#endif /* CONFIG_X86_64 && CONFIG_SMP */
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -56,7 +56,6 @@
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
-#include <linux/syscore_ops.h>
 
 #include <asm/acpi.h>
 #include <asm/desc.h>
@@ -1847,357 +1846,3 @@ void native_play_dead(void)
 }
 
 #endif
-
-#ifdef CONFIG_X86_64
-/*
- * APERF/MPERF frequency ratio computation.
- *
- * The scheduler wants to do frequency invariant accounting and needs a <1
- * ratio to account for the 'current' frequency, corresponding to
- * freq_curr / freq_max.
- *
- * Since the frequency freq_curr on x86 is controlled by micro-controller and
- * our P-state setting is little more than a request/hint, we need to observe
- * the effective frequency 'BusyMHz', i.e. the average frequency over a time
- * interval after discarding idle time. This is given by:
- *
- *   BusyMHz = delta_APERF / delta_MPERF * freq_base
- *
- * where freq_base is the max non-turbo P-state.
- *
- * The freq_max term has to be set to a somewhat arbitrary value, because we
- * can't know which turbo states will be available at a given point in time:
- * it all depends on the thermal headroom of the entire package. We set it to
- * the turbo level with 4 cores active.
- *
- * Benchmarks show that's a good compromise between the 1C turbo ratio
- * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
- * which would ignore the entire turbo range (a conspicuous part, making
- * freq_curr/freq_max always maxed out).
- *
- * An exception to the heuristic above is the Atom uarch, where we choose the
- * highest turbo level for freq_max since Atom's are generally oriented towards
- * power efficiency.
- *
- * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
- * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
- */
-
-DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
-
-static DEFINE_PER_CPU(u64, arch_prev_aperf);
-static DEFINE_PER_CPU(u64, arch_prev_mperf);
-static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
-static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
-
-void arch_set_max_freq_ratio(bool turbo_disabled)
-{
-	arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
-					arch_turbo_freq_ratio;
-}
-EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
-
-static bool turbo_disabled(void)
-{
-	u64 misc_en;
-	int err;
-
-	err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
-	if (err)
-		return false;
-
-	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
-}
-
-static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
-{
-	int err;
-
-	err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
-	if (err)
-		return false;
-
-	err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 16) & 0x3F;     /* max P state */
-	*turbo_freq = *turbo_freq & 0x3F;           /* 1C turbo    */
-
-	return true;
-}
-
-#define X86_MATCH(model)					\
-	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,		\
-		INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
-
-static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
-	X86_MATCH(XEON_PHI_KNL),
-	X86_MATCH(XEON_PHI_KNM),
-	{}
-};
-
-static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
-	X86_MATCH(SKYLAKE_X),
-	{}
-};
-
-static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
-	X86_MATCH(ATOM_GOLDMONT),
-	X86_MATCH(ATOM_GOLDMONT_D),
-	X86_MATCH(ATOM_GOLDMONT_PLUS),
-	{}
-};
-
-static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
-				int num_delta_fratio)
-{
-	int fratio, delta_fratio, found;
-	int err, i;
-	u64 msr;
-
-	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 8) & 0xFF;	    /* max P state */
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
-	if (err)
-		return false;
-
-	fratio = (msr >> 8) & 0xFF;
-	i = 16;
-	found = 0;
-	do {
-		if (found >= num_delta_fratio) {
-			*turbo_freq = fratio;
-			return true;
-		}
-
-		delta_fratio = (msr >> (i + 5)) & 0x7;
-
-		if (delta_fratio) {
-			found += 1;
-			fratio -= delta_fratio;
-		}
-
-		i += 8;
-	} while (i < 64);
-
-	return true;
-}
-
-static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
-{
-	u64 ratios, counts;
-	u32 group_size;
-	int err, i;
-
-	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
-	if (err)
-		return false;
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
-	if (err)
-		return false;
-
-	for (i = 0; i < 64; i += 8) {
-		group_size = (counts >> i) & 0xFF;
-		if (group_size >= size) {
-			*turbo_freq = (ratios >> i) & 0xFF;
-			return true;
-		}
-	}
-
-	return false;
-}
-
-static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
-{
-	u64 msr;
-	int err;
-
-	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
-	if (err)
-		return false;
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 8) & 0xFF;    /* max P state */
-	*turbo_freq = (msr >> 24) & 0xFF;         /* 4C turbo    */
-
-	/* The CPU may have less than 4 cores */
-	if (!*turbo_freq)
-		*turbo_freq = msr & 0xFF;         /* 1C turbo    */
-
-	return true;
-}
-
-static bool intel_set_max_freq_ratio(void)
-{
-	u64 base_freq, turbo_freq;
-	u64 turbo_ratio;
-
-	if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
-		goto out;
-
-	if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
-	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
-		goto out;
-
-	if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
-	    knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
-		goto out;
-
-	if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
-	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
-		goto out;
-
-	if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
-		goto out;
-
-	return false;
-
-out:
-	/*
-	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
-	 * but then fill all MSR's with zeroes.
-	 * Some CPUs have turbo boost but don't declare any turbo ratio
-	 * in MSR_TURBO_RATIO_LIMIT.
-	 */
-	if (!base_freq || !turbo_freq) {
-		pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
-		return false;
-	}
-
-	turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
-	if (!turbo_ratio) {
-		pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
-		return false;
-	}
-
-	arch_turbo_freq_ratio = turbo_ratio;
-	arch_set_max_freq_ratio(turbo_disabled());
-
-	return true;
-}
-
-static void init_counter_refs(void)
-{
-	u64 aperf, mperf;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
-}
-
-#ifdef CONFIG_PM_SLEEP
-static struct syscore_ops freq_invariance_syscore_ops = {
-	.resume = init_counter_refs,
-};
-
-static void register_freq_invariance_syscore_ops(void)
-{
-	/* Bail out if registered already. */
-	if (freq_invariance_syscore_ops.node.prev)
-		return;
-
-	register_syscore_ops(&freq_invariance_syscore_ops);
-}
-#else
-static inline void register_freq_invariance_syscore_ops(void) {}
-#endif
-
-void init_freq_invariance(bool secondary, bool cppc_ready)
-{
-	bool ret = false;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return;
-
-	if (secondary) {
-		if (static_branch_likely(&arch_scale_freq_key)) {
-			init_counter_refs();
-		}
-		return;
-	}
-
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
-		ret = intel_set_max_freq_ratio();
-	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-		if (!cppc_ready) {
-			return;
-		}
-		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
-	}
-
-	if (ret) {
-		init_counter_refs();
-		static_branch_enable(&arch_scale_freq_key);
-		register_freq_invariance_syscore_ops();
-		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
-	} else {
-		pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
-	}
-}
-
-static void disable_freq_invariance_workfn(struct work_struct *work)
-{
-	static_branch_disable(&arch_scale_freq_key);
-}
-
-static DECLARE_WORK(disable_freq_invariance_work,
-		    disable_freq_invariance_workfn);
-
-DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
-
-void arch_scale_freq_tick(void)
-{
-	u64 freq_scale;
-	u64 aperf, mperf;
-	u64 acnt, mcnt;
-
-	if (!arch_scale_freq_invariant())
-		return;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	acnt = aperf - this_cpu_read(arch_prev_aperf);
-	mcnt = mperf - this_cpu_read(arch_prev_mperf);
-
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
-
-	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
-		goto error;
-
-	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
-		goto error;
-
-	freq_scale = div64_u64(acnt, mcnt);
-	if (!freq_scale)
-		goto error;
-
-	if (freq_scale > SCHED_CAPACITY_SCALE)
-		freq_scale = SCHED_CAPACITY_SCALE;
-
-	this_cpu_write(arch_freq_scale, freq_scale);
-	return;
-
-error:
-	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
-	schedule_work(&disable_freq_invariance_work);
-}
-#endif /* CONFIG_X86_64 */


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 03/10] x86/aperfmperf: Separate AP/BP frequency invariance init
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
  2022-04-15 19:19 ` [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() Thomas Gleixner
  2022-04-15 19:19 ` [patch 02/10] x86/smp: Move APERF/MPERF code where it belongs Thomas Gleixner
@ 2022-04-15 19:19 ` Thomas Gleixner
  2022-04-19 16:04   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-15 19:19 ` [patch 04/10] x86/aperfmperf: Untangle Intel and AMD " Thomas Gleixner
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

This code is convoluted and because it can be invoked post init via the
ACPI/CPPC code, all of the initialization functionality is built in instead
of being part of init text and init data.

As a first step create separate calls for the boot and the application
processors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/topology.h  |   12 +++++-------
 arch/x86/kernel/acpi/cppc.c      |    3 ++-
 arch/x86/kernel/cpu/aperfmperf.c |   23 +++++++++++------------
 arch/x86/kernel/smpboot.c        |    4 ++--
 4 files changed, 20 insertions(+), 22 deletions(-)

--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -216,14 +216,12 @@ extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
-void init_freq_invariance(bool secondary, bool cppc_ready);
+extern void bp_init_freq_invariance(bool cppc_ready);
+extern void ap_init_freq_invariance(void);
 #else
-static inline void arch_set_max_freq_ratio(bool turbo_disabled)
-{
-}
-static inline void init_freq_invariance(bool secondary, bool cppc_ready)
-{
-}
+static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
+static inline void bp_init_freq_invariance(bool cppc_ready) { }
+static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
--- a/arch/x86/kernel/acpi/cppc.c
+++ b/arch/x86/kernel/acpi/cppc.c
@@ -96,7 +96,8 @@ void init_freq_invariance_cppc(void)
 
 	mutex_lock(&freq_invariance_lock);
 
-	init_freq_invariance(secondary, true);
+	if (!secondary)
+		bp_init_freq_invariance(true);
 	secondary = true;
 
 	mutex_unlock(&freq_invariance_lock);
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -428,31 +428,24 @@ static void register_freq_invariance_sys
 static inline void register_freq_invariance_syscore_ops(void) {}
 #endif
 
-void init_freq_invariance(bool secondary, bool cppc_ready)
+void bp_init_freq_invariance(bool cppc_ready)
 {
-	bool ret = false;
+	bool ret;
 
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
-	if (secondary) {
-		if (static_branch_likely(&arch_scale_freq_key)) {
-			init_counter_refs();
-		}
-		return;
-	}
+	init_counter_refs();
 
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
 		ret = intel_set_max_freq_ratio();
 	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-		if (!cppc_ready) {
+		if (!cppc_ready)
 			return;
-		}
 		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
 	}
 
 	if (ret) {
-		init_counter_refs();
 		static_branch_enable(&arch_scale_freq_key);
 		register_freq_invariance_syscore_ops();
 		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
@@ -461,6 +454,12 @@ void init_freq_invariance(bool secondary
 	}
 }
 
+void ap_init_freq_invariance(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		init_counter_refs();
+}
+
 static void disable_freq_invariance_workfn(struct work_struct *work)
 {
 	static_branch_disable(&arch_scale_freq_key);
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -186,7 +186,7 @@ static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
-	init_freq_invariance(true, false);
+	ap_init_freq_invariance();
 
 	/*
 	 * Get our bogomips.
@@ -1396,7 +1396,7 @@ void __init native_smp_prepare_cpus(unsi
 {
 	smp_prepare_cpus_common();
 
-	init_freq_invariance(false, false);
+	bp_init_freq_invariance(false);
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 04/10] x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (2 preceding siblings ...)
  2022-04-15 19:19 ` [patch 03/10] x86/aperfmperf: Separate AP/BP frequency invariance init Thomas Gleixner
@ 2022-04-15 19:19 ` Thomas Gleixner
  2022-04-19 16:12   ` Rafael J. Wysocki
                     ` (2 more replies)
  2022-04-15 19:19 ` [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct Thomas Gleixner
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

AMD boot CPU initialization happens late via ACPI/CPPC which prevents the
Intel parts from being marked __init.

Split out the common code and provide a dedicated interface for the AMD
initialization and mark the Intel specific code and data __init.

The remaining text size is almost cut in half:

  text:		2614	->	1350
  init.text:	   0	->	 786

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/topology.h  |   13 ++------
 arch/x86/kernel/acpi/cppc.c      |   29 +++++++-----------
 arch/x86/kernel/cpu/aperfmperf.c |   62 ++++++++++++++++++++-------------------
 arch/x86/kernel/smpboot.c        |    2 -
 4 files changed, 49 insertions(+), 57 deletions(-)

--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -216,24 +216,19 @@ extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
-extern void bp_init_freq_invariance(bool cppc_ready);
+extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
+extern void bp_init_freq_invariance(void);
 extern void ap_init_freq_invariance(void);
 #else
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
-static inline void bp_init_freq_invariance(bool cppc_ready) { }
+static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
+static inline void bp_init_freq_invariance(void) { }
 static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
 void init_freq_invariance_cppc(void);
 #define arch_init_invariance_cppc init_freq_invariance_cppc
-
-bool amd_set_max_freq_ratio(u64 *ratio);
-#else
-static inline bool amd_set_max_freq_ratio(u64 *ratio)
-{
-	return false;
-}
 #endif
 
 #endif /* _ASM_X86_TOPOLOGY_H */
--- a/arch/x86/kernel/acpi/cppc.c
+++ b/arch/x86/kernel/acpi/cppc.c
@@ -50,20 +50,17 @@ int cpc_write_ffh(int cpunum, struct cpc
 	return err;
 }
 
-bool amd_set_max_freq_ratio(u64 *ratio)
+static void amd_set_max_freq_ratio(void)
 {
 	struct cppc_perf_caps perf_caps;
 	u64 highest_perf, nominal_perf;
 	u64 perf_ratio;
 	int rc;
 
-	if (!ratio)
-		return false;
-
 	rc = cppc_get_perf_caps(0, &perf_caps);
 	if (rc) {
 		pr_debug("Could not retrieve perf counters (%d)\n", rc);
-		return false;
+		return;
 	}
 
 	highest_perf = amd_get_highest_perf();
@@ -71,7 +68,7 @@ bool amd_set_max_freq_ratio(u64 *ratio)
 
 	if (!highest_perf || !nominal_perf) {
 		pr_debug("Could not retrieve highest or nominal performance\n");
-		return false;
+		return;
 	}
 
 	perf_ratio = div_u64(highest_perf * SCHED_CAPACITY_SCALE, nominal_perf);
@@ -79,26 +76,24 @@ bool amd_set_max_freq_ratio(u64 *ratio)
 	perf_ratio = (perf_ratio + SCHED_CAPACITY_SCALE) >> 1;
 	if (!perf_ratio) {
 		pr_debug("Non-zero highest/nominal perf values led to a 0 ratio\n");
-		return false;
+		return;
 	}
 
-	*ratio = perf_ratio;
-	arch_set_max_freq_ratio(false);
-
-	return true;
+	freq_invariance_set_perf_ratio(perf_ratio, false);
 }
 
 static DEFINE_MUTEX(freq_invariance_lock);
 
 void init_freq_invariance_cppc(void)
 {
-	static bool secondary;
+	static bool init_done;
 
-	mutex_lock(&freq_invariance_lock);
-
-	if (!secondary)
-		bp_init_freq_invariance(true);
-	secondary = true;
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return;
 
+	mutex_lock(&freq_invariance_lock);
+	if (!init_done)
+		amd_set_max_freq_ratio();
+	init_done = true;
 	mutex_unlock(&freq_invariance_lock);
 }
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -206,7 +206,7 @@ void arch_set_max_freq_ratio(bool turbo_
 }
 EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
 
-static bool turbo_disabled(void)
+static bool __init turbo_disabled(void)
 {
 	u64 misc_en;
 	int err;
@@ -218,7 +218,7 @@ static bool turbo_disabled(void)
 	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
 }
 
-static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+static bool __init slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 {
 	int err;
 
@@ -240,26 +240,26 @@ static bool slv_set_max_freq_ratio(u64 *
 	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,		\
 		INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
 
-static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(XEON_PHI_KNL),
 	X86_MATCH(XEON_PHI_KNM),
 	{}
 };
 
-static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(SKYLAKE_X),
 	{}
 };
 
-static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(ATOM_GOLDMONT),
 	X86_MATCH(ATOM_GOLDMONT_D),
 	X86_MATCH(ATOM_GOLDMONT_PLUS),
 	{}
 };
 
-static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
-				int num_delta_fratio)
+static bool __init knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+					  int num_delta_fratio)
 {
 	int fratio, delta_fratio, found;
 	int err, i;
@@ -297,7 +297,7 @@ static bool knl_set_max_freq_ratio(u64 *
 	return true;
 }
 
-static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+static bool __init skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
 {
 	u64 ratios, counts;
 	u32 group_size;
@@ -328,7 +328,7 @@ static bool skx_set_max_freq_ratio(u64 *
 	return false;
 }
 
-static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+static bool __init core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 {
 	u64 msr;
 	int err;
@@ -351,7 +351,7 @@ static bool core_set_max_freq_ratio(u64
 	return true;
 }
 
-static bool intel_set_max_freq_ratio(void)
+static bool __init intel_set_max_freq_ratio(void)
 {
 	u64 base_freq, turbo_freq;
 	u64 turbo_ratio;
@@ -418,40 +418,42 @@ static struct syscore_ops freq_invarianc
 
 static void register_freq_invariance_syscore_ops(void)
 {
-	/* Bail out if registered already. */
-	if (freq_invariance_syscore_ops.node.prev)
-		return;
-
 	register_syscore_ops(&freq_invariance_syscore_ops);
 }
 #else
 static inline void register_freq_invariance_syscore_ops(void) {}
 #endif
 
-void bp_init_freq_invariance(bool cppc_ready)
+static void freq_invariance_enable(void)
+{
+	if (static_branch_unlikely(&arch_scale_freq_key)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+	static_branch_enable(&arch_scale_freq_key);
+	register_freq_invariance_syscore_ops();
+	pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
+}
+
+void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled)
 {
-	bool ret;
+	arch_turbo_freq_ratio = ratio;
+	arch_set_max_freq_ratio(turbo_disabled);
+	freq_invariance_enable();
+}
 
+void __init bp_init_freq_invariance(void)
+{
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
 	init_counter_refs();
 
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
-		ret = intel_set_max_freq_ratio();
-	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-		if (!cppc_ready)
-			return;
-		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
-	}
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+		return;
 
-	if (ret) {
-		static_branch_enable(&arch_scale_freq_key);
-		register_freq_invariance_syscore_ops();
-		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
-	} else {
-		pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
-	}
+	if (intel_set_max_freq_ratio())
+		freq_invariance_enable();
 }
 
 void ap_init_freq_invariance(void)
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1396,7 +1396,7 @@ void __init native_smp_prepare_cpus(unsi
 {
 	smp_prepare_cpus_common();
 
-	bp_init_freq_invariance(false);
+	bp_init_freq_invariance();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (3 preceding siblings ...)
  2022-04-15 19:19 ` [patch 04/10] x86/aperfmperf: Untangle Intel and AMD " Thomas Gleixner
@ 2022-04-15 19:19 ` Thomas Gleixner
  2022-04-19 16:15   ` Rafael J. Wysocki
                     ` (2 more replies)
  2022-04-15 19:19 ` [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick() Thomas Gleixner
                   ` (8 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |   26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -22,6 +22,13 @@
 
 #include "cpu.h"
 
+struct aperfmperf {
+	u64		aperf;
+	u64		mperf;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
+
 struct aperfmperf_sample {
 	unsigned int	khz;
 	atomic_t	scfpending;
@@ -194,8 +201,6 @@ unsigned int arch_freq_get_on_cpu(int cp
 
 DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
 
-static DEFINE_PER_CPU(u64, arch_prev_aperf);
-static DEFINE_PER_CPU(u64, arch_prev_mperf);
 static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
 static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
 
@@ -407,8 +412,8 @@ static void init_counter_refs(void)
 	rdmsrl(MSR_IA32_APERF, aperf);
 	rdmsrl(MSR_IA32_MPERF, mperf);
 
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
+	this_cpu_write(cpu_samples.aperf, aperf);
+	this_cpu_write(cpu_samples.mperf, mperf);
 }
 
 #ifdef CONFIG_PM_SLEEP
@@ -474,9 +479,8 @@ DEFINE_PER_CPU(unsigned long, arch_freq_
 
 void arch_scale_freq_tick(void)
 {
-	u64 freq_scale;
-	u64 aperf, mperf;
-	u64 acnt, mcnt;
+	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
+	u64 aperf, mperf, acnt, mcnt, freq_scale;
 
 	if (!arch_scale_freq_invariant())
 		return;
@@ -484,11 +488,11 @@ void arch_scale_freq_tick(void)
 	rdmsrl(MSR_IA32_APERF, aperf);
 	rdmsrl(MSR_IA32_MPERF, mperf);
 
-	acnt = aperf - this_cpu_read(arch_prev_aperf);
-	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+	acnt = aperf - s->aperf;
+	mcnt = mperf - s->mperf;
 
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
+	s->aperf = aperf;
+	s->mperf = mperf;
 
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick()
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (4 preceding siblings ...)
  2022-04-15 19:19 ` [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct Thomas Gleixner
@ 2022-04-15 19:19 ` Thomas Gleixner
  2022-04-19 16:20   ` Rafael J. Wysocki
                     ` (2 more replies)
  2022-04-15 19:19 ` [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional Thomas Gleixner
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |   36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -477,22 +477,9 @@ static DECLARE_WORK(disable_freq_invaria
 
 DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 
-void arch_scale_freq_tick(void)
+static void scale_freq_tick(u64 acnt, u64 mcnt)
 {
-	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
-	u64 aperf, mperf, acnt, mcnt, freq_scale;
-
-	if (!arch_scale_freq_invariant())
-		return;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	acnt = aperf - s->aperf;
-	mcnt = mperf - s->mperf;
-
-	s->aperf = aperf;
-	s->mperf = mperf;
+	u64 freq_scale;
 
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
@@ -514,4 +501,23 @@ void arch_scale_freq_tick(void)
 	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
 	schedule_work(&disable_freq_invariance_work);
 }
+
+void arch_scale_freq_tick(void)
+{
+	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
+	u64 acnt, mcnt, aperf, mperf;
+
+	if (!arch_scale_freq_invariant())
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+	acnt = aperf - s->aperf;
+	mcnt = mperf - s->mperf;
+
+	s->aperf = aperf;
+	s->mperf = mperf;
+
+	scale_freq_tick(acnt, mcnt);
+}
 #endif /* CONFIG_X86_64 && CONFIG_SMP */


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (5 preceding siblings ...)
  2022-04-15 19:19 ` [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick() Thomas Gleixner
@ 2022-04-15 19:19 ` Thomas Gleixner
  2022-04-19 16:27   ` Rafael J. Wysocki
                     ` (2 more replies)
  2022-04-15 19:20 ` [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads Thomas Gleixner
                   ` (6 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:19 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

The frequency invariance support is currently limited to x86/64 and SMP,
which is the vast majority of machines.

arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
and MPERF MSRs. The CPU frequency getters function do the same via dedicated
IPIs.

While it could be argued that on systems where frequency invariance support
is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
be avoided, it does not make sense to keep the extra code and the resulting
runtime issues of mass IPIs around.

As a first step split out the non frequency invariance specific
initialization code and the read MSR portion of arch_scale_freq_tick(). The
rest of the code is still conditional and guarded with a static key.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/cpu.h       |    2 +
 arch/x86/include/asm/topology.h  |    4 --
 arch/x86/kernel/cpu/aperfmperf.c |   63 +++++++++++++++++++++++----------------
 arch/x86/kernel/smpboot.c        |    3 -
 4 files changed, 41 insertions(+), 31 deletions(-)

--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -36,6 +36,8 @@ extern int _debug_hotplug_cpu(int cpu, i
 #endif
 #endif
 
+extern void ap_init_aperfmperf(void);
+
 int mwait_usable(const struct cpuinfo_x86 *);
 
 unsigned int x86_family(unsigned int sig);
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -217,13 +217,9 @@ extern void arch_scale_freq_tick(void);
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
 extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
-extern void bp_init_freq_invariance(void);
-extern void ap_init_freq_invariance(void);
 #else
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
 static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
-static inline void bp_init_freq_invariance(void) { }
-static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -17,6 +17,7 @@
 #include <linux/smp.h>
 #include <linux/syscore_ops.h>
 
+#include <asm/cpu.h>
 #include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 
@@ -164,6 +165,17 @@ unsigned int arch_freq_get_on_cpu(int cp
 	return per_cpu(samples.khz, cpu);
 }
 
+static void init_counter_refs(void)
+{
+	u64 aperf, mperf;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(cpu_samples.aperf, aperf);
+	this_cpu_write(cpu_samples.mperf, mperf);
+}
+
 #if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
 /*
  * APERF/MPERF frequency ratio computation.
@@ -405,17 +417,6 @@ static bool __init intel_set_max_freq_ra
 	return true;
 }
 
-static void init_counter_refs(void)
-{
-	u64 aperf, mperf;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	this_cpu_write(cpu_samples.aperf, aperf);
-	this_cpu_write(cpu_samples.mperf, mperf);
-}
-
 #ifdef CONFIG_PM_SLEEP
 static struct syscore_ops freq_invariance_syscore_ops = {
 	.resume = init_counter_refs,
@@ -447,13 +448,8 @@ void freq_invariance_set_perf_ratio(u64
 	freq_invariance_enable();
 }
 
-void __init bp_init_freq_invariance(void)
+static void __init bp_init_freq_invariance(void)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		return;
-
-	init_counter_refs();
-
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return;
 
@@ -461,12 +457,6 @@ void __init bp_init_freq_invariance(void
 		freq_invariance_enable();
 }
 
-void ap_init_freq_invariance(void)
-{
-	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		init_counter_refs();
-}
-
 static void disable_freq_invariance_workfn(struct work_struct *work)
 {
 	static_branch_disable(&arch_scale_freq_key);
@@ -481,6 +471,9 @@ static void scale_freq_tick(u64 acnt, u6
 {
 	u64 freq_scale;
 
+	if (!arch_scale_freq_invariant())
+		return;
+
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
 
@@ -501,13 +494,17 @@ static void scale_freq_tick(u64 acnt, u6
 	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
 	schedule_work(&disable_freq_invariance_work);
 }
+#else
+static inline void bp_init_freq_invariance(void) { }
+static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
+#endif /* CONFIG_X86_64 && CONFIG_SMP */
 
 void arch_scale_freq_tick(void)
 {
 	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
 	u64 acnt, mcnt, aperf, mperf;
 
-	if (!arch_scale_freq_invariant())
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
 	rdmsrl(MSR_IA32_APERF, aperf);
@@ -520,4 +517,20 @@ void arch_scale_freq_tick(void)
 
 	scale_freq_tick(acnt, mcnt);
 }
-#endif /* CONFIG_X86_64 && CONFIG_SMP */
+
+static int __init bp_init_aperfmperf(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return 0;
+
+	init_counter_refs();
+	bp_init_freq_invariance();
+	return 0;
+}
+early_initcall(bp_init_aperfmperf);
+
+void ap_init_aperfmperf(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		init_counter_refs();
+}
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -186,7 +186,7 @@ static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
-	ap_init_freq_invariance();
+	ap_init_aperfmperf();
 
 	/*
 	 * Get our bogomips.
@@ -1396,7 +1396,6 @@ void __init native_smp_prepare_cpus(unsi
 {
 	smp_prepare_cpus_common();
 
-	bp_init_freq_invariance();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (6 preceding siblings ...)
  2022-04-15 19:19 ` [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional Thomas Gleixner
@ 2022-04-15 19:20 ` Thomas Gleixner
  2022-04-19 16:30   ` Rafael J. Wysocki
                     ` (2 more replies)
  2022-04-15 19:20 ` [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz() Thomas Gleixner
                   ` (5 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:20 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

Now that the MSR readout is unconditional, store the results in the per CPU
data structure along with a jiffies timestamp for the CPU frequency readout
code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -24,11 +24,17 @@
 #include "cpu.h"
 
 struct aperfmperf {
+	seqcount_t	seq;
+	unsigned long	last_update;
+	u64		acnt;
+	u64		mcnt;
 	u64		aperf;
 	u64		mperf;
 };
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples) = {
+	.seq = SEQCNT_ZERO(cpu_samples.seq)
+};
 
 struct aperfmperf_sample {
 	unsigned int	khz;
@@ -515,6 +521,12 @@ void arch_scale_freq_tick(void)
 	s->aperf = aperf;
 	s->mperf = mperf;
 
+	raw_write_seqcount_begin(&s->seq);
+	s->last_update = jiffies;
+	s->acnt = acnt;
+	s->mcnt = mcnt;
+	raw_write_seqcount_end(&s->seq);
+
 	scale_freq_tick(acnt, mcnt);
 }
 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz()
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (7 preceding siblings ...)
  2022-04-15 19:20 ` [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads Thomas Gleixner
@ 2022-04-15 19:20 ` Thomas Gleixner
  2022-04-19 16:35   ` Rafael J. Wysocki
                     ` (2 more replies)
  2022-04-15 19:20 ` [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu() Thomas Gleixner
                   ` (4 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:20 UTC (permalink / raw)
  To: LKML
  Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney,
	Eric Dumazet

The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them for the cpu frequency display in /proc/cpuinfo.

The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.

This gets rid of the mass IPIs and a delay of 20ms for stabilizing observed
by Eric when reading /proc/cpuinfo.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |   77 +++++++++++++++++----------------------
 fs/proc/cpuinfo.c                |    6 ---
 include/linux/cpufreq.h          |    1 
 3 files changed, 35 insertions(+), 49 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -100,49 +100,6 @@ static bool aperfmperf_snapshot_cpu(int
 	return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
 }
 
-unsigned int aperfmperf_get_khz(int cpu)
-{
-	if (!cpu_khz)
-		return 0;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return 0;
-
-	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-		return 0;
-
-	if (rcu_is_idle_cpu(cpu))
-		return 0; /* Idle CPUs are completely uninteresting. */
-
-	aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
-	return per_cpu(samples.khz, cpu);
-}
-
-void arch_freq_prepare_all(void)
-{
-	ktime_t now = ktime_get();
-	bool wait = false;
-	int cpu;
-
-	if (!cpu_khz)
-		return;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return;
-
-	for_each_online_cpu(cpu) {
-		if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-			continue;
-		if (rcu_is_idle_cpu(cpu))
-			continue; /* Idle CPUs are completely uninteresting. */
-		if (!aperfmperf_snapshot_cpu(cpu, now, false))
-			wait = true;
-	}
-
-	if (wait)
-		msleep(APERFMPERF_REFRESH_DELAY_MS);
-}
-
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
@@ -529,6 +486,40 @@ void arch_scale_freq_tick(void)
 	scale_freq_tick(acnt, mcnt);
 }
 
+/*
+ * Discard samples older than the define maximum sample age of 20ms. There
+ * is no point in sending IPIs in such a case. If the scheduler tick was
+ * not running then the CPU is either idle or isolated.
+ */
+#define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)
+
+unsigned int aperfmperf_get_khz(int cpu)
+{
+	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+	unsigned long last;
+	unsigned int seq;
+	u64 acnt, mcnt;
+
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return 0;
+
+	do {
+		seq = raw_read_seqcount_begin(&s->seq);
+		last = s->last_update;
+		acnt = s->acnt;
+		mcnt = s->mcnt;
+	} while (read_seqcount_retry(&s->seq, seq));
+
+	/*
+	 * Bail on invalid count and when the last update was too long ago,
+	 * which covers idle and NOHZ full CPUs.
+	 */
+	if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
+		return 0;
+
+	return div64_u64((cpu_khz * acnt), mcnt);
+}
+
 static int __init bp_init_aperfmperf(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
--- a/fs/proc/cpuinfo.c
+++ b/fs/proc/cpuinfo.c
@@ -5,14 +5,10 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 
-__weak void arch_freq_prepare_all(void)
-{
-}
-
 extern const struct seq_operations cpuinfo_op;
+
 static int cpuinfo_open(struct inode *inode, struct file *file)
 {
-	arch_freq_prepare_all();
 	return seq_open(file, &cpuinfo_op);
 }
 
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1199,7 +1199,6 @@ static inline void sched_cpufreq_governo
 			struct cpufreq_governor *old_gov) { }
 #endif
 
-extern void arch_freq_prepare_all(void);
 extern unsigned int arch_freq_get_on_cpu(int cpu);
 
 #ifndef arch_set_freq_scale


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu()
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (8 preceding siblings ...)
  2022-04-15 19:20 ` [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz() Thomas Gleixner
@ 2022-04-15 19:20 ` Thomas Gleixner
  2022-04-19 16:37   ` Rafael J. Wysocki
                     ` (2 more replies)
  2022-04-19 15:51 ` [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Eric Dumazet
                   ` (3 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-15 19:20 UTC (permalink / raw)
  To: LKML; +Cc: x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

Reading the current CPU frequency from /sys/..../scaling_cur_freq involves
in the worst case two IPIs due to the ad hoc sampling.

The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them and consolidate this with the /proc/cpuinfo readout.

The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.

The resulting text size vs. the original APERF/MPERF plus the separate
frequency invariance code:

  text:		2411	->   723
  init.text:	   0	->   767

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |   94 ---------------------------------------
 arch/x86/kernel/cpu/proc.c       |    2 
 2 files changed, 2 insertions(+), 94 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -35,98 +35,6 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(str
 	.seq = SEQCNT_ZERO(cpu_samples.seq)
 };
 
-struct aperfmperf_sample {
-	unsigned int	khz;
-	atomic_t	scfpending;
-	ktime_t	time;
-	u64	aperf;
-	u64	mperf;
-};
-
-static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
-
-#define APERFMPERF_CACHE_THRESHOLD_MS	10
-#define APERFMPERF_REFRESH_DELAY_MS	10
-#define APERFMPERF_STALE_THRESHOLD_MS	1000
-
-/*
- * aperfmperf_snapshot_khz()
- * On the current CPU, snapshot APERF, MPERF, and jiffies
- * unless we already did it within 10ms
- * calculate kHz, save snapshot
- */
-static void aperfmperf_snapshot_khz(void *dummy)
-{
-	u64 aperf, aperf_delta;
-	u64 mperf, mperf_delta;
-	struct aperfmperf_sample *s = this_cpu_ptr(&samples);
-	unsigned long flags;
-
-	local_irq_save(flags);
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-	local_irq_restore(flags);
-
-	aperf_delta = aperf - s->aperf;
-	mperf_delta = mperf - s->mperf;
-
-	/*
-	 * There is no architectural guarantee that MPERF
-	 * increments faster than we can read it.
-	 */
-	if (mperf_delta == 0)
-		return;
-
-	s->time = ktime_get();
-	s->aperf = aperf;
-	s->mperf = mperf;
-	s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
-	atomic_set_release(&s->scfpending, 0);
-}
-
-static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
-{
-	s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
-	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
-
-	/* Don't bother re-computing within the cache threshold time. */
-	if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
-		return true;
-
-	if (!atomic_xchg(&s->scfpending, 1) || wait)
-		smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
-
-	/* Return false if the previous iteration was too long ago. */
-	return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
-}
-
-unsigned int arch_freq_get_on_cpu(int cpu)
-{
-	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
-
-	if (!cpu_khz)
-		return 0;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return 0;
-
-	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-		return 0;
-
-	if (rcu_is_idle_cpu(cpu))
-		return 0;
-
-	if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
-		return per_cpu(samples.khz, cpu);
-
-	msleep(APERFMPERF_REFRESH_DELAY_MS);
-	atomic_set(&s->scfpending, 1);
-	smp_mb(); /* ->scfpending before smp_call_function_single(). */
-	smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
-
-	return per_cpu(samples.khz, cpu);
-}
-
 static void init_counter_refs(void)
 {
 	u64 aperf, mperf;
@@ -493,7 +401,7 @@ void arch_scale_freq_tick(void)
  */
 #define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)
 
-unsigned int aperfmperf_get_khz(int cpu)
+unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
 	unsigned long last;
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -84,7 +84,7 @@ static int show_cpuinfo(struct seq_file
 		seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
 
 	if (cpu_has(c, X86_FEATURE_TSC)) {
-		unsigned int freq = aperfmperf_get_khz(cpu);
+		unsigned int freq = arch_freq_get_on_cpu(cpu);
 
 		if (!freq)
 			freq = cpufreq_quick_get(cpu);


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
  2022-04-15 19:19 ` [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() Thomas Gleixner
@ 2022-04-19 15:34   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 15:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> aperfmperf_get_khz() already excludes idle CPUs from APERF/MPERF sampling
> and that's a reasonable decision. There is no point in sending up to two
> IPIs to an idle CPU just because someone reads a sysfs file.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/cpu/aperfmperf.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -139,6 +139,9 @@ unsigned int arch_freq_get_on_cpu(int cp
>         if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
>                 return 0;
>
> +       if (rcu_is_idle_cpu(cpu))
> +               return 0;
> +
>         if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
>                 return per_cpu(samples.khz, cpu);
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 02/10] x86/smp: Move APERF/MPERF code where it belongs
  2022-04-15 19:19 ` [patch 02/10] x86/smp: Move APERF/MPERF code where it belongs Thomas Gleixner
@ 2022-04-19 15:40   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 15:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> as this can share code with the preexisting APERF/MPERF code.
>
> No functional change.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/cpu/aperfmperf.c |  366 ++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kernel/smpboot.c        |  355 -------------------------------------
>  2 files changed, 362 insertions(+), 359 deletions(-)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -6,15 +6,19 @@
>   * Copyright (C) 2017 Intel Corp.
>   * Author: Len Brown <len.brown@intel.com>
>   */
> -
> +#include <linux/cpufreq.h>
>  #include <linux/delay.h>
>  #include <linux/ktime.h>
>  #include <linux/math64.h>
>  #include <linux/percpu.h>
> -#include <linux/cpufreq.h>
> -#include <linux/smp.h>
> -#include <linux/sched/isolation.h>
>  #include <linux/rcupdate.h>
> +#include <linux/sched/isolation.h>
> +#include <linux/sched/topology.h>
> +#include <linux/smp.h>
> +#include <linux/syscore_ops.h>
> +
> +#include <asm/cpu_device_id.h>
> +#include <asm/intel-family.h>
>
>  #include "cpu.h"
>
> @@ -152,3 +156,357 @@ unsigned int arch_freq_get_on_cpu(int cp
>
>         return per_cpu(samples.khz, cpu);
>  }
> +
> +#if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
> +/*
> + * APERF/MPERF frequency ratio computation.
> + *
> + * The scheduler wants to do frequency invariant accounting and needs a <1
> + * ratio to account for the 'current' frequency, corresponding to
> + * freq_curr / freq_max.
> + *
> + * Since the frequency freq_curr on x86 is controlled by micro-controller and
> + * our P-state setting is little more than a request/hint, we need to observe
> + * the effective frequency 'BusyMHz', i.e. the average frequency over a time
> + * interval after discarding idle time. This is given by:
> + *
> + *   BusyMHz = delta_APERF / delta_MPERF * freq_base
> + *
> + * where freq_base is the max non-turbo P-state.
> + *
> + * The freq_max term has to be set to a somewhat arbitrary value, because we
> + * can't know which turbo states will be available at a given point in time:
> + * it all depends on the thermal headroom of the entire package. We set it to
> + * the turbo level with 4 cores active.
> + *
> + * Benchmarks show that's a good compromise between the 1C turbo ratio
> + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
> + * which would ignore the entire turbo range (a conspicuous part, making
> + * freq_curr/freq_max always maxed out).
> + *
> + * An exception to the heuristic above is the Atom uarch, where we choose the
> + * highest turbo level for freq_max since Atom's are generally oriented towards
> + * power efficiency.
> + *
> + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
> + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
> + */
> +
> +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
> +
> +static DEFINE_PER_CPU(u64, arch_prev_aperf);
> +static DEFINE_PER_CPU(u64, arch_prev_mperf);
> +static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
> +static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
> +
> +void arch_set_max_freq_ratio(bool turbo_disabled)
> +{
> +       arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
> +                                       arch_turbo_freq_ratio;
> +}
> +EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
> +
> +static bool turbo_disabled(void)
> +{
> +       u64 misc_en;
> +       int err;
> +
> +       err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
> +       if (err)
> +               return false;
> +
> +       return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
> +}
> +
> +static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> +{
> +       int err;
> +
> +       err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
> +       if (err)
> +               return false;
> +
> +       err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
> +       if (err)
> +               return false;
> +
> +       *base_freq = (*base_freq >> 16) & 0x3F;     /* max P state */
> +       *turbo_freq = *turbo_freq & 0x3F;           /* 1C turbo    */
> +
> +       return true;
> +}
> +
> +#define X86_MATCH(model)                                       \
> +       X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,            \
> +               INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
> +
> +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
> +       X86_MATCH(XEON_PHI_KNL),
> +       X86_MATCH(XEON_PHI_KNM),
> +       {}
> +};
> +
> +static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
> +       X86_MATCH(SKYLAKE_X),
> +       {}
> +};
> +
> +static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
> +       X86_MATCH(ATOM_GOLDMONT),
> +       X86_MATCH(ATOM_GOLDMONT_D),
> +       X86_MATCH(ATOM_GOLDMONT_PLUS),
> +       {}
> +};
> +
> +static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
> +                               int num_delta_fratio)
> +{
> +       int fratio, delta_fratio, found;
> +       int err, i;
> +       u64 msr;
> +
> +       err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> +       if (err)
> +               return false;
> +
> +       *base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
> +
> +       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> +       if (err)
> +               return false;
> +
> +       fratio = (msr >> 8) & 0xFF;
> +       i = 16;
> +       found = 0;
> +       do {
> +               if (found >= num_delta_fratio) {
> +                       *turbo_freq = fratio;
> +                       return true;
> +               }
> +
> +               delta_fratio = (msr >> (i + 5)) & 0x7;
> +
> +               if (delta_fratio) {
> +                       found += 1;
> +                       fratio -= delta_fratio;
> +               }
> +
> +               i += 8;
> +       } while (i < 64);
> +
> +       return true;
> +}
> +
> +static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
> +{
> +       u64 ratios, counts;
> +       u32 group_size;
> +       int err, i;
> +
> +       err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> +       if (err)
> +               return false;
> +
> +       *base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
> +
> +       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
> +       if (err)
> +               return false;
> +
> +       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
> +       if (err)
> +               return false;
> +
> +       for (i = 0; i < 64; i += 8) {
> +               group_size = (counts >> i) & 0xFF;
> +               if (group_size >= size) {
> +                       *turbo_freq = (ratios >> i) & 0xFF;
> +                       return true;
> +               }
> +       }
> +
> +       return false;
> +}
> +
> +static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> +{
> +       u64 msr;
> +       int err;
> +
> +       err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> +       if (err)
> +               return false;
> +
> +       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> +       if (err)
> +               return false;
> +
> +       *base_freq = (*base_freq >> 8) & 0xFF;    /* max P state */
> +       *turbo_freq = (msr >> 24) & 0xFF;         /* 4C turbo    */
> +
> +       /* The CPU may have less than 4 cores */
> +       if (!*turbo_freq)
> +               *turbo_freq = msr & 0xFF;         /* 1C turbo    */
> +
> +       return true;
> +}
> +
> +static bool intel_set_max_freq_ratio(void)
> +{
> +       u64 base_freq, turbo_freq;
> +       u64 turbo_ratio;
> +
> +       if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
> +               goto out;
> +
> +       if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
> +           skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> +               goto out;
> +
> +       if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
> +           knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> +               goto out;
> +
> +       if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
> +           skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
> +               goto out;
> +
> +       if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
> +               goto out;
> +
> +       return false;
> +
> +out:
> +       /*
> +        * Some hypervisors advertise X86_FEATURE_APERFMPERF
> +        * but then fill all MSR's with zeroes.
> +        * Some CPUs have turbo boost but don't declare any turbo ratio
> +        * in MSR_TURBO_RATIO_LIMIT.
> +        */
> +       if (!base_freq || !turbo_freq) {
> +               pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
> +               return false;
> +       }
> +
> +       turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
> +       if (!turbo_ratio) {
> +               pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
> +               return false;
> +       }
> +
> +       arch_turbo_freq_ratio = turbo_ratio;
> +       arch_set_max_freq_ratio(turbo_disabled());
> +
> +       return true;
> +}
> +
> +static void init_counter_refs(void)
> +{
> +       u64 aperf, mperf;
> +
> +       rdmsrl(MSR_IA32_APERF, aperf);
> +       rdmsrl(MSR_IA32_MPERF, mperf);
> +
> +       this_cpu_write(arch_prev_aperf, aperf);
> +       this_cpu_write(arch_prev_mperf, mperf);
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static struct syscore_ops freq_invariance_syscore_ops = {
> +       .resume = init_counter_refs,
> +};
> +
> +static void register_freq_invariance_syscore_ops(void)
> +{
> +       /* Bail out if registered already. */
> +       if (freq_invariance_syscore_ops.node.prev)
> +               return;
> +
> +       register_syscore_ops(&freq_invariance_syscore_ops);
> +}
> +#else
> +static inline void register_freq_invariance_syscore_ops(void) {}
> +#endif
> +
> +void init_freq_invariance(bool secondary, bool cppc_ready)
> +{
> +       bool ret = false;
> +
> +       if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> +               return;
> +
> +       if (secondary) {
> +               if (static_branch_likely(&arch_scale_freq_key)) {
> +                       init_counter_refs();
> +               }
> +               return;
> +       }
> +
> +       if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> +               ret = intel_set_max_freq_ratio();
> +       else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> +               if (!cppc_ready) {
> +                       return;
> +               }
> +               ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
> +       }
> +
> +       if (ret) {
> +               init_counter_refs();
> +               static_branch_enable(&arch_scale_freq_key);
> +               register_freq_invariance_syscore_ops();
> +               pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
> +       } else {
> +               pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
> +       }
> +}
> +
> +static void disable_freq_invariance_workfn(struct work_struct *work)
> +{
> +       static_branch_disable(&arch_scale_freq_key);
> +}
> +
> +static DECLARE_WORK(disable_freq_invariance_work,
> +                   disable_freq_invariance_workfn);
> +
> +DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
> +
> +void arch_scale_freq_tick(void)
> +{
> +       u64 freq_scale;
> +       u64 aperf, mperf;
> +       u64 acnt, mcnt;
> +
> +       if (!arch_scale_freq_invariant())
> +               return;
> +
> +       rdmsrl(MSR_IA32_APERF, aperf);
> +       rdmsrl(MSR_IA32_MPERF, mperf);
> +
> +       acnt = aperf - this_cpu_read(arch_prev_aperf);
> +       mcnt = mperf - this_cpu_read(arch_prev_mperf);
> +
> +       this_cpu_write(arch_prev_aperf, aperf);
> +       this_cpu_write(arch_prev_mperf, mperf);
> +
> +       if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> +               goto error;
> +
> +       if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
> +               goto error;
> +
> +       freq_scale = div64_u64(acnt, mcnt);
> +       if (!freq_scale)
> +               goto error;
> +
> +       if (freq_scale > SCHED_CAPACITY_SCALE)
> +               freq_scale = SCHED_CAPACITY_SCALE;
> +
> +       this_cpu_write(arch_freq_scale, freq_scale);
> +       return;
> +
> +error:
> +       pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
> +       schedule_work(&disable_freq_invariance_work);
> +}
> +#endif /* CONFIG_X86_64 && CONFIG_SMP */
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -56,7 +56,6 @@
>  #include <linux/numa.h>
>  #include <linux/pgtable.h>
>  #include <linux/overflow.h>
> -#include <linux/syscore_ops.h>
>
>  #include <asm/acpi.h>
>  #include <asm/desc.h>
> @@ -1847,357 +1846,3 @@ void native_play_dead(void)
>  }
>
>  #endif
> -
> -#ifdef CONFIG_X86_64
> -/*
> - * APERF/MPERF frequency ratio computation.
> - *
> - * The scheduler wants to do frequency invariant accounting and needs a <1
> - * ratio to account for the 'current' frequency, corresponding to
> - * freq_curr / freq_max.
> - *
> - * Since the frequency freq_curr on x86 is controlled by micro-controller and
> - * our P-state setting is little more than a request/hint, we need to observe
> - * the effective frequency 'BusyMHz', i.e. the average frequency over a time
> - * interval after discarding idle time. This is given by:
> - *
> - *   BusyMHz = delta_APERF / delta_MPERF * freq_base
> - *
> - * where freq_base is the max non-turbo P-state.
> - *
> - * The freq_max term has to be set to a somewhat arbitrary value, because we
> - * can't know which turbo states will be available at a given point in time:
> - * it all depends on the thermal headroom of the entire package. We set it to
> - * the turbo level with 4 cores active.
> - *
> - * Benchmarks show that's a good compromise between the 1C turbo ratio
> - * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
> - * which would ignore the entire turbo range (a conspicuous part, making
> - * freq_curr/freq_max always maxed out).
> - *
> - * An exception to the heuristic above is the Atom uarch, where we choose the
> - * highest turbo level for freq_max since Atom's are generally oriented towards
> - * power efficiency.
> - *
> - * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
> - * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
> - */
> -
> -DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
> -
> -static DEFINE_PER_CPU(u64, arch_prev_aperf);
> -static DEFINE_PER_CPU(u64, arch_prev_mperf);
> -static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
> -static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
> -
> -void arch_set_max_freq_ratio(bool turbo_disabled)
> -{
> -       arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
> -                                       arch_turbo_freq_ratio;
> -}
> -EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
> -
> -static bool turbo_disabled(void)
> -{
> -       u64 misc_en;
> -       int err;
> -
> -       err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
> -       if (err)
> -               return false;
> -
> -       return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
> -}
> -
> -static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> -{
> -       int err;
> -
> -       err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
> -       if (err)
> -               return false;
> -
> -       err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
> -       if (err)
> -               return false;
> -
> -       *base_freq = (*base_freq >> 16) & 0x3F;     /* max P state */
> -       *turbo_freq = *turbo_freq & 0x3F;           /* 1C turbo    */
> -
> -       return true;
> -}
> -
> -#define X86_MATCH(model)                                       \
> -       X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,            \
> -               INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
> -
> -static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
> -       X86_MATCH(XEON_PHI_KNL),
> -       X86_MATCH(XEON_PHI_KNM),
> -       {}
> -};
> -
> -static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
> -       X86_MATCH(SKYLAKE_X),
> -       {}
> -};
> -
> -static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
> -       X86_MATCH(ATOM_GOLDMONT),
> -       X86_MATCH(ATOM_GOLDMONT_D),
> -       X86_MATCH(ATOM_GOLDMONT_PLUS),
> -       {}
> -};
> -
> -static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
> -                               int num_delta_fratio)
> -{
> -       int fratio, delta_fratio, found;
> -       int err, i;
> -       u64 msr;
> -
> -       err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> -       if (err)
> -               return false;
> -
> -       *base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
> -
> -       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> -       if (err)
> -               return false;
> -
> -       fratio = (msr >> 8) & 0xFF;
> -       i = 16;
> -       found = 0;
> -       do {
> -               if (found >= num_delta_fratio) {
> -                       *turbo_freq = fratio;
> -                       return true;
> -               }
> -
> -               delta_fratio = (msr >> (i + 5)) & 0x7;
> -
> -               if (delta_fratio) {
> -                       found += 1;
> -                       fratio -= delta_fratio;
> -               }
> -
> -               i += 8;
> -       } while (i < 64);
> -
> -       return true;
> -}
> -
> -static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
> -{
> -       u64 ratios, counts;
> -       u32 group_size;
> -       int err, i;
> -
> -       err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> -       if (err)
> -               return false;
> -
> -       *base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
> -
> -       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
> -       if (err)
> -               return false;
> -
> -       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
> -       if (err)
> -               return false;
> -
> -       for (i = 0; i < 64; i += 8) {
> -               group_size = (counts >> i) & 0xFF;
> -               if (group_size >= size) {
> -                       *turbo_freq = (ratios >> i) & 0xFF;
> -                       return true;
> -               }
> -       }
> -
> -       return false;
> -}
> -
> -static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> -{
> -       u64 msr;
> -       int err;
> -
> -       err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> -       if (err)
> -               return false;
> -
> -       err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> -       if (err)
> -               return false;
> -
> -       *base_freq = (*base_freq >> 8) & 0xFF;    /* max P state */
> -       *turbo_freq = (msr >> 24) & 0xFF;         /* 4C turbo    */
> -
> -       /* The CPU may have less than 4 cores */
> -       if (!*turbo_freq)
> -               *turbo_freq = msr & 0xFF;         /* 1C turbo    */
> -
> -       return true;
> -}
> -
> -static bool intel_set_max_freq_ratio(void)
> -{
> -       u64 base_freq, turbo_freq;
> -       u64 turbo_ratio;
> -
> -       if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
> -               goto out;
> -
> -       if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
> -           skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> -               goto out;
> -
> -       if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
> -           knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> -               goto out;
> -
> -       if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
> -           skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
> -               goto out;
> -
> -       if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
> -               goto out;
> -
> -       return false;
> -
> -out:
> -       /*
> -        * Some hypervisors advertise X86_FEATURE_APERFMPERF
> -        * but then fill all MSR's with zeroes.
> -        * Some CPUs have turbo boost but don't declare any turbo ratio
> -        * in MSR_TURBO_RATIO_LIMIT.
> -        */
> -       if (!base_freq || !turbo_freq) {
> -               pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
> -               return false;
> -       }
> -
> -       turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
> -       if (!turbo_ratio) {
> -               pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
> -               return false;
> -       }
> -
> -       arch_turbo_freq_ratio = turbo_ratio;
> -       arch_set_max_freq_ratio(turbo_disabled());
> -
> -       return true;
> -}
> -
> -static void init_counter_refs(void)
> -{
> -       u64 aperf, mperf;
> -
> -       rdmsrl(MSR_IA32_APERF, aperf);
> -       rdmsrl(MSR_IA32_MPERF, mperf);
> -
> -       this_cpu_write(arch_prev_aperf, aperf);
> -       this_cpu_write(arch_prev_mperf, mperf);
> -}
> -
> -#ifdef CONFIG_PM_SLEEP
> -static struct syscore_ops freq_invariance_syscore_ops = {
> -       .resume = init_counter_refs,
> -};
> -
> -static void register_freq_invariance_syscore_ops(void)
> -{
> -       /* Bail out if registered already. */
> -       if (freq_invariance_syscore_ops.node.prev)
> -               return;
> -
> -       register_syscore_ops(&freq_invariance_syscore_ops);
> -}
> -#else
> -static inline void register_freq_invariance_syscore_ops(void) {}
> -#endif
> -
> -void init_freq_invariance(bool secondary, bool cppc_ready)
> -{
> -       bool ret = false;
> -
> -       if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> -               return;
> -
> -       if (secondary) {
> -               if (static_branch_likely(&arch_scale_freq_key)) {
> -                       init_counter_refs();
> -               }
> -               return;
> -       }
> -
> -       if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> -               ret = intel_set_max_freq_ratio();
> -       else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> -               if (!cppc_ready) {
> -                       return;
> -               }
> -               ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
> -       }
> -
> -       if (ret) {
> -               init_counter_refs();
> -               static_branch_enable(&arch_scale_freq_key);
> -               register_freq_invariance_syscore_ops();
> -               pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
> -       } else {
> -               pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
> -       }
> -}
> -
> -static void disable_freq_invariance_workfn(struct work_struct *work)
> -{
> -       static_branch_disable(&arch_scale_freq_key);
> -}
> -
> -static DECLARE_WORK(disable_freq_invariance_work,
> -                   disable_freq_invariance_workfn);
> -
> -DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
> -
> -void arch_scale_freq_tick(void)
> -{
> -       u64 freq_scale;
> -       u64 aperf, mperf;
> -       u64 acnt, mcnt;
> -
> -       if (!arch_scale_freq_invariant())
> -               return;
> -
> -       rdmsrl(MSR_IA32_APERF, aperf);
> -       rdmsrl(MSR_IA32_MPERF, mperf);
> -
> -       acnt = aperf - this_cpu_read(arch_prev_aperf);
> -       mcnt = mperf - this_cpu_read(arch_prev_mperf);
> -
> -       this_cpu_write(arch_prev_aperf, aperf);
> -       this_cpu_write(arch_prev_mperf, mperf);
> -
> -       if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> -               goto error;
> -
> -       if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
> -               goto error;
> -
> -       freq_scale = div64_u64(acnt, mcnt);
> -       if (!freq_scale)
> -               goto error;
> -
> -       if (freq_scale > SCHED_CAPACITY_SCALE)
> -               freq_scale = SCHED_CAPACITY_SCALE;
> -
> -       this_cpu_write(arch_freq_scale, freq_scale);
> -       return;
> -
> -error:
> -       pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
> -       schedule_work(&disable_freq_invariance_work);
> -}
> -#endif /* CONFIG_X86_64 */
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (9 preceding siblings ...)
  2022-04-15 19:20 ` [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu() Thomas Gleixner
@ 2022-04-19 15:51 ` Eric Dumazet
  2022-04-19 20:39   ` Thomas Gleixner
  2022-04-19 16:41 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2022-04-19 15:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, linux-pm,
	Paul E. McKenney

On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> APERF/MPERF is utilized in two ways:
>
>   1) Ad hoc readout of CPU frequency which requires IPIs
>
>   2) Frequency scale calculation for frequency invariant scheduling which
>      reads APERF/MPERF on every tick.
>
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
>
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
>
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
>
> The following series consolidates that.
>

Thanks a lot for working on that Thomas.

I am not sure I will be able to backport this to a Google prodkernel,
as I guess there will be many merge conflicts.

Do you have by any chance this work available in a git branch ?

Thanks.



> Thanks,
>
>         tglx
> ---
>  arch/x86/include/asm/cpu.h       |    2
>  arch/x86/include/asm/topology.h  |   17 -
>  arch/x86/kernel/acpi/cppc.c      |   28 --
>  arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
>  arch/x86/kernel/cpu/proc.c       |    2
>  arch/x86/kernel/smpboot.c        |  358 -----------------------------
>  fs/proc/cpuinfo.c                |    6
>  include/linux/cpufreq.h          |    1
>  8 files changed, 405 insertions(+), 483 deletions(-)
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 03/10] x86/aperfmperf: Separate AP/BP frequency invariance init
  2022-04-15 19:19 ` [patch 03/10] x86/aperfmperf: Separate AP/BP frequency invariance init Thomas Gleixner
@ 2022-04-19 16:04   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This code is convoluted and because it can be invoked post init via the
> ACPI/CPPC code, all of the initialization functionality is built in instead
> of being part of init text and init data.
>
> As a first step create separate calls for the boot and the application
> processors.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/topology.h  |   12 +++++-------
>  arch/x86/kernel/acpi/cppc.c      |    3 ++-
>  arch/x86/kernel/cpu/aperfmperf.c |   23 +++++++++++------------
>  arch/x86/kernel/smpboot.c        |    4 ++--
>  4 files changed, 20 insertions(+), 22 deletions(-)
>
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -216,14 +216,12 @@ extern void arch_scale_freq_tick(void);
>  #define arch_scale_freq_tick arch_scale_freq_tick
>
>  extern void arch_set_max_freq_ratio(bool turbo_disabled);
> -void init_freq_invariance(bool secondary, bool cppc_ready);
> +extern void bp_init_freq_invariance(bool cppc_ready);
> +extern void ap_init_freq_invariance(void);
>  #else
> -static inline void arch_set_max_freq_ratio(bool turbo_disabled)
> -{
> -}
> -static inline void init_freq_invariance(bool secondary, bool cppc_ready)
> -{
> -}
> +static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
> +static inline void bp_init_freq_invariance(bool cppc_ready) { }
> +static inline void ap_init_freq_invariance(void) { }
>  #endif
>
>  #ifdef CONFIG_ACPI_CPPC_LIB
> --- a/arch/x86/kernel/acpi/cppc.c
> +++ b/arch/x86/kernel/acpi/cppc.c
> @@ -96,7 +96,8 @@ void init_freq_invariance_cppc(void)
>
>         mutex_lock(&freq_invariance_lock);
>
> -       init_freq_invariance(secondary, true);
> +       if (!secondary)
> +               bp_init_freq_invariance(true);
>         secondary = true;
>
>         mutex_unlock(&freq_invariance_lock);
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -428,31 +428,24 @@ static void register_freq_invariance_sys
>  static inline void register_freq_invariance_syscore_ops(void) {}
>  #endif
>
> -void init_freq_invariance(bool secondary, bool cppc_ready)
> +void bp_init_freq_invariance(bool cppc_ready)
>  {
> -       bool ret = false;
> +       bool ret;
>
> -       if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> +       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
>                 return;
>
> -       if (secondary) {
> -               if (static_branch_likely(&arch_scale_freq_key)) {
> -                       init_counter_refs();
> -               }
> -               return;
> -       }
> +       init_counter_refs();
>
>         if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
>                 ret = intel_set_max_freq_ratio();
>         else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> -               if (!cppc_ready) {
> +               if (!cppc_ready)
>                         return;
> -               }
>                 ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
>         }
>
>         if (ret) {
> -               init_counter_refs();
>                 static_branch_enable(&arch_scale_freq_key);
>                 register_freq_invariance_syscore_ops();
>                 pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
> @@ -461,6 +454,12 @@ void init_freq_invariance(bool secondary
>         }
>  }
>
> +void ap_init_freq_invariance(void)
> +{
> +       if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> +               init_counter_refs();

This doesn't check arch_scale_freq_key now which may be a good thing
to mention in the changelog.

I don't see anything questionable in the patch, though, so

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> +}
> +
>  static void disable_freq_invariance_workfn(struct work_struct *work)
>  {
>         static_branch_disable(&arch_scale_freq_key);
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -186,7 +186,7 @@ static void smp_callin(void)
>          */
>         set_cpu_sibling_map(raw_smp_processor_id());
>
> -       init_freq_invariance(true, false);
> +       ap_init_freq_invariance();
>
>         /*
>          * Get our bogomips.
> @@ -1396,7 +1396,7 @@ void __init native_smp_prepare_cpus(unsi
>  {
>         smp_prepare_cpus_common();
>
> -       init_freq_invariance(false, false);
> +       bp_init_freq_invariance(false);
>         smp_sanity_check();
>
>         switch (apic_intr_mode) {
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 04/10] x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  2022-04-15 19:19 ` [patch 04/10] x86/aperfmperf: Untangle Intel and AMD " Thomas Gleixner
@ 2022-04-19 16:12   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> AMD boot CPU initialization happens late via ACPI/CPPC which prevents the
> Intel parts from being marked __init.
>
> Split out the common code and provide a dedicated interface for the AMD
> initialization and mark the Intel specific code and data __init.
>
> The remaining text size is almost cut in half:
>
>   text:         2614    ->      1350
>   init.text:       0    ->       786
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

All good AFAICS:

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/include/asm/topology.h  |   13 ++------
>  arch/x86/kernel/acpi/cppc.c      |   29 +++++++-----------
>  arch/x86/kernel/cpu/aperfmperf.c |   62 ++++++++++++++++++++-------------------
>  arch/x86/kernel/smpboot.c        |    2 -
>  4 files changed, 49 insertions(+), 57 deletions(-)
>
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -216,24 +216,19 @@ extern void arch_scale_freq_tick(void);
>  #define arch_scale_freq_tick arch_scale_freq_tick
>
>  extern void arch_set_max_freq_ratio(bool turbo_disabled);
> -extern void bp_init_freq_invariance(bool cppc_ready);
> +extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
> +extern void bp_init_freq_invariance(void);
>  extern void ap_init_freq_invariance(void);
>  #else
>  static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
> -static inline void bp_init_freq_invariance(bool cppc_ready) { }
> +static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
> +static inline void bp_init_freq_invariance(void) { }
>  static inline void ap_init_freq_invariance(void) { }
>  #endif
>
>  #ifdef CONFIG_ACPI_CPPC_LIB
>  void init_freq_invariance_cppc(void);
>  #define arch_init_invariance_cppc init_freq_invariance_cppc
> -
> -bool amd_set_max_freq_ratio(u64 *ratio);
> -#else
> -static inline bool amd_set_max_freq_ratio(u64 *ratio)
> -{
> -       return false;
> -}
>  #endif
>
>  #endif /* _ASM_X86_TOPOLOGY_H */
> --- a/arch/x86/kernel/acpi/cppc.c
> +++ b/arch/x86/kernel/acpi/cppc.c
> @@ -50,20 +50,17 @@ int cpc_write_ffh(int cpunum, struct cpc
>         return err;
>  }
>
> -bool amd_set_max_freq_ratio(u64 *ratio)
> +static void amd_set_max_freq_ratio(void)
>  {
>         struct cppc_perf_caps perf_caps;
>         u64 highest_perf, nominal_perf;
>         u64 perf_ratio;
>         int rc;
>
> -       if (!ratio)
> -               return false;
> -
>         rc = cppc_get_perf_caps(0, &perf_caps);
>         if (rc) {
>                 pr_debug("Could not retrieve perf counters (%d)\n", rc);
> -               return false;
> +               return;
>         }
>
>         highest_perf = amd_get_highest_perf();
> @@ -71,7 +68,7 @@ bool amd_set_max_freq_ratio(u64 *ratio)
>
>         if (!highest_perf || !nominal_perf) {
>                 pr_debug("Could not retrieve highest or nominal performance\n");
> -               return false;
> +               return;
>         }
>
>         perf_ratio = div_u64(highest_perf * SCHED_CAPACITY_SCALE, nominal_perf);
> @@ -79,26 +76,24 @@ bool amd_set_max_freq_ratio(u64 *ratio)
>         perf_ratio = (perf_ratio + SCHED_CAPACITY_SCALE) >> 1;
>         if (!perf_ratio) {
>                 pr_debug("Non-zero highest/nominal perf values led to a 0 ratio\n");
> -               return false;
> +               return;
>         }
>
> -       *ratio = perf_ratio;
> -       arch_set_max_freq_ratio(false);
> -
> -       return true;
> +       freq_invariance_set_perf_ratio(perf_ratio, false);
>  }
>
>  static DEFINE_MUTEX(freq_invariance_lock);
>
>  void init_freq_invariance_cppc(void)
>  {
> -       static bool secondary;
> +       static bool init_done;
>
> -       mutex_lock(&freq_invariance_lock);
> -
> -       if (!secondary)
> -               bp_init_freq_invariance(true);
> -       secondary = true;
> +       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> +               return;
>
> +       mutex_lock(&freq_invariance_lock);
> +       if (!init_done)
> +               amd_set_max_freq_ratio();
> +       init_done = true;
>         mutex_unlock(&freq_invariance_lock);
>  }
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -206,7 +206,7 @@ void arch_set_max_freq_ratio(bool turbo_
>  }
>  EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
>
> -static bool turbo_disabled(void)
> +static bool __init turbo_disabled(void)
>  {
>         u64 misc_en;
>         int err;
> @@ -218,7 +218,7 @@ static bool turbo_disabled(void)
>         return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
>  }
>
> -static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> +static bool __init slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
>  {
>         int err;
>
> @@ -240,26 +240,26 @@ static bool slv_set_max_freq_ratio(u64 *
>         X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,            \
>                 INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
>
> -static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
> +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] __initconst = {
>         X86_MATCH(XEON_PHI_KNL),
>         X86_MATCH(XEON_PHI_KNM),
>         {}
>  };
>
> -static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
> +static const struct x86_cpu_id has_skx_turbo_ratio_limits[] __initconst = {
>         X86_MATCH(SKYLAKE_X),
>         {}
>  };
>
> -static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
> +static const struct x86_cpu_id has_glm_turbo_ratio_limits[] __initconst = {
>         X86_MATCH(ATOM_GOLDMONT),
>         X86_MATCH(ATOM_GOLDMONT_D),
>         X86_MATCH(ATOM_GOLDMONT_PLUS),
>         {}
>  };
>
> -static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
> -                               int num_delta_fratio)
> +static bool __init knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
> +                                         int num_delta_fratio)
>  {
>         int fratio, delta_fratio, found;
>         int err, i;
> @@ -297,7 +297,7 @@ static bool knl_set_max_freq_ratio(u64 *
>         return true;
>  }
>
> -static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
> +static bool __init skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
>  {
>         u64 ratios, counts;
>         u32 group_size;
> @@ -328,7 +328,7 @@ static bool skx_set_max_freq_ratio(u64 *
>         return false;
>  }
>
> -static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> +static bool __init core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
>  {
>         u64 msr;
>         int err;
> @@ -351,7 +351,7 @@ static bool core_set_max_freq_ratio(u64
>         return true;
>  }
>
> -static bool intel_set_max_freq_ratio(void)
> +static bool __init intel_set_max_freq_ratio(void)
>  {
>         u64 base_freq, turbo_freq;
>         u64 turbo_ratio;
> @@ -418,40 +418,42 @@ static struct syscore_ops freq_invarianc
>
>  static void register_freq_invariance_syscore_ops(void)
>  {
> -       /* Bail out if registered already. */
> -       if (freq_invariance_syscore_ops.node.prev)
> -               return;
> -
>         register_syscore_ops(&freq_invariance_syscore_ops);
>  }
>  #else
>  static inline void register_freq_invariance_syscore_ops(void) {}
>  #endif
>
> -void bp_init_freq_invariance(bool cppc_ready)
> +static void freq_invariance_enable(void)
> +{
> +       if (static_branch_unlikely(&arch_scale_freq_key)) {
> +               WARN_ON_ONCE(1);
> +               return;
> +       }
> +       static_branch_enable(&arch_scale_freq_key);
> +       register_freq_invariance_syscore_ops();
> +       pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
> +}
> +
> +void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled)
>  {
> -       bool ret;
> +       arch_turbo_freq_ratio = ratio;
> +       arch_set_max_freq_ratio(turbo_disabled);
> +       freq_invariance_enable();
> +}
>
> +void __init bp_init_freq_invariance(void)
> +{
>         if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
>                 return;
>
>         init_counter_refs();
>
> -       if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> -               ret = intel_set_max_freq_ratio();
> -       else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> -               if (!cppc_ready)
> -                       return;
> -               ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
> -       }
> +       if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +               return;
>
> -       if (ret) {
> -               static_branch_enable(&arch_scale_freq_key);
> -               register_freq_invariance_syscore_ops();
> -               pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
> -       } else {
> -               pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
> -       }
> +       if (intel_set_max_freq_ratio())
> +               freq_invariance_enable();
>  }
>
>  void ap_init_freq_invariance(void)
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1396,7 +1396,7 @@ void __init native_smp_prepare_cpus(unsi
>  {
>         smp_prepare_cpus_common();
>
> -       bp_init_freq_invariance(false);
> +       bp_init_freq_invariance();
>         smp_sanity_check();
>
>         switch (apic_intr_mode) {
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  2022-04-15 19:19 ` [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct Thomas Gleixner
@ 2022-04-19 16:15   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Preparation for sharing code with the CPU frequency portion of the
> aperf/mperf code.
>
> No functional change.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

All good AFAICS:

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/cpu/aperfmperf.c |   26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -22,6 +22,13 @@
>
>  #include "cpu.h"
>
> +struct aperfmperf {
> +       u64             aperf;
> +       u64             mperf;
> +};
> +
> +static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
> +
>  struct aperfmperf_sample {
>         unsigned int    khz;
>         atomic_t        scfpending;
> @@ -194,8 +201,6 @@ unsigned int arch_freq_get_on_cpu(int cp
>
>  DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
>
> -static DEFINE_PER_CPU(u64, arch_prev_aperf);
> -static DEFINE_PER_CPU(u64, arch_prev_mperf);
>  static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
>  static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
>
> @@ -407,8 +412,8 @@ static void init_counter_refs(void)
>         rdmsrl(MSR_IA32_APERF, aperf);
>         rdmsrl(MSR_IA32_MPERF, mperf);
>
> -       this_cpu_write(arch_prev_aperf, aperf);
> -       this_cpu_write(arch_prev_mperf, mperf);
> +       this_cpu_write(cpu_samples.aperf, aperf);
> +       this_cpu_write(cpu_samples.mperf, mperf);
>  }
>
>  #ifdef CONFIG_PM_SLEEP
> @@ -474,9 +479,8 @@ DEFINE_PER_CPU(unsigned long, arch_freq_
>
>  void arch_scale_freq_tick(void)
>  {
> -       u64 freq_scale;
> -       u64 aperf, mperf;
> -       u64 acnt, mcnt;
> +       struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
> +       u64 aperf, mperf, acnt, mcnt, freq_scale;
>
>         if (!arch_scale_freq_invariant())
>                 return;
> @@ -484,11 +488,11 @@ void arch_scale_freq_tick(void)
>         rdmsrl(MSR_IA32_APERF, aperf);
>         rdmsrl(MSR_IA32_MPERF, mperf);
>
> -       acnt = aperf - this_cpu_read(arch_prev_aperf);
> -       mcnt = mperf - this_cpu_read(arch_prev_mperf);
> +       acnt = aperf - s->aperf;
> +       mcnt = mperf - s->mperf;
>
> -       this_cpu_write(arch_prev_aperf, aperf);
> -       this_cpu_write(arch_prev_mperf, mperf);
> +       s->aperf = aperf;
> +       s->mperf = mperf;
>
>         if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
>                 goto error;
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick()
  2022-04-15 19:19 ` [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick() Thomas Gleixner
@ 2022-04-19 16:20   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Preparation for sharing code with the CPU frequency portion of the
> aperf/mperf code.
>
> No functional change.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

All good AFAICS:

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/cpu/aperfmperf.c |   36 +++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -477,22 +477,9 @@ static DECLARE_WORK(disable_freq_invaria
>
>  DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
>
> -void arch_scale_freq_tick(void)
> +static void scale_freq_tick(u64 acnt, u64 mcnt)
>  {
> -       struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
> -       u64 aperf, mperf, acnt, mcnt, freq_scale;
> -
> -       if (!arch_scale_freq_invariant())
> -               return;
> -
> -       rdmsrl(MSR_IA32_APERF, aperf);
> -       rdmsrl(MSR_IA32_MPERF, mperf);
> -
> -       acnt = aperf - s->aperf;
> -       mcnt = mperf - s->mperf;
> -
> -       s->aperf = aperf;
> -       s->mperf = mperf;
> +       u64 freq_scale;
>
>         if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
>                 goto error;
> @@ -514,4 +501,23 @@ void arch_scale_freq_tick(void)
>         pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
>         schedule_work(&disable_freq_invariance_work);
>  }
> +
> +void arch_scale_freq_tick(void)
> +{
> +       struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
> +       u64 acnt, mcnt, aperf, mperf;
> +
> +       if (!arch_scale_freq_invariant())
> +               return;
> +
> +       rdmsrl(MSR_IA32_APERF, aperf);
> +       rdmsrl(MSR_IA32_MPERF, mperf);
> +       acnt = aperf - s->aperf;
> +       mcnt = mperf - s->mperf;
> +
> +       s->aperf = aperf;
> +       s->mperf = mperf;
> +
> +       scale_freq_tick(acnt, mcnt);
> +}
>  #endif /* CONFIG_X86_64 && CONFIG_SMP */
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional
  2022-04-15 19:19 ` [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional Thomas Gleixner
@ 2022-04-19 16:27   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The frequency invariance support is currently limited to x86/64 and SMP,
> which is the vast majority of machines.
>
> arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
> and MPERF MSRs. The CPU frequency getters function do the same via dedicated
> IPIs.
>
> While it could be argued that on systems where frequency invariance support
> is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
> be avoided, it does not make sense to keep the extra code and the resulting
> runtime issues of mass IPIs around.
>
> As a first step split out the non frequency invariance specific
> initialization code and the read MSR portion of arch_scale_freq_tick(). The
> rest of the code is still conditional and guarded with a static key.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

All good AFAICS:

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/include/asm/cpu.h       |    2 +
>  arch/x86/include/asm/topology.h  |    4 --
>  arch/x86/kernel/cpu/aperfmperf.c |   63 +++++++++++++++++++++++----------------
>  arch/x86/kernel/smpboot.c        |    3 -
>  4 files changed, 41 insertions(+), 31 deletions(-)
>
> --- a/arch/x86/include/asm/cpu.h
> +++ b/arch/x86/include/asm/cpu.h
> @@ -36,6 +36,8 @@ extern int _debug_hotplug_cpu(int cpu, i
>  #endif
>  #endif
>
> +extern void ap_init_aperfmperf(void);
> +
>  int mwait_usable(const struct cpuinfo_x86 *);
>
>  unsigned int x86_family(unsigned int sig);
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -217,13 +217,9 @@ extern void arch_scale_freq_tick(void);
>
>  extern void arch_set_max_freq_ratio(bool turbo_disabled);
>  extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
> -extern void bp_init_freq_invariance(void);
> -extern void ap_init_freq_invariance(void);
>  #else
>  static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
>  static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
> -static inline void bp_init_freq_invariance(void) { }
> -static inline void ap_init_freq_invariance(void) { }
>  #endif
>
>  #ifdef CONFIG_ACPI_CPPC_LIB
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -17,6 +17,7 @@
>  #include <linux/smp.h>
>  #include <linux/syscore_ops.h>
>
> +#include <asm/cpu.h>
>  #include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>
> @@ -164,6 +165,17 @@ unsigned int arch_freq_get_on_cpu(int cp
>         return per_cpu(samples.khz, cpu);
>  }
>
> +static void init_counter_refs(void)
> +{
> +       u64 aperf, mperf;
> +
> +       rdmsrl(MSR_IA32_APERF, aperf);
> +       rdmsrl(MSR_IA32_MPERF, mperf);
> +
> +       this_cpu_write(cpu_samples.aperf, aperf);
> +       this_cpu_write(cpu_samples.mperf, mperf);
> +}
> +
>  #if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
>  /*
>   * APERF/MPERF frequency ratio computation.
> @@ -405,17 +417,6 @@ static bool __init intel_set_max_freq_ra
>         return true;
>  }
>
> -static void init_counter_refs(void)
> -{
> -       u64 aperf, mperf;
> -
> -       rdmsrl(MSR_IA32_APERF, aperf);
> -       rdmsrl(MSR_IA32_MPERF, mperf);
> -
> -       this_cpu_write(cpu_samples.aperf, aperf);
> -       this_cpu_write(cpu_samples.mperf, mperf);
> -}
> -
>  #ifdef CONFIG_PM_SLEEP
>  static struct syscore_ops freq_invariance_syscore_ops = {
>         .resume = init_counter_refs,
> @@ -447,13 +448,8 @@ void freq_invariance_set_perf_ratio(u64
>         freq_invariance_enable();
>  }
>
> -void __init bp_init_freq_invariance(void)
> +static void __init bp_init_freq_invariance(void)
>  {
> -       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> -               return;
> -
> -       init_counter_refs();
> -
>         if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
>                 return;
>
> @@ -461,12 +457,6 @@ void __init bp_init_freq_invariance(void
>                 freq_invariance_enable();
>  }
>
> -void ap_init_freq_invariance(void)
> -{
> -       if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> -               init_counter_refs();
> -}
> -
>  static void disable_freq_invariance_workfn(struct work_struct *work)
>  {
>         static_branch_disable(&arch_scale_freq_key);
> @@ -481,6 +471,9 @@ static void scale_freq_tick(u64 acnt, u6
>  {
>         u64 freq_scale;
>
> +       if (!arch_scale_freq_invariant())
> +               return;
> +
>         if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
>                 goto error;
>
> @@ -501,13 +494,17 @@ static void scale_freq_tick(u64 acnt, u6
>         pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
>         schedule_work(&disable_freq_invariance_work);
>  }
> +#else
> +static inline void bp_init_freq_invariance(void) { }
> +static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
> +#endif /* CONFIG_X86_64 && CONFIG_SMP */
>
>  void arch_scale_freq_tick(void)
>  {
>         struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
>         u64 acnt, mcnt, aperf, mperf;
>
> -       if (!arch_scale_freq_invariant())
> +       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
>                 return;
>
>         rdmsrl(MSR_IA32_APERF, aperf);
> @@ -520,4 +517,20 @@ void arch_scale_freq_tick(void)
>
>         scale_freq_tick(acnt, mcnt);
>  }
> -#endif /* CONFIG_X86_64 && CONFIG_SMP */
> +
> +static int __init bp_init_aperfmperf(void)
> +{
> +       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> +               return 0;
> +
> +       init_counter_refs();
> +       bp_init_freq_invariance();
> +       return 0;
> +}
> +early_initcall(bp_init_aperfmperf);
> +
> +void ap_init_aperfmperf(void)
> +{
> +       if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> +               init_counter_refs();
> +}
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -186,7 +186,7 @@ static void smp_callin(void)
>          */
>         set_cpu_sibling_map(raw_smp_processor_id());
>
> -       ap_init_freq_invariance();
> +       ap_init_aperfmperf();
>
>         /*
>          * Get our bogomips.
> @@ -1396,7 +1396,6 @@ void __init native_smp_prepare_cpus(unsi
>  {
>         smp_prepare_cpus_common();
>
> -       bp_init_freq_invariance();
>         smp_sanity_check();
>
>         switch (apic_intr_mode) {
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  2022-04-15 19:20 ` [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads Thomas Gleixner
@ 2022-04-19 16:30   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Now that the MSR readout is unconditional, store the results in the per CPU
> data structure along with a jiffies timestamp for the CPU frequency readout
> code.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/cpu/aperfmperf.c |   14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -24,11 +24,17 @@
>  #include "cpu.h"
>
>  struct aperfmperf {
> +       seqcount_t      seq;
> +       unsigned long   last_update;
> +       u64             acnt;
> +       u64             mcnt;
>         u64             aperf;
>         u64             mperf;
>  };
>
> -static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
> +static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples) = {
> +       .seq = SEQCNT_ZERO(cpu_samples.seq)
> +};
>
>  struct aperfmperf_sample {
>         unsigned int    khz;
> @@ -515,6 +521,12 @@ void arch_scale_freq_tick(void)
>         s->aperf = aperf;
>         s->mperf = mperf;
>
> +       raw_write_seqcount_begin(&s->seq);
> +       s->last_update = jiffies;
> +       s->acnt = acnt;
> +       s->mcnt = mcnt;
> +       raw_write_seqcount_end(&s->seq);
> +
>         scale_freq_tick(acnt, mcnt);
>  }
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz()
  2022-04-15 19:20 ` [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz() Thomas Gleixner
@ 2022-04-19 16:35   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney, Eric Dumazet

On Fri, Apr 15, 2022 at 9:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The frequency invariance infrastructure provides the APERF/MPERF samples
> already. Utilize them for the cpu frequency display in /proc/cpuinfo.
>
> The sample is considered valid for 20ms. So for idle or isolated NOHZ full
> CPUs the function returns 0, which is matching the previous behaviour.
>
> This gets rid of the mass IPIs and a delay of 20ms for stabilizing observed
> by Eric when reading /proc/cpuinfo.
>
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

All fine IMV, one minor nit below.


Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/cpu/aperfmperf.c |   77 +++++++++++++++++----------------------
>  fs/proc/cpuinfo.c                |    6 ---
>  include/linux/cpufreq.h          |    1
>  3 files changed, 35 insertions(+), 49 deletions(-)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -100,49 +100,6 @@ static bool aperfmperf_snapshot_cpu(int
>         return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
>  }
>
> -unsigned int aperfmperf_get_khz(int cpu)
> -{
> -       if (!cpu_khz)
> -               return 0;
> -
> -       if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> -               return 0;
> -
> -       if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> -               return 0;
> -
> -       if (rcu_is_idle_cpu(cpu))
> -               return 0; /* Idle CPUs are completely uninteresting. */
> -
> -       aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
> -       return per_cpu(samples.khz, cpu);
> -}
> -
> -void arch_freq_prepare_all(void)
> -{
> -       ktime_t now = ktime_get();
> -       bool wait = false;
> -       int cpu;
> -
> -       if (!cpu_khz)
> -               return;
> -
> -       if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> -               return;
> -
> -       for_each_online_cpu(cpu) {
> -               if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> -                       continue;
> -               if (rcu_is_idle_cpu(cpu))
> -                       continue; /* Idle CPUs are completely uninteresting. */
> -               if (!aperfmperf_snapshot_cpu(cpu, now, false))
> -                       wait = true;
> -       }
> -
> -       if (wait)
> -               msleep(APERFMPERF_REFRESH_DELAY_MS);
> -}
> -
>  unsigned int arch_freq_get_on_cpu(int cpu)
>  {
>         struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> @@ -529,6 +486,40 @@ void arch_scale_freq_tick(void)
>         scale_freq_tick(acnt, mcnt);
>  }
>
> +/*
> + * Discard samples older than the define maximum sample age of 20ms. There
> + * is no point in sending IPIs in such a case. If the scheduler tick was
> + * not running then the CPU is either idle or isolated.
> + */
> +#define MAX_SAMPLE_AGE ((unsigned long)HZ / 50)
> +
> +unsigned int aperfmperf_get_khz(int cpu)
> +{
> +       struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
> +       unsigned long last;
> +       unsigned int seq;
> +       u64 acnt, mcnt;
> +
> +       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> +               return 0;
> +
> +       do {
> +               seq = raw_read_seqcount_begin(&s->seq);
> +               last = s->last_update;
> +               acnt = s->acnt;
> +               mcnt = s->mcnt;
> +       } while (read_seqcount_retry(&s->seq, seq));
> +
> +       /*
> +        * Bail on invalid count and when the last update was too long ago,
> +        * which covers idle and NOHZ full CPUs.
> +        */
> +       if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)

The inner parens are not needed here.

> +               return 0;
> +
> +       return div64_u64((cpu_khz * acnt), mcnt);
> +}
> +
>  static int __init bp_init_aperfmperf(void)
>  {
>         if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> --- a/fs/proc/cpuinfo.c
> +++ b/fs/proc/cpuinfo.c
> @@ -5,14 +5,10 @@
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
>
> -__weak void arch_freq_prepare_all(void)
> -{
> -}
> -
>  extern const struct seq_operations cpuinfo_op;
> +
>  static int cpuinfo_open(struct inode *inode, struct file *file)
>  {
> -       arch_freq_prepare_all();
>         return seq_open(file, &cpuinfo_op);
>  }
>
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -1199,7 +1199,6 @@ static inline void sched_cpufreq_governo
>                         struct cpufreq_governor *old_gov) { }
>  #endif
>
> -extern void arch_freq_prepare_all(void);
>  extern unsigned int arch_freq_get_on_cpu(int cpu);
>
>  #ifndef arch_set_freq_scale
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu()
  2022-04-15 19:20 ` [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu() Thomas Gleixner
@ 2022-04-19 16:37   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 16:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 9:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Reading the current CPU frequency from /sys/..../scaling_cur_freq involves
> in the worst case two IPIs due to the ad hoc sampling.
>
> The frequency invariance infrastructure provides the APERF/MPERF samples
> already. Utilize them and consolidate this with the /proc/cpuinfo readout.
>
> The sample is considered valid for 20ms. So for idle or isolated NOHZ full
> CPUs the function returns 0, which is matching the previous behaviour.
>
> The resulting text size vs. the original APERF/MPERF plus the separate
> frequency invariance code:
>
>   text:         2411    ->   723
>   init.text:       0    ->   767
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

All good AFAICS.

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/cpu/aperfmperf.c |   94 ---------------------------------------
>  arch/x86/kernel/cpu/proc.c       |    2
>  2 files changed, 2 insertions(+), 94 deletions(-)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -35,98 +35,6 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(str
>         .seq = SEQCNT_ZERO(cpu_samples.seq)
>  };
>
> -struct aperfmperf_sample {
> -       unsigned int    khz;
> -       atomic_t        scfpending;
> -       ktime_t time;
> -       u64     aperf;
> -       u64     mperf;
> -};
> -
> -static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
> -
> -#define APERFMPERF_CACHE_THRESHOLD_MS  10
> -#define APERFMPERF_REFRESH_DELAY_MS    10
> -#define APERFMPERF_STALE_THRESHOLD_MS  1000
> -
> -/*
> - * aperfmperf_snapshot_khz()
> - * On the current CPU, snapshot APERF, MPERF, and jiffies
> - * unless we already did it within 10ms
> - * calculate kHz, save snapshot
> - */
> -static void aperfmperf_snapshot_khz(void *dummy)
> -{
> -       u64 aperf, aperf_delta;
> -       u64 mperf, mperf_delta;
> -       struct aperfmperf_sample *s = this_cpu_ptr(&samples);
> -       unsigned long flags;
> -
> -       local_irq_save(flags);
> -       rdmsrl(MSR_IA32_APERF, aperf);
> -       rdmsrl(MSR_IA32_MPERF, mperf);
> -       local_irq_restore(flags);
> -
> -       aperf_delta = aperf - s->aperf;
> -       mperf_delta = mperf - s->mperf;
> -
> -       /*
> -        * There is no architectural guarantee that MPERF
> -        * increments faster than we can read it.
> -        */
> -       if (mperf_delta == 0)
> -               return;
> -
> -       s->time = ktime_get();
> -       s->aperf = aperf;
> -       s->mperf = mperf;
> -       s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
> -       atomic_set_release(&s->scfpending, 0);
> -}
> -
> -static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> -{
> -       s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
> -       struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> -
> -       /* Don't bother re-computing within the cache threshold time. */
> -       if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> -               return true;
> -
> -       if (!atomic_xchg(&s->scfpending, 1) || wait)
> -               smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
> -
> -       /* Return false if the previous iteration was too long ago. */
> -       return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
> -}
> -
> -unsigned int arch_freq_get_on_cpu(int cpu)
> -{
> -       struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> -
> -       if (!cpu_khz)
> -               return 0;
> -
> -       if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> -               return 0;
> -
> -       if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> -               return 0;
> -
> -       if (rcu_is_idle_cpu(cpu))
> -               return 0;
> -
> -       if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
> -               return per_cpu(samples.khz, cpu);
> -
> -       msleep(APERFMPERF_REFRESH_DELAY_MS);
> -       atomic_set(&s->scfpending, 1);
> -       smp_mb(); /* ->scfpending before smp_call_function_single(). */
> -       smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
> -
> -       return per_cpu(samples.khz, cpu);
> -}
> -
>  static void init_counter_refs(void)
>  {
>         u64 aperf, mperf;
> @@ -493,7 +401,7 @@ void arch_scale_freq_tick(void)
>   */
>  #define MAX_SAMPLE_AGE ((unsigned long)HZ / 50)
>
> -unsigned int aperfmperf_get_khz(int cpu)
> +unsigned int arch_freq_get_on_cpu(int cpu)
>  {
>         struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
>         unsigned long last;
> --- a/arch/x86/kernel/cpu/proc.c
> +++ b/arch/x86/kernel/cpu/proc.c
> @@ -84,7 +84,7 @@ static int show_cpuinfo(struct seq_file
>                 seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
>
>         if (cpu_has(c, X86_FEATURE_TSC)) {
> -               unsigned int freq = aperfmperf_get_khz(cpu);
> +               unsigned int freq = arch_freq_get_on_cpu(cpu);
>
>                 if (!freq)
>                         freq = cpufreq_quick_get(cpu);
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (10 preceding siblings ...)
  2022-04-19 15:51 ` [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Eric Dumazet
@ 2022-04-19 16:41 ` Peter Zijlstra
  2022-04-19 17:32 ` Doug Smythies
  2022-04-19 21:56 ` [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Paul E. McKenney
  13 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2022-04-19 16:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Rafael J. Wysocki, linux-pm, Eric Dumazet, Paul E. McKenney

On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote:
> ---
>  arch/x86/include/asm/cpu.h       |    2 
>  arch/x86/include/asm/topology.h  |   17 -
>  arch/x86/kernel/acpi/cppc.c      |   28 --
>  arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
>  arch/x86/kernel/cpu/proc.c       |    2 
>  arch/x86/kernel/smpboot.c        |  358 -----------------------------
>  fs/proc/cpuinfo.c                |    6 
>  include/linux/cpufreq.h          |    1 
>  8 files changed, 405 insertions(+), 483 deletions(-)


Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (11 preceding siblings ...)
  2022-04-19 16:41 ` Peter Zijlstra
@ 2022-04-19 17:32 ` Doug Smythies
  2022-04-19 18:49   ` Rafael J. Wysocki
  2022-04-19 21:56 ` [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Paul E. McKenney
  13 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2022-04-19 17:32 UTC (permalink / raw)
  To: 'Thomas Gleixner'
  Cc: x86, 'Rafael J. Wysocki',
	linux-pm, 'Eric Dumazet', 'Paul E. McKenney',
	'LKML',
	Doug Smythies

Hi Thomas,

On 2022.04.15 12:20 Thomas Gleixner wrote:

> APERF/MPERF is utilized in two ways:
>
>  1) Ad hoc readout of CPU frequency which requires IPIs
>
>  2) Frequency scale calculation for frequency invariant scheduling which
>     reads APERF/MPERF on every tick.
>
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
>
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
>
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
>
> The following series consolidates that.

I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive),
and intel_pstate (active) CPU frequency scaling drivers and various
governors. Additionally, with HWP both enabled and disabled.

For intel_pstate (active), both HWP enabled or disabled, the behaviour
of scaling_cur_freq is inconsistent with prior to this patch set and other
scaling driver governor combinations.

Note there is no issue with " grep MHz /proc/cpuinfo" for any
combination.

Examples:

No-HWP:

active/powersave:
doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

active/performance:
doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

HWP:

active/powersave:
doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:799993
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800069
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800131
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:799844

active/performance:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:4800186
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:4800016
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 

Other configurations:
intel_cpufreq /schedutil (no HWP), for example:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

> Thanks,
>
>	tglx
> ---
> arch/x86/include/asm/cpu.h       |    2 
> arch/x86/include/asm/topology.h  |   17 -
> arch/x86/kernel/acpi/cppc.c      |   28 --
> arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
> arch/x86/kernel/cpu/proc.c       |    2 
> arch/x86/kernel/smpboot.c        |  358 -----------------------------
> fs/proc/cpuinfo.c                |    6 
> include/linux/cpufreq.h          |    1 
> 8 files changed, 405 insertions(+), 483 deletions(-)



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-19 17:32 ` Doug Smythies
@ 2022-04-19 18:49   ` Rafael J. Wysocki
  2022-04-19 21:11     ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2022-04-19 18:49 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Thomas Gleixner, the arch/x86 maintainers, Rafael J. Wysocki,
	Linux PM, Eric Dumazet, Paul E. McKenney, LKML

On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Thomas,
>
> On 2022.04.15 12:20 Thomas Gleixner wrote:
>
> > APERF/MPERF is utilized in two ways:
> >
> >  1) Ad hoc readout of CPU frequency which requires IPIs
> >
> >  2) Frequency scale calculation for frequency invariant scheduling which
> >     reads APERF/MPERF on every tick.
> >
> > These are completely independent code parts. Eric observed long latencies
> > when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> > proposed to replace the per CPU single IPI with a broadcast IPI.
> >
> > While this makes the latency smaller, it is not necessary at all because #2
> > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> > which are excluded from IPI already.
> >
> > It could be argued that not all APERF/MPERF capable systems have the
> > required BIOS information to enable frequency invariance support, but in
> > practice most of them do. So the APERF/MPERF sampling can be made
> > unconditional and just the frequency scale calculation for the scheduler
> > excluded.
> >
> > The following series consolidates that.
>
> I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive),
> and intel_pstate (active) CPU frequency scaling drivers and various
> governors. Additionally, with HWP both enabled and disabled.
>
> For intel_pstate (active), both HWP enabled or disabled, the behaviour
> of scaling_cur_freq is inconsistent with prior to this patch set and other
> scaling driver governor combinations.
>
> Note there is no issue with " grep MHz /proc/cpuinfo" for any
> combination.
>
> Examples:
>
> No-HWP:
>
> active/powersave:
> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

That's because after the changes in this series scaling_cur_freq
returns 0 if the given CPU is idle.

I guess it could return the last known result, but that wouldn't be
more meaningful.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-19 15:51 ` [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Eric Dumazet
@ 2022-04-19 20:39   ` Thomas Gleixner
  2022-04-19 21:20     ` Eric Dumazet
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-19 20:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, linux-pm,
	Paul E. McKenney

Eric,

On Tue, Apr 19 2022 at 08:51, Eric Dumazet wrote:
> On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> It could be argued that not all APERF/MPERF capable systems have the
>> required BIOS information to enable frequency invariance support, but in
>> practice most of them do. So the APERF/MPERF sampling can be made
>> unconditional and just the frequency scale calculation for the scheduler
>> excluded.
>>
>> The following series consolidates that.
>>
>
> Thanks a lot for working on that Thomas.
>
> I am not sure I will be able to backport this to a Google prodkernel,
> as I guess there will be many merge conflicts.

:)

> Do you have by any chance this work available in a git branch ?

 git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-19 18:49   ` Rafael J. Wysocki
@ 2022-04-19 21:11     ` Thomas Gleixner
  2022-04-20 22:08       ` Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-19 21:11 UTC (permalink / raw)
  To: Rafael J. Wysocki, Doug Smythies
  Cc: the arch/x86 maintainers, Rafael J. Wysocki, Linux PM,
	Eric Dumazet, Paul E. McKenney, LKML

On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote:
> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote:
>> For intel_pstate (active), both HWP enabled or disabled, the behaviour
>> of scaling_cur_freq is inconsistent with prior to this patch set and other
>> scaling driver governor combinations.
>>
>> Note there is no issue with " grep MHz /proc/cpuinfo" for any
>> combination.
>>
>> Examples:
>>
>> No-HWP:
>>
>> active/powersave:
>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0
>
> That's because after the changes in this series scaling_cur_freq
> returns 0 if the given CPU is idle.

Which is sensible IMO as there is really no point in waking an idle CPU
just to read those MSRs, then wait 20ms wake it up again to read those
MSRs again.

> I guess it could return the last known result, but that wouldn't be
> more meaningful.

Right.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-19 20:39   ` Thomas Gleixner
@ 2022-04-19 21:20     ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2022-04-19 21:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Rafael J. Wysocki, linux-pm,
	Paul E. McKenney

On Tue, Apr 19, 2022 at 1:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf
>

Excellent, things look fine to me.

Before:
# grep MHz /proc/cpuinfo | sort | uniq -c
    255 cpu MHz : 2249.998
      1 cpu MHz : 3297.719

# grep MHz /proc/cpuinfo | sort|uniq -c
      1 cpu MHz : 1590.400
      1 cpu MHz : 1684.772
      1 cpu MHz : 1693.890
      1 cpu MHz : 1780.072
      1 cpu MHz : 1784.513
      1 cpu MHz : 1831.106
      1 cpu MHz : 1880.344
      1 cpu MHz : 1953.481
      1 cpu MHz : 1980.636
      1 cpu MHz : 2013.620
      1 cpu MHz : 2219.617
    240 cpu MHz : 2250.173
      1 cpu MHz : 3292.206
      1 cpu MHz : 3294.956
      1 cpu MHz : 3297.653
      1 cpu MHz : 3298.385
      1 cpu MHz : 3300.197

Tested-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
                   ` (12 preceding siblings ...)
  2022-04-19 17:32 ` Doug Smythies
@ 2022-04-19 21:56 ` Paul E. McKenney
  13 siblings, 0 replies; 51+ messages in thread
From: Paul E. McKenney @ 2022-04-19 21:56 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Rafael J. Wysocki, linux-pm, Eric Dumazet

On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote:
> APERF/MPERF is utilized in two ways:
> 
>   1) Ad hoc readout of CPU frequency which requires IPIs
> 
>   2) Frequency scale calculation for frequency invariant scheduling which
>      reads APERF/MPERF on every tick.
> 
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
> 
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
> 
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
> 
> The following series consolidates that.

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> Thanks,
> 
> 	tglx
> ---
>  arch/x86/include/asm/cpu.h       |    2 
>  arch/x86/include/asm/topology.h  |   17 -
>  arch/x86/kernel/acpi/cppc.c      |   28 --
>  arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
>  arch/x86/kernel/cpu/proc.c       |    2 
>  arch/x86/kernel/smpboot.c        |  358 -----------------------------
>  fs/proc/cpuinfo.c                |    6 
>  include/linux/cpufreq.h          |    1 
>  8 files changed, 405 insertions(+), 483 deletions(-)
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-19 21:11     ` Thomas Gleixner
@ 2022-04-20 22:08       ` Doug Smythies
  2022-04-25 15:45         ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2022-04-20 22:08 UTC (permalink / raw)
  To: 'Thomas Gleixner', 'Rafael J. Wysocki'
  Cc: 'the arch/x86 maintainers', 'Rafael J. Wysocki',
	'Linux PM', 'Eric Dumazet',
	'Paul E. McKenney', 'LKML',
	Doug Smythies

Hi Thomas, Rafael,

Thank you for your replies.

On 2022.04.19 14:11 Thomas Gleixner wrote:
> On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote:
>> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote:
>>> For intel_pstate (active), both HWP enabled or disabled, the behaviour
>>> of scaling_cur_freq is inconsistent with prior to this patch set and other
>>> scaling driver governor combinations.
>>>
>>> Note there is no issue with " grep MHz /proc/cpuinfo" for any
>>> combination.
>>>
>>> Examples:
>>>
>>> No-HWP:
>>>
>>> active/powersave:
>>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
>>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
>>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
>>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0
>>
>> That's because after the changes in this series scaling_cur_freq
>> returns 0 if the given CPU is idle.
>
> Which is sensible IMO as there is really no point in waking an idle CPU
> just to read those MSRs, then wait 20ms wake it up again to read those
> MSRs again.

I totally agree.
It is the inconsistency for what is displayed as a function of driver/governor
that is my concern.

>
>> I guess it could return the last known result, but that wouldn't be
>> more meaningful.
>
> Right.

How about something like this, which I realize might break something else,
but just to demonstrate:

doug@s19:~/kernel/linux$ git diff
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 80f535cc8a75..a161e75794cd 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -710,7 +710,7 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
        else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
                ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
        else
-               ret = sprintf(buf, "%u\n", policy->cur);
+               ret = sprintf(buf, "%u\n", freq);
        return ret;
 }

Note: I left the other 0 return condition, because I do not know what uses it.

Which gives:

acpi-cpufreq/schedutil
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:4100723

intel_pstate/powersave (no-HWP)
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800295
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800015

intel_cpufreq/schedutil (no-HWP)
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:1971265
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2785446
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

Which I suggest is more consistent.

Note: because it was deleted from this thread, 
and just for reference, I'll repost the previous
intel_cpufreq/schedutil (no-HWP) output:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000

... Doug



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-20 22:08       ` Doug Smythies
@ 2022-04-25 15:45         ` Thomas Gleixner
  2022-04-25 23:20           ` Doug Smythies
                             ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Thomas Gleixner @ 2022-04-25 15:45 UTC (permalink / raw)
  To: Doug Smythies, 'Rafael J. Wysocki'
  Cc: 'the arch/x86 maintainers', 'Rafael J. Wysocki',
	'Linux PM', 'Eric Dumazet',
	'Paul E. McKenney', 'LKML',
	Doug Smythies

On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote:
> On 2022.04.19 14:11 Thomas Gleixner wrote:
>>> That's because after the changes in this series scaling_cur_freq
>>> returns 0 if the given CPU is idle.
>>
>> Which is sensible IMO as there is really no point in waking an idle CPU
>> just to read those MSRs, then wait 20ms wake it up again to read those
>> MSRs again.
>
> I totally agree.
> It is the inconsistency for what is displayed as a function of driver/governor
> that is my concern.

Raphael suggested to move the show_cpuinfo() logic into the a/mperf
code. See below.

Thanks,

        tglx
---
Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
From: Thomas Gleixner <tglx@linutronix.de>
Date: Mon, 25 Apr 2022 15:19:29 +0200

Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.

show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.

Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.

Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |   10 +++++++---
 arch/x86/kernel/cpu/proc.c       |    7 +------
 2 files changed, 8 insertions(+), 9 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -405,12 +405,12 @@ void arch_scale_freq_tick(void)
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+	unsigned int seq, freq;
 	unsigned long last;
-	unsigned int seq;
 	u64 acnt, mcnt;
 
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		return 0;
+		goto fallback;
 
 	do {
 		seq = raw_read_seqcount_begin(&s->seq);
@@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cp
 	 * which covers idle and NOHZ full CPUs.
 	 */
 	if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
-		return 0;
+		goto fallback;
 
 	return div64_u64((cpu_khz * acnt), mcnt);
+
+fallback:
+	freq = cpufreq_quick_get(cpu);
+	return freq ? freq : cpu_khz;
 }
 
 static int __init bp_init_aperfmperf(void)
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file
 	if (cpu_has(c, X86_FEATURE_TSC)) {
 		unsigned int freq = arch_freq_get_on_cpu(cpu);
 
-		if (!freq)
-			freq = cpufreq_quick_get(cpu);
-		if (!freq)
-			freq = cpu_khz;
-		seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
-			   freq / 1000, (freq % 1000));
+		seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
 	}
 
 	/* Cache size */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code
  2022-04-25 15:45         ` Thomas Gleixner
@ 2022-04-25 23:20           ` Doug Smythies
  2022-04-27 13:56           ` [tip: x86/cleanups] x86/aperfmperf: Integrate the fallback code from show_cpuinfo() tip-bot2 for Thomas Gleixner
  2022-04-27 18:27           ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: Doug Smythies @ 2022-04-25 23:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rafael J. Wysocki, the arch/x86 maintainers, Linux PM,
	Eric Dumazet, Paul E. McKenney, LKML, dsmythies

On Mon, Apr 25, 2022 at 8:45 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote:
>> On 2022.04.19 14:11 Thomas Gleixner wrote:
>>>> That's because after the changes in this series scaling_cur_freq
>>>> returns 0 if the given CPU is idle.
>>>
>>> Which is sensible IMO as there is really no point in waking an idle CPU
>>> just to read those MSRs, then wait 20ms wake it up again to read those
>>> MSRs again.
>>
>> I totally agree.
>> It is the inconsistency for what is displayed as a function of driver/governor
>> that is my concern.
>
> Raphael suggested to move the show_cpuinfo() logic into the a/mperf
> code. See below.

Hi Thomas,

I tested the patch on top of your 10 patch set on kernel 5.18-rc3.
It addresses my consistency concerns.

Thank you

... Doug

> ---
> Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Mon, 25 Apr 2022 15:19:29 +0200
>
...

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
  2022-04-25 15:45         ` Thomas Gleixner
  2022-04-25 23:20           ` Doug Smythies
@ 2022-04-27 13:56           ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27           ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rafael J. Wysocki, Thomas Gleixner, Doug Smythies, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     e696cabf5da2b4ed104508674de6125b860f3c9f
Gitweb:        https://git.kernel.org/tip/e696cabf5da2b4ed104508674de6125b860f3c9f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 25 Apr 2022 17:45:42 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:09 +02:00

x86/aperfmperf: Integrate the fallback code from show_cpuinfo()

Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.

show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.

Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.

Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/r/87pml5180p.ffs@tglx

---
 arch/x86/kernel/cpu/aperfmperf.c | 10 +++++++---
 arch/x86/kernel/cpu/proc.c       |  7 +------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index b15c884..1f60a2b 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -405,12 +405,12 @@ void arch_scale_freq_tick(void)
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+	unsigned int seq, freq;
 	unsigned long last;
-	unsigned int seq;
 	u64 acnt, mcnt;
 
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		return 0;
+		goto fallback;
 
 	do {
 		seq = raw_read_seqcount_begin(&s->seq);
@@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 	 * which covers idle and NOHZ full CPUs.
 	 */
 	if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
-		return 0;
+		goto fallback;
 
 	return div64_u64((cpu_khz * acnt), mcnt);
+
+fallback:
+	freq = cpufreq_quick_get(cpu);
+	return freq ? freq : cpu_khz;
 }
 
 static int __init bp_init_aperfmperf(void)
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 0a0ee55..099b6f0 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 	if (cpu_has(c, X86_FEATURE_TSC)) {
 		unsigned int freq = arch_freq_get_on_cpu(cpu);
 
-		if (!freq)
-			freq = cpufreq_quick_get(cpu);
-		if (!freq)
-			freq = cpu_khz;
-		seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
-			   freq / 1000, (freq % 1000));
+		seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
 	}
 
 	/* Cache size */

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Replace arch_freq_get_on_cpu()
  2022-04-15 19:20 ` [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu() Thomas Gleixner
  2022-04-19 16:37   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Eric Dumazet, Rafael J. Wysocki,
	Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     a86a1356ee79da6bc4abfb9e0499f963ff6da9ae
Gitweb:        https://git.kernel.org/tip/a86a1356ee79da6bc4abfb9e0499f963ff6da9ae
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:20:04 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:09 +02:00

x86/aperfmperf: Replace arch_freq_get_on_cpu()

Reading the current CPU frequency from /sys/..../scaling_cur_freq involves
in the worst case two IPIs due to the ad hoc sampling.

The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them and consolidate this with the /proc/cpuinfo readout.

The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.

The resulting text size vs. the original APERF/MPERF plus the separate
frequency invariance code:

  text:		2411	->   723
  init.text:	   0	->   767

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.934040006@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 94 +-------------------------------
 arch/x86/kernel/cpu/proc.c       |  2 +-
 2 files changed, 2 insertions(+), 94 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index e9d2da7..b15c884 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -36,98 +36,6 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples) = {
 	.seq = SEQCNT_ZERO(cpu_samples.seq)
 };
 
-struct aperfmperf_sample {
-	unsigned int	khz;
-	atomic_t	scfpending;
-	ktime_t	time;
-	u64	aperf;
-	u64	mperf;
-};
-
-static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
-
-#define APERFMPERF_CACHE_THRESHOLD_MS	10
-#define APERFMPERF_REFRESH_DELAY_MS	10
-#define APERFMPERF_STALE_THRESHOLD_MS	1000
-
-/*
- * aperfmperf_snapshot_khz()
- * On the current CPU, snapshot APERF, MPERF, and jiffies
- * unless we already did it within 10ms
- * calculate kHz, save snapshot
- */
-static void aperfmperf_snapshot_khz(void *dummy)
-{
-	u64 aperf, aperf_delta;
-	u64 mperf, mperf_delta;
-	struct aperfmperf_sample *s = this_cpu_ptr(&samples);
-	unsigned long flags;
-
-	local_irq_save(flags);
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-	local_irq_restore(flags);
-
-	aperf_delta = aperf - s->aperf;
-	mperf_delta = mperf - s->mperf;
-
-	/*
-	 * There is no architectural guarantee that MPERF
-	 * increments faster than we can read it.
-	 */
-	if (mperf_delta == 0)
-		return;
-
-	s->time = ktime_get();
-	s->aperf = aperf;
-	s->mperf = mperf;
-	s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
-	atomic_set_release(&s->scfpending, 0);
-}
-
-static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
-{
-	s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
-	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
-
-	/* Don't bother re-computing within the cache threshold time. */
-	if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
-		return true;
-
-	if (!atomic_xchg(&s->scfpending, 1) || wait)
-		smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
-
-	/* Return false if the previous iteration was too long ago. */
-	return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
-}
-
-unsigned int arch_freq_get_on_cpu(int cpu)
-{
-	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
-
-	if (!cpu_khz)
-		return 0;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return 0;
-
-	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-		return 0;
-
-	if (rcu_is_idle_cpu(cpu))
-		return 0;
-
-	if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
-		return per_cpu(samples.khz, cpu);
-
-	msleep(APERFMPERF_REFRESH_DELAY_MS);
-	atomic_set(&s->scfpending, 1);
-	smp_mb(); /* ->scfpending before smp_call_function_single(). */
-	smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
-
-	return per_cpu(samples.khz, cpu);
-}
-
 static void init_counter_refs(void)
 {
 	u64 aperf, mperf;
@@ -494,7 +402,7 @@ void arch_scale_freq_tick(void)
  */
 #define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)
 
-unsigned int aperfmperf_get_khz(int cpu)
+unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
 	unsigned long last;
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 4eec888..0a0ee55 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -84,7 +84,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 		seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
 
 	if (cpu_has(c, X86_FEATURE_TSC)) {
-		unsigned int freq = aperfmperf_get_khz(cpu);
+		unsigned int freq = arch_freq_get_on_cpu(cpu);
 
 		if (!freq)
 			freq = cpufreq_quick_get(cpu);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Replace aperfmperf_get_khz()
  2022-04-15 19:20 ` [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz() Thomas Gleixner
  2022-04-19 16:35   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric Dumazet, Thomas Gleixner, Eric Dumazet, Rafael J. Wysocki,
	Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     61551f094837f77952eba2fdf8b913bb5b191ced
Gitweb:        https://git.kernel.org/tip/61551f094837f77952eba2fdf8b913bb5b191ced
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:20:02 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Replace aperfmperf_get_khz()

The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them for the cpu frequency display in /proc/cpuinfo.

The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.

This gets rid of the mass IPIs and a delay of 20ms for stabilizing observed
by Eric when reading /proc/cpuinfo.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.875029458@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 77 +++++++++++++------------------
 fs/proc/cpuinfo.c                |  6 +--
 include/linux/cpufreq.h          |  1 +-
 3 files changed, 35 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 963c069..e9d2da7 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -101,49 +101,6 @@ static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
 	return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
 }
 
-unsigned int aperfmperf_get_khz(int cpu)
-{
-	if (!cpu_khz)
-		return 0;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return 0;
-
-	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-		return 0;
-
-	if (rcu_is_idle_cpu(cpu))
-		return 0; /* Idle CPUs are completely uninteresting. */
-
-	aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
-	return per_cpu(samples.khz, cpu);
-}
-
-void arch_freq_prepare_all(void)
-{
-	ktime_t now = ktime_get();
-	bool wait = false;
-	int cpu;
-
-	if (!cpu_khz)
-		return;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return;
-
-	for_each_online_cpu(cpu) {
-		if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-			continue;
-		if (rcu_is_idle_cpu(cpu))
-			continue; /* Idle CPUs are completely uninteresting. */
-		if (!aperfmperf_snapshot_cpu(cpu, now, false))
-			wait = true;
-	}
-
-	if (wait)
-		msleep(APERFMPERF_REFRESH_DELAY_MS);
-}
-
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
@@ -530,6 +487,40 @@ void arch_scale_freq_tick(void)
 	scale_freq_tick(acnt, mcnt);
 }
 
+/*
+ * Discard samples older than the define maximum sample age of 20ms. There
+ * is no point in sending IPIs in such a case. If the scheduler tick was
+ * not running then the CPU is either idle or isolated.
+ */
+#define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)
+
+unsigned int aperfmperf_get_khz(int cpu)
+{
+	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+	unsigned long last;
+	unsigned int seq;
+	u64 acnt, mcnt;
+
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return 0;
+
+	do {
+		seq = raw_read_seqcount_begin(&s->seq);
+		last = s->last_update;
+		acnt = s->acnt;
+		mcnt = s->mcnt;
+	} while (read_seqcount_retry(&s->seq, seq));
+
+	/*
+	 * Bail on invalid count and when the last update was too long ago,
+	 * which covers idle and NOHZ full CPUs.
+	 */
+	if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
+		return 0;
+
+	return div64_u64((cpu_khz * acnt), mcnt);
+}
+
 static int __init bp_init_aperfmperf(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
diff --git a/fs/proc/cpuinfo.c b/fs/proc/cpuinfo.c
index 419760f..f38bda5 100644
--- a/fs/proc/cpuinfo.c
+++ b/fs/proc/cpuinfo.c
@@ -5,14 +5,10 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 
-__weak void arch_freq_prepare_all(void)
-{
-}
-
 extern const struct seq_operations cpuinfo_op;
+
 static int cpuinfo_open(struct inode *inode, struct file *file)
 {
-	arch_freq_prepare_all();
 	return seq_open(file, &cpuinfo_op);
 }
 
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 35c7d6d..d5595d5 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1199,7 +1199,6 @@ static inline void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
 			struct cpufreq_governor *old_gov) { }
 #endif
 
-extern void arch_freq_prepare_all(void);
 extern unsigned int arch_freq_get_on_cpu(int cpu);
 
 #ifndef arch_set_freq_scale

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  2022-04-15 19:20 ` [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads Thomas Gleixner
  2022-04-19 16:30   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     3b1fc17f635164d74934f67d3bb46cdf877fb67f
Gitweb:        https://git.kernel.org/tip/3b1fc17f635164d74934f67d3bb46cdf877fb67f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:20:01 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Store aperf/mperf data for cpu frequency reads

Now that the MSR readout is unconditional, store the results in the per CPU
data structure along with a jiffies timestamp for the CPU frequency readout
code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.817702355@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index df528a4..963c069 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -24,11 +24,17 @@
 #include "cpu.h"
 
 struct aperfmperf {
+	seqcount_t	seq;
+	unsigned long	last_update;
+	u64		acnt;
+	u64		mcnt;
 	u64		aperf;
 	u64		mperf;
 };
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples) = {
+	.seq = SEQCNT_ZERO(cpu_samples.seq)
+};
 
 struct aperfmperf_sample {
 	unsigned int	khz;
@@ -515,6 +521,12 @@ void arch_scale_freq_tick(void)
 	s->aperf = aperf;
 	s->mperf = mperf;
 
+	raw_write_seqcount_begin(&s->seq);
+	s->last_update = jiffies;
+	s->acnt = acnt;
+	s->mcnt = mcnt;
+	raw_write_seqcount_end(&s->seq);
+
 	scale_freq_tick(acnt, mcnt);
 }
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Make parts of the frequency invariance code unconditional
  2022-04-15 19:19 ` [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional Thomas Gleixner
  2022-04-19 16:27   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     93989bbee7b21fa779ab22343b0463d395d020fd
Gitweb:        https://git.kernel.org/tip/93989bbee7b21fa779ab22343b0463d395d020fd
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:59 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Make parts of the frequency invariance code unconditional

The frequency invariance support is currently limited to x86/64 and SMP,
which is the vast majority of machines.

arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
and MPERF MSRs. The CPU frequency getters function do the same via dedicated
IPIs.

While it could be argued that on systems where frequency invariance support
is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
be avoided, it does not make sense to keep the extra code and the resulting
runtime issues of mass IPIs around.

As a first step split out the non frequency invariance specific
initialization code and the read MSR portion of arch_scale_freq_tick(). The
rest of the code is still conditional and guarded with a static key.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.761988704@linutronix.de

---
 arch/x86/include/asm/cpu.h       |  2 +-
 arch/x86/include/asm/topology.h  |  4 +--
 arch/x86/kernel/cpu/aperfmperf.c | 63 ++++++++++++++++++-------------
 arch/x86/kernel/smpboot.c        |  3 +-
 4 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index 86e5e4e..e89772d 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -36,6 +36,8 @@ extern int _debug_hotplug_cpu(int cpu, int action);
 #endif
 #endif
 
+extern void ap_init_aperfmperf(void);
+
 int mwait_usable(const struct cpuinfo_x86 *);
 
 unsigned int x86_family(unsigned int sig);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index cc31707..1b2553d 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -217,13 +217,9 @@ extern void arch_scale_freq_tick(void);
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
 extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
-extern void bp_init_freq_invariance(void);
-extern void ap_init_freq_invariance(void);
 #else
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
 static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
-static inline void bp_init_freq_invariance(void) { }
-static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 6220503..df528a4 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -17,6 +17,7 @@
 #include <linux/smp.h>
 #include <linux/syscore_ops.h>
 
+#include <asm/cpu.h>
 #include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 
@@ -164,6 +165,17 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 	return per_cpu(samples.khz, cpu);
 }
 
+static void init_counter_refs(void)
+{
+	u64 aperf, mperf;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(cpu_samples.aperf, aperf);
+	this_cpu_write(cpu_samples.mperf, mperf);
+}
+
 #if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
 /*
  * APERF/MPERF frequency ratio computation.
@@ -405,17 +417,6 @@ out:
 	return true;
 }
 
-static void init_counter_refs(void)
-{
-	u64 aperf, mperf;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	this_cpu_write(cpu_samples.aperf, aperf);
-	this_cpu_write(cpu_samples.mperf, mperf);
-}
-
 #ifdef CONFIG_PM_SLEEP
 static struct syscore_ops freq_invariance_syscore_ops = {
 	.resume = init_counter_refs,
@@ -447,13 +448,8 @@ void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled)
 	freq_invariance_enable();
 }
 
-void __init bp_init_freq_invariance(void)
+static void __init bp_init_freq_invariance(void)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		return;
-
-	init_counter_refs();
-
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return;
 
@@ -461,12 +457,6 @@ void __init bp_init_freq_invariance(void)
 		freq_invariance_enable();
 }
 
-void ap_init_freq_invariance(void)
-{
-	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		init_counter_refs();
-}
-
 static void disable_freq_invariance_workfn(struct work_struct *work)
 {
 	static_branch_disable(&arch_scale_freq_key);
@@ -481,6 +471,9 @@ static void scale_freq_tick(u64 acnt, u64 mcnt)
 {
 	u64 freq_scale;
 
+	if (!arch_scale_freq_invariant())
+		return;
+
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
 
@@ -501,13 +494,17 @@ error:
 	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
 	schedule_work(&disable_freq_invariance_work);
 }
+#else
+static inline void bp_init_freq_invariance(void) { }
+static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
+#endif /* CONFIG_X86_64 && CONFIG_SMP */
 
 void arch_scale_freq_tick(void)
 {
 	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
 	u64 acnt, mcnt, aperf, mperf;
 
-	if (!arch_scale_freq_invariant())
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
 	rdmsrl(MSR_IA32_APERF, aperf);
@@ -520,4 +517,20 @@ void arch_scale_freq_tick(void)
 
 	scale_freq_tick(acnt, mcnt);
 }
-#endif /* CONFIG_X86_64 && CONFIG_SMP */
+
+static int __init bp_init_aperfmperf(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return 0;
+
+	init_counter_refs();
+	bp_init_freq_invariance();
+	return 0;
+}
+early_initcall(bp_init_aperfmperf);
+
+void ap_init_aperfmperf(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		init_counter_refs();
+}
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b1ba7dd..eb7de77 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -186,7 +186,7 @@ static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
-	ap_init_freq_invariance();
+	ap_init_aperfmperf();
 
 	/*
 	 * Get our bogomips.
@@ -1396,7 +1396,6 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 {
 	smp_prepare_cpus_common();
 
-	bp_init_freq_invariance();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Restructure arch_scale_freq_tick()
  2022-04-15 19:19 ` [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick() Thomas Gleixner
  2022-04-19 16:20   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     76225ca3666c5b05afb8dcc18a73f002b097af24
Gitweb:        https://git.kernel.org/tip/76225ca3666c5b05afb8dcc18a73f002b097af24
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:57 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Restructure arch_scale_freq_tick()

Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.706185092@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 36 ++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 6922c77..6220503 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -477,22 +477,9 @@ static DECLARE_WORK(disable_freq_invariance_work,
 
 DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 
-void arch_scale_freq_tick(void)
+static void scale_freq_tick(u64 acnt, u64 mcnt)
 {
-	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
-	u64 aperf, mperf, acnt, mcnt, freq_scale;
-
-	if (!arch_scale_freq_invariant())
-		return;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	acnt = aperf - s->aperf;
-	mcnt = mperf - s->mperf;
-
-	s->aperf = aperf;
-	s->mperf = mperf;
+	u64 freq_scale;
 
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
@@ -514,4 +501,23 @@ error:
 	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
 	schedule_work(&disable_freq_invariance_work);
 }
+
+void arch_scale_freq_tick(void)
+{
+	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
+	u64 acnt, mcnt, aperf, mperf;
+
+	if (!arch_scale_freq_invariant())
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+	acnt = aperf - s->aperf;
+	mcnt = mperf - s->mperf;
+
+	s->aperf = aperf;
+	s->mperf = mperf;
+
+	scale_freq_tick(acnt, mcnt);
+}
 #endif /* CONFIG_X86_64 && CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  2022-04-15 19:19 ` [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct Thomas Gleixner
  2022-04-19 16:15   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     d512d9a127285226c370ec7e92efb9ef61a6515b
Gitweb:        https://git.kernel.org/tip/d512d9a127285226c370ec7e92efb9ef61a6515b
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:56 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct

Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.648485667@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index b4f4ea5..6922c77 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -22,6 +22,13 @@
 
 #include "cpu.h"
 
+struct aperfmperf {
+	u64		aperf;
+	u64		mperf;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
+
 struct aperfmperf_sample {
 	unsigned int	khz;
 	atomic_t	scfpending;
@@ -194,8 +201,6 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 
 DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
 
-static DEFINE_PER_CPU(u64, arch_prev_aperf);
-static DEFINE_PER_CPU(u64, arch_prev_mperf);
 static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
 static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
 
@@ -407,8 +412,8 @@ static void init_counter_refs(void)
 	rdmsrl(MSR_IA32_APERF, aperf);
 	rdmsrl(MSR_IA32_MPERF, mperf);
 
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
+	this_cpu_write(cpu_samples.aperf, aperf);
+	this_cpu_write(cpu_samples.mperf, mperf);
 }
 
 #ifdef CONFIG_PM_SLEEP
@@ -474,9 +479,8 @@ DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 
 void arch_scale_freq_tick(void)
 {
-	u64 freq_scale;
-	u64 aperf, mperf;
-	u64 acnt, mcnt;
+	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
+	u64 aperf, mperf, acnt, mcnt, freq_scale;
 
 	if (!arch_scale_freq_invariant())
 		return;
@@ -484,11 +488,11 @@ void arch_scale_freq_tick(void)
 	rdmsrl(MSR_IA32_APERF, aperf);
 	rdmsrl(MSR_IA32_MPERF, mperf);
 
-	acnt = aperf - this_cpu_read(arch_prev_aperf);
-	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+	acnt = aperf - s->aperf;
+	mcnt = mperf - s->mperf;
 
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
+	s->aperf = aperf;
+	s->mperf = mperf;
 
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  2022-04-15 19:19 ` [patch 04/10] x86/aperfmperf: Untangle Intel and AMD " Thomas Gleixner
  2022-04-19 16:12   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     39a80f50743fd38d4a96603ca3ddbe4fc8723c01
Gitweb:        https://git.kernel.org/tip/39a80f50743fd38d4a96603ca3ddbe4fc8723c01
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:54 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Untangle Intel and AMD frequency invariance init

AMD boot CPU initialization happens late via ACPI/CPPC which prevents the
Intel parts from being marked __init.

Split out the common code and provide a dedicated interface for the AMD
initialization and mark the Intel specific code and data __init.

The remaining text size is almost cut in half:

  text:		2614	->	1350
  init.text:	   0	->	 786

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.592465719@linutronix.de

---
 arch/x86/include/asm/topology.h  | 13 ++-----
 arch/x86/kernel/acpi/cppc.c      | 29 ++++++---------
 arch/x86/kernel/cpu/aperfmperf.c | 62 ++++++++++++++++---------------
 arch/x86/kernel/smpboot.c        |  2 +-
 4 files changed, 49 insertions(+), 57 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index e2faedc..cc31707 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -216,24 +216,19 @@ extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
-extern void bp_init_freq_invariance(bool cppc_ready);
+extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
+extern void bp_init_freq_invariance(void);
 extern void ap_init_freq_invariance(void);
 #else
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
-static inline void bp_init_freq_invariance(bool cppc_ready) { }
+static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
+static inline void bp_init_freq_invariance(void) { }
 static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
 void init_freq_invariance_cppc(void);
 #define arch_init_invariance_cppc init_freq_invariance_cppc
-
-bool amd_set_max_freq_ratio(u64 *ratio);
-#else
-static inline bool amd_set_max_freq_ratio(u64 *ratio)
-{
-	return false;
-}
 #endif
 
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/acpi/cppc.c b/arch/x86/kernel/acpi/cppc.c
index 06109d9..79a0e92 100644
--- a/arch/x86/kernel/acpi/cppc.c
+++ b/arch/x86/kernel/acpi/cppc.c
@@ -50,20 +50,17 @@ int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val)
 	return err;
 }
 
-bool amd_set_max_freq_ratio(u64 *ratio)
+static void amd_set_max_freq_ratio(void)
 {
 	struct cppc_perf_caps perf_caps;
 	u64 highest_perf, nominal_perf;
 	u64 perf_ratio;
 	int rc;
 
-	if (!ratio)
-		return false;
-
 	rc = cppc_get_perf_caps(0, &perf_caps);
 	if (rc) {
 		pr_debug("Could not retrieve perf counters (%d)\n", rc);
-		return false;
+		return;
 	}
 
 	highest_perf = amd_get_highest_perf();
@@ -71,7 +68,7 @@ bool amd_set_max_freq_ratio(u64 *ratio)
 
 	if (!highest_perf || !nominal_perf) {
 		pr_debug("Could not retrieve highest or nominal performance\n");
-		return false;
+		return;
 	}
 
 	perf_ratio = div_u64(highest_perf * SCHED_CAPACITY_SCALE, nominal_perf);
@@ -79,26 +76,24 @@ bool amd_set_max_freq_ratio(u64 *ratio)
 	perf_ratio = (perf_ratio + SCHED_CAPACITY_SCALE) >> 1;
 	if (!perf_ratio) {
 		pr_debug("Non-zero highest/nominal perf values led to a 0 ratio\n");
-		return false;
+		return;
 	}
 
-	*ratio = perf_ratio;
-	arch_set_max_freq_ratio(false);
-
-	return true;
+	freq_invariance_set_perf_ratio(perf_ratio, false);
 }
 
 static DEFINE_MUTEX(freq_invariance_lock);
 
 void init_freq_invariance_cppc(void)
 {
-	static bool secondary;
+	static bool init_done;
 
-	mutex_lock(&freq_invariance_lock);
-
-	if (!secondary)
-		bp_init_freq_invariance(true);
-	secondary = true;
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return;
 
+	mutex_lock(&freq_invariance_lock);
+	if (!init_done)
+		amd_set_max_freq_ratio();
+	init_done = true;
 	mutex_unlock(&freq_invariance_lock);
 }
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 87f34f2..b4f4ea5 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -206,7 +206,7 @@ void arch_set_max_freq_ratio(bool turbo_disabled)
 }
 EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
 
-static bool turbo_disabled(void)
+static bool __init turbo_disabled(void)
 {
 	u64 misc_en;
 	int err;
@@ -218,7 +218,7 @@ static bool turbo_disabled(void)
 	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
 }
 
-static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+static bool __init slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 {
 	int err;
 
@@ -240,26 +240,26 @@ static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,		\
 		INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
 
-static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(XEON_PHI_KNL),
 	X86_MATCH(XEON_PHI_KNM),
 	{}
 };
 
-static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(SKYLAKE_X),
 	{}
 };
 
-static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(ATOM_GOLDMONT),
 	X86_MATCH(ATOM_GOLDMONT_D),
 	X86_MATCH(ATOM_GOLDMONT_PLUS),
 	{}
 };
 
-static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
-				int num_delta_fratio)
+static bool __init knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+					  int num_delta_fratio)
 {
 	int fratio, delta_fratio, found;
 	int err, i;
@@ -297,7 +297,7 @@ static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
 	return true;
 }
 
-static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+static bool __init skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
 {
 	u64 ratios, counts;
 	u32 group_size;
@@ -328,7 +328,7 @@ static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
 	return false;
 }
 
-static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+static bool __init core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 {
 	u64 msr;
 	int err;
@@ -351,7 +351,7 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 	return true;
 }
 
-static bool intel_set_max_freq_ratio(void)
+static bool __init intel_set_max_freq_ratio(void)
 {
 	u64 base_freq, turbo_freq;
 	u64 turbo_ratio;
@@ -418,40 +418,42 @@ static struct syscore_ops freq_invariance_syscore_ops = {
 
 static void register_freq_invariance_syscore_ops(void)
 {
-	/* Bail out if registered already. */
-	if (freq_invariance_syscore_ops.node.prev)
-		return;
-
 	register_syscore_ops(&freq_invariance_syscore_ops);
 }
 #else
 static inline void register_freq_invariance_syscore_ops(void) {}
 #endif
 
-void bp_init_freq_invariance(bool cppc_ready)
+static void freq_invariance_enable(void)
+{
+	if (static_branch_unlikely(&arch_scale_freq_key)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+	static_branch_enable(&arch_scale_freq_key);
+	register_freq_invariance_syscore_ops();
+	pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
+}
+
+void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled)
 {
-	bool ret;
+	arch_turbo_freq_ratio = ratio;
+	arch_set_max_freq_ratio(turbo_disabled);
+	freq_invariance_enable();
+}
 
+void __init bp_init_freq_invariance(void)
+{
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
 	init_counter_refs();
 
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
-		ret = intel_set_max_freq_ratio();
-	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-		if (!cppc_ready)
-			return;
-		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
-	}
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+		return;
 
-	if (ret) {
-		static_branch_enable(&arch_scale_freq_key);
-		register_freq_invariance_syscore_ops();
-		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
-	} else {
-		pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
-	}
+	if (intel_set_max_freq_ratio())
+		freq_invariance_enable();
 }
 
 void ap_init_freq_invariance(void)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 023feb4..b1ba7dd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1396,7 +1396,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 {
 	smp_prepare_cpus_common();
 
-	bp_init_freq_invariance(false);
+	bp_init_freq_invariance();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Separate AP/BP frequency invariance init
  2022-04-15 19:19 ` [patch 03/10] x86/aperfmperf: Separate AP/BP frequency invariance init Thomas Gleixner
  2022-04-19 16:04   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     138a7f9c6beae8d652113b8e7a44994b4200bbcd
Gitweb:        https://git.kernel.org/tip/138a7f9c6beae8d652113b8e7a44994b4200bbcd
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:53 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Separate AP/BP frequency invariance init

This code is convoluted and because it can be invoked post init via the
ACPI/CPPC code, all of the initialization functionality is built in instead
of being part of init text and init data.

As a first step create separate calls for the boot and the application
processors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.536733494@linutronix.de

---
 arch/x86/include/asm/topology.h  | 12 +++++-------
 arch/x86/kernel/acpi/cppc.c      |  3 ++-
 arch/x86/kernel/cpu/aperfmperf.c | 23 +++++++++++------------
 arch/x86/kernel/smpboot.c        |  4 ++--
 4 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9619385..e2faedc 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -216,14 +216,12 @@ extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
-void init_freq_invariance(bool secondary, bool cppc_ready);
+extern void bp_init_freq_invariance(bool cppc_ready);
+extern void ap_init_freq_invariance(void);
 #else
-static inline void arch_set_max_freq_ratio(bool turbo_disabled)
-{
-}
-static inline void init_freq_invariance(bool secondary, bool cppc_ready)
-{
-}
+static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
+static inline void bp_init_freq_invariance(bool cppc_ready) { }
+static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
diff --git a/arch/x86/kernel/acpi/cppc.c b/arch/x86/kernel/acpi/cppc.c
index df1644d..06109d9 100644
--- a/arch/x86/kernel/acpi/cppc.c
+++ b/arch/x86/kernel/acpi/cppc.c
@@ -96,7 +96,8 @@ void init_freq_invariance_cppc(void)
 
 	mutex_lock(&freq_invariance_lock);
 
-	init_freq_invariance(secondary, true);
+	if (!secondary)
+		bp_init_freq_invariance(true);
 	secondary = true;
 
 	mutex_unlock(&freq_invariance_lock);
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 35fff01..87f34f2 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -428,31 +428,24 @@ static void register_freq_invariance_syscore_ops(void)
 static inline void register_freq_invariance_syscore_ops(void) {}
 #endif
 
-void init_freq_invariance(bool secondary, bool cppc_ready)
+void bp_init_freq_invariance(bool cppc_ready)
 {
-	bool ret = false;
+	bool ret;
 
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
-	if (secondary) {
-		if (static_branch_likely(&arch_scale_freq_key)) {
-			init_counter_refs();
-		}
-		return;
-	}
+	init_counter_refs();
 
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
 		ret = intel_set_max_freq_ratio();
 	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-		if (!cppc_ready) {
+		if (!cppc_ready)
 			return;
-		}
 		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
 	}
 
 	if (ret) {
-		init_counter_refs();
 		static_branch_enable(&arch_scale_freq_key);
 		register_freq_invariance_syscore_ops();
 		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
@@ -461,6 +454,12 @@ void init_freq_invariance(bool secondary, bool cppc_ready)
 	}
 }
 
+void ap_init_freq_invariance(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		init_counter_refs();
+}
+
 static void disable_freq_invariance_workfn(struct work_struct *work)
 {
 	static_branch_disable(&arch_scale_freq_key);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a9fc16a..023feb4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -186,7 +186,7 @@ static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
-	init_freq_invariance(true, false);
+	ap_init_freq_invariance();
 
 	/*
 	 * Get our bogomips.
@@ -1396,7 +1396,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 {
 	smp_prepare_cpus_common();
 
-	init_freq_invariance(false, false);
+	bp_init_freq_invariance(false);
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/smp: Move APERF/MPERF code where it belongs
  2022-04-15 19:19 ` [patch 02/10] x86/smp: Move APERF/MPERF code where it belongs Thomas Gleixner
  2022-04-19 15:40   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     55cb0b70749361d7f82a979768c77ac301f07da9
Gitweb:        https://git.kernel.org/tip/55cb0b70749361d7f82a979768c77ac301f07da9
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:51 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/smp: Move APERF/MPERF code where it belongs

as this can share code with the preexisting APERF/MPERF code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.478362457@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 366 +++++++++++++++++++++++++++++-
 arch/x86/kernel/smpboot.c        | 355 +-----------------------------
 2 files changed, 362 insertions(+), 359 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index ea9160f..35fff01 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -6,15 +6,19 @@
  * Copyright (C) 2017 Intel Corp.
  * Author: Len Brown <len.brown@intel.com>
  */
-
+#include <linux/cpufreq.h>
 #include <linux/delay.h>
 #include <linux/ktime.h>
 #include <linux/math64.h>
 #include <linux/percpu.h>
-#include <linux/cpufreq.h>
-#include <linux/smp.h>
-#include <linux/sched/isolation.h>
 #include <linux/rcupdate.h>
+#include <linux/sched/isolation.h>
+#include <linux/sched/topology.h>
+#include <linux/smp.h>
+#include <linux/syscore_ops.h>
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
 
 #include "cpu.h"
 
@@ -152,3 +156,357 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 
 	return per_cpu(samples.khz, cpu);
 }
+
+#if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency, corresponding to
+ * freq_curr / freq_max.
+ *
+ * Since the frequency freq_curr on x86 is controlled by micro-controller and
+ * our P-state setting is little more than a request/hint, we need to observe
+ * the effective frequency 'BusyMHz', i.e. the average frequency over a time
+ * interval after discarding idle time. This is given by:
+ *
+ *   BusyMHz = delta_APERF / delta_MPERF * freq_base
+ *
+ * where freq_base is the max non-turbo P-state.
+ *
+ * The freq_max term has to be set to a somewhat arbitrary value, because we
+ * can't know which turbo states will be available at a given point in time:
+ * it all depends on the thermal headroom of the entire package. We set it to
+ * the turbo level with 4 cores active.
+ *
+ * Benchmarks show that's a good compromise between the 1C turbo ratio
+ * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
+ * which would ignore the entire turbo range (a conspicuous part, making
+ * freq_curr/freq_max always maxed out).
+ *
+ * An exception to the heuristic above is the Atom uarch, where we choose the
+ * highest turbo level for freq_max since Atom's are generally oriented towards
+ * power efficiency.
+ *
+ * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
+ * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
+ */
+
+DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
+static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
+
+void arch_set_max_freq_ratio(bool turbo_disabled)
+{
+	arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
+					arch_turbo_freq_ratio;
+}
+EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
+
+static bool turbo_disabled(void)
+{
+	u64 misc_en;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+	if (err)
+		return false;
+
+	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+{
+	int err;
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 16) & 0x3F;     /* max P state */
+	*turbo_freq = *turbo_freq & 0x3F;           /* 1C turbo    */
+
+	return true;
+}
+
+#define X86_MATCH(model)					\
+	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,		\
+		INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
+
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+	X86_MATCH(XEON_PHI_KNL),
+	X86_MATCH(XEON_PHI_KNM),
+	{}
+};
+
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+	X86_MATCH(SKYLAKE_X),
+	{}
+};
+
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+	X86_MATCH(ATOM_GOLDMONT),
+	X86_MATCH(ATOM_GOLDMONT_D),
+	X86_MATCH(ATOM_GOLDMONT_PLUS),
+	{}
+};
+
+static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+				int num_delta_fratio)
+{
+	int fratio, delta_fratio, found;
+	int err, i;
+	u64 msr;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 8) & 0xFF;	    /* max P state */
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+	if (err)
+		return false;
+
+	fratio = (msr >> 8) & 0xFF;
+	i = 16;
+	found = 0;
+	do {
+		if (found >= num_delta_fratio) {
+			*turbo_freq = fratio;
+			return true;
+		}
+
+		delta_fratio = (msr >> (i + 5)) & 0x7;
+
+		if (delta_fratio) {
+			found += 1;
+			fratio -= delta_fratio;
+		}
+
+		i += 8;
+	} while (i < 64);
+
+	return true;
+}
+
+static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+{
+	u64 ratios, counts;
+	u32 group_size;
+	int err, i;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
+	if (err)
+		return false;
+
+	for (i = 0; i < 64; i += 8) {
+		group_size = (counts >> i) & 0xFF;
+		if (group_size >= size) {
+			*turbo_freq = (ratios >> i) & 0xFF;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+{
+	u64 msr;
+	int err;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+	if (err)
+		return false;
+
+	*base_freq = (*base_freq >> 8) & 0xFF;    /* max P state */
+	*turbo_freq = (msr >> 24) & 0xFF;         /* 4C turbo    */
+
+	/* The CPU may have less than 4 cores */
+	if (!*turbo_freq)
+		*turbo_freq = msr & 0xFF;         /* 1C turbo    */
+
+	return true;
+}
+
+static bool intel_set_max_freq_ratio(void)
+{
+	u64 base_freq, turbo_freq;
+	u64 turbo_ratio;
+
+	if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
+		goto out;
+
+	if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
+	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+		goto out;
+
+	if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
+	    knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+		goto out;
+
+	if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
+	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
+		goto out;
+
+	if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
+		goto out;
+
+	return false;
+
+out:
+	/*
+	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
+	 * but then fill all MSR's with zeroes.
+	 * Some CPUs have turbo boost but don't declare any turbo ratio
+	 * in MSR_TURBO_RATIO_LIMIT.
+	 */
+	if (!base_freq || !turbo_freq) {
+		pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
+		return false;
+	}
+
+	turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
+	if (!turbo_ratio) {
+		pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
+		return false;
+	}
+
+	arch_turbo_freq_ratio = turbo_ratio;
+	arch_set_max_freq_ratio(turbo_disabled());
+
+	return true;
+}
+
+static void init_counter_refs(void)
+{
+	u64 aperf, mperf;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+}
+
+#ifdef CONFIG_PM_SLEEP
+static struct syscore_ops freq_invariance_syscore_ops = {
+	.resume = init_counter_refs,
+};
+
+static void register_freq_invariance_syscore_ops(void)
+{
+	/* Bail out if registered already. */
+	if (freq_invariance_syscore_ops.node.prev)
+		return;
+
+	register_syscore_ops(&freq_invariance_syscore_ops);
+}
+#else
+static inline void register_freq_invariance_syscore_ops(void) {}
+#endif
+
+void init_freq_invariance(bool secondary, bool cppc_ready)
+{
+	bool ret = false;
+
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return;
+
+	if (secondary) {
+		if (static_branch_likely(&arch_scale_freq_key)) {
+			init_counter_refs();
+		}
+		return;
+	}
+
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		ret = intel_set_max_freq_ratio();
+	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
+		if (!cppc_ready) {
+			return;
+		}
+		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
+	}
+
+	if (ret) {
+		init_counter_refs();
+		static_branch_enable(&arch_scale_freq_key);
+		register_freq_invariance_syscore_ops();
+		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
+	} else {
+		pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
+	}
+}
+
+static void disable_freq_invariance_workfn(struct work_struct *work)
+{
+	static_branch_disable(&arch_scale_freq_key);
+}
+
+static DECLARE_WORK(disable_freq_invariance_work,
+		    disable_freq_invariance_workfn);
+
+DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
+
+void arch_scale_freq_tick(void)
+{
+	u64 freq_scale;
+	u64 aperf, mperf;
+	u64 acnt, mcnt;
+
+	if (!arch_scale_freq_invariant())
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	acnt = aperf - this_cpu_read(arch_prev_aperf);
+	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+
+	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
+		goto error;
+
+	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
+		goto error;
+
+	freq_scale = div64_u64(acnt, mcnt);
+	if (!freq_scale)
+		goto error;
+
+	if (freq_scale > SCHED_CAPACITY_SCALE)
+		freq_scale = SCHED_CAPACITY_SCALE;
+
+	this_cpu_write(arch_freq_scale, freq_scale);
+	return;
+
+error:
+	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
+	schedule_work(&disable_freq_invariance_work);
+}
+#endif /* CONFIG_X86_64 && CONFIG_SMP */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2ef1477..a9fc16a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -56,7 +56,6 @@
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
-#include <linux/syscore_ops.h>
 
 #include <asm/acpi.h>
 #include <asm/desc.h>
@@ -1847,357 +1846,3 @@ void native_play_dead(void)
 }
 
 #endif
-
-#ifdef CONFIG_X86_64
-/*
- * APERF/MPERF frequency ratio computation.
- *
- * The scheduler wants to do frequency invariant accounting and needs a <1
- * ratio to account for the 'current' frequency, corresponding to
- * freq_curr / freq_max.
- *
- * Since the frequency freq_curr on x86 is controlled by micro-controller and
- * our P-state setting is little more than a request/hint, we need to observe
- * the effective frequency 'BusyMHz', i.e. the average frequency over a time
- * interval after discarding idle time. This is given by:
- *
- *   BusyMHz = delta_APERF / delta_MPERF * freq_base
- *
- * where freq_base is the max non-turbo P-state.
- *
- * The freq_max term has to be set to a somewhat arbitrary value, because we
- * can't know which turbo states will be available at a given point in time:
- * it all depends on the thermal headroom of the entire package. We set it to
- * the turbo level with 4 cores active.
- *
- * Benchmarks show that's a good compromise between the 1C turbo ratio
- * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
- * which would ignore the entire turbo range (a conspicuous part, making
- * freq_curr/freq_max always maxed out).
- *
- * An exception to the heuristic above is the Atom uarch, where we choose the
- * highest turbo level for freq_max since Atom's are generally oriented towards
- * power efficiency.
- *
- * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
- * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
- */
-
-DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
-
-static DEFINE_PER_CPU(u64, arch_prev_aperf);
-static DEFINE_PER_CPU(u64, arch_prev_mperf);
-static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
-static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
-
-void arch_set_max_freq_ratio(bool turbo_disabled)
-{
-	arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
-					arch_turbo_freq_ratio;
-}
-EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
-
-static bool turbo_disabled(void)
-{
-	u64 misc_en;
-	int err;
-
-	err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
-	if (err)
-		return false;
-
-	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
-}
-
-static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
-{
-	int err;
-
-	err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
-	if (err)
-		return false;
-
-	err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 16) & 0x3F;     /* max P state */
-	*turbo_freq = *turbo_freq & 0x3F;           /* 1C turbo    */
-
-	return true;
-}
-
-#define X86_MATCH(model)					\
-	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,		\
-		INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
-
-static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
-	X86_MATCH(XEON_PHI_KNL),
-	X86_MATCH(XEON_PHI_KNM),
-	{}
-};
-
-static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
-	X86_MATCH(SKYLAKE_X),
-	{}
-};
-
-static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
-	X86_MATCH(ATOM_GOLDMONT),
-	X86_MATCH(ATOM_GOLDMONT_D),
-	X86_MATCH(ATOM_GOLDMONT_PLUS),
-	{}
-};
-
-static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
-				int num_delta_fratio)
-{
-	int fratio, delta_fratio, found;
-	int err, i;
-	u64 msr;
-
-	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 8) & 0xFF;	    /* max P state */
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
-	if (err)
-		return false;
-
-	fratio = (msr >> 8) & 0xFF;
-	i = 16;
-	found = 0;
-	do {
-		if (found >= num_delta_fratio) {
-			*turbo_freq = fratio;
-			return true;
-		}
-
-		delta_fratio = (msr >> (i + 5)) & 0x7;
-
-		if (delta_fratio) {
-			found += 1;
-			fratio -= delta_fratio;
-		}
-
-		i += 8;
-	} while (i < 64);
-
-	return true;
-}
-
-static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
-{
-	u64 ratios, counts;
-	u32 group_size;
-	int err, i;
-
-	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 8) & 0xFF;      /* max P state */
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
-	if (err)
-		return false;
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
-	if (err)
-		return false;
-
-	for (i = 0; i < 64; i += 8) {
-		group_size = (counts >> i) & 0xFF;
-		if (group_size >= size) {
-			*turbo_freq = (ratios >> i) & 0xFF;
-			return true;
-		}
-	}
-
-	return false;
-}
-
-static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
-{
-	u64 msr;
-	int err;
-
-	err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
-	if (err)
-		return false;
-
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
-	if (err)
-		return false;
-
-	*base_freq = (*base_freq >> 8) & 0xFF;    /* max P state */
-	*turbo_freq = (msr >> 24) & 0xFF;         /* 4C turbo    */
-
-	/* The CPU may have less than 4 cores */
-	if (!*turbo_freq)
-		*turbo_freq = msr & 0xFF;         /* 1C turbo    */
-
-	return true;
-}
-
-static bool intel_set_max_freq_ratio(void)
-{
-	u64 base_freq, turbo_freq;
-	u64 turbo_ratio;
-
-	if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
-		goto out;
-
-	if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
-	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
-		goto out;
-
-	if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
-	    knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
-		goto out;
-
-	if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
-	    skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
-		goto out;
-
-	if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
-		goto out;
-
-	return false;
-
-out:
-	/*
-	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
-	 * but then fill all MSR's with zeroes.
-	 * Some CPUs have turbo boost but don't declare any turbo ratio
-	 * in MSR_TURBO_RATIO_LIMIT.
-	 */
-	if (!base_freq || !turbo_freq) {
-		pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
-		return false;
-	}
-
-	turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
-	if (!turbo_ratio) {
-		pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
-		return false;
-	}
-
-	arch_turbo_freq_ratio = turbo_ratio;
-	arch_set_max_freq_ratio(turbo_disabled());
-
-	return true;
-}
-
-static void init_counter_refs(void)
-{
-	u64 aperf, mperf;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
-}
-
-#ifdef CONFIG_PM_SLEEP
-static struct syscore_ops freq_invariance_syscore_ops = {
-	.resume = init_counter_refs,
-};
-
-static void register_freq_invariance_syscore_ops(void)
-{
-	/* Bail out if registered already. */
-	if (freq_invariance_syscore_ops.node.prev)
-		return;
-
-	register_syscore_ops(&freq_invariance_syscore_ops);
-}
-#else
-static inline void register_freq_invariance_syscore_ops(void) {}
-#endif
-
-void init_freq_invariance(bool secondary, bool cppc_ready)
-{
-	bool ret = false;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return;
-
-	if (secondary) {
-		if (static_branch_likely(&arch_scale_freq_key)) {
-			init_counter_refs();
-		}
-		return;
-	}
-
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
-		ret = intel_set_max_freq_ratio();
-	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-		if (!cppc_ready) {
-			return;
-		}
-		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
-	}
-
-	if (ret) {
-		init_counter_refs();
-		static_branch_enable(&arch_scale_freq_key);
-		register_freq_invariance_syscore_ops();
-		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
-	} else {
-		pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
-	}
-}
-
-static void disable_freq_invariance_workfn(struct work_struct *work)
-{
-	static_branch_disable(&arch_scale_freq_key);
-}
-
-static DECLARE_WORK(disable_freq_invariance_work,
-		    disable_freq_invariance_workfn);
-
-DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
-
-void arch_scale_freq_tick(void)
-{
-	u64 freq_scale;
-	u64 aperf, mperf;
-	u64 acnt, mcnt;
-
-	if (!arch_scale_freq_invariant())
-		return;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	acnt = aperf - this_cpu_read(arch_prev_aperf);
-	mcnt = mperf - this_cpu_read(arch_prev_mperf);
-
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
-
-	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
-		goto error;
-
-	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
-		goto error;
-
-	freq_scale = div64_u64(acnt, mcnt);
-	if (!freq_scale)
-		goto error;
-
-	if (freq_scale > SCHED_CAPACITY_SCALE)
-		freq_scale = SCHED_CAPACITY_SCALE;
-
-	this_cpu_write(arch_freq_scale, freq_scale);
-	return;
-
-error:
-	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
-	schedule_work(&disable_freq_invariance_work);
-}
-#endif /* CONFIG_X86_64 */

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
  2022-04-15 19:19 ` [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() Thomas Gleixner
  2022-04-19 15:34   ` Rafael J. Wysocki
@ 2022-04-27 13:56   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 13:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     6d108c96bf23598cc3b4f91d60e9b7694abcd2a7
Gitweb:        https://git.kernel.org/tip/6d108c96bf23598cc3b4f91d60e9b7694abcd2a7
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:50 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 15:51:08 +02:00

x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()

aperfmperf_get_khz() already excludes idle CPUs from APERF/MPERF sampling
and that's a reasonable decision. There is no point in sending up to two
IPIs to an idle CPU just because someone reads a sysfs file.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.419880163@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 9ca008f..ea9160f 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -139,6 +139,9 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
 		return 0;
 
+	if (rcu_is_idle_cpu(cpu))
+		return 0;
+
 	if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
 		return per_cpu(samples.khz, cpu);
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
  2022-04-25 15:45         ` Thomas Gleixner
  2022-04-25 23:20           ` Doug Smythies
  2022-04-27 13:56           ` [tip: x86/cleanups] x86/aperfmperf: Integrate the fallback code from show_cpuinfo() tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27           ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rafael J. Wysocki, Thomas Gleixner, Doug Smythies, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     fb4c77c21aba03677f283acda3cae748ef866abf
Gitweb:        https://git.kernel.org/tip/fb4c77c21aba03677f283acda3cae748ef866abf
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 25 Apr 2022 17:45:42 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:20 +02:00

x86/aperfmperf: Integrate the fallback code from show_cpuinfo()

Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.

show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.

Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.

Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/r/87pml5180p.ffs@tglx

---
 arch/x86/kernel/cpu/aperfmperf.c | 10 +++++++---
 arch/x86/kernel/cpu/proc.c       |  7 +------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index b15c884..1f60a2b 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -405,12 +405,12 @@ void arch_scale_freq_tick(void)
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+	unsigned int seq, freq;
 	unsigned long last;
-	unsigned int seq;
 	u64 acnt, mcnt;
 
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		return 0;
+		goto fallback;
 
 	do {
 		seq = raw_read_seqcount_begin(&s->seq);
@@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 	 * which covers idle and NOHZ full CPUs.
 	 */
 	if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
-		return 0;
+		goto fallback;
 
 	return div64_u64((cpu_khz * acnt), mcnt);
+
+fallback:
+	freq = cpufreq_quick_get(cpu);
+	return freq ? freq : cpu_khz;
 }
 
 static int __init bp_init_aperfmperf(void)
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 0a0ee55..099b6f0 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 	if (cpu_has(c, X86_FEATURE_TSC)) {
 		unsigned int freq = arch_freq_get_on_cpu(cpu);
 
-		if (!freq)
-			freq = cpufreq_quick_get(cpu);
-		if (!freq)
-			freq = cpu_khz;
-		seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
-			   freq / 1000, (freq % 1000));
+		seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
 	}
 
 	/* Cache size */

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Replace arch_freq_get_on_cpu()
  2022-04-15 19:20 ` [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu() Thomas Gleixner
  2022-04-19 16:37   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Eric Dumazet, Rafael J. Wysocki,
	Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     f3eca381bd49d708073ba1a9af4fa6ea5d5810a6
Gitweb:        https://git.kernel.org/tip/f3eca381bd49d708073ba1a9af4fa6ea5d5810a6
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:20:04 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:19 +02:00

x86/aperfmperf: Replace arch_freq_get_on_cpu()

Reading the current CPU frequency from /sys/..../scaling_cur_freq involves
in the worst case two IPIs due to the ad hoc sampling.

The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them and consolidate this with the /proc/cpuinfo readout.

The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.

The resulting text size vs. the original APERF/MPERF plus the separate
frequency invariance code:

  text:		2411	->   723
  init.text:	   0	->   767

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.934040006@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 94 +-------------------------------
 arch/x86/kernel/cpu/proc.c       |  2 +-
 2 files changed, 2 insertions(+), 94 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index e9d2da7..b15c884 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -36,98 +36,6 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples) = {
 	.seq = SEQCNT_ZERO(cpu_samples.seq)
 };
 
-struct aperfmperf_sample {
-	unsigned int	khz;
-	atomic_t	scfpending;
-	ktime_t	time;
-	u64	aperf;
-	u64	mperf;
-};
-
-static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
-
-#define APERFMPERF_CACHE_THRESHOLD_MS	10
-#define APERFMPERF_REFRESH_DELAY_MS	10
-#define APERFMPERF_STALE_THRESHOLD_MS	1000
-
-/*
- * aperfmperf_snapshot_khz()
- * On the current CPU, snapshot APERF, MPERF, and jiffies
- * unless we already did it within 10ms
- * calculate kHz, save snapshot
- */
-static void aperfmperf_snapshot_khz(void *dummy)
-{
-	u64 aperf, aperf_delta;
-	u64 mperf, mperf_delta;
-	struct aperfmperf_sample *s = this_cpu_ptr(&samples);
-	unsigned long flags;
-
-	local_irq_save(flags);
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-	local_irq_restore(flags);
-
-	aperf_delta = aperf - s->aperf;
-	mperf_delta = mperf - s->mperf;
-
-	/*
-	 * There is no architectural guarantee that MPERF
-	 * increments faster than we can read it.
-	 */
-	if (mperf_delta == 0)
-		return;
-
-	s->time = ktime_get();
-	s->aperf = aperf;
-	s->mperf = mperf;
-	s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
-	atomic_set_release(&s->scfpending, 0);
-}
-
-static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
-{
-	s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
-	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
-
-	/* Don't bother re-computing within the cache threshold time. */
-	if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
-		return true;
-
-	if (!atomic_xchg(&s->scfpending, 1) || wait)
-		smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
-
-	/* Return false if the previous iteration was too long ago. */
-	return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
-}
-
-unsigned int arch_freq_get_on_cpu(int cpu)
-{
-	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
-
-	if (!cpu_khz)
-		return 0;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return 0;
-
-	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-		return 0;
-
-	if (rcu_is_idle_cpu(cpu))
-		return 0;
-
-	if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
-		return per_cpu(samples.khz, cpu);
-
-	msleep(APERFMPERF_REFRESH_DELAY_MS);
-	atomic_set(&s->scfpending, 1);
-	smp_mb(); /* ->scfpending before smp_call_function_single(). */
-	smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
-
-	return per_cpu(samples.khz, cpu);
-}
-
 static void init_counter_refs(void)
 {
 	u64 aperf, mperf;
@@ -494,7 +402,7 @@ void arch_scale_freq_tick(void)
  */
 #define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)
 
-unsigned int aperfmperf_get_khz(int cpu)
+unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
 	unsigned long last;
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 4eec888..0a0ee55 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -84,7 +84,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 		seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
 
 	if (cpu_has(c, X86_FEATURE_TSC)) {
-		unsigned int freq = aperfmperf_get_khz(cpu);
+		unsigned int freq = arch_freq_get_on_cpu(cpu);
 
 		if (!freq)
 			freq = cpufreq_quick_get(cpu);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Replace aperfmperf_get_khz()
  2022-04-15 19:20 ` [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz() Thomas Gleixner
  2022-04-19 16:35   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric Dumazet, Thomas Gleixner, Eric Dumazet, Rafael J. Wysocki,
	Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     7d84c1ebf9ddafca27b481e6da7d24a023dacaa2
Gitweb:        https://git.kernel.org/tip/7d84c1ebf9ddafca27b481e6da7d24a023dacaa2
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:20:02 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:19 +02:00

x86/aperfmperf: Replace aperfmperf_get_khz()

The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them for the cpu frequency display in /proc/cpuinfo.

The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.

This gets rid of the mass IPIs and a delay of 20ms for stabilizing observed
by Eric when reading /proc/cpuinfo.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.875029458@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 77 +++++++++++++------------------
 fs/proc/cpuinfo.c                |  6 +--
 include/linux/cpufreq.h          |  1 +-
 3 files changed, 35 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 963c069..e9d2da7 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -101,49 +101,6 @@ static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
 	return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
 }
 
-unsigned int aperfmperf_get_khz(int cpu)
-{
-	if (!cpu_khz)
-		return 0;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return 0;
-
-	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-		return 0;
-
-	if (rcu_is_idle_cpu(cpu))
-		return 0; /* Idle CPUs are completely uninteresting. */
-
-	aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
-	return per_cpu(samples.khz, cpu);
-}
-
-void arch_freq_prepare_all(void)
-{
-	ktime_t now = ktime_get();
-	bool wait = false;
-	int cpu;
-
-	if (!cpu_khz)
-		return;
-
-	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
-		return;
-
-	for_each_online_cpu(cpu) {
-		if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-			continue;
-		if (rcu_is_idle_cpu(cpu))
-			continue; /* Idle CPUs are completely uninteresting. */
-		if (!aperfmperf_snapshot_cpu(cpu, now, false))
-			wait = true;
-	}
-
-	if (wait)
-		msleep(APERFMPERF_REFRESH_DELAY_MS);
-}
-
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
@@ -530,6 +487,40 @@ void arch_scale_freq_tick(void)
 	scale_freq_tick(acnt, mcnt);
 }
 
+/*
+ * Discard samples older than the define maximum sample age of 20ms. There
+ * is no point in sending IPIs in such a case. If the scheduler tick was
+ * not running then the CPU is either idle or isolated.
+ */
+#define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)
+
+unsigned int aperfmperf_get_khz(int cpu)
+{
+	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+	unsigned long last;
+	unsigned int seq;
+	u64 acnt, mcnt;
+
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return 0;
+
+	do {
+		seq = raw_read_seqcount_begin(&s->seq);
+		last = s->last_update;
+		acnt = s->acnt;
+		mcnt = s->mcnt;
+	} while (read_seqcount_retry(&s->seq, seq));
+
+	/*
+	 * Bail on invalid count and when the last update was too long ago,
+	 * which covers idle and NOHZ full CPUs.
+	 */
+	if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
+		return 0;
+
+	return div64_u64((cpu_khz * acnt), mcnt);
+}
+
 static int __init bp_init_aperfmperf(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
diff --git a/fs/proc/cpuinfo.c b/fs/proc/cpuinfo.c
index 419760f..f38bda5 100644
--- a/fs/proc/cpuinfo.c
+++ b/fs/proc/cpuinfo.c
@@ -5,14 +5,10 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 
-__weak void arch_freq_prepare_all(void)
-{
-}
-
 extern const struct seq_operations cpuinfo_op;
+
 static int cpuinfo_open(struct inode *inode, struct file *file)
 {
-	arch_freq_prepare_all();
 	return seq_open(file, &cpuinfo_op);
 }
 
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 35c7d6d..d5595d5 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1199,7 +1199,6 @@ static inline void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
 			struct cpufreq_governor *old_gov) { }
 #endif
 
-extern void arch_freq_prepare_all(void);
 extern unsigned int arch_freq_get_on_cpu(int cpu);
 
 #ifndef arch_set_freq_scale

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  2022-04-15 19:20 ` [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads Thomas Gleixner
  2022-04-19 16:30   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     cd8c0e142daf9de9ce594e61b75509b0af7bfb26
Gitweb:        https://git.kernel.org/tip/cd8c0e142daf9de9ce594e61b75509b0af7bfb26
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:20:01 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:19 +02:00

x86/aperfmperf: Store aperf/mperf data for cpu frequency reads

Now that the MSR readout is unconditional, store the results in the per CPU
data structure along with a jiffies timestamp for the CPU frequency readout
code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.817702355@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index df528a4..963c069 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -24,11 +24,17 @@
 #include "cpu.h"
 
 struct aperfmperf {
+	seqcount_t	seq;
+	unsigned long	last_update;
+	u64		acnt;
+	u64		mcnt;
 	u64		aperf;
 	u64		mperf;
 };
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples) = {
+	.seq = SEQCNT_ZERO(cpu_samples.seq)
+};
 
 struct aperfmperf_sample {
 	unsigned int	khz;
@@ -515,6 +521,12 @@ void arch_scale_freq_tick(void)
 	s->aperf = aperf;
 	s->mperf = mperf;
 
+	raw_write_seqcount_begin(&s->seq);
+	s->last_update = jiffies;
+	s->acnt = acnt;
+	s->mcnt = mcnt;
+	raw_write_seqcount_end(&s->seq);
+
 	scale_freq_tick(acnt, mcnt);
 }
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Make parts of the frequency invariance code unconditional
  2022-04-15 19:19 ` [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional Thomas Gleixner
  2022-04-19 16:27   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     bb6e89df9028b2fab0ce6ac71cd9ef25b6ada32d
Gitweb:        https://git.kernel.org/tip/bb6e89df9028b2fab0ce6ac71cd9ef25b6ada32d
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:59 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:19 +02:00

x86/aperfmperf: Make parts of the frequency invariance code unconditional

The frequency invariance support is currently limited to x86/64 and SMP,
which is the vast majority of machines.

arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
and MPERF MSRs. The CPU frequency getters function do the same via dedicated
IPIs.

While it could be argued that on systems where frequency invariance support
is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
be avoided, it does not make sense to keep the extra code and the resulting
runtime issues of mass IPIs around.

As a first step split out the non frequency invariance specific
initialization code and the read MSR portion of arch_scale_freq_tick(). The
rest of the code is still conditional and guarded with a static key.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.761988704@linutronix.de

---
 arch/x86/include/asm/cpu.h       |  2 +-
 arch/x86/include/asm/topology.h  |  4 +--
 arch/x86/kernel/cpu/aperfmperf.c | 63 ++++++++++++++++++-------------
 arch/x86/kernel/smpboot.c        |  3 +-
 4 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index 86e5e4e..e89772d 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -36,6 +36,8 @@ extern int _debug_hotplug_cpu(int cpu, int action);
 #endif
 #endif
 
+extern void ap_init_aperfmperf(void);
+
 int mwait_usable(const struct cpuinfo_x86 *);
 
 unsigned int x86_family(unsigned int sig);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index cc31707..1b2553d 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -217,13 +217,9 @@ extern void arch_scale_freq_tick(void);
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
 extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
-extern void bp_init_freq_invariance(void);
-extern void ap_init_freq_invariance(void);
 #else
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
 static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
-static inline void bp_init_freq_invariance(void) { }
-static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 6220503..df528a4 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -17,6 +17,7 @@
 #include <linux/smp.h>
 #include <linux/syscore_ops.h>
 
+#include <asm/cpu.h>
 #include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 
@@ -164,6 +165,17 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 	return per_cpu(samples.khz, cpu);
 }
 
+static void init_counter_refs(void)
+{
+	u64 aperf, mperf;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(cpu_samples.aperf, aperf);
+	this_cpu_write(cpu_samples.mperf, mperf);
+}
+
 #if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
 /*
  * APERF/MPERF frequency ratio computation.
@@ -405,17 +417,6 @@ out:
 	return true;
 }
 
-static void init_counter_refs(void)
-{
-	u64 aperf, mperf;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	this_cpu_write(cpu_samples.aperf, aperf);
-	this_cpu_write(cpu_samples.mperf, mperf);
-}
-
 #ifdef CONFIG_PM_SLEEP
 static struct syscore_ops freq_invariance_syscore_ops = {
 	.resume = init_counter_refs,
@@ -447,13 +448,8 @@ void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled)
 	freq_invariance_enable();
 }
 
-void __init bp_init_freq_invariance(void)
+static void __init bp_init_freq_invariance(void)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		return;
-
-	init_counter_refs();
-
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return;
 
@@ -461,12 +457,6 @@ void __init bp_init_freq_invariance(void)
 		freq_invariance_enable();
 }
 
-void ap_init_freq_invariance(void)
-{
-	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		init_counter_refs();
-}
-
 static void disable_freq_invariance_workfn(struct work_struct *work)
 {
 	static_branch_disable(&arch_scale_freq_key);
@@ -481,6 +471,9 @@ static void scale_freq_tick(u64 acnt, u64 mcnt)
 {
 	u64 freq_scale;
 
+	if (!arch_scale_freq_invariant())
+		return;
+
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
 
@@ -501,13 +494,17 @@ error:
 	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
 	schedule_work(&disable_freq_invariance_work);
 }
+#else
+static inline void bp_init_freq_invariance(void) { }
+static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
+#endif /* CONFIG_X86_64 && CONFIG_SMP */
 
 void arch_scale_freq_tick(void)
 {
 	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
 	u64 acnt, mcnt, aperf, mperf;
 
-	if (!arch_scale_freq_invariant())
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
 	rdmsrl(MSR_IA32_APERF, aperf);
@@ -520,4 +517,20 @@ void arch_scale_freq_tick(void)
 
 	scale_freq_tick(acnt, mcnt);
 }
-#endif /* CONFIG_X86_64 && CONFIG_SMP */
+
+static int __init bp_init_aperfmperf(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return 0;
+
+	init_counter_refs();
+	bp_init_freq_invariance();
+	return 0;
+}
+early_initcall(bp_init_aperfmperf);
+
+void ap_init_aperfmperf(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		init_counter_refs();
+}
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b1ba7dd..eb7de77 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -186,7 +186,7 @@ static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
-	ap_init_freq_invariance();
+	ap_init_aperfmperf();
 
 	/*
 	 * Get our bogomips.
@@ -1396,7 +1396,6 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 {
 	smp_prepare_cpus_common();
 
-	bp_init_freq_invariance();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Restructure arch_scale_freq_tick()
  2022-04-15 19:19 ` [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick() Thomas Gleixner
  2022-04-19 16:20   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     73a5fa7d51366a549a9f2e3ee875ae51aa0b5580
Gitweb:        https://git.kernel.org/tip/73a5fa7d51366a549a9f2e3ee875ae51aa0b5580
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:57 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:19 +02:00

x86/aperfmperf: Restructure arch_scale_freq_tick()

Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.706185092@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 36 ++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 6922c77..6220503 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -477,22 +477,9 @@ static DECLARE_WORK(disable_freq_invariance_work,
 
 DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 
-void arch_scale_freq_tick(void)
+static void scale_freq_tick(u64 acnt, u64 mcnt)
 {
-	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
-	u64 aperf, mperf, acnt, mcnt, freq_scale;
-
-	if (!arch_scale_freq_invariant())
-		return;
-
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	acnt = aperf - s->aperf;
-	mcnt = mperf - s->mperf;
-
-	s->aperf = aperf;
-	s->mperf = mperf;
+	u64 freq_scale;
 
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
@@ -514,4 +501,23 @@ error:
 	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
 	schedule_work(&disable_freq_invariance_work);
 }
+
+void arch_scale_freq_tick(void)
+{
+	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
+	u64 acnt, mcnt, aperf, mperf;
+
+	if (!arch_scale_freq_invariant())
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+	acnt = aperf - s->aperf;
+	mcnt = mperf - s->mperf;
+
+	s->aperf = aperf;
+	s->mperf = mperf;
+
+	scale_freq_tick(acnt, mcnt);
+}
 #endif /* CONFIG_X86_64 && CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  2022-04-15 19:19 ` [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct Thomas Gleixner
  2022-04-19 16:15   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     24620d94a52adc0cafe65dc65bed1d586ca04a6e
Gitweb:        https://git.kernel.org/tip/24620d94a52adc0cafe65dc65bed1d586ca04a6e
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:56 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:19 +02:00

x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct

Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.648485667@linutronix.de

---
 arch/x86/kernel/cpu/aperfmperf.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index b4f4ea5..6922c77 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -22,6 +22,13 @@
 
 #include "cpu.h"
 
+struct aperfmperf {
+	u64		aperf;
+	u64		mperf;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct aperfmperf, cpu_samples);
+
 struct aperfmperf_sample {
 	unsigned int	khz;
 	atomic_t	scfpending;
@@ -194,8 +201,6 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 
 DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
 
-static DEFINE_PER_CPU(u64, arch_prev_aperf);
-static DEFINE_PER_CPU(u64, arch_prev_mperf);
 static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
 static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
 
@@ -407,8 +412,8 @@ static void init_counter_refs(void)
 	rdmsrl(MSR_IA32_APERF, aperf);
 	rdmsrl(MSR_IA32_MPERF, mperf);
 
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
+	this_cpu_write(cpu_samples.aperf, aperf);
+	this_cpu_write(cpu_samples.mperf, mperf);
 }
 
 #ifdef CONFIG_PM_SLEEP
@@ -474,9 +479,8 @@ DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 
 void arch_scale_freq_tick(void)
 {
-	u64 freq_scale;
-	u64 aperf, mperf;
-	u64 acnt, mcnt;
+	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
+	u64 aperf, mperf, acnt, mcnt, freq_scale;
 
 	if (!arch_scale_freq_invariant())
 		return;
@@ -484,11 +488,11 @@ void arch_scale_freq_tick(void)
 	rdmsrl(MSR_IA32_APERF, aperf);
 	rdmsrl(MSR_IA32_MPERF, mperf);
 
-	acnt = aperf - this_cpu_read(arch_prev_aperf);
-	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+	acnt = aperf - s->aperf;
+	mcnt = mperf - s->mperf;
 
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
+	s->aperf = aperf;
+	s->mperf = mperf;
 
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: x86/cleanups] x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  2022-04-15 19:19 ` [patch 04/10] x86/aperfmperf: Untangle Intel and AMD " Thomas Gleixner
  2022-04-19 16:12   ` Rafael J. Wysocki
  2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
@ 2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 51+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-27 18:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Rafael J. Wysocki, Peter Zijlstra (Intel),
	Paul E. McKenney, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     0dfaf3f6ecc0c7f4f876255aa82e8959d3721365
Gitweb:        https://git.kernel.org/tip/0dfaf3f6ecc0c7f4f876255aa82e8959d3721365
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 15 Apr 2022 21:19:54 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 27 Apr 2022 20:22:19 +02:00

x86/aperfmperf: Untangle Intel and AMD frequency invariance init

AMD boot CPU initialization happens late via ACPI/CPPC which prevents the
Intel parts from being marked __init.

Split out the common code and provide a dedicated interface for the AMD
initialization and mark the Intel specific code and data __init.

The remaining text size is almost cut in half:

  text:		2614	->	1350
  init.text:	   0	->	 786

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.592465719@linutronix.de

---
 arch/x86/include/asm/topology.h  | 13 ++-----
 arch/x86/kernel/acpi/cppc.c      | 30 +++++++--------
 arch/x86/kernel/cpu/aperfmperf.c | 62 ++++++++++++++++---------------
 arch/x86/kernel/smpboot.c        |  2 +-
 4 files changed, 51 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index e2faedc..cc31707 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -216,24 +216,19 @@ extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
 
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
-extern void bp_init_freq_invariance(bool cppc_ready);
+extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
+extern void bp_init_freq_invariance(void);
 extern void ap_init_freq_invariance(void);
 #else
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
-static inline void bp_init_freq_invariance(bool cppc_ready) { }
+static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
+static inline void bp_init_freq_invariance(void) { }
 static inline void ap_init_freq_invariance(void) { }
 #endif
 
 #ifdef CONFIG_ACPI_CPPC_LIB
 void init_freq_invariance_cppc(void);
 #define arch_init_invariance_cppc init_freq_invariance_cppc
-
-bool amd_set_max_freq_ratio(u64 *ratio);
-#else
-static inline bool amd_set_max_freq_ratio(u64 *ratio)
-{
-	return false;
-}
 #endif
 
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/acpi/cppc.c b/arch/x86/kernel/acpi/cppc.c
index 06109d9..8b8cbf2 100644
--- a/arch/x86/kernel/acpi/cppc.c
+++ b/arch/x86/kernel/acpi/cppc.c
@@ -50,20 +50,17 @@ int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val)
 	return err;
 }
 
-bool amd_set_max_freq_ratio(u64 *ratio)
+static void amd_set_max_freq_ratio(void)
 {
 	struct cppc_perf_caps perf_caps;
 	u64 highest_perf, nominal_perf;
 	u64 perf_ratio;
 	int rc;
 
-	if (!ratio)
-		return false;
-
 	rc = cppc_get_perf_caps(0, &perf_caps);
 	if (rc) {
 		pr_debug("Could not retrieve perf counters (%d)\n", rc);
-		return false;
+		return;
 	}
 
 	highest_perf = amd_get_highest_perf();
@@ -71,7 +68,7 @@ bool amd_set_max_freq_ratio(u64 *ratio)
 
 	if (!highest_perf || !nominal_perf) {
 		pr_debug("Could not retrieve highest or nominal performance\n");
-		return false;
+		return;
 	}
 
 	perf_ratio = div_u64(highest_perf * SCHED_CAPACITY_SCALE, nominal_perf);
@@ -79,26 +76,27 @@ bool amd_set_max_freq_ratio(u64 *ratio)
 	perf_ratio = (perf_ratio + SCHED_CAPACITY_SCALE) >> 1;
 	if (!perf_ratio) {
 		pr_debug("Non-zero highest/nominal perf values led to a 0 ratio\n");
-		return false;
+		return;
 	}
 
-	*ratio = perf_ratio;
-	arch_set_max_freq_ratio(false);
-
-	return true;
+	freq_invariance_set_perf_ratio(perf_ratio, false);
 }
 
 static DEFINE_MUTEX(freq_invariance_lock);
 
 void init_freq_invariance_cppc(void)
 {
-	static bool secondary;
+	static bool init_done;
 
-	mutex_lock(&freq_invariance_lock);
+	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+		return;
 
-	if (!secondary)
-		bp_init_freq_invariance(true);
-	secondary = true;
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD)
+		return;
 
+	mutex_lock(&freq_invariance_lock);
+	if (!init_done)
+		amd_set_max_freq_ratio();
+	init_done = true;
 	mutex_unlock(&freq_invariance_lock);
 }
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 87f34f2..b4f4ea5 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -206,7 +206,7 @@ void arch_set_max_freq_ratio(bool turbo_disabled)
 }
 EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
 
-static bool turbo_disabled(void)
+static bool __init turbo_disabled(void)
 {
 	u64 misc_en;
 	int err;
@@ -218,7 +218,7 @@ static bool turbo_disabled(void)
 	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
 }
 
-static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+static bool __init slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 {
 	int err;
 
@@ -240,26 +240,26 @@ static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6,		\
 		INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
 
-static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(XEON_PHI_KNL),
 	X86_MATCH(XEON_PHI_KNM),
 	{}
 };
 
-static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(SKYLAKE_X),
 	{}
 };
 
-static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] __initconst = {
 	X86_MATCH(ATOM_GOLDMONT),
 	X86_MATCH(ATOM_GOLDMONT_D),
 	X86_MATCH(ATOM_GOLDMONT_PLUS),
 	{}
 };
 
-static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
-				int num_delta_fratio)
+static bool __init knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+					  int num_delta_fratio)
 {
 	int fratio, delta_fratio, found;
 	int err, i;
@@ -297,7 +297,7 @@ static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
 	return true;
 }
 
-static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+static bool __init skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
 {
 	u64 ratios, counts;
 	u32 group_size;
@@ -328,7 +328,7 @@ static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
 	return false;
 }
 
-static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+static bool __init core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 {
 	u64 msr;
 	int err;
@@ -351,7 +351,7 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
 	return true;
 }
 
-static bool intel_set_max_freq_ratio(void)
+static bool __init intel_set_max_freq_ratio(void)
 {
 	u64 base_freq, turbo_freq;
 	u64 turbo_ratio;
@@ -418,40 +418,42 @@ static struct syscore_ops freq_invariance_syscore_ops = {
 
 static void register_freq_invariance_syscore_ops(void)
 {
-	/* Bail out if registered already. */
-	if (freq_invariance_syscore_ops.node.prev)
-		return;
-
 	register_syscore_ops(&freq_invariance_syscore_ops);
 }
 #else
 static inline void register_freq_invariance_syscore_ops(void) {}
 #endif
 
-void bp_init_freq_invariance(bool cppc_ready)
+static void freq_invariance_enable(void)
+{
+	if (static_branch_unlikely(&arch_scale_freq_key)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+	static_branch_enable(&arch_scale_freq_key);
+	register_freq_invariance_syscore_ops();
+	pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
+}
+
+void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled)
 {
-	bool ret;
+	arch_turbo_freq_ratio = ratio;
+	arch_set_max_freq_ratio(turbo_disabled);
+	freq_invariance_enable();
+}
 
+void __init bp_init_freq_invariance(void)
+{
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
 	init_counter_refs();
 
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
-		ret = intel_set_max_freq_ratio();
-	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-		if (!cppc_ready)
-			return;
-		ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
-	}
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+		return;
 
-	if (ret) {
-		static_branch_enable(&arch_scale_freq_key);
-		register_freq_invariance_syscore_ops();
-		pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
-	} else {
-		pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
-	}
+	if (intel_set_max_freq_ratio())
+		freq_invariance_enable();
 }
 
 void ap_init_freq_invariance(void)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 023feb4..b1ba7dd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1396,7 +1396,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 {
 	smp_prepare_cpus_common();
 
-	bp_init_freq_invariance(false);
+	bp_init_freq_invariance();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {

^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2022-04-27 18:43 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-15 19:19 [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Thomas Gleixner
2022-04-15 19:19 ` [patch 01/10] x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() Thomas Gleixner
2022-04-19 15:34   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-15 19:19 ` [patch 02/10] x86/smp: Move APERF/MPERF code where it belongs Thomas Gleixner
2022-04-19 15:40   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-15 19:19 ` [patch 03/10] x86/aperfmperf: Separate AP/BP frequency invariance init Thomas Gleixner
2022-04-19 16:04   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-15 19:19 ` [patch 04/10] x86/aperfmperf: Untangle Intel and AMD " Thomas Gleixner
2022-04-19 16:12   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
2022-04-15 19:19 ` [patch 05/10] x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct Thomas Gleixner
2022-04-19 16:15   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
2022-04-15 19:19 ` [patch 06/10] x86/aperfmperf: Restructure arch_scale_freq_tick() Thomas Gleixner
2022-04-19 16:20   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
2022-04-15 19:19 ` [patch 07/10] x86/aperfmperf: Make parts of the frequency invariance code unconditional Thomas Gleixner
2022-04-19 16:27   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
2022-04-15 19:20 ` [patch 08/10] x86/aperfmperf: Store aperf/mperf data for cpu frequency reads Thomas Gleixner
2022-04-19 16:30   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
2022-04-15 19:20 ` [patch 09/10] x86/aperfmperf: Replace aperfmperf_get_khz() Thomas Gleixner
2022-04-19 16:35   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
2022-04-15 19:20 ` [patch 10/10] x86/aperfmperf: Replace arch_freq_get_on_cpu() Thomas Gleixner
2022-04-19 16:37   ` Rafael J. Wysocki
2022-04-27 13:56   ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-04-27 18:27   ` tip-bot2 for Thomas Gleixner
2022-04-19 15:51 ` [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Eric Dumazet
2022-04-19 20:39   ` Thomas Gleixner
2022-04-19 21:20     ` Eric Dumazet
2022-04-19 16:41 ` Peter Zijlstra
2022-04-19 17:32 ` Doug Smythies
2022-04-19 18:49   ` Rafael J. Wysocki
2022-04-19 21:11     ` Thomas Gleixner
2022-04-20 22:08       ` Doug Smythies
2022-04-25 15:45         ` Thomas Gleixner
2022-04-25 23:20           ` Doug Smythies
2022-04-27 13:56           ` [tip: x86/cleanups] x86/aperfmperf: Integrate the fallback code from show_cpuinfo() tip-bot2 for Thomas Gleixner
2022-04-27 18:27           ` tip-bot2 for Thomas Gleixner
2022-04-19 21:56 ` [patch 00/10] x86/cpu: Consolidate APERF/MPERF code Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.