[PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance
@ 2021-06-16  6:48 Viresh Kumar
  2021-06-16  6:48 ` [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks Viresh Kumar
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16  6:48 UTC (permalink / raw)
  To: Rafael Wysocki, Ionela Voinescu, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Greg Kroah-Hartman,
	Ingo Molnar, Jonathan Corbet, Juri Lelli, Mel Gorman,
	Peter Zijlstra, Rafael J. Wysocki, Steven Rostedt, Sudeep Holla,
	Vincent Guittot, Viresh Kumar
  Cc: linux-pm, Qian Cai, linux-acpi, linux-doc, linux-kernel,
	Paul E. McKenney

Hello,

Changes since V1:

- Few of the patches migrating users to ->exit() callback are posted separately.

- The CPPC patch was completely reverted and so the support for FIE is again
  added here from scratch.

- The start_cpu() and stop_cpu() interface is reworked a little so stop_cpu() is
  only ever called for a CPU if start_cpu() was called for it earlier.

- A new patch to implement RCU locking in arch_topology core to avoid some
  races.

- Some cleanup and very clear/separate paths for FIE in cppc driver now.


-------------------------8<-------------------------

CPPC cpufreq driver is used for ARM servers and this patch series tries to
provide counter-based frequency invariance support for them in the absence for
architecture specific counters (like AMUs).

This was reverted earlier for the 5.13 kernel after Qian Cai reported kernel
oops during suspend/resume.

This is based of pm/linux-next + a cleanup patchset:

https://lore.kernel.org/linux-pm/cover.1623825358.git.viresh.kumar@linaro.org/

All the patches are pushed here together for people to run.

https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc

This is tested on my Hikey platform (without the actual read/write to
performance counters), with this script for over an hour:

while true; do
    for i in `seq 1 7`;
    do
        echo 0 > /sys/devices/system/cpu/cpu$i/online;
    done;

    for i in `seq 1 7`;
    do
        echo 1 > /sys/devices/system/cpu/cpu$i/online;
    done;
done


Vincent will be giving this patchset a try on ThunderX2.

Meanwhile it is up for review. Ideally I would like to get this merged for 5.14,
but lets see how it goes.

Thanks.

--
Viresh

Viresh Kumar (3):
  cpufreq: Add start_cpu() and stop_cpu() callbacks
  arch_topology: Avoid use-after-free for scale_freq_data
  cpufreq: CPPC: Add support for frequency invariance

 Documentation/cpu-freq/cpu-drivers.rst |   7 +-
 drivers/base/arch_topology.c           |  27 ++-
 drivers/cpufreq/Kconfig.arm            |  10 ++
 drivers/cpufreq/cppc_cpufreq.c         | 232 +++++++++++++++++++++++--
 drivers/cpufreq/cpufreq.c              |  19 +-
 include/linux/arch_topology.h          |   1 +
 include/linux/cpufreq.h                |   5 +-
 kernel/sched/core.c                    |   1 +
 8 files changed, 272 insertions(+), 30 deletions(-)

-- 
2.31.1.272.g89b43f80a514


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks
  2021-06-16  6:48 [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance Viresh Kumar
@ 2021-06-16  6:48 ` Viresh Kumar
  2021-06-17 13:33   ` Rafael J. Wysocki
  2021-06-16  6:48 ` [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data Viresh Kumar
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16  6:48 UTC (permalink / raw)
  To: Rafael Wysocki, Ionela Voinescu, Viresh Kumar, Jonathan Corbet
  Cc: linux-pm, Vincent Guittot, Qian Cai, linux-doc, linux-kernel

On CPU hot-unplug, the cpufreq core doesn't call any driver specific
callback unless all the CPUs of a policy went away, in which case we
call stop_cpu() callback.

For the CPPC cpufreq driver, we need to perform per-cpu init/exit work
which that can't be performed from policy specific init()/exit()
callbacks.

This patch adds a new callback, start_cpu() and modifies the stop_cpu()
callback, to perform such CPU specific work.

These routines are called whenever a CPU is added or removed from a
policy.

Note that this also moves the setting of policy->cpus to online CPUs
only, outside of rwsem as we needed to call start_cpu() for online CPUs
only. This shouldn't have any side effects.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 Documentation/cpu-freq/cpu-drivers.rst |  7 +++++--
 drivers/cpufreq/cpufreq.c              | 19 +++++++++++++++----
 include/linux/cpufreq.h                |  5 ++++-
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/Documentation/cpu-freq/cpu-drivers.rst b/Documentation/cpu-freq/cpu-drivers.rst
index a697278ce190..15cfe42b4075 100644
--- a/Documentation/cpu-freq/cpu-drivers.rst
+++ b/Documentation/cpu-freq/cpu-drivers.rst
@@ -71,8 +71,11 @@ And optionally
  .exit - A pointer to a per-policy cleanup function called during
  CPU_POST_DEAD phase of cpu hotplug process.
 
- .stop_cpu - A pointer to a per-policy stop function called during
- CPU_DOWN_PREPARE phase of cpu hotplug process.
+ .start_cpu - A pointer to a per-policy per-cpu start function called
+ during CPU online phase.
+
+ .stop_cpu - A pointer to a per-policy per-cpu stop function called
+ during CPU offline phase.
 
  .suspend - A pointer to a per-policy suspend function which is called
  with interrupts disabled and _after_ the governor is stopped for the
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 802abc925b2a..128dfb1b0cdf 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1119,6 +1119,10 @@ static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy, unsigned int cp
 
 	cpumask_set_cpu(cpu, policy->cpus);
 
+	/* Do CPU specific initialization if required */
+	if (cpufreq_driver->start_cpu)
+		cpufreq_driver->start_cpu(policy, cpu);
+
 	if (has_target()) {
 		ret = cpufreq_start_governor(policy);
 		if (ret)
@@ -1375,13 +1379,19 @@ static int cpufreq_online(unsigned int cpu)
 		cpumask_copy(policy->related_cpus, policy->cpus);
 	}
 
-	down_write(&policy->rwsem);
 	/*
 	 * affected cpus must always be the one, which are online. We aren't
 	 * managing offline cpus here.
 	 */
 	cpumask_and(policy->cpus, policy->cpus, cpu_online_mask);
 
+	/* Do CPU specific initialization if required */
+	if (cpufreq_driver->start_cpu) {
+		for_each_cpu(j, policy->cpus)
+			cpufreq_driver->start_cpu(policy, j);
+	}
+
+	down_write(&policy->rwsem);
 	if (new_policy) {
 		for_each_cpu(j, policy->related_cpus) {
 			per_cpu(cpufreq_cpu_data, j) = policy;
@@ -1581,6 +1591,10 @@ static int cpufreq_offline(unsigned int cpu)
 		policy->cpu = cpumask_any(policy->cpus);
 	}
 
+	/* Do CPU specific de-initialization if required */
+	if (cpufreq_driver->stop_cpu)
+		cpufreq_driver->stop_cpu(policy, cpu);
+
 	/* Start governor again for active policy */
 	if (!policy_is_inactive(policy)) {
 		if (has_target()) {
@@ -1597,9 +1611,6 @@ static int cpufreq_offline(unsigned int cpu)
 		policy->cdev = NULL;
 	}
 
-	if (cpufreq_driver->stop_cpu)
-		cpufreq_driver->stop_cpu(policy);
-
 	if (has_target())
 		cpufreq_exit_governor(policy);
 
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 353969c7acd3..c281b3df4e2f 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -371,7 +371,10 @@ struct cpufreq_driver {
 	int		(*online)(struct cpufreq_policy *policy);
 	int		(*offline)(struct cpufreq_policy *policy);
 	int		(*exit)(struct cpufreq_policy *policy);
-	void		(*stop_cpu)(struct cpufreq_policy *policy);
+
+	/* CPU specific start/stop */
+	void		(*start_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
+	void		(*stop_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
 	int		(*suspend)(struct cpufreq_policy *policy);
 	int		(*resume)(struct cpufreq_policy *policy);
 
-- 
2.31.1.272.g89b43f80a514


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16  6:48 [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance Viresh Kumar
  2021-06-16  6:48 ` [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks Viresh Kumar
@ 2021-06-16  6:48 ` Viresh Kumar
  2021-06-16  7:57   ` Greg Kroah-Hartman
  2021-06-16 11:25   ` Ionela Voinescu
  2021-06-16  6:48 ` [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance Viresh Kumar
  2021-06-16 10:02 ` [PATCH V2 0/3] cpufreq: cppc: " Vincent Guittot
  3 siblings, 2 replies; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16  6:48 UTC (permalink / raw)
  To: Rafael Wysocki, Ionela Voinescu, Sudeep Holla,
	Greg Kroah-Hartman, Rafael J. Wysocki
  Cc: Viresh Kumar, linux-pm, Vincent Guittot, Qian Cai,
	Paul E . McKenney, linux-kernel

Currently topology_scale_freq_tick() may end up using a pointer to
struct scale_freq_data, which was previously cleared by
topology_clear_scale_freq_source(), as there is no protection in place
here. The users of topology_clear_scale_freq_source() though needs a
guarantee that the previous scale_freq_data isn't used anymore.

Since topology_scale_freq_tick() is called from scheduler tick, we don't
want to add locking in there. Use the RCU update mechanism instead
(which is already used by the scheduler's utilization update path) to
guarantee race free updates here.

Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index c1179edc0f3b..921312a8d957 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -18,10 +18,11 @@
 #include <linux/cpumask.h>
 #include <linux/init.h>
 #include <linux/percpu.h>
+#include <linux/rcupdate.h>
 #include <linux/sched.h>
 #include <linux/smp.h>
 
-static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
+static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
 static struct cpumask scale_freq_counters_mask;
 static bool scale_freq_invariant;
 
@@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
 	if (cpumask_empty(&scale_freq_counters_mask))
 		scale_freq_invariant = topology_scale_freq_invariant();
 
+	rcu_read_lock();
+
 	for_each_cpu(cpu, cpus) {
-		sfd = per_cpu(sft_data, cpu);
+		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
 
 		/* Use ARCH provided counters whenever possible */
 		if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
-			per_cpu(sft_data, cpu) = data;
+			rcu_assign_pointer(per_cpu(sft_data, cpu), data);
 			cpumask_set_cpu(cpu, &scale_freq_counters_mask);
 		}
 	}
 
+	rcu_read_unlock();
+
 	update_scale_freq_invariant(true);
 }
 EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
@@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
 	struct scale_freq_data *sfd;
 	int cpu;
 
+	rcu_read_lock();
+
 	for_each_cpu(cpu, cpus) {
-		sfd = per_cpu(sft_data, cpu);
+		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
 
 		if (sfd && sfd->source == source) {
-			per_cpu(sft_data, cpu) = NULL;
+			rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
 			cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
 		}
 	}
 
+	rcu_read_unlock();
+
+	/*
+	 * Make sure all references to previous sft_data are dropped to avoid
+	 * use-after-free races.
+	 */
+	synchronize_rcu();
+
 	update_scale_freq_invariant(false);
 }
 EXPORT_SYMBOL_GPL(topology_clear_scale_freq_source);
 
 void topology_scale_freq_tick(void)
 {
-	struct scale_freq_data *sfd = *this_cpu_ptr(&sft_data);
+	struct scale_freq_data *sfd = rcu_dereference_sched(*this_cpu_ptr(&sft_data));
 
 	if (sfd)
 		sfd->set_freq_scale();
-- 
2.31.1.272.g89b43f80a514


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-16  6:48 [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance Viresh Kumar
  2021-06-16  6:48 ` [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks Viresh Kumar
  2021-06-16  6:48 ` [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data Viresh Kumar
@ 2021-06-16  6:48 ` Viresh Kumar
  2021-06-16 12:48   ` Ionela Voinescu
  2021-06-16 10:02 ` [PATCH V2 0/3] cpufreq: cppc: " Vincent Guittot
  3 siblings, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16  6:48 UTC (permalink / raw)
  To: Rafael Wysocki, Ionela Voinescu, Viresh Kumar, Sudeep Holla,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: linux-pm, Qian Cai, linux-acpi, linux-kernel

The Frequency Invariance Engine (FIE) is providing a frequency scaling
correction factor that helps achieve more accurate load-tracking.

Normally, this scaling factor can be obtained directly with the help of
the cpufreq drivers as they know the exact frequency the hardware is
running at. But that isn't the case for CPPC cpufreq driver.

Another way of obtaining that is using the arch specific counter
support, which is already present in kernel, but that hardware is
optional for platforms.

This patch updates the CPPC driver to register itself with the topology
core to provide its own implementation (cppc_scale_freq_tick()) of
topology_scale_freq_tick() which gets called by the scheduler on every
tick. Note that the arch specific counters have higher priority than
CPPC counters, if available, though the CPPC driver doesn't need to have
any special handling for that.

On an invocation of cppc_scale_freq_tick(), we schedule an irq work
(since we reach here from hard-irq context), which then schedules a
normal work item and cppc_scale_freq_workfn() updates the per_cpu
arch_freq_scale variable based on the counter updates since the last
tick.

To allow platforms to disable this CPPC counter-based frequency
invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE,
which is enabled by default.

This also exports sched_setattr_nocheck() as the CPPC driver can be
built as a module.

Cc: linux-acpi@vger.kernel.org
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 drivers/cpufreq/Kconfig.arm    |  10 ++
 drivers/cpufreq/cppc_cpufreq.c | 232 ++++++++++++++++++++++++++++++---
 include/linux/arch_topology.h  |   1 +
 kernel/sched/core.c            |   1 +
 4 files changed, 227 insertions(+), 17 deletions(-)

diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index e65e0a43be64..a5c5f70acfc9 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -19,6 +19,16 @@ config ACPI_CPPC_CPUFREQ
 
 	  If in doubt, say N.
 
+config ACPI_CPPC_CPUFREQ_FIE
+	bool "Frequency Invariance support for CPPC cpufreq driver"
+	depends on ACPI_CPPC_CPUFREQ && GENERIC_ARCH_TOPOLOGY
+	default y
+	help
+	  This extends frequency invariance support in the CPPC cpufreq driver,
+	  by using CPPC delivered and reference performance counters.
+
+	  If in doubt, say N.
+
 config ARM_ALLWINNER_SUN50I_CPUFREQ_NVMEM
 	tristate "Allwinner nvmem based SUN50I CPUFreq driver"
 	depends on ARCH_SUNXI
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index be4f62e2c5f1..63e4cff46f95 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -10,14 +10,18 @@
 
 #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt
 
+#include <linux/arch_topology.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/delay.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
 #include <linux/dmi.h>
+#include <linux/irq_work.h>
+#include <linux/kthread.h>
 #include <linux/time.h>
 #include <linux/vmalloc.h>
+#include <uapi/linux/sched/types.h>
 
 #include <asm/unaligned.h>
 
@@ -57,6 +61,183 @@ static struct cppc_workaround_oem_info wa_info[] = {
 	}
 };
 
+#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
+
+/* Frequency invariance support */
+struct cppc_freq_invariance {
+	int cpu;
+	struct irq_work irq_work;
+	struct kthread_work work;
+	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
+	struct cppc_cpudata *cpu_data;
+};
+
+static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
+static struct kthread_worker *kworker_fie;
+
+static struct cpufreq_driver cppc_cpufreq_driver;
+static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);
+
+/**
+ * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
+ * @work: The work item.
+ *
+ * The CPPC driver register itself with the topology core to provide its own
+ * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
+ * gets called by the scheduler on every tick.
+ *
+ * Note that the arch specific counters have higher priority than CPPC counters,
+ * if available, though the CPPC driver doesn't need to have any special
+ * handling for that.
+ *
+ * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
+ * reach here from hard-irq context), which then schedules a normal work item
+ * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
+ * based on the counter updates since the last tick.
+ */
+static void cppc_scale_freq_workfn(struct kthread_work *work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	struct cppc_perf_fb_ctrs fb_ctrs = {0};
+	struct cppc_cpudata *cpu_data;
+	unsigned long local_freq_scale;
+	u64 perf;
+
+	cppc_fi = container_of(work, struct cppc_freq_invariance, work);
+	cpu_data = cppc_fi->cpu_data;
+
+	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
+		pr_warn("%s: failed to read perf counters\n", __func__);
+		return;
+	}
+
+	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
+				     &fb_ctrs);
+	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
+
+	perf <<= SCHED_CAPACITY_SHIFT;
+	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
+	if (WARN_ON(local_freq_scale > 1024))
+		local_freq_scale = 1024;
+
+	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
+}
+
+static void cppc_irq_work(struct irq_work *irq_work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+
+	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
+	kthread_queue_work(kworker_fie, &cppc_fi->work);
+}
+
+static void cppc_scale_freq_tick(void)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
+
+	/*
+	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right
+	 * context.
+	 */
+	irq_work_queue(&cppc_fi->irq_work);
+}
+
+static struct scale_freq_data cppc_sftd = {
+	.source = SCALE_FREQ_SOURCE_CPPC,
+	.set_freq_scale = cppc_scale_freq_tick,
+};
+
+static void cppc_cpufreq_start_cpu(struct cpufreq_policy *policy,
+				   unsigned int cpu)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+	int ret;
+
+	cppc_fi->cpu = cpu;
+	cppc_fi->cpu_data = policy->driver_data;
+	kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
+	init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
+
+	ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
+	if (ret) {
+		pr_warn("%s: failed to read perf counters: %d\n", __func__,
+			ret);
+		return;
+	}
+
+	/* Register for freq-invariance */
+	topology_set_scale_freq_source(&cppc_sftd, cpumask_of(cpu));
+}
+
+static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
+				  unsigned int cpu)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+
+	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
+
+	irq_work_sync(&cppc_fi->irq_work);
+	kthread_cancel_work_sync(&cppc_fi->work);
+}
+
+static void __init cppc_freq_invariance_init(void)
+{
+	struct sched_attr attr = {
+		.size		= sizeof(struct sched_attr),
+		.sched_policy	= SCHED_DEADLINE,
+		.sched_nice	= 0,
+		.sched_priority	= 0,
+		/*
+		 * Fake (unused) bandwidth; workaround to "fix"
+		 * priority inheritance.
+		 */
+		.sched_runtime	= 1000000,
+		.sched_deadline = 10000000,
+		.sched_period	= 10000000,
+	};
+	int ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kworker_fie = kthread_create_worker(0, "cppc_fie");
+	if (IS_ERR(kworker_fie))
+		return;
+
+	ret = sched_setattr_nocheck(kworker_fie->task, &attr);
+	if (ret) {
+		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
+			ret);
+		kthread_destroy_worker(kworker_fie);
+		return;
+	}
+
+	cppc_cpufreq_driver.start_cpu = cppc_cpufreq_start_cpu;
+	cppc_cpufreq_driver.stop_cpu = cppc_cpufreq_stop_cpu;
+}
+
+static void cppc_freq_invariance_exit(void)
+{
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kthread_destroy_worker(kworker_fie);
+	kworker_fie = NULL;
+}
+
+#else
+static inline void cppc_freq_invariance_init(void)
+{
+}
+
+static inline void cppc_freq_invariance_exit(void)
+{
+}
+#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
+
 /* Callback function used to retrieve the max frequency from DMI */
 static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
 {
@@ -362,26 +543,35 @@ static inline u64 get_delta(u64 t1, u64 t0)
 	return (u32)t1 - (u32)t0;
 }
 
-static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t0,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)
 {
 	u64 delta_reference, delta_delivered;
-	u64 reference_perf, delivered_perf;
+	u64 reference_perf;
 
-	reference_perf = fb_ctrs_t0.reference_perf;
+	reference_perf = fb_ctrs_t0->reference_perf;
 
-	delta_reference = get_delta(fb_ctrs_t1.reference,
-				    fb_ctrs_t0.reference);
-	delta_delivered = get_delta(fb_ctrs_t1.delivered,
-				    fb_ctrs_t0.delivered);
+	delta_reference = get_delta(fb_ctrs_t1->reference,
+				    fb_ctrs_t0->reference);
+	delta_delivered = get_delta(fb_ctrs_t1->delivered,
+				    fb_ctrs_t0->delivered);
 
-	/* Check to avoid divide-by zero */
-	if (delta_reference || delta_delivered)
-		delivered_perf = (reference_perf * delta_delivered) /
-					delta_reference;
-	else
-		delivered_perf = cpu_data->perf_ctrls.desired_perf;
+	/* Check to avoid divide-by zero and invalid delivered_perf */
+	if (!delta_reference || !delta_delivered)
+		return cpu_data->perf_ctrls.desired_perf;
+
+	return (reference_perf * delta_delivered) / delta_reference;
+}
+
+static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)
+{
+	u64 delivered_perf;
+
+	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
+					       fb_ctrs_t1);
 
 	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
 }
@@ -405,7 +595,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
 	if (ret)
 		return ret;
 
-	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
+	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
 }
 
 static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
@@ -506,14 +696,21 @@ static void cppc_check_hisi_workaround(void)
 
 static int __init cppc_cpufreq_init(void)
 {
+	int ret;
+
 	if ((acpi_disabled) || !acpi_cpc_valid())
 		return -ENODEV;
 
 	INIT_LIST_HEAD(&cpu_data_list);
 
 	cppc_check_hisi_workaround();
+	cppc_freq_invariance_init();
 
-	return cpufreq_register_driver(&cppc_cpufreq_driver);
+	ret = cpufreq_register_driver(&cppc_cpufreq_driver);
+	if (ret)
+		cppc_freq_invariance_exit();
+
+	return ret;
 }
 
 static inline void free_cpu_data(void)
@@ -531,6 +728,7 @@ static inline void free_cpu_data(void)
 static void __exit cppc_cpufreq_exit(void)
 {
 	cpufreq_unregister_driver(&cppc_cpufreq_driver);
+	cppc_freq_invariance_exit();
 
 	free_cpu_data();
 }
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 11e555cfaecb..f180240dc95f 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -37,6 +37,7 @@ bool topology_scale_freq_invariant(void);
 enum scale_freq_source {
 	SCALE_FREQ_SOURCE_CPUFREQ = 0,
 	SCALE_FREQ_SOURCE_ARCH,
+	SCALE_FREQ_SOURCE_CPPC,
 };
 
 struct scale_freq_data {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ca80df205ce..5226cc26a095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6389,6 +6389,7 @@ int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
 {
 	return __sched_setscheduler(p, attr, false, true);
 }
+EXPORT_SYMBOL_GPL(sched_setattr_nocheck);
 
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
-- 
2.31.1.272.g89b43f80a514


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16  6:48 ` [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data Viresh Kumar
@ 2021-06-16  7:57   ` Greg Kroah-Hartman
  2021-06-16  8:18     ` Viresh Kumar
  2021-06-16 11:25   ` Ionela Voinescu
  1 sibling, 1 reply; 24+ messages in thread
From: Greg Kroah-Hartman @ 2021-06-16  7:57 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Ionela Voinescu, Sudeep Holla, Rafael J. Wysocki,
	linux-pm, Vincent Guittot, Qian Cai, Paul E . McKenney,
	linux-kernel

On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> Currently topology_scale_freq_tick() may end up using a pointer to
> struct scale_freq_data, which was previously cleared by
> topology_clear_scale_freq_source(), as there is no protection in place
> here. The users of topology_clear_scale_freq_source() though needs a
> guarantee that the previous scale_freq_data isn't used anymore.
> 
> Since topology_scale_freq_tick() is called from scheduler tick, we don't
> want to add locking in there. Use the RCU update mechanism instead
> (which is already used by the scheduler's utilization update path) to
> guarantee race free updates here.
> 
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

So this is a bugfix for problems in the current codebase?  What commit
does this fix?  Should it go to the stable kernels?

> ---
>  drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index c1179edc0f3b..921312a8d957 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -18,10 +18,11 @@
>  #include <linux/cpumask.h>
>  #include <linux/init.h>
>  #include <linux/percpu.h>
> +#include <linux/rcupdate.h>
>  #include <linux/sched.h>
>  #include <linux/smp.h>
>  
> -static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
> +static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
>  static struct cpumask scale_freq_counters_mask;
>  static bool scale_freq_invariant;
>  
> @@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
>  	if (cpumask_empty(&scale_freq_counters_mask))
>  		scale_freq_invariant = topology_scale_freq_invariant();
>  
> +	rcu_read_lock();
> +
>  	for_each_cpu(cpu, cpus) {
> -		sfd = per_cpu(sft_data, cpu);
> +		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>  
>  		/* Use ARCH provided counters whenever possible */
>  		if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
> -			per_cpu(sft_data, cpu) = data;
> +			rcu_assign_pointer(per_cpu(sft_data, cpu), data);
>  			cpumask_set_cpu(cpu, &scale_freq_counters_mask);
>  		}
>  	}
>  
> +	rcu_read_unlock();
> +
>  	update_scale_freq_invariant(true);
>  }
>  EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
> @@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
>  	struct scale_freq_data *sfd;
>  	int cpu;
>  
> +	rcu_read_lock();
> +
>  	for_each_cpu(cpu, cpus) {
> -		sfd = per_cpu(sft_data, cpu);
> +		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>  
>  		if (sfd && sfd->source == source) {
> -			per_cpu(sft_data, cpu) = NULL;
> +			rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
>  			cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
>  		}
>  	}
>  
> +	rcu_read_unlock();
> +
> +	/*
> +	 * Make sure all references to previous sft_data are dropped to avoid
> +	 * use-after-free races.
> +	 */
> +	synchronize_rcu();

What race is happening?  How could the current code race?  Only when a
cpu is removed?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16  7:57   ` Greg Kroah-Hartman
@ 2021-06-16  8:18     ` Viresh Kumar
  2021-06-16  8:31       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16  8:18 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael Wysocki, Ionela Voinescu, Sudeep Holla, Rafael J. Wysocki,
	linux-pm, Vincent Guittot, Qian Cai, Paul E . McKenney,
	linux-kernel

On 16-06-21, 09:57, Greg Kroah-Hartman wrote:
> On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> > Currently topology_scale_freq_tick() may end up using a pointer to
> > struct scale_freq_data, which was previously cleared by
> > topology_clear_scale_freq_source(), as there is no protection in place
> > here. The users of topology_clear_scale_freq_source() though needs a
> > guarantee that the previous scale_freq_data isn't used anymore.
> > 
> > Since topology_scale_freq_tick() is called from scheduler tick, we don't
> > want to add locking in there. Use the RCU update mechanism instead
> > (which is already used by the scheduler's utilization update path) to
> > guarantee race free updates here.
> > 
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> 
> So this is a bugfix for problems in the current codebase?  What commit
> does this fix?  Should it go to the stable kernels?

There is only one user of topology_clear_scale_freq_source()
(cppc-cpufreq driver, which is already reverted in pm/linux-next). So
in the upcoming 5.13 kernel release, there will be no one using this
API and so no one will break.

And so I skipped the fixes tag, I can add it though.

> > ---
> >  drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
> >  1 file changed, 21 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index c1179edc0f3b..921312a8d957 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -18,10 +18,11 @@
> >  #include <linux/cpumask.h>
> >  #include <linux/init.h>
> >  #include <linux/percpu.h>
> > +#include <linux/rcupdate.h>
> >  #include <linux/sched.h>
> >  #include <linux/smp.h>
> >  
> > -static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
> > +static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> >  static struct cpumask scale_freq_counters_mask;
> >  static bool scale_freq_invariant;
> >  
> > @@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
> >  	if (cpumask_empty(&scale_freq_counters_mask))
> >  		scale_freq_invariant = topology_scale_freq_invariant();
> >  
> > +	rcu_read_lock();
> > +
> >  	for_each_cpu(cpu, cpus) {
> > -		sfd = per_cpu(sft_data, cpu);
> > +		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
> >  
> >  		/* Use ARCH provided counters whenever possible */
> >  		if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
> > -			per_cpu(sft_data, cpu) = data;
> > +			rcu_assign_pointer(per_cpu(sft_data, cpu), data);
> >  			cpumask_set_cpu(cpu, &scale_freq_counters_mask);
> >  		}
> >  	}
> >  
> > +	rcu_read_unlock();
> > +
> >  	update_scale_freq_invariant(true);
> >  }
> >  EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
> > @@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
> >  	struct scale_freq_data *sfd;
> >  	int cpu;
> >  
> > +	rcu_read_lock();
> > +
> >  	for_each_cpu(cpu, cpus) {
> > -		sfd = per_cpu(sft_data, cpu);
> > +		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
> >  
> >  		if (sfd && sfd->source == source) {
> > -			per_cpu(sft_data, cpu) = NULL;
> > +			rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
> >  			cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
> >  		}
> >  	}
> >  
> > +	rcu_read_unlock();
> > +
> > +	/*
> > +	 * Make sure all references to previous sft_data are dropped to avoid
> > +	 * use-after-free races.
> > +	 */
> > +	synchronize_rcu();
> 
> What race is happening?  How could the current code race?  Only when a
> cpu is removed?

topology_scale_freq_tick() is called by the scheduler for each CPU
from scheduler_tick().

It is possible that topology_scale_freq_tick() ends up using an older
copy of sft_data pointer, while it is being removed by
topology_clear_scale_freq_source() because a CPU went away or a
cpufreq driver went away, or during normal suspend/resume (where CPUs
are hot-unplugged).

synchronize_rcu() makes sure that all RCU critical sections that
started before it is called, will finish before it returns. And so the
callers of topology_clear_scale_freq_source() don't need to worry
about their callback getting called anymore.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16  8:18     ` Viresh Kumar
@ 2021-06-16  8:31       ` Greg Kroah-Hartman
  2021-06-16  9:10         ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Greg Kroah-Hartman @ 2021-06-16  8:31 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Ionela Voinescu, Sudeep Holla, Rafael J. Wysocki,
	linux-pm, Vincent Guittot, Qian Cai, Paul E . McKenney,
	linux-kernel

On Wed, Jun 16, 2021 at 01:48:59PM +0530, Viresh Kumar wrote:
> On 16-06-21, 09:57, Greg Kroah-Hartman wrote:
> > On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> > > Currently topology_scale_freq_tick() may end up using a pointer to
> > > struct scale_freq_data, which was previously cleared by
> > > topology_clear_scale_freq_source(), as there is no protection in place
> > > here. The users of topology_clear_scale_freq_source() though needs a
> > > guarantee that the previous scale_freq_data isn't used anymore.
> > > 
> > > Since topology_scale_freq_tick() is called from scheduler tick, we don't
> > > want to add locking in there. Use the RCU update mechanism instead
> > > (which is already used by the scheduler's utilization update path) to
> > > guarantee race free updates here.
> > > 
> > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> > 
> > So this is a bugfix for problems in the current codebase?  What commit
> > does this fix?  Should it go to the stable kernels?
> 
> There is only one user of topology_clear_scale_freq_source()
> (cppc-cpufreq driver, which is already reverted in pm/linux-next). So
> in the upcoming 5.13 kernel release, there will be no one using this
> API and so no one will break.
> 
> And so I skipped the fixes tag, I can add it though.

It would be nice to have to answer this type of question, otherwise you
will have automated scripts trying to backport this to kernels where it
does not belong :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16  8:31       ` Greg Kroah-Hartman
@ 2021-06-16  9:10         ` Viresh Kumar
  0 siblings, 0 replies; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16  9:10 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael Wysocki, Ionela Voinescu, Sudeep Holla, Rafael J. Wysocki,
	linux-pm, Vincent Guittot, Qian Cai, Paul E . McKenney,
	linux-kernel

On 16-06-21, 10:31, Greg Kroah-Hartman wrote:
> On Wed, Jun 16, 2021 at 01:48:59PM +0530, Viresh Kumar wrote:
> > On 16-06-21, 09:57, Greg Kroah-Hartman wrote:
> > > On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> > > > Currently topology_scale_freq_tick() may end up using a pointer to
> > > > struct scale_freq_data, which was previously cleared by
> > > > topology_clear_scale_freq_source(), as there is no protection in place
> > > > here. The users of topology_clear_scale_freq_source() though needs a
> > > > guarantee that the previous scale_freq_data isn't used anymore.
> > > > 
> > > > Since topology_scale_freq_tick() is called from scheduler tick, we don't
> > > > want to add locking in there. Use the RCU update mechanism instead
> > > > (which is already used by the scheduler's utilization update path) to
> > > > guarantee race free updates here.
> > > > 
> > > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> > > 
> > > So this is a bugfix for problems in the current codebase?  What commit
> > > does this fix?  Should it go to the stable kernels?
> > 
> > There is only one user of topology_clear_scale_freq_source()
> > (cppc-cpufreq driver, which is already reverted in pm/linux-next). So
> > in the upcoming 5.13 kernel release, there will be no one using this
> > API and so no one will break.
> > 
> > And so I skipped the fixes tag, I can add it though.
> 
> It would be nice to have to answer this type of question, otherwise you
> will have automated scripts trying to backport this to kernels where it
> does not belong :)

Sure, I will add these to the next version.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance
  2021-06-16  6:48 [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance Viresh Kumar
                   ` (2 preceding siblings ...)
  2021-06-16  6:48 ` [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance Viresh Kumar
@ 2021-06-16 10:02 ` Vincent Guittot
  2021-06-16 11:54   ` Viresh Kumar
  3 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2021-06-16 10:02 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Ionela Voinescu, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Greg Kroah-Hartman,
	Ingo Molnar, Jonathan Corbet, Juri Lelli, Mel Gorman,
	Peter Zijlstra, Rafael J. Wysocki, Steven Rostedt, Sudeep Holla,
	open list:THERMAL, Qian Cai, ACPI Devel Maling List,
	Linux Doc Mailing List, linux-kernel, Paul E. McKenney

On Wed, 16 Jun 2021 at 08:48, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> Hello,
>
> Changes since V1:
>
> - Few of the patches migrating users to ->exit() callback are posted separately.
>
> - The CPPC patch was completely reverted and so the support for FIE is again
>   added here from scratch.
>
> - The start_cpu() and stop_cpu() interface is reworked a little so stop_cpu() is
>   only ever called for a CPU if start_cpu() was called for it earlier.
>
> - A new patch to implement RCU locking in arch_topology core to avoid some
>   races.
>
> - Some cleanup and very clear/separate paths for FIE in cppc driver now.
>
>
> -------------------------8<-------------------------
>
> CPPC cpufreq driver is used for ARM servers and this patch series tries to
> provide counter-based frequency invariance support for them in the absence for
> architecture specific counters (like AMUs).
>
> This was reverted earlier for the 5.13 kernel after Qian Cai reported kernel
> oops during suspend/resume.
>
> This is based of pm/linux-next + a cleanup patchset:
>
> https://lore.kernel.org/linux-pm/cover.1623825358.git.viresh.kumar@linaro.org/
>
> All the patches are pushed here together for people to run.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc
>
> This is tested on my Hikey platform (without the actual read/write to
> performance counters), with this script for over an hour:
>
> while true; do
>     for i in `seq 1 7`;
>     do
>         echo 0 > /sys/devices/system/cpu/cpu$i/online;
>     done;
>
>     for i in `seq 1 7`;
>     do
>         echo 1 > /sys/devices/system/cpu/cpu$i/online;
>     done;
> done
>
>
> Vincent will be giving this patchset a try on ThunderX2.

I tested your branch and got the following while booting:

[   24.454543] zswap: loaded using pool lzo/zbud
[   24.454753] pstore: Using crash dump compression: deflate
[   24.454776] AppArmor: AppArmor sha1 policy hashing enabled
[   24.454784] ima: No TPM chip found, activating TPM-bypass!
[   24.454789] ima: Allocated hash algorithm: sha256
[   24.454801] ima: No architecture policies found
[   24.455750] pcieport 0000:0f:00.0: Adding to iommu group 0
[   24.893888] ------------[ cut here ]------------
[   24.893891] WARNING: CPU: 95 PID: 1442 at
drivers/cpufreq/cppc_cpufreq.c:123 cppc_scale_freq_workfn+0xc8/0xf8
[   24.893901] Modules linked in:
[   24.893906] CPU: 95 PID: 1442 Comm: cppc_fie Not tainted 5.13.0-rc6+ #359
[   24.893910] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS
0ACKL026 03/19/2019
[   24.893912] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[   24.893915] pc : cppc_scale_freq_workfn+0xc8/0xf8
[   24.893918] lr : cppc_scale_freq_workfn+0x5c/0xf8
[   24.893921] sp : ffff80003727bd90
[   24.893922] x29: ffff80003727bd90 x28: 0000000000000000 x27: ffff800010ec2000
[   24.893928] x26: ffff800010ec2000 x25: ffff8000107c3d90 x24: 0000000000000001
[   24.893932] x23: ffff000816244880 x22: ffff8000113f9000 x21: ffff009f825a0a80
[   24.893935] x20: ffff009efc394220 x19: ffff800011199000 x18: 000000000000001b
[   24.893939] x17: 0000000000000007 x16: 0000000000000001 x15: 00000000000000bf
[   24.893943] x14: 0000000000000016 x13: 000000000000029b x12: 0000000000000016
[   24.893946] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff009efc6958c0
[   24.893950] x8 : ffff009efc394248 x7 : 0000000002bde780 x6 : 00000000ffffffff
[   24.893954] x5 : 00000000916e502a x4 : 00000000d9730e80 x3 : ffffffffffffffff
[   24.893958] x2 : 00000000001e8480 x1 : 00000000002625a0 x0 : 0000000000000401
[   24.893962] Call trace:
[   24.893964]  cppc_scale_freq_workfn+0xc8/0xf8
[   24.893967]  kthread_worker_fn+0x110/0x318
[   24.893971]  kthread+0xf4/0x120
[   24.893973]  ret_from_fork+0x10/0x18
[   24.893977] ---[ end trace ea6dbaf832bce3e4 ]---


>
> Meanwhile it is up for review. Ideally I would like to get this merged for 5.14,
> but lets see how it goes.
>
> Thanks.
>
> --
> Viresh
>
> Viresh Kumar (3):
>   cpufreq: Add start_cpu() and stop_cpu() callbacks
>   arch_topology: Avoid use-after-free for scale_freq_data
>   cpufreq: CPPC: Add support for frequency invariance
>
>  Documentation/cpu-freq/cpu-drivers.rst |   7 +-
>  drivers/base/arch_topology.c           |  27 ++-
>  drivers/cpufreq/Kconfig.arm            |  10 ++
>  drivers/cpufreq/cppc_cpufreq.c         | 232 +++++++++++++++++++++++--
>  drivers/cpufreq/cpufreq.c              |  19 +-
>  include/linux/arch_topology.h          |   1 +
>  include/linux/cpufreq.h                |   5 +-
>  kernel/sched/core.c                    |   1 +
>  8 files changed, 272 insertions(+), 30 deletions(-)
>
> --
> 2.31.1.272.g89b43f80a514
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16  6:48 ` [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data Viresh Kumar
  2021-06-16  7:57   ` Greg Kroah-Hartman
@ 2021-06-16 11:25   ` Ionela Voinescu
  2021-06-16 11:36     ` Viresh Kumar
  1 sibling, 1 reply; 24+ messages in thread
From: Ionela Voinescu @ 2021-06-16 11:25 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-pm, Vincent Guittot, Qian Cai,
	Paul E . McKenney, linux-kernel

Hey,

On Wednesday 16 Jun 2021 at 12:18:08 (+0530), Viresh Kumar wrote:
> Currently topology_scale_freq_tick() may end up using a pointer to
> struct scale_freq_data, which was previously cleared by
> topology_clear_scale_freq_source(), as there is no protection in place
> here. The users of topology_clear_scale_freq_source() though needs a
> guarantee that the previous scale_freq_data isn't used anymore.
> 

Please correct me if I'm wrong, but my understanding is that this is
only a problem for the cppc cpufreq invariance functionality. Let's
consider a scenario where CPUs are either hotplugged out or the cpufreq
CPPC driver module is removed; topology_clear_scale_freq_source() would
get called and the sfd_data will be set to NULL. But if at the same
time topology_scale_freq_tick() got an old reference of sfd_data,
set_freq_scale() will be called. This is only a problem for CPPC cpufreq
as cppc_scale_freq_tick() will end up using driver internal data that
might have been freed during the hotplug callbacks or the exit path.

If this is the case, wouldn't the synchronisation issue be better
resolved in the CPPC cpufreq driver, rather than here?

Thank you,
Ionela.

> Since topology_scale_freq_tick() is called from scheduler tick, we don't
> want to add locking in there. Use the RCU update mechanism instead
> (which is already used by the scheduler's utilization update path) to
> guarantee race free updates here.
> 
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index c1179edc0f3b..921312a8d957 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -18,10 +18,11 @@
>  #include <linux/cpumask.h>
>  #include <linux/init.h>
>  #include <linux/percpu.h>
> +#include <linux/rcupdate.h>
>  #include <linux/sched.h>
>  #include <linux/smp.h>
>  
> -static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
> +static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
>  static struct cpumask scale_freq_counters_mask;
>  static bool scale_freq_invariant;
>  
> @@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
>  	if (cpumask_empty(&scale_freq_counters_mask))
>  		scale_freq_invariant = topology_scale_freq_invariant();
>  
> +	rcu_read_lock();
> +
>  	for_each_cpu(cpu, cpus) {
> -		sfd = per_cpu(sft_data, cpu);
> +		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>  
>  		/* Use ARCH provided counters whenever possible */
>  		if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
> -			per_cpu(sft_data, cpu) = data;
> +			rcu_assign_pointer(per_cpu(sft_data, cpu), data);
>  			cpumask_set_cpu(cpu, &scale_freq_counters_mask);
>  		}
>  	}
>  
> +	rcu_read_unlock();
> +
>  	update_scale_freq_invariant(true);
>  }
>  EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
> @@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
>  	struct scale_freq_data *sfd;
>  	int cpu;
>  
> +	rcu_read_lock();
> +
>  	for_each_cpu(cpu, cpus) {
> -		sfd = per_cpu(sft_data, cpu);
> +		sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>  
>  		if (sfd && sfd->source == source) {
> -			per_cpu(sft_data, cpu) = NULL;
> +			rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
>  			cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
>  		}
>  	}
>  
> +	rcu_read_unlock();
> +
> +	/*
> +	 * Make sure all references to previous sft_data are dropped to avoid
> +	 * use-after-free races.
> +	 */
> +	synchronize_rcu();
> +
>  	update_scale_freq_invariant(false);
>  }
>  EXPORT_SYMBOL_GPL(topology_clear_scale_freq_source);
>  
>  void topology_scale_freq_tick(void)
>  {
> -	struct scale_freq_data *sfd = *this_cpu_ptr(&sft_data);
> +	struct scale_freq_data *sfd = rcu_dereference_sched(*this_cpu_ptr(&sft_data));
>  
>  	if (sfd)
>  		sfd->set_freq_scale();
> -- 
> 2.31.1.272.g89b43f80a514
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16 11:25   ` Ionela Voinescu
@ 2021-06-16 11:36     ` Viresh Kumar
  2021-06-16 12:00       ` Ionela Voinescu
  0 siblings, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16 11:36 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Rafael Wysocki, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-pm, Vincent Guittot, Qian Cai,
	Paul E . McKenney, linux-kernel

Hi Ionela,

On 16-06-21, 12:25, Ionela Voinescu wrote:
> Please correct me if I'm wrong, but my understanding is that this is
> only a problem for the cppc cpufreq invariance functionality. Let's
> consider a scenario where CPUs are either hotplugged out or the cpufreq
> CPPC driver module is removed; topology_clear_scale_freq_source() would
> get called and the sfd_data will be set to NULL. But if at the same
> time topology_scale_freq_tick() got an old reference of sfd_data,
> set_freq_scale() will be called. This is only a problem for CPPC cpufreq
> as cppc_scale_freq_tick() will end up using driver internal data that
> might have been freed during the hotplug callbacks or the exit path.

For now, yes, CPPC is the only one affected.

> If this is the case, wouldn't the synchronisation issue be better
> resolved in the CPPC cpufreq driver, rather than here?

Hmm, the way I see it is that topology_clear_scale_freq_source() is an API
provided by topology core and the topology core needs to guarantee that it
doesn't use the data any longer after topology_clear_scale_freq_source() is
called.

The same is true for other APIs, like:

irq_work_sync();
kthread_cancel_work_sync();

It isn't the user which needs to take this into account, but the API provider.

There may be more users of this in the future, lets say another cpufreq driver,
and so keeping this synchronization at the API provider is the right thing to do
IMHO.

And from the user's perspective, like cppc, it doesn't have any control over who
is using its callback and how and when. It is very very difficult to provide
something like this at the users, redundant anyway. For example cppc won't ever
know when topology_scale_freq_tick() has stopped calling its callback.

For example this is what cppc driver needs to do now:

+static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
+                                 unsigned int cpu)
+{
+       struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+
+       topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
+       irq_work_sync(&cppc_fi->irq_work);
+       kthread_cancel_work_sync(&cppc_fi->work);
+}

The driver uses APIs provided by 3 layers, topology, irq-work, kthread and all
must provide these guarantees.

A very similar thing is implemented in kernel/sched/cpufreq.c for example.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance
  2021-06-16 10:02 ` [PATCH V2 0/3] cpufreq: cppc: " Vincent Guittot
@ 2021-06-16 11:54   ` Viresh Kumar
  0 siblings, 0 replies; 24+ messages in thread
From: Viresh Kumar @ 2021-06-16 11:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael Wysocki, Ionela Voinescu, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Greg Kroah-Hartman,
	Ingo Molnar, Jonathan Corbet, Juri Lelli, Mel Gorman,
	Peter Zijlstra, Rafael J. Wysocki, Steven Rostedt, Sudeep Holla,
	open list:THERMAL, Qian Cai, ACPI Devel Maling List,
	Linux Doc Mailing List, linux-kernel, Paul E. McKenney

On 16-06-21, 12:02, Vincent Guittot wrote:
> I tested your branch and got the following while booting:
> 
> [   24.454543] zswap: loaded using pool lzo/zbud
> [   24.454753] pstore: Using crash dump compression: deflate
> [   24.454776] AppArmor: AppArmor sha1 policy hashing enabled
> [   24.454784] ima: No TPM chip found, activating TPM-bypass!
> [   24.454789] ima: Allocated hash algorithm: sha256
> [   24.454801] ima: No architecture policies found
> [   24.455750] pcieport 0000:0f:00.0: Adding to iommu group 0
> [   24.893888] ------------[ cut here ]------------
> [   24.893891] WARNING: CPU: 95 PID: 1442 at
> drivers/cpufreq/cppc_cpufreq.c:123 cppc_scale_freq_workfn+0xc8/0xf8
> [   24.893901] Modules linked in:
> [   24.893906] CPU: 95 PID: 1442 Comm: cppc_fie Not tainted 5.13.0-rc6+ #359
> [   24.893910] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS
> 0ACKL026 03/19/2019
> [   24.893912] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--)
> [   24.893915] pc : cppc_scale_freq_workfn+0xc8/0xf8
> [   24.893918] lr : cppc_scale_freq_workfn+0x5c/0xf8
> [   24.893921] sp : ffff80003727bd90
> [   24.893922] x29: ffff80003727bd90 x28: 0000000000000000 x27: ffff800010ec2000
> [   24.893928] x26: ffff800010ec2000 x25: ffff8000107c3d90 x24: 0000000000000001
> [   24.893932] x23: ffff000816244880 x22: ffff8000113f9000 x21: ffff009f825a0a80
> [   24.893935] x20: ffff009efc394220 x19: ffff800011199000 x18: 000000000000001b
> [   24.893939] x17: 0000000000000007 x16: 0000000000000001 x15: 00000000000000bf
> [   24.893943] x14: 0000000000000016 x13: 000000000000029b x12: 0000000000000016
> [   24.893946] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff009efc6958c0
> [   24.893950] x8 : ffff009efc394248 x7 : 0000000002bde780 x6 : 00000000ffffffff
> [   24.893954] x5 : 00000000916e502a x4 : 00000000d9730e80 x3 : ffffffffffffffff
> [   24.893958] x2 : 00000000001e8480 x1 : 00000000002625a0 x0 : 0000000000000401
> [   24.893962] Call trace:
> [   24.893964]  cppc_scale_freq_workfn+0xc8/0xf8
> [   24.893967]  kthread_worker_fn+0x110/0x318
> [   24.893971]  kthread+0xf4/0x120
> [   24.893973]  ret_from_fork+0x10/0x18
> [   24.893977] ---[ end trace ea6dbaf832bce3e4 ]---

Thanks Vincent.

This is triggering from cppc_scale_freq_workfn():

        if (WARN_ON(local_freq_scale > 1024))

Looks like there is something fishy about the perf calculations here
after reading the counters, we tried to scale that in the range 0-1024
and it came larger than that.

Will keep you posted.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16 11:36     ` Viresh Kumar
@ 2021-06-16 12:00       ` Ionela Voinescu
  2021-06-17  3:06         ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Ionela Voinescu @ 2021-06-16 12:00 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-pm, Vincent Guittot, Qian Cai,
	Paul E . McKenney, linux-kernel

On Wednesday 16 Jun 2021 at 17:06:04 (+0530), Viresh Kumar wrote:
> Hi Ionela,
> 
> On 16-06-21, 12:25, Ionela Voinescu wrote:
> > Please correct me if I'm wrong, but my understanding is that this is
> > only a problem for the cppc cpufreq invariance functionality. Let's
> > consider a scenario where CPUs are either hotplugged out or the cpufreq
> > CPPC driver module is removed; topology_clear_scale_freq_source() would
> > get called and the sfd_data will be set to NULL. But if at the same
> > time topology_scale_freq_tick() got an old reference of sfd_data,
> > set_freq_scale() will be called. This is only a problem for CPPC cpufreq
> > as cppc_scale_freq_tick() will end up using driver internal data that
> > might have been freed during the hotplug callbacks or the exit path.
> 
> For now, yes, CPPC is the only one affected.
> 
> > If this is the case, wouldn't the synchronisation issue be better
> > resolved in the CPPC cpufreq driver, rather than here?
> 
> Hmm, the way I see it is that topology_clear_scale_freq_source() is an API
> provided by topology core and the topology core needs to guarantee that it
> doesn't use the data any longer after topology_clear_scale_freq_source() is
> called.
> 
> The same is true for other APIs, like:
> 
> irq_work_sync();
> kthread_cancel_work_sync();
> 
> It isn't the user which needs to take this into account, but the API provider.
> 

I would agree if it wasn't for the fact that the driver provides the
set_freq_scale() implementation that ends up using driver internal data
which could have been freed by the driver's own .exit()/stop_cpu()
callbacks. The API and the generic implementation has the responsibility
of making sure of sane access to its own structures.

Even if we would want to keep drivers from shooting themselves in the
foot, I would prefer we postpone it until we have more users for this,
before we add any synchronisation mechanisms to functionality called
on the tick.

Let's see if there's a less invasive solution to fix CPPC for now, what
do you think?

Thanks,
Ionela.

> There may be more users of this in the future, lets say another cpufreq driver,
> and so keeping this synchronization at the API provider is the right thing to do
> IMHO.
> 
> And from the user's perspective, like cppc, it doesn't have any control over who
> is using its callback and how and when. It is very very difficult to provide

> something like this at the users, redundant anyway. For example cppc won't ever
> know when topology_scale_freq_tick() has stopped calling its callback.
> 
> For example this is what cppc driver needs to do now:
> 
> +static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
> +                                 unsigned int cpu)
> +{
> +       struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +
> +       topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
> +       irq_work_sync(&cppc_fi->irq_work);
> +       kthread_cancel_work_sync(&cppc_fi->work);
> +}
> 
> The driver uses APIs provided by 3 layers, topology, irq-work, kthread and all
> must provide these guarantees.
> 
> A very similar thing is implemented in kernel/sched/cpufreq.c for example.
> 
> -- 
> viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-16  6:48 ` [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance Viresh Kumar
@ 2021-06-16 12:48   ` Ionela Voinescu
  2021-06-17  3:24     ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Ionela Voinescu @ 2021-06-16 12:48 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Sudeep Holla, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

Hi,

I was looking forward to the complete removal of stop_cpu() :).

On Wednesday 16 Jun 2021 at 12:18:09 (+0530), Viresh Kumar wrote:
> The Frequency Invariance Engine (FIE) is providing a frequency scaling
> correction factor that helps achieve more accurate load-tracking.
> 
> Normally, this scaling factor can be obtained directly with the help of
> the cpufreq drivers as they know the exact frequency the hardware is
> running at. But that isn't the case for CPPC cpufreq driver.
> 
> Another way of obtaining that is using the arch specific counter
> support, which is already present in kernel, but that hardware is
> optional for platforms.
> 
> This patch updates the CPPC driver to register itself with the topology
> core to provide its own implementation (cppc_scale_freq_tick()) of
> topology_scale_freq_tick() which gets called by the scheduler on every
> tick. Note that the arch specific counters have higher priority than
> CPPC counters, if available, though the CPPC driver doesn't need to have
> any special handling for that.
> 
> On an invocation of cppc_scale_freq_tick(), we schedule an irq work
> (since we reach here from hard-irq context), which then schedules a
> normal work item and cppc_scale_freq_workfn() updates the per_cpu
> arch_freq_scale variable based on the counter updates since the last
> tick.
> 
> To allow platforms to disable this CPPC counter-based frequency
> invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE,
> which is enabled by default.
> 
> This also exports sched_setattr_nocheck() as the CPPC driver can be
> built as a module.
> 
> Cc: linux-acpi@vger.kernel.org
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  drivers/cpufreq/Kconfig.arm    |  10 ++
>  drivers/cpufreq/cppc_cpufreq.c | 232 ++++++++++++++++++++++++++++++---
>  include/linux/arch_topology.h  |   1 +
>  kernel/sched/core.c            |   1 +
>  4 files changed, 227 insertions(+), 17 deletions(-)
> 
[..]
> +static void cppc_cpufreq_start_cpu(struct cpufreq_policy *policy,
> +				   unsigned int cpu)
> +{
> +	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +	int ret;
> +
> +	cppc_fi->cpu = cpu;
> +	cppc_fi->cpu_data = policy->driver_data;
> +	kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
> +	init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
> +
> +	ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
> +	if (ret) {
> +		pr_warn("%s: failed to read perf counters: %d\n", __func__,
> +			ret);
> +		return;
> +	}
> +
> +	/* Register for freq-invariance */
> +	topology_set_scale_freq_source(&cppc_sftd, cpumask_of(cpu));
> +}
> +
> +static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
> +				  unsigned int cpu)
> +{
> +	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +
> +	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
> +
> +	irq_work_sync(&cppc_fi->irq_work);
> +	kthread_cancel_work_sync(&cppc_fi->work);
> +}

I'll only comment on this for now as I should know the rest.

Let's assume we don't have these, what happens now is the following:

1. We hotplug out the last CPU in a policy, we call the
   .stop_cpu()/exit() function which will free the cppc_cpudata structure.

   The only vulnerability is if we have a last tick on that last CPU,
   after the above callback was called.

2. When the CPU at 1. gets hotplugged back in, the cppc_fi->cpu_data is
   stale.

We do not have a problem when removing the CPPC cpufreq module as we're
doing cppc_freq_invariance_exit() before unregistering the driver and
freeing the data.

Are 1. and 2 the only problems we have, or have I missed any?

Thanks,
Ionela.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data
  2021-06-16 12:00       ` Ionela Voinescu
@ 2021-06-17  3:06         ` Viresh Kumar
  0 siblings, 0 replies; 24+ messages in thread
From: Viresh Kumar @ 2021-06-17  3:06 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Rafael Wysocki, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-pm, Vincent Guittot, Qian Cai,
	Paul E . McKenney, linux-kernel

On 16-06-21, 13:00, Ionela Voinescu wrote:
> I would agree if it wasn't for the fact that the driver provides the
> set_freq_scale() implementation that ends up using driver internal data
> which could have been freed by the driver's own .exit()/stop_cpu()
> callbacks. The API and the generic implementation has the responsibility
> of making sure of sane access to its own structures.

How do you see timer core or workqueue core then ? Why do they make
sure they don't end up calling user's function once we ask them not to
?

And also scheduler's own util update mechanism, the exact thing
happens there. Users (cpufreq governors) call
cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook()
to pass their own data structure "struct update_util_data", which has
ia callback within. This is what happens from scheduler's context in
those cases.

static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
{
	struct update_util_data *data;

	data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
						  cpu_of(rq)));
	if (data)
		data->func(data, rq_clock(rq), flags);
}

Also note that some kind of synchronisation mechanism is indeed
required between topology_set_scale_freq_source() and
topology_clear_scale_freq_source(), there is no race there today,
sure, but this is an API which is made generic.

> Even if we would want to keep drivers from shooting themselves in the
> foot, I would prefer we postpone it until we have more users for this,
> before we add any synchronisation mechanisms to functionality called
> on the tick.

The rcu mechanism is very much used in the scheduler itself because it
is lightweight. Honestly I don't even see any other way (w.r.t.
locking) users can fix it at their end. They don't know which was the
last tick that used their callback.

> Let's see if there's a less invasive solution to fix CPPC for now, what
> do you think?

For me, this change is required in the API despite how CPPC ends up
using it. Else we are failing at defining the API itself IMHO.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-16 12:48   ` Ionela Voinescu
@ 2021-06-17  3:24     ` Viresh Kumar
  2021-06-17 10:34       ` Ionela Voinescu
  0 siblings, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2021-06-17  3:24 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Rafael Wysocki, Sudeep Holla, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

On 16-06-21, 13:48, Ionela Voinescu wrote:
> I was looking forward to the complete removal of stop_cpu() :).

No one wants to remove more code than I do :)

> I'll only comment on this for now as I should know the rest.
> 
> Let's assume we don't have these, what happens now is the following:
> 
> 1. We hotplug out the last CPU in a policy, we call the
>    .stop_cpu()/exit() function which will free the cppc_cpudata structure.
> 
>    The only vulnerability is if we have a last tick on that last CPU,
>    after the above callback was called.
> 
> 2. When the CPU at 1. gets hotplugged back in, the cppc_fi->cpu_data is
>    stale.
> 
> We do not have a problem when removing the CPPC cpufreq module as we're
> doing cppc_freq_invariance_exit() before unregistering the driver and
> freeing the data.
> 
> Are 1. and 2 the only problems we have, or have I missed any?

There is more to it. For simplicity, lets assume a quad-core setup,
with all 4 CPUs sharing the cpufreq policy. And here is what happens
without the new changes:

- On CPPC cpufreq driver insertion, we start 4 kthreads/irq-works (1
  for each CPU as it fires from tick) from the ->init() callback.

- Now lets say we offline CPU3. The CPPC cpufreq driver isn't notified
  by anyone and it hasn't registered itself to hotplug notifier as
  well. So, the irq-work/kthread isn't stopped. This results in the
  issue reported by Qian earlier.

  The very same thing happens with schedutil governor too, which uses
  very similar mechanisms, and the cpufreq core takes care of it there
  by stopping the governor before removing the CPU from policy->cpus
  and starting it again. So there we stop irq-work/kthread for all 4
  CPUs, then start them only for remaining 3.

  I thought about that approach as well, but it was too heavy to stop
  everyone and start again in this case. And so I invented start_cpu()
  and stop_cpu() callbacks.

- In this case, because the CPU is going away, we need to make sure we
  don't queue any more irq-work or kthread to it and this is one of
  the main reasons for adding synchronization in the topology core,
  because we need a hard guarantee here that irq-work won't fire
  again, as the CPU won't be there or will not be in a sane state.

- The same sequence of events is true for the case where the last CPU
  of a policy goes away (not in this example, but lets say quad-core
  setup with separate policies for each CPU).

- Not just the policy, but the CPU may be going away as well.

I hope I was able to clarify a bit here.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-17  3:24     ` Viresh Kumar
@ 2021-06-17 10:34       ` Ionela Voinescu
  2021-06-17 11:19         ` Viresh Kumar
  2021-06-18  7:37         ` Viresh Kumar
  0 siblings, 2 replies; 24+ messages in thread
From: Ionela Voinescu @ 2021-06-17 10:34 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Sudeep Holla, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

Many thanks for the details!

On Thursday 17 Jun 2021 at 08:54:16 (+0530), Viresh Kumar wrote:
> On 16-06-21, 13:48, Ionela Voinescu wrote:
> > I was looking forward to the complete removal of stop_cpu() :).
> 
> No one wants to remove more code than I do :)
> 
> > I'll only comment on this for now as I should know the rest.
> > 
> > Let's assume we don't have these, what happens now is the following:
> > 
> > 1. We hotplug out the last CPU in a policy, we call the
> >    .stop_cpu()/exit() function which will free the cppc_cpudata structure.
> > 
> >    The only vulnerability is if we have a last tick on that last CPU,
> >    after the above callback was called.
> > 
> > 2. When the CPU at 1. gets hotplugged back in, the cppc_fi->cpu_data is
> >    stale.
> > 
> > We do not have a problem when removing the CPPC cpufreq module as we're
> > doing cppc_freq_invariance_exit() before unregistering the driver and
> > freeing the data.
> > 
> > Are 1. and 2 the only problems we have, or have I missed any?
> 
> There is more to it. For simplicity, lets assume a quad-core setup,
> with all 4 CPUs sharing the cpufreq policy. And here is what happens
> without the new changes:
> 
> - On CPPC cpufreq driver insertion, we start 4 kthreads/irq-works (1
>   for each CPU as it fires from tick) from the ->init() callback.
> 
> - Now lets say we offline CPU3. The CPPC cpufreq driver isn't notified
>   by anyone and it hasn't registered itself to hotplug notifier as
>   well. So, the irq-work/kthread isn't stopped. This results in the
>   issue reported by Qian earlier.
> 

I might be missing something, but when you offline a single CPU in a
policy, the worse that can happen is that a last call to
cppc_scale_freq_tick() would have sneaked in before irqs and the tick
are disabled. But even if we have a last call to
cppc_scale_freq_workfn(), the counter read methods would know how to
cope with hotplug, and the cppc_cpudata structure would still be
allocated and have valid desired_perf and highest_perf values.

Worse case, the last scale factor set for the CPU will be meaningless,
but it's already meaningless as the CPU is going down.

When you are referring to the issue reported by Qian I suppose you are
referring to this [1]. I think this is the case where you hotplug the
last CPU in a policy and free cppc_cpudata.

[1] https://lore.kernel.org/linux-pm/41f5195e-0e5f-fdfe-ba37-34e1fd8e4064@quicinc.com/

>   The very same thing happens with schedutil governor too, which uses
>   very similar mechanisms, and the cpufreq core takes care of it there
>   by stopping the governor before removing the CPU from policy->cpus
>   and starting it again. So there we stop irq-work/kthread for all 4
>   CPUs, then start them only for remaining 3.
> 

Things are different for sugov: you have to stop the governor when one
CPU in the policy goes down, and it's normal for sugov not to allow its
hooks to be called while the governor is down so it has to do a full
cleanup when going down and a full bringup when going back up.

The difference for CPPC invariance is that only a CPU can trigger the
work to update its own scale factor, through the tick. No other CPU x
can trigger a scale factor update for CPU y, but x can carry out the
work for CPU y (x can run the cppc_scale_freq_workfn(y)).

So when y goes down, it won't have a tick to queue any irq or kthread
work any longer until it's brought back up. So I believe that the only
cleanup that needs to be done when a CPU goes offline, is to ensure
that the work triggered by that last tick is done safely.

>   I thought about that approach as well, but it was too heavy to stop
>   everyone and start again in this case. And so I invented start_cpu()
>   and stop_cpu() callbacks.
> 

> - In this case, because the CPU is going away, we need to make sure we
>   don't queue any more irq-work or kthread to it and this is one of
>   the main reasons for adding synchronization in the topology core,
>   because we need a hard guarantee here that irq-work won't fire
>   again, as the CPU won't be there or will not be in a sane state.
> 

We can't queue any more work for it as there's no tick for the offline
CPU.

> - The same sequence of events is true for the case where the last CPU
>   of a policy goes away (not in this example, but lets say quad-core
>   setup with separate policies for each CPU).
> 
> - Not just the policy, but the CPU may be going away as well.
> 
> I hope I was able to clarify a bit here.
> 

Thanks! I do agree it is better to be cautious, but I initially wanted to
understand we don't see the problem bigger than it actually is.

Thanks,
Ionela.

P.S. I'll give more thought to the rcu use in the arch_topology driver.
I'm the boring kind that likes to err on the side of caution, so I tend
to agree that it might be good to have even if the current problem could
be solved in this driver.

> -- 
> viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-17 10:34       ` Ionela Voinescu
@ 2021-06-17 11:19         ` Viresh Kumar
  2021-06-17 12:22           ` Peter Zijlstra
  2021-06-18  7:37         ` Viresh Kumar
  1 sibling, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2021-06-17 11:19 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Rafael Wysocki, Sudeep Holla, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

On 17-06-21, 11:34, Ionela Voinescu wrote:
> I might be missing something, but when you offline a single CPU in a
> policy, the worse that can happen is that a last call to
> cppc_scale_freq_tick() would have sneaked in before irqs and the tick
> are disabled. But even if we have a last call to
> cppc_scale_freq_workfn(), the counter read methods would know how to
> cope with hotplug, and the cppc_cpudata structure would still be
> allocated and have valid desired_perf and highest_perf values.

Okay, I somehow assumed that cppc_scale_freq_workfn() needs to run on the local
CPU, while it can actually land anywhere. My fault.

But the irq-work being queued here is per-cpu and it will get queued on the
local CPU where the tick occurred.

Now I am not sure of what will happen to this irq-work in that case. I tried to
look now and I see that these irq-work items are processed first on tick and
then the tick handler of scheduler is called, so that will again queue the cppc
irq-work.

What happens if this happens along with CPU hotplug ? I am not sure I understand
that. There may or may not be any side effects of this. Lets assume the work
item is left in the queue as is and no tick happens after that as the CPU is
offlined. We are good.

Now if we unload the cpufreq driver at this moment, the driver will call
irq_work_sync(), which may end up in a while loop ? There is no
irq-work-cancel() API.

Peter: Can you help here on this ? Lemme try to explain a bit here:

We are starting an irq-work (in cppc cpufreq driver) from
scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't
take care of CPU hotplug explicitly and make sure this work isn't queued again
from the next tick.

Is it important for user to make sure it gets rid of the irq-work during hotplug
here ?

> Worse case, the last scale factor set for the CPU will be meaningless,
> but it's already meaningless as the CPU is going down.
> 
> When you are referring to the issue reported by Qian I suppose you are
> referring to this [1]. I think this is the case where you hotplug the
> last CPU in a policy and free cppc_cpudata.
> 
> [1] https://lore.kernel.org/linux-pm/41f5195e-0e5f-fdfe-ba37-34e1fd8e4064@quicinc.com/

Yes, I was talking about this report only, I am not sure now if I understand
what actually happened here :)

Ionela: I have skipped replying to rest of your email, will get back to that
once we have clarification on the above first.

Thanks a lot for your reviews, always on time :)

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-17 11:19         ` Viresh Kumar
@ 2021-06-17 12:22           ` Peter Zijlstra
  2021-06-18  3:45             ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2021-06-17 12:22 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ionela Voinescu, Rafael Wysocki, Sudeep Holla, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

On Thu, Jun 17, 2021 at 04:49:36PM +0530, Viresh Kumar wrote:
> Peter: Can you help here on this ? Lemme try to explain a bit here:
> 
> We are starting an irq-work (in cppc cpufreq driver) from
> scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't
> take care of CPU hotplug explicitly and make sure this work isn't queued again
> from the next tick.
> 
> Is it important for user to make sure it gets rid of the irq-work during hotplug
> here ?

irq-work is flushed on hotplug, see smpcfd_dying_cpu().

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks
  2021-06-16  6:48 ` [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks Viresh Kumar
@ 2021-06-17 13:33   ` Rafael J. Wysocki
  2021-06-18  7:46     ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael J. Wysocki @ 2021-06-17 13:33 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Ionela Voinescu, Jonathan Corbet, Linux PM,
	Vincent Guittot, Qian Cai, open list:DOCUMENTATION,
	Linux Kernel Mailing List

On Wed, Jun 16, 2021 at 8:48 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On CPU hot-unplug, the cpufreq core doesn't call any driver specific
> callback unless all the CPUs of a policy went away, in which case we
> call stop_cpu() callback.
>
> For the CPPC cpufreq driver, we need to perform per-cpu init/exit work
> which that can't be performed from policy specific init()/exit()
> callbacks.
>
> This patch adds a new callback, start_cpu() and modifies the stop_cpu()
> callback, to perform such CPU specific work.
>
> These routines are called whenever a CPU is added or removed from a
> policy.
>
> Note that this also moves the setting of policy->cpus to online CPUs
> only, outside of rwsem as we needed to call start_cpu() for online CPUs
> only. This shouldn't have any side effects.
>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  Documentation/cpu-freq/cpu-drivers.rst |  7 +++++--
>  drivers/cpufreq/cpufreq.c              | 19 +++++++++++++++----
>  include/linux/cpufreq.h                |  5 ++++-
>  3 files changed, 24 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/cpu-freq/cpu-drivers.rst b/Documentation/cpu-freq/cpu-drivers.rst
> index a697278ce190..15cfe42b4075 100644
> --- a/Documentation/cpu-freq/cpu-drivers.rst
> +++ b/Documentation/cpu-freq/cpu-drivers.rst
> @@ -71,8 +71,11 @@ And optionally
>   .exit - A pointer to a per-policy cleanup function called during
>   CPU_POST_DEAD phase of cpu hotplug process.
>
> - .stop_cpu - A pointer to a per-policy stop function called during
> - CPU_DOWN_PREPARE phase of cpu hotplug process.
> + .start_cpu - A pointer to a per-policy per-cpu start function called
> + during CPU online phase.
> +
> + .stop_cpu - A pointer to a per-policy per-cpu stop function called
> + during CPU offline phase.
>
>   .suspend - A pointer to a per-policy suspend function which is called
>   with interrupts disabled and _after_ the governor is stopped for the
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 802abc925b2a..128dfb1b0cdf 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1119,6 +1119,10 @@ static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy, unsigned int cp
>
>         cpumask_set_cpu(cpu, policy->cpus);
>
> +       /* Do CPU specific initialization if required */
> +       if (cpufreq_driver->start_cpu)
> +               cpufreq_driver->start_cpu(policy, cpu);
> +
>         if (has_target()) {
>                 ret = cpufreq_start_governor(policy);
>                 if (ret)
> @@ -1375,13 +1379,19 @@ static int cpufreq_online(unsigned int cpu)
>                 cpumask_copy(policy->related_cpus, policy->cpus);
>         }
>
> -       down_write(&policy->rwsem);
>         /*
>          * affected cpus must always be the one, which are online. We aren't
>          * managing offline cpus here.
>          */
>         cpumask_and(policy->cpus, policy->cpus, cpu_online_mask);
>
> +       /* Do CPU specific initialization if required */
> +       if (cpufreq_driver->start_cpu) {
> +               for_each_cpu(j, policy->cpus)
> +                       cpufreq_driver->start_cpu(policy, j);
> +       }
> +
> +       down_write(&policy->rwsem);
>         if (new_policy) {
>                 for_each_cpu(j, policy->related_cpus) {
>                         per_cpu(cpufreq_cpu_data, j) = policy;
> @@ -1581,6 +1591,10 @@ static int cpufreq_offline(unsigned int cpu)
>                 policy->cpu = cpumask_any(policy->cpus);
>         }
>
> +       /* Do CPU specific de-initialization if required */
> +       if (cpufreq_driver->stop_cpu)
> +               cpufreq_driver->stop_cpu(policy, cpu);
> +
>         /* Start governor again for active policy */
>         if (!policy_is_inactive(policy)) {
>                 if (has_target()) {
> @@ -1597,9 +1611,6 @@ static int cpufreq_offline(unsigned int cpu)
>                 policy->cdev = NULL;
>         }
>
> -       if (cpufreq_driver->stop_cpu)
> -               cpufreq_driver->stop_cpu(policy);
> -

This should be a separate patch IMO, after you've migrated everyone to
->offline/->exit.

BTW, IMO it might be better to migrate ->stop_cpu to ->offline rather
than to ->exit.

>         if (has_target())
>                 cpufreq_exit_governor(policy);
>
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 353969c7acd3..c281b3df4e2f 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -371,7 +371,10 @@ struct cpufreq_driver {
>         int             (*online)(struct cpufreq_policy *policy);
>         int             (*offline)(struct cpufreq_policy *policy);
>         int             (*exit)(struct cpufreq_policy *policy);
> -       void            (*stop_cpu)(struct cpufreq_policy *policy);
> +
> +       /* CPU specific start/stop */
> +       void            (*start_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
> +       void            (*stop_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
>         int             (*suspend)(struct cpufreq_policy *policy);
>         int             (*resume)(struct cpufreq_policy *policy);
>
> --
> 2.31.1.272.g89b43f80a514
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-17 12:22           ` Peter Zijlstra
@ 2021-06-18  3:45             ` Viresh Kumar
  0 siblings, 0 replies; 24+ messages in thread
From: Viresh Kumar @ 2021-06-18  3:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ionela Voinescu, Rafael Wysocki, Sudeep Holla, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

On 17-06-21, 14:22, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 04:49:36PM +0530, Viresh Kumar wrote:
> > Peter: Can you help here on this ? Lemme try to explain a bit here:
> > 
> > We are starting an irq-work (in cppc cpufreq driver) from
> > scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't
> > take care of CPU hotplug explicitly and make sure this work isn't queued again
> > from the next tick.
> > 
> > Is it important for user to make sure it gets rid of the irq-work during hotplug
> > here ?
> 
> irq-work is flushed on hotplug, see smpcfd_dying_cpu().

Thanks Peter.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-17 10:34       ` Ionela Voinescu
  2021-06-17 11:19         ` Viresh Kumar
@ 2021-06-18  7:37         ` Viresh Kumar
  2021-06-18 12:26           ` Ionela Voinescu
  1 sibling, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2021-06-18  7:37 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Rafael Wysocki, Sudeep Holla, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

On 17-06-21, 11:34, Ionela Voinescu wrote:
> We can't queue any more work for it as there's no tick for the offline
> CPU.

Okay, so as discussed yesterday and confirmed by Peter, CPU hotplug shouldn't be
a problem and we can avoid handling that here. Good.

The patch 1/3 from this series is not required anymore. I will get rid of
stop_cpu() as well.

The second patch stays as is, as I still don't see another way of fixing the
same problem on policy ->exit(). We still need that guarantee from topology
core.

> P.S. I'll give more thought to the rcu use in the arch_topology driver.
> I'm the boring kind that likes to err on the side of caution, so I tend
> to agree that it might be good to have even if the current problem could
> be solved in this driver.

Before I resend the series again, maybe it is better to align on the idea here
itself first (as I need to fix some existing potential memleaks in policy
->init() first). So here is the new diff, looks okay now ?

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index be4f62e2c5f1..3e9070f107c5 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -10,14 +10,18 @@
 
 #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt
 
+#include <linux/arch_topology.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/delay.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
 #include <linux/dmi.h>
+#include <linux/irq_work.h>
+#include <linux/kthread.h>
 #include <linux/time.h>
 #include <linux/vmalloc.h>
+#include <uapi/linux/sched/types.h>
 
 #include <asm/unaligned.h>
 
@@ -57,6 +61,214 @@ static struct cppc_workaround_oem_info wa_info[] = {
 	}
 };
 
+#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
+
+/* Frequency invariance support */
+struct cppc_freq_invariance {
+	int cpu;
+	struct irq_work irq_work;
+	struct kthread_work work;
+	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
+	struct cppc_cpudata *cpu_data;
+};
+
+static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
+static struct kthread_worker *kworker_fie;
+
+static struct cpufreq_driver cppc_cpufreq_driver;
+static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);
+
+/**
+ * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
+ * @work: The work item.
+ *
+ * The CPPC driver register itself with the topology core to provide its own
+ * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
+ * gets called by the scheduler on every tick.
+ *
+ * Note that the arch specific counters have higher priority than CPPC counters,
+ * if available, though the CPPC driver doesn't need to have any special
+ * handling for that.
+ *
+ * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
+ * reach here from hard-irq context), which then schedules a normal work item
+ * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
+ * based on the counter updates since the last tick.
+ */
+static void cppc_scale_freq_workfn(struct kthread_work *work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	struct cppc_perf_fb_ctrs fb_ctrs = {0};
+	struct cppc_cpudata *cpu_data;
+	unsigned long local_freq_scale;
+	u64 perf;
+
+	cppc_fi = container_of(work, struct cppc_freq_invariance, work);
+	cpu_data = cppc_fi->cpu_data;
+
+	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
+		pr_warn("%s: failed to read perf counters\n", __func__);
+		return;
+	}
+
+	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
+				     &fb_ctrs);
+	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
+
+	perf <<= SCHED_CAPACITY_SHIFT;
+	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
+
+	/* This can happen due to counter overflows */
+	if (unlikely(local_freq_scale > 1024))
+		local_freq_scale = 1024;
+
+	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
+}
+
+static void cppc_irq_work(struct irq_work *irq_work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+
+	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
+	kthread_queue_work(kworker_fie, &cppc_fi->work);
+}
+
+static void cppc_scale_freq_tick(void)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
+
+	/*
+	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right
+	 * context.
+	 */
+	irq_work_queue(&cppc_fi->irq_work);
+}
+
+static struct scale_freq_data cppc_sftd = {
+	.source = SCALE_FREQ_SOURCE_CPPC,
+	.set_freq_scale = cppc_scale_freq_tick,
+};
+
+static int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	int cpu, ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return 0;
+
+	for_each_cpu(cpu, policy->cpus) {
+		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+		cppc_fi->cpu = cpu;
+		cppc_fi->cpu_data = policy->driver_data;
+		kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
+		init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
+
+		ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
+		if (ret) {
+			pr_warn("%s: failed to read perf counters for cpu:%d: %d\n",
+				__func__, cpu, ret);
+
+			/* Avoid failing driver's probe because of FIE */
+			return 0;
+		}
+	}
+
+	/* Register for freq-invariance */
+	topology_set_scale_freq_source(&cppc_sftd, policy->cpus);
+	return 0;
+}
+
+/*
+ * We free all the resources on policy's removal and not on CPU removal as the
+ * irq-work are per-cpu and the hotplug core takes care of flushing the pending
+ * irq-works (hint: smpcfd_dying_cpu()) on CPU hotplug. Even if the kthread-work
+ * fires on another CPU after the concerned CPU is removed, it won't harm.
+ *
+ * We just need to make sure to remove them all on policy->exit().
+ */
+static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	int cpu;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	/* policy->cpus will be empty here, use related_cpus instead */
+	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, policy->related_cpus);
+
+	for_each_cpu(cpu, policy->related_cpus) {
+		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+		irq_work_sync(&cppc_fi->irq_work);
+		kthread_cancel_work_sync(&cppc_fi->work);
+	}
+}
+
+static void __init cppc_freq_invariance_init(void)
+{
+	struct sched_attr attr = {
+		.size		= sizeof(struct sched_attr),
+		.sched_policy	= SCHED_DEADLINE,
+		.sched_nice	= 0,
+		.sched_priority	= 0,
+		/*
+		 * Fake (unused) bandwidth; workaround to "fix"
+		 * priority inheritance.
+		 */
+		.sched_runtime	= 1000000,
+		.sched_deadline = 10000000,
+		.sched_period	= 10000000,
+	};
+	int ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kworker_fie = kthread_create_worker(0, "cppc_fie");
+	if (IS_ERR(kworker_fie))
+		return;
+
+	ret = sched_setattr_nocheck(kworker_fie->task, &attr);
+	if (ret) {
+		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
+			ret);
+		kthread_destroy_worker(kworker_fie);
+		return;
+	}
+}
+
+static void cppc_freq_invariance_exit(void)
+{
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kthread_destroy_worker(kworker_fie);
+	kworker_fie = NULL;
+}
+
+#else
+static inline int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
+{
+	return 0;
+}
+
+static inline void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
+{
+}
+
+static inline void cppc_freq_invariance_init(void)
+{
+}
+
+static inline void cppc_freq_invariance_exit(void)
+{
+}
+#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
+
 /* Callback function used to retrieve the max frequency from DMI */
 static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
 {
@@ -324,11 +536,13 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	cpu_data->perf_ctrls.desired_perf =  caps->highest_perf;
 
 	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
-	if (ret)
+	if (ret) {
 		pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
 			 caps->highest_perf, cpu, ret);
+		return ret;
+	}
 
-	return ret;
+	return cppc_cpufreq_cpu_fie_init(policy);
 }
 
 static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
@@ -338,6 +552,8 @@ static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 	unsigned int cpu = policy->cpu;
 	int ret;
 
+	cppc_cpufreq_cpu_fie_exit(policy);
+
 	cpu_data->perf_ctrls.desired_perf = caps->lowest_perf;
 
 	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
@@ -362,26 +578,35 @@ static inline u64 get_delta(u64 t1, u64 t0)
 	return (u32)t1 - (u32)t0;
 }
 
-static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t0,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)
 {
 	u64 delta_reference, delta_delivered;
-	u64 reference_perf, delivered_perf;
+	u64 reference_perf;
 
-	reference_perf = fb_ctrs_t0.reference_perf;
+	reference_perf = fb_ctrs_t0->reference_perf;
 
-	delta_reference = get_delta(fb_ctrs_t1.reference,
-				    fb_ctrs_t0.reference);
-	delta_delivered = get_delta(fb_ctrs_t1.delivered,
-				    fb_ctrs_t0.delivered);
+	delta_reference = get_delta(fb_ctrs_t1->reference,
+				    fb_ctrs_t0->reference);
+	delta_delivered = get_delta(fb_ctrs_t1->delivered,
+				    fb_ctrs_t0->delivered);
 
-	/* Check to avoid divide-by zero */
-	if (delta_reference || delta_delivered)
-		delivered_perf = (reference_perf * delta_delivered) /
-					delta_reference;
-	else
-		delivered_perf = cpu_data->perf_ctrls.desired_perf;
+	/* Check to avoid divide-by zero and invalid delivered_perf */
+	if (!delta_reference || !delta_delivered)
+		return cpu_data->perf_ctrls.desired_perf;
+
+	return (reference_perf * delta_delivered) / delta_reference;
+}
+
+static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)
+{
+	u64 delivered_perf;
+
+	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
+					       fb_ctrs_t1);
 
 	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
 }
@@ -405,7 +630,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
 	if (ret)
 		return ret;
 
-	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
+	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
 }
 
 static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
@@ -506,14 +731,21 @@ static void cppc_check_hisi_workaround(void)
 
 static int __init cppc_cpufreq_init(void)
 {
+	int ret;
+
 	if ((acpi_disabled) || !acpi_cpc_valid())
 		return -ENODEV;
 
 	INIT_LIST_HEAD(&cpu_data_list);
 
 	cppc_check_hisi_workaround();
+	cppc_freq_invariance_init();
 
-	return cpufreq_register_driver(&cppc_cpufreq_driver);
+	ret = cpufreq_register_driver(&cppc_cpufreq_driver);
+	if (ret)
+		cppc_freq_invariance_exit();
+
+	return ret;
 }
 
 static inline void free_cpu_data(void)
@@ -531,6 +763,7 @@ static inline void free_cpu_data(void)
 static void __exit cppc_cpufreq_exit(void)
 {
 	cpufreq_unregister_driver(&cppc_cpufreq_driver);
+	cppc_freq_invariance_exit();
 
 	free_cpu_data();
 }

-- 
viresh

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks
  2021-06-17 13:33   ` Rafael J. Wysocki
@ 2021-06-18  7:46     ` Viresh Kumar
  0 siblings, 0 replies; 24+ messages in thread
From: Viresh Kumar @ 2021-06-18  7:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael Wysocki, Ionela Voinescu, Jonathan Corbet, Linux PM,
	Vincent Guittot, Qian Cai, open list:DOCUMENTATION,
	Linux Kernel Mailing List

On 17-06-21, 15:33, Rafael J. Wysocki wrote:
> On Wed, Jun 16, 2021 at 8:48 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > +       /* Do CPU specific de-initialization if required */
> > +       if (cpufreq_driver->stop_cpu)
> > +               cpufreq_driver->stop_cpu(policy, cpu);
> > +
> >         /* Start governor again for active policy */
> >         if (!policy_is_inactive(policy)) {
> >                 if (has_target()) {
> > @@ -1597,9 +1611,6 @@ static int cpufreq_offline(unsigned int cpu)
> >                 policy->cdev = NULL;
> >         }
> >
> > -       if (cpufreq_driver->stop_cpu)
> > -               cpufreq_driver->stop_cpu(policy);
> > -
> 
> This should be a separate patch IMO, after you've migrated everyone to
> ->offline/->exit.

Right, anyway this patch may not be required anymore. I will send a patch to
remove this.

> BTW, IMO it might be better to migrate ->stop_cpu to ->offline rather
> than to ->exit.

This is a bit tricky IMO.

First, offline() isn't implemented by everyone, out of the three implementations
which were using stop_cpu(), only intel-pstate had offline() as well.

Second, the primary purpose of online/offline callbacks was suspend/resume
oriented and for the same reason, we don't call online() when the policy first
comes up and so in case of errors during bring up, we end up calling exit()
directly and not offline().

IMO this is a very specific thing to drivers and they need to see what fits best
for them, exit() or offline() or both.

-- 
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance
  2021-06-18  7:37         ` Viresh Kumar
@ 2021-06-18 12:26           ` Ionela Voinescu
  0 siblings, 0 replies; 24+ messages in thread
From: Ionela Voinescu @ 2021-06-18 12:26 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael Wysocki, Sudeep Holla, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-pm,
	Qian Cai, linux-acpi, linux-kernel

Hey,

On Friday 18 Jun 2021 at 13:07:13 (+0530), Viresh Kumar wrote:
> On 17-06-21, 11:34, Ionela Voinescu wrote:
> > We can't queue any more work for it as there's no tick for the offline
> > CPU.
> 
> Okay, so as discussed yesterday and confirmed by Peter, CPU hotplug shouldn't be
> a problem and we can avoid handling that here. Good.
> 
> The patch 1/3 from this series is not required anymore. I will get rid of
> stop_cpu() as well.
> 
> The second patch stays as is, as I still don't see another way of fixing the
> same problem on policy ->exit(). We still need that guarantee from topology
> core.
>

Right! That is because we need to make sure the tick does not queue any
more irq work while we do our cleanup (irq_work_sync() and then
kthread_cancel_work_sync()). Then kthread_cancel_work_sync() would
ensure the last worker wraps up so we can free the data.

I was hoping to avoid this but I don't currently see another way either.

> > P.S. I'll give more thought to the rcu use in the arch_topology driver.
> > I'm the boring kind that likes to err on the side of caution, so I tend
> > to agree that it might be good to have even if the current problem could
> > be solved in this driver.
> 
> Before I resend the series again, maybe it is better to align on the idea here
> itself first (as I need to fix some existing potential memleaks in policy
> ->init() first). So here is the new diff, looks okay now ?
> 
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index be4f62e2c5f1..3e9070f107c5 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -10,14 +10,18 @@
>  
>  #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt
>  
> +#include <linux/arch_topology.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/delay.h>
>  #include <linux/cpu.h>
>  #include <linux/cpufreq.h>
>  #include <linux/dmi.h>
> +#include <linux/irq_work.h>
> +#include <linux/kthread.h>
>  #include <linux/time.h>
>  #include <linux/vmalloc.h>
> +#include <uapi/linux/sched/types.h>
>  
>  #include <asm/unaligned.h>
>  
> @@ -57,6 +61,214 @@ static struct cppc_workaround_oem_info wa_info[] = {
>  	}
>  };
>  
> +#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
> +
> +/* Frequency invariance support */
> +struct cppc_freq_invariance {
> +	int cpu;
> +	struct irq_work irq_work;
> +	struct kthread_work work;
> +	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
> +	struct cppc_cpudata *cpu_data;
> +};
> +
> +static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
> +static struct kthread_worker *kworker_fie;
> +
> +static struct cpufreq_driver cppc_cpufreq_driver;
> +static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
> +static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);
> +
> +/**
> + * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
> + * @work: The work item.
> + *
> + * The CPPC driver register itself with the topology core to provide its own
> + * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
> + * gets called by the scheduler on every tick.
> + *
> + * Note that the arch specific counters have higher priority than CPPC counters,
> + * if available, though the CPPC driver doesn't need to have any special
> + * handling for that.
> + *
> + * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
> + * reach here from hard-irq context), which then schedules a normal work item
> + * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
> + * based on the counter updates since the last tick.
> + */
> +static void cppc_scale_freq_workfn(struct kthread_work *work)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +	struct cppc_perf_fb_ctrs fb_ctrs = {0};
> +	struct cppc_cpudata *cpu_data;
> +	unsigned long local_freq_scale;
> +	u64 perf;
> +
> +	cppc_fi = container_of(work, struct cppc_freq_invariance, work);
> +	cpu_data = cppc_fi->cpu_data;
> +
> +	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
> +		pr_warn("%s: failed to read perf counters\n", __func__);
> +		return;
> +	}
> +
> +	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
> +				     &fb_ctrs);
> +	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
> +
> +	perf <<= SCHED_CAPACITY_SHIFT;
> +	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
> +
> +	/* This can happen due to counter overflows */
> +	if (unlikely(local_freq_scale > 1024))
> +		local_freq_scale = 1024;
> +
> +	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
> +}
> +
> +static void cppc_irq_work(struct irq_work *irq_work)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +
> +	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
> +	kthread_queue_work(kworker_fie, &cppc_fi->work);
> +}
> +
> +static void cppc_scale_freq_tick(void)
> +{
> +	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
> +
> +	/*
> +	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right
> +	 * context.
> +	 */
> +	irq_work_queue(&cppc_fi->irq_work);
> +}
> +
> +static struct scale_freq_data cppc_sftd = {
> +	.source = SCALE_FREQ_SOURCE_CPPC,
> +	.set_freq_scale = cppc_scale_freq_tick,
> +};
> +
> +static int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +	int cpu, ret;
> +
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return 0;
> +
> +	for_each_cpu(cpu, policy->cpus) {
> +		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +		cppc_fi->cpu = cpu;
> +		cppc_fi->cpu_data = policy->driver_data;
> +		kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
> +		init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
> +
> +		ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
> +		if (ret) {
> +			pr_warn("%s: failed to read perf counters for cpu:%d: %d\n",
> +				__func__, cpu, ret);
> +
> +			/* Avoid failing driver's probe because of FIE */
> +			return 0;
> +		}
> +	}
> +
> +	/* Register for freq-invariance */
> +	topology_set_scale_freq_source(&cppc_sftd, policy->cpus);
> +	return 0;
> +}
> +
> +/*
> + * We free all the resources on policy's removal and not on CPU removal as the
> + * irq-work are per-cpu and the hotplug core takes care of flushing the pending
> + * irq-works (hint: smpcfd_dying_cpu()) on CPU hotplug. Even if the kthread-work
> + * fires on another CPU after the concerned CPU is removed, it won't harm.
> + *
> + * We just need to make sure to remove them all on policy->exit().
> + */
> +static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +	int cpu;
> +
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return;
> +
> +	/* policy->cpus will be empty here, use related_cpus instead */
> +	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, policy->related_cpus);
> +
> +	for_each_cpu(cpu, policy->related_cpus) {
> +		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +		irq_work_sync(&cppc_fi->irq_work);
> +		kthread_cancel_work_sync(&cppc_fi->work);
> +	}
> +}

Yep, looks good (with its call on cppc_cpufreq_cpu_exit()).

Feel free to post V3 and I'll take an even closer look during testing.

Thanks,
Ionela.


> +
> +static void __init cppc_freq_invariance_init(void)
> +{
> +	struct sched_attr attr = {
> +		.size		= sizeof(struct sched_attr),
> +		.sched_policy	= SCHED_DEADLINE,
> +		.sched_nice	= 0,
> +		.sched_priority	= 0,
> +		/*
> +		 * Fake (unused) bandwidth; workaround to "fix"
> +		 * priority inheritance.
> +		 */
> +		.sched_runtime	= 1000000,
> +		.sched_deadline = 10000000,
> +		.sched_period	= 10000000,
> +	};
> +	int ret;
> +
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return;
> +
> +	kworker_fie = kthread_create_worker(0, "cppc_fie");
> +	if (IS_ERR(kworker_fie))
> +		return;
> +
> +	ret = sched_setattr_nocheck(kworker_fie->task, &attr);
> +	if (ret) {
> +		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
> +			ret);
> +		kthread_destroy_worker(kworker_fie);
> +		return;
> +	}
> +}
> +
> +static void cppc_freq_invariance_exit(void)
> +{
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return;
> +
> +	kthread_destroy_worker(kworker_fie);
> +	kworker_fie = NULL;
> +}
> +
> +#else
> +static inline int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
> +{
> +	return 0;
> +}
> +
> +static inline void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
> +{
> +}
> +
> +static inline void cppc_freq_invariance_init(void)
> +{
> +}
> +
> +static inline void cppc_freq_invariance_exit(void)
> +{
> +}
> +#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
> +
>  /* Callback function used to retrieve the max frequency from DMI */
>  static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
>  {
> @@ -324,11 +536,13 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  	cpu_data->perf_ctrls.desired_perf =  caps->highest_perf;
>  
>  	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
> -	if (ret)
> +	if (ret) {
>  		pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
>  			 caps->highest_perf, cpu, ret);
> +		return ret;
> +	}
>  
> -	return ret;
> +	return cppc_cpufreq_cpu_fie_init(policy);
>  }
>  
>  static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> @@ -338,6 +552,8 @@ static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>  	unsigned int cpu = policy->cpu;
>  	int ret;
>  
> +	cppc_cpufreq_cpu_fie_exit(policy);
> +
>  	cpu_data->perf_ctrls.desired_perf = caps->lowest_perf;
>  
>  	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
> @@ -362,26 +578,35 @@ static inline u64 get_delta(u64 t1, u64 t0)
>  	return (u32)t1 - (u32)t0;
>  }
>  
> -static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
> -				     struct cppc_perf_fb_ctrs fb_ctrs_t0,
> -				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
> +static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)
>  {
>  	u64 delta_reference, delta_delivered;
> -	u64 reference_perf, delivered_perf;
> +	u64 reference_perf;
>  
> -	reference_perf = fb_ctrs_t0.reference_perf;
> +	reference_perf = fb_ctrs_t0->reference_perf;
>  
> -	delta_reference = get_delta(fb_ctrs_t1.reference,
> -				    fb_ctrs_t0.reference);
> -	delta_delivered = get_delta(fb_ctrs_t1.delivered,
> -				    fb_ctrs_t0.delivered);
> +	delta_reference = get_delta(fb_ctrs_t1->reference,
> +				    fb_ctrs_t0->reference);
> +	delta_delivered = get_delta(fb_ctrs_t1->delivered,
> +				    fb_ctrs_t0->delivered);
>  
> -	/* Check to avoid divide-by zero */
> -	if (delta_reference || delta_delivered)
> -		delivered_perf = (reference_perf * delta_delivered) /
> -					delta_reference;
> -	else
> -		delivered_perf = cpu_data->perf_ctrls.desired_perf;
> +	/* Check to avoid divide-by zero and invalid delivered_perf */
> +	if (!delta_reference || !delta_delivered)
> +		return cpu_data->perf_ctrls.desired_perf;
> +
> +	return (reference_perf * delta_delivered) / delta_reference;
> +}
> +
> +static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
> +				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,
> +				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)
> +{
> +	u64 delivered_perf;
> +
> +	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
> +					       fb_ctrs_t1);
>  
>  	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
>  }
> @@ -405,7 +630,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
>  	if (ret)
>  		return ret;
>  
> -	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
> +	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
>  }
>  
>  static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
> @@ -506,14 +731,21 @@ static void cppc_check_hisi_workaround(void)
>  
>  static int __init cppc_cpufreq_init(void)
>  {
> +	int ret;
> +
>  	if ((acpi_disabled) || !acpi_cpc_valid())
>  		return -ENODEV;
>  
>  	INIT_LIST_HEAD(&cpu_data_list);
>  
>  	cppc_check_hisi_workaround();
> +	cppc_freq_invariance_init();
>  
> -	return cpufreq_register_driver(&cppc_cpufreq_driver);
> +	ret = cpufreq_register_driver(&cppc_cpufreq_driver);
> +	if (ret)
> +		cppc_freq_invariance_exit();
> +
> +	return ret;
>  }
>  
>  static inline void free_cpu_data(void)
> @@ -531,6 +763,7 @@ static inline void free_cpu_data(void)
>  static void __exit cppc_cpufreq_exit(void)
>  {
>  	cpufreq_unregister_driver(&cppc_cpufreq_driver);
> +	cppc_freq_invariance_exit();
>  
>  	free_cpu_data();
>  }
> 
> -- 
> viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-06-18 12:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-16  6:48 [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance Viresh Kumar
2021-06-16  6:48 ` [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks Viresh Kumar
2021-06-17 13:33   ` Rafael J. Wysocki
2021-06-18  7:46     ` Viresh Kumar
2021-06-16  6:48 ` [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data Viresh Kumar
2021-06-16  7:57   ` Greg Kroah-Hartman
2021-06-16  8:18     ` Viresh Kumar
2021-06-16  8:31       ` Greg Kroah-Hartman
2021-06-16  9:10         ` Viresh Kumar
2021-06-16 11:25   ` Ionela Voinescu
2021-06-16 11:36     ` Viresh Kumar
2021-06-16 12:00       ` Ionela Voinescu
2021-06-17  3:06         ` Viresh Kumar
2021-06-16  6:48 ` [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance Viresh Kumar
2021-06-16 12:48   ` Ionela Voinescu
2021-06-17  3:24     ` Viresh Kumar
2021-06-17 10:34       ` Ionela Voinescu
2021-06-17 11:19         ` Viresh Kumar
2021-06-17 12:22           ` Peter Zijlstra
2021-06-18  3:45             ` Viresh Kumar
2021-06-18  7:37         ` Viresh Kumar
2021-06-18 12:26           ` Ionela Voinescu
2021-06-16 10:02 ` [PATCH V2 0/3] cpufreq: cppc: " Vincent Guittot
2021-06-16 11:54   ` Viresh Kumar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.