linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Per CPU idle injection
@ 2022-11-29 23:34 Srinivas Pandruvada
  2022-11-29 23:34 ` [PATCH v2 1/4] powercap: idle_inject: Export symbols Srinivas Pandruvada
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Srinivas Pandruvada @ 2022-11-29 23:34 UTC (permalink / raw)
  To: rafael
  Cc: linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk,
	Srinivas Pandruvada, kernel test robot

This series introduces per CPU idle injection. In preparation for this
enhance the existing powercap/idle_inject and modify intel_powerclamp
to use this. Then add per core idle injection driver.

v2
- Update based on feedback from Rafael on patch 2/4
- Kconfig dependency issue
Reported-by: kernel test robot <lkp@intel.com>

Srinivas Pandruvada (4):
  powercap: idle_inject: Export symbols
  powercap: idle_inject: Add prepare/complete callbacks
  thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
  thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle
    cooling driver

 drivers/powercap/idle_inject.c                |  69 ++++-
 drivers/thermal/intel/Kconfig                 |  12 +
 drivers/thermal/intel/Makefile                |   1 +
 .../thermal/intel/intel_cpu_idle_cooling.c    | 261 ++++++++++++++++
 drivers/thermal/intel/intel_powerclamp.c      | 292 ++++++++----------
 include/linux/idle_inject.h                   |   4 +
 6 files changed, 467 insertions(+), 172 deletions(-)
 create mode 100644 drivers/thermal/intel/intel_cpu_idle_cooling.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/4] powercap: idle_inject: Export symbols
  2022-11-29 23:34 [PATCH v2 0/4] Per CPU idle injection Srinivas Pandruvada
@ 2022-11-29 23:34 ` Srinivas Pandruvada
  2023-01-12 15:05   ` Rafael J. Wysocki
  2022-11-29 23:34 ` [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks Srinivas Pandruvada
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Srinivas Pandruvada @ 2022-11-29 23:34 UTC (permalink / raw)
  To: rafael
  Cc: linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk,
	Srinivas Pandruvada

Export symbols for external interfaces, so that they can be used in
other loadable modules.

Export is done under name space IDLE_INJECT.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
v2:
	No change

 drivers/powercap/idle_inject.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
index fe86a09e3b67..dfa989182e71 100644
--- a/drivers/powercap/idle_inject.c
+++ b/drivers/powercap/idle_inject.c
@@ -160,6 +160,7 @@ void idle_inject_set_duration(struct idle_inject_device *ii_dev,
 		WRITE_ONCE(ii_dev->idle_duration_us, idle_duration_us);
 	}
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_set_duration, IDLE_INJECT);
 
 /**
  * idle_inject_get_duration - idle and run duration retrieval helper
@@ -174,6 +175,7 @@ void idle_inject_get_duration(struct idle_inject_device *ii_dev,
 	*run_duration_us = READ_ONCE(ii_dev->run_duration_us);
 	*idle_duration_us = READ_ONCE(ii_dev->idle_duration_us);
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_get_duration, IDLE_INJECT);
 
 /**
  * idle_inject_set_latency - set the maximum latency allowed
@@ -185,6 +187,7 @@ void idle_inject_set_latency(struct idle_inject_device *ii_dev,
 {
 	WRITE_ONCE(ii_dev->latency_us, latency_us);
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_set_latency, IDLE_INJECT);
 
 /**
  * idle_inject_start - start idle injections
@@ -216,6 +219,7 @@ int idle_inject_start(struct idle_inject_device *ii_dev)
 
 	return 0;
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_start, IDLE_INJECT);
 
 /**
  * idle_inject_stop - stops idle injections
@@ -262,6 +266,7 @@ void idle_inject_stop(struct idle_inject_device *ii_dev)
 
 	cpu_hotplug_enable();
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_stop, IDLE_INJECT);
 
 /**
  * idle_inject_setup - prepare the current task for idle injection
@@ -337,6 +342,7 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
 
 	return NULL;
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
 
 /**
  * idle_inject_unregister - unregister idle injection control device
@@ -357,6 +363,7 @@ void idle_inject_unregister(struct idle_inject_device *ii_dev)
 
 	kfree(ii_dev);
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_unregister, IDLE_INJECT);
 
 static struct smp_hotplug_thread idle_inject_threads = {
 	.store = &idle_inject_thread.tsk,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2022-11-29 23:34 [PATCH v2 0/4] Per CPU idle injection Srinivas Pandruvada
  2022-11-29 23:34 ` [PATCH v2 1/4] powercap: idle_inject: Export symbols Srinivas Pandruvada
@ 2022-11-29 23:34 ` Srinivas Pandruvada
  2022-12-21 14:52   ` Daniel Lezcano
  2023-01-12 14:53   ` Rafael J. Wysocki
  2022-11-29 23:34 ` [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework Srinivas Pandruvada
  2022-11-29 23:34 ` [PATCH v2 4/4] thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle cooling driver Srinivas Pandruvada
  3 siblings, 2 replies; 18+ messages in thread
From: Srinivas Pandruvada @ 2022-11-29 23:34 UTC (permalink / raw)
  To: rafael
  Cc: linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk,
	Srinivas Pandruvada

The actual idle percentage can be less than the desired because of
interrupts. Since the objective for CPU Idle injection is for thermal
control, there should be some way to compensate for lost idle percentage.
Some architectures provide interface to get actual idle percent observed
by the hardware. So, the idle percent can be adjusted using the hardware
feedback. For example, Intel CPUs provides package idle counters, which
is currently used by intel powerclamp driver to adjust idle time.

The only way this can be done currently is by monitoring hardware idle
percent from a different software thread. This can be avoided by adding
callbacks.

Add a capability to register a prepare and complete callback during idle
inject registry. Add a new register function idle_inject_register_full()
which also allows to register callbacks.

If they are not NULL, then prepare callback is called before calling
play_idle_precise() and complete callback is called after calling
play_idle_precise().

If prepare callback is present and returns non 0 value then
play_idle_precise() is not called to avoid over compensation.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
v2
- Replace begin/end with prepare/complete
- Add new interface idle_inject_register_full with callbacks
- Update kernel doc
- Update commit description

 drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
 include/linux/idle_inject.h    |  4 +++
 2 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
index dfa989182e71..f48e71501429 100644
--- a/drivers/powercap/idle_inject.c
+++ b/drivers/powercap/idle_inject.c
@@ -63,13 +63,31 @@ struct idle_inject_thread {
  * @idle_duration_us: duration of CPU idle time to inject
  * @run_duration_us: duration of CPU run time to allow
  * @latency_us: max allowed latency
+ * @prepare: Callback function which is called before calling
+ *		play_idle_precise()
+ * @complete: Callback function which is called after calling
+ *		play_idle_precise()
  * @cpumask: mask of CPUs affected by idle injection
+ *
+ * This structure is used to define per instance idle inject device data. Each
+ * instance has an idle duration, a run duration and mask of CPUs to inject
+ * idle.
+ * Actual idle is injected by calling kernel scheduler interface
+ * play_idle_precise(). There are two optional callbacks which the caller can
+ * register by calling idle_inject_register_full():
+ * prepare() - This callback is called just before calling play_idle_precise()
+ *		If this callback returns non zero value then
+ *		play_idle_precise() is not called. This means skip injecting
+ *		idle during this period.
+ * complete() - This callback is called after calling play_idle_precise().
  */
 struct idle_inject_device {
 	struct hrtimer timer;
 	unsigned int idle_duration_us;
 	unsigned int run_duration_us;
 	unsigned int latency_us;
+	int (*prepare)(unsigned int cpu);
+	void (*complete)(unsigned int cpu);
 	unsigned long cpumask[];
 };
 
@@ -132,6 +150,7 @@ static void idle_inject_fn(unsigned int cpu)
 {
 	struct idle_inject_device *ii_dev;
 	struct idle_inject_thread *iit;
+	int ret;
 
 	ii_dev = per_cpu(idle_inject_device, cpu);
 	iit = per_cpu_ptr(&idle_inject_thread, cpu);
@@ -141,8 +160,18 @@ static void idle_inject_fn(unsigned int cpu)
 	 */
 	iit->should_run = 0;
 
+	if (ii_dev->prepare) {
+		ret = ii_dev->prepare(cpu);
+		if (ret)
+			goto skip;
+	}
+
 	play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) * NSEC_PER_USEC,
 			  READ_ONCE(ii_dev->latency_us) * NSEC_PER_USEC);
+
+skip:
+	if (ii_dev->complete)
+		ii_dev->complete(cpu);
 }
 
 /**
@@ -295,17 +324,23 @@ static int idle_inject_should_run(unsigned int cpu)
 }
 
 /**
- * idle_inject_register - initialize idle injection on a set of CPUs
+ * idle_inject_register_full - initialize idle injection on a set of CPUs
  * @cpumask: CPUs to be affected by idle injection
+ * @prepare: callback called before calling play_idle_precise()
+ * @complete: callback called after calling play_idle_precise()
  *
  * This function creates an idle injection control device structure for the
- * given set of CPUs and initializes the timer associated with it.  It does not
- * start any injection cycles.
+ * given set of CPUs and initializes the timer associated with it. This
+ * function also allows to register prepare() and complete() callbacks.
+ * It does not start any injection cycles.
  *
  * Return: NULL if memory allocation fails, idle injection control device
  * pointer on success.
  */
-struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
+
+struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
+						     int (*prepare)(unsigned int cpu),
+						     void (*complete)(unsigned int cpu))
 {
 	struct idle_inject_device *ii_dev;
 	int cpu, cpu_rb;
@@ -318,6 +353,8 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
 	hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	ii_dev->timer.function = idle_inject_timer_fn;
 	ii_dev->latency_us = UINT_MAX;
+	ii_dev->prepare = prepare;
+	ii_dev->complete = complete;
 
 	for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
 
@@ -342,6 +379,23 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
 
 	return NULL;
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
+
+/**
+ * idle_inject_register - initialize idle injection on a set of CPUs
+ * @cpumask: CPUs to be affected by idle injection
+ *
+ * This function creates an idle injection control device structure for the
+ * given set of CPUs and initializes the timer associated with it.  It does not
+ * start any injection cycles.
+ *
+ * Return: NULL if memory allocation fails, idle injection control device
+ * pointer on success.
+ */
+struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
+{
+	return idle_inject_register_full(cpumask, NULL, NULL);
+}
 EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
 
 /**
diff --git a/include/linux/idle_inject.h b/include/linux/idle_inject.h
index fb88e23a99d3..e18d89793490 100644
--- a/include/linux/idle_inject.h
+++ b/include/linux/idle_inject.h
@@ -13,6 +13,10 @@ struct idle_inject_device;
 
 struct idle_inject_device *idle_inject_register(struct cpumask *cpumask);
 
+struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
+						     int (*prepare)(unsigned int cpu),
+						     void (*complete)(unsigned int cpu));
+
 void idle_inject_unregister(struct idle_inject_device *ii_dev);
 
 int idle_inject_start(struct idle_inject_device *ii_dev);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
  2022-11-29 23:34 [PATCH v2 0/4] Per CPU idle injection Srinivas Pandruvada
  2022-11-29 23:34 ` [PATCH v2 1/4] powercap: idle_inject: Export symbols Srinivas Pandruvada
  2022-11-29 23:34 ` [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks Srinivas Pandruvada
@ 2022-11-29 23:34 ` Srinivas Pandruvada
  2023-01-12 18:32   ` Rafael J. Wysocki
  2022-11-29 23:34 ` [PATCH v2 4/4] thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle cooling driver Srinivas Pandruvada
  3 siblings, 1 reply; 18+ messages in thread
From: Srinivas Pandruvada @ 2022-11-29 23:34 UTC (permalink / raw)
  To: rafael
  Cc: linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk,
	Srinivas Pandruvada, kernel test robot

There are two idle injection implementation in the Linux kernel. One
via intel_powerclamp and the other using powercap/idle_inject. Both
implementation end up in calling play_idle* function from a FIFO
priority thread. Both can't be used at the same time.

Currently per core idle injection (cpuidle_cooling) is using
powercap/idle_inject, which is not used in platforms where
intel_powerclamp is used for system wide idle injection. So there is
no conflict. But there are some use cases where per core idle injection
is beneficial on the same system where system wide idle injection is
also used via intel_powerclamp. To avoid conflict only one of the idle
injection type must be in use at a time. This require a common framework
which both per core and system wide idle injection can use.

Here powercap/idle_inject can be used for both per-core and for system
wide idle injection. This framework has a well defined interface
which allow registry for per-core or for all CPUs (system wide). If
particular CPU is already participating in idle injection, the call
to registry fails. Here the registry can be done when user space
changes the current cooling device state.

Also one framework for idle injection is better as there is one loop
calling play_idle*, instead of multiple for better maintenance.

So, reuse powercap/idle_inject calls in intel_powerclamp. This simplifies
the code as all per CPU kthreads which calls play_idle* can be removed.

The changes include:
- Remove unneeded include files
- Remove per CPU kthread workers: balancing_work and idle_injection_work
- Reuse the compensation related code by moving from previous worker
thread to idle_injection callbacks
- Adjust the idle_duration and runtime by using powercap/idle_inject
interface
- Remove all variables, which are not required once powercap/idle_inject
is used
- Add mutex to avoid race during removal of idle injection during module
unload and user action to change idle inject percent
- Use READ_ONCE and WRITE_ONCE for data accessed from multiple CPUs

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
v2:
- Use idle_inject_register_full instead of idle_inject_register
- Also fix dependency issue with POWERCAP config
Reported-by: kernel test robot <lkp@intel.com>

 drivers/thermal/intel/Kconfig            |   2 +
 drivers/thermal/intel/intel_powerclamp.c | 292 ++++++++++-------------
 2 files changed, 126 insertions(+), 168 deletions(-)

diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index f0c845679250..6c2a95f41c81 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
 	tristate "Intel PowerClamp idle injection driver"
 	depends on X86
 	depends on CPU_SUP_INTEL
+	select POWERCAP
+	select IDLE_INJECT
 	help
 	  Enable this to enable Intel PowerClamp idle injection driver. This
 	  enforce idle time which results in more package C-state residency. The
diff --git a/drivers/thermal/intel/intel_powerclamp.c b/drivers/thermal/intel/intel_powerclamp.c
index b80e25ec1261..3f2b20ae8f68 100644
--- a/drivers/thermal/intel/intel_powerclamp.c
+++ b/drivers/thermal/intel/intel_powerclamp.c
@@ -2,7 +2,7 @@
 /*
  * intel_powerclamp.c - package c-state idle injection
  *
- * Copyright (c) 2012, Intel Corporation.
+ * Copyright (c) 2022, Intel Corporation.
  *
  * Authors:
  *     Arjan van de Ven <arjan@linux.intel.com>
@@ -27,21 +27,15 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/delay.h>
-#include <linux/kthread.h>
 #include <linux/cpu.h>
 #include <linux/thermal.h>
-#include <linux/slab.h>
-#include <linux/tick.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
-#include <linux/sched/rt.h>
-#include <uapi/linux/sched/types.h>
+#include <linux/idle_inject.h>
 
-#include <asm/nmi.h>
 #include <asm/msr.h>
 #include <asm/mwait.h>
 #include <asm/cpu_device_id.h>
-#include <asm/hardirq.h>
 
 #define MAX_TARGET_RATIO (50U)
 /* For each undisturbed clamping period (no extra wake ups during idle time),
@@ -60,6 +54,7 @@ static struct dentry *debug_dir;
 
 /* user selected target */
 static unsigned int set_target_ratio;
+static bool target_ratio_updated;
 static unsigned int current_ratio;
 static bool should_skip;
 
@@ -67,26 +62,20 @@ static unsigned int control_cpu; /* The cpu assigned to collect stat and update
 				  * control parameters. default to BSP but BSP
 				  * can be offlined.
 				  */
-static bool clamping;
-
-struct powerclamp_worker_data {
-	struct kthread_worker *worker;
-	struct kthread_work balancing_work;
-	struct kthread_delayed_work idle_injection_work;
+struct powerclamp_data {
 	unsigned int cpu;
 	unsigned int count;
 	unsigned int guard;
 	unsigned int window_size_now;
 	unsigned int target_ratio;
-	unsigned int duration_jiffies;
 	bool clamping;
 };
 
-static struct powerclamp_worker_data __percpu *worker_data;
+static struct powerclamp_data powerclamp_data;
+
 static struct thermal_cooling_device *cooling_dev;
-static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
-					   * clamping kthread worker
-					   */
+
+static DEFINE_MUTEX(powerclamp_lock);
 
 static unsigned int duration;
 static unsigned int pkg_cstate_ratio_cur;
@@ -344,79 +333,33 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
 	return set_target_ratio + guard <= current_ratio;
 }
 
-static void clamp_balancing_func(struct kthread_work *work)
+static unsigned int get_run_time(void)
 {
-	struct powerclamp_worker_data *w_data;
-	int sleeptime;
-	unsigned long target_jiffies;
 	unsigned int compensated_ratio;
-	int interval; /* jiffies to sleep for each attempt */
-
-	w_data = container_of(work, struct powerclamp_worker_data,
-			      balancing_work);
+	unsigned int runtime;
 
 	/*
 	 * make sure user selected ratio does not take effect until
 	 * the next round. adjust target_ratio if user has changed
 	 * target such that we can converge quickly.
 	 */
-	w_data->target_ratio = READ_ONCE(set_target_ratio);
-	w_data->guard = 1 + w_data->target_ratio / 20;
-	w_data->window_size_now = window_size;
-	w_data->duration_jiffies = msecs_to_jiffies(duration);
-	w_data->count++;
+	powerclamp_data.target_ratio = READ_ONCE(set_target_ratio);
+	powerclamp_data.guard = 1 + powerclamp_data.target_ratio / 20;
+	powerclamp_data.window_size_now = window_size;
 
 	/*
 	 * systems may have different ability to enter package level
 	 * c-states, thus we need to compensate the injected idle ratio
 	 * to achieve the actual target reported by the HW.
 	 */
-	compensated_ratio = w_data->target_ratio +
-		get_compensation(w_data->target_ratio);
+	compensated_ratio = powerclamp_data.target_ratio +
+		get_compensation(powerclamp_data.target_ratio);
 	if (compensated_ratio <= 0)
 		compensated_ratio = 1;
-	interval = w_data->duration_jiffies * 100 / compensated_ratio;
-
-	/* align idle time */
-	target_jiffies = roundup(jiffies, interval);
-	sleeptime = target_jiffies - jiffies;
-	if (sleeptime <= 0)
-		sleeptime = 1;
-
-	if (clamping && w_data->clamping && cpu_online(w_data->cpu))
-		kthread_queue_delayed_work(w_data->worker,
-					   &w_data->idle_injection_work,
-					   sleeptime);
-}
 
-static void clamp_idle_injection_func(struct kthread_work *work)
-{
-	struct powerclamp_worker_data *w_data;
-
-	w_data = container_of(work, struct powerclamp_worker_data,
-			      idle_injection_work.work);
+	runtime = duration * 100 / compensated_ratio - duration;
 
-	/*
-	 * only elected controlling cpu can collect stats and update
-	 * control parameters.
-	 */
-	if (w_data->cpu == control_cpu &&
-	    !(w_data->count % w_data->window_size_now)) {
-		should_skip =
-			powerclamp_adjust_controls(w_data->target_ratio,
-						   w_data->guard,
-						   w_data->window_size_now);
-		smp_mb();
-	}
-
-	if (should_skip)
-		goto balance;
-
-	play_idle(jiffies_to_usecs(w_data->duration_jiffies));
-
-balance:
-	if (clamping && w_data->clamping && cpu_online(w_data->cpu))
-		kthread_queue_work(w_data->worker, &w_data->balancing_work);
+	return runtime;
 }
 
 /*
@@ -452,104 +395,127 @@ static void poll_pkg_cstate(struct work_struct *dummy)
 	msr_last = msr_now;
 	tsc_last = tsc_now;
 
-	if (true == clamping)
+	if (powerclamp_data.clamping)
 		schedule_delayed_work(&poll_pkg_cstate_work, HZ);
 }
 
-static void start_power_clamp_worker(unsigned long cpu)
+static struct idle_inject_device *ii_dev;
+
+static int idle_inject_begin(unsigned int cpu)
 {
-	struct powerclamp_worker_data *w_data = per_cpu_ptr(worker_data, cpu);
-	struct kthread_worker *worker;
+	/*
+	 * only elected controlling cpu can collect stats and update
+	 * control parameters.
+	 */
+	if (cpu == control_cpu) {
+		bool update = READ_ONCE(target_ratio_updated);
+
+		if (!(powerclamp_data.count % powerclamp_data.window_size_now)) {
+			bool skip = powerclamp_adjust_controls(powerclamp_data.target_ratio,
+						       powerclamp_data.guard,
+						       powerclamp_data.window_size_now);
+			WRITE_ONCE(should_skip, skip);
+			update = true;
+		}
 
-	worker = kthread_create_worker_on_cpu(cpu, 0, "kidle_inj/%ld", cpu);
-	if (IS_ERR(worker))
-		return;
+		if (update) {
+			unsigned int runtime;
+
+			runtime = get_run_time();
+			idle_inject_set_duration(ii_dev, runtime, duration);
+			WRITE_ONCE(target_ratio_updated, false);
+		}
+		powerclamp_data.count++;
+	}
+
+	if (READ_ONCE(should_skip))
+		return -EAGAIN;
 
-	w_data->worker = worker;
-	w_data->count = 0;
-	w_data->cpu = cpu;
-	w_data->clamping = true;
-	set_bit(cpu, cpu_clamping_mask);
-	sched_set_fifo(worker->task);
-	kthread_init_work(&w_data->balancing_work, clamp_balancing_func);
-	kthread_init_delayed_work(&w_data->idle_injection_work,
-				  clamp_idle_injection_func);
-	kthread_queue_work(w_data->worker, &w_data->balancing_work);
+	return 0;
 }
 
-static void stop_power_clamp_worker(unsigned long cpu)
+static void trigger_idle_injection(void)
 {
-	struct powerclamp_worker_data *w_data = per_cpu_ptr(worker_data, cpu);
+	unsigned int runtime = get_run_time();
 
-	if (!w_data->worker)
-		return;
+	idle_inject_set_duration(ii_dev, runtime, duration);
+	idle_inject_start(ii_dev);
+	powerclamp_data.clamping = true;
+}
+
+static int powerclamp_idle_injection_register(void)
+{
+	static cpumask_t idle_injection_cpu_mask;
+	unsigned long cpu;
 
-	w_data->clamping = false;
-	/*
-	 * Make sure that all works that get queued after this point see
-	 * the clamping disabled. The counter part is not needed because
-	 * there is an implicit memory barrier when the queued work
-	 * is proceed.
-	 */
-	smp_wmb();
-	kthread_cancel_work_sync(&w_data->balancing_work);
-	kthread_cancel_delayed_work_sync(&w_data->idle_injection_work);
 	/*
-	 * The balancing work still might be queued here because
-	 * the handling of the "clapming" variable, cancel, and queue
-	 * operations are not synchronized via a lock. But it is not
-	 * a big deal. The balancing work is fast and destroy kthread
-	 * will wait for it.
+	 * The idle inject core will only inject for online CPUs,
+	 * So we can register for all present CPUs. In this way
+	 * if some CPU goes online/offline while idle inject
+	 * is registered, nothing additional calls are required.
+	 * The same runtime and idle time is applicable for
+	 * newly onlined CPUs if any.
 	 */
-	clear_bit(w_data->cpu, cpu_clamping_mask);
-	kthread_destroy_worker(w_data->worker);
+	for_each_present_cpu(cpu) {
+		cpumask_set_cpu(cpu, &idle_injection_cpu_mask);
+	}
+
+	ii_dev = idle_inject_register_full(&idle_injection_cpu_mask,
+					   idle_inject_begin, NULL);
+	if (!ii_dev) {
+		pr_err("powerclamp: idle_inject_register failed\n");
+		return -EAGAIN;
+	}
 
-	w_data->worker = NULL;
+	idle_inject_set_duration(ii_dev, TICK_USEC, duration);
+	idle_inject_set_latency(ii_dev, UINT_MAX);
+
+	return 0;
+}
+
+static void remove_idle_injection(void)
+{
+	if (!powerclamp_data.clamping)
+		return;
+
+	powerclamp_data.clamping = false;
+	idle_inject_stop(ii_dev);
 }
 
 static int start_power_clamp(void)
 {
-	unsigned long cpu;
+	int ret;
 
-	set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
 	/* prevent cpu hotplug */
 	cpus_read_lock();
 
 	/* prefer BSP */
 	control_cpu = cpumask_first(cpu_online_mask);
 
-	clamping = true;
-	schedule_delayed_work(&poll_pkg_cstate_work, 0);
-
-	/* start one kthread worker per online cpu */
-	for_each_online_cpu(cpu) {
-		start_power_clamp_worker(cpu);
+	ret = powerclamp_idle_injection_register();
+	if (!ret) {
+		trigger_idle_injection();
+		schedule_delayed_work(&poll_pkg_cstate_work, 0);
 	}
+
 	cpus_read_unlock();
 
-	return 0;
+	return ret;
 }
 
 static void end_power_clamp(void)
 {
-	int i;
-
-	/*
-	 * Block requeuing in all the kthread workers. They will flush and
-	 * stop faster.
-	 */
-	clamping = false;
-	for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
-		pr_debug("clamping worker for cpu %d alive, destroy\n", i);
-		stop_power_clamp_worker(i);
+	if (powerclamp_data.clamping) {
+		remove_idle_injection();
+		idle_inject_unregister(ii_dev);
 	}
 }
 
 static int powerclamp_cpu_online(unsigned int cpu)
 {
-	if (clamping == false)
+	if (!powerclamp_data.clamping)
 		return 0;
-	start_power_clamp_worker(cpu);
+
 	/* prefer BSP as controlling CPU */
 	if (cpu == 0) {
 		control_cpu = 0;
@@ -560,10 +526,6 @@ static int powerclamp_cpu_online(unsigned int cpu)
 
 static int powerclamp_cpu_predown(unsigned int cpu)
 {
-	if (clamping == false)
-		return 0;
-
-	stop_power_clamp_worker(cpu);
 	if (cpu != control_cpu)
 		return 0;
 
@@ -585,7 +547,7 @@ static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
 static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
 				 unsigned long *state)
 {
-	if (true == clamping)
+	if (powerclamp_data.clamping)
 		*state = pkg_cstate_ratio_cur;
 	else
 		/* to save power, do not poll idle ratio while not clamping */
@@ -599,24 +561,30 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
 {
 	int ret = 0;
 
+	mutex_lock(&powerclamp_lock);
+
 	new_target_ratio = clamp(new_target_ratio, 0UL,
-				(unsigned long) (MAX_TARGET_RATIO-1));
-	if (set_target_ratio == 0 && new_target_ratio > 0) {
+				(unsigned long) (MAX_TARGET_RATIO - 1));
+	if (READ_ONCE(set_target_ratio) == 0 && new_target_ratio > 0) {
 		pr_info("Start idle injection to reduce power\n");
-		set_target_ratio = new_target_ratio;
+		WRITE_ONCE(set_target_ratio, new_target_ratio);
 		ret = start_power_clamp();
+		if (ret)
+			WRITE_ONCE(set_target_ratio, 0);
 		goto exit_set;
-	} else	if (set_target_ratio > 0 && new_target_ratio == 0) {
+	} else	if (READ_ONCE(set_target_ratio) > 0 && new_target_ratio == 0) {
 		pr_info("Stop forced idle injection\n");
 		end_power_clamp();
-		set_target_ratio = 0;
+		WRITE_ONCE(set_target_ratio, 0);
+		WRITE_ONCE(target_ratio_updated, false);
 	} else	/* adjust currently running */ {
-		set_target_ratio = new_target_ratio;
-		/* make new set_target_ratio visible to other cpus */
-		smp_mb();
+		WRITE_ONCE(set_target_ratio, new_target_ratio);
+		WRITE_ONCE(target_ratio_updated, true);
 	}
 
 exit_set:
+	mutex_unlock(&powerclamp_lock);
+
 	return ret;
 }
 
@@ -686,14 +654,10 @@ static int __init powerclamp_init(void)
 {
 	int retval;
 
-	cpu_clamping_mask = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
-	if (!cpu_clamping_mask)
-		return -ENOMEM;
-
 	/* probe cpu features and ids here */
 	retval = powerclamp_probe();
 	if (retval)
-		goto exit_free;
+		return retval;
 
 	/* set default limit, maybe adjusted during runtime based on feedback */
 	window_size = 2;
@@ -702,53 +666,45 @@ static int __init powerclamp_init(void)
 					   powerclamp_cpu_online,
 					   powerclamp_cpu_predown);
 	if (retval < 0)
-		goto exit_free;
+		return retval;
 
 	hp_state = retval;
 
-	worker_data = alloc_percpu(struct powerclamp_worker_data);
-	if (!worker_data) {
-		retval = -ENOMEM;
-		goto exit_unregister;
-	}
-
 	cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
-						&powerclamp_cooling_ops);
+						      &powerclamp_cooling_ops);
 	if (IS_ERR(cooling_dev)) {
 		retval = -ENODEV;
-		goto exit_free_thread;
+		goto exit_unregister;
 	}
 
 	if (!duration)
-		duration = jiffies_to_msecs(DEFAULT_DURATION_JIFFIES);
+		duration = jiffies_to_usecs(DEFAULT_DURATION_JIFFIES);
 
 	powerclamp_create_debug_files();
 
 	return 0;
 
-exit_free_thread:
-	free_percpu(worker_data);
 exit_unregister:
 	cpuhp_remove_state_nocalls(hp_state);
-exit_free:
-	bitmap_free(cpu_clamping_mask);
 	return retval;
 }
 module_init(powerclamp_init);
 
 static void __exit powerclamp_exit(void)
 {
+	mutex_lock(&powerclamp_lock);
 	end_power_clamp();
+	mutex_unlock(&powerclamp_lock);
 	cpuhp_remove_state_nocalls(hp_state);
-	free_percpu(worker_data);
 	thermal_cooling_device_unregister(cooling_dev);
-	bitmap_free(cpu_clamping_mask);
 
 	cancel_delayed_work_sync(&poll_pkg_cstate_work);
 	debugfs_remove_recursive(debug_dir);
 }
 module_exit(powerclamp_exit);
 
+MODULE_IMPORT_NS(IDLE_INJECT);
+
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Arjan van de Ven <arjan@linux.intel.com>");
 MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@linux.intel.com>");
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/4] thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle cooling driver
  2022-11-29 23:34 [PATCH v2 0/4] Per CPU idle injection Srinivas Pandruvada
                   ` (2 preceding siblings ...)
  2022-11-29 23:34 ` [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework Srinivas Pandruvada
@ 2022-11-29 23:34 ` Srinivas Pandruvada
  2022-12-21 15:10   ` Daniel Lezcano
  3 siblings, 1 reply; 18+ messages in thread
From: Srinivas Pandruvada @ 2022-11-29 23:34 UTC (permalink / raw)
  To: rafael
  Cc: linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk,
	Srinivas Pandruvada

The cpu idle cooling is used to cool down a CPU by injecting idle cycles
at runtime. The objective is similar to intel_powerclamp driver, which is
used for system wide cooling by injecting idle on each CPU.

This driver is modeled after drivers/thermal/cpuidle_cooling.c by reusing
powercap/idle_inject framework.

On each CPU online a thermal cooling device is registered. The minimum
state of the cooling device is 0 and maximum is 100. When user space
changes the current state to non zero, then register with idle inject
framework and start idle inject.

The default idle duration is 24 milli seconds, matching intel_powerclamp,
which doesn't change based on the current state of cooling device. The
runtime is changed based on the current state.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
v2:
- Removed callback arguments for idle_inject_register

 drivers/thermal/intel/Kconfig                 |  10 +
 drivers/thermal/intel/Makefile                |   1 +
 .../thermal/intel/intel_cpu_idle_cooling.c    | 261 ++++++++++++++++++
 3 files changed, 272 insertions(+)
 create mode 100644 drivers/thermal/intel/intel_cpu_idle_cooling.c

diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index 6c2a95f41c81..8c88d6e18414 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -115,3 +115,13 @@ config INTEL_HFI_THERMAL
 	  These capabilities may change as a result of changes in the operating
 	  conditions of the system such power and thermal limits. If selected,
 	  the kernel relays updates in CPUs' capabilities to userspace.
+
+config INTEL_CPU_IDLE_COOLING
+	tristate "Intel CPU idle cooling device"
+	depends on IDLE_INJECT
+	help
+	  This implements the CPU cooling mechanism through
+	  idle injection. This will throttle the CPU by injecting
+	  idle cycle.
+	  Unlike Intel Power clamp driver, this driver provides
+	  idle injection for each CPU.
diff --git a/drivers/thermal/intel/Makefile b/drivers/thermal/intel/Makefile
index 9a8d8054f316..8d5f7b5cf9b7 100644
--- a/drivers/thermal/intel/Makefile
+++ b/drivers/thermal/intel/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_INTEL_TCC_COOLING)	+= intel_tcc_cooling.o
 obj-$(CONFIG_X86_THERMAL_VECTOR) += therm_throt.o
 obj-$(CONFIG_INTEL_MENLOW)	+= intel_menlow.o
 obj-$(CONFIG_INTEL_HFI_THERMAL) += intel_hfi.o
+obj-$(CONFIG_INTEL_CPU_IDLE_COOLING) += intel_cpu_idle_cooling.o
diff --git a/drivers/thermal/intel/intel_cpu_idle_cooling.c b/drivers/thermal/intel/intel_cpu_idle_cooling.c
new file mode 100644
index 000000000000..cdd62756cc3d
--- /dev/null
+++ b/drivers/thermal/intel/intel_cpu_idle_cooling.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per CPU Idle injection cooling device implementation
+ *
+ * Copyright (c) 2022, Intel Corporation.
+ * All rights reserved.
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cpufeature.h>
+#include <linux/cpuhotplug.h>
+#include <linux/idle_inject.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/thermal.h>
+#include <linux/topology.h>
+
+#include <asm/cpu_device_id.h>
+
+/* Duration match with intel_powerclamp driver */
+#define IDLE_DURATION		24000
+#define IDLE_LATENCY		UINT_MAX
+
+static int idle_duration_us = IDLE_DURATION;
+static int idle_latency_us = IDLE_LATENCY;
+
+module_param(idle_duration_us, int, 0644);
+MODULE_PARM_DESC(idle_duration_us,
+		 "Idle duration in us.");
+
+module_param(idle_latency_us, int, 0644);
+MODULE_PARM_DESC(idle_latency_us,
+		 "Idle latency in us.");
+
+/**
+ * struct cpuidle_cooling - Per instance data for cooling device
+ * @cpu: CPU number for this cooling device
+ * @ii_dev: Idle inject core instance pointer
+ * @cdev: Thermal core cooling device instance
+ * @state:  Current cooling device state
+ *
+ * Stores per instance cooling device state.
+ */
+struct cpuidle_cooling {
+	int cpu;
+	struct idle_inject_device *ii_dev;
+	struct thermal_cooling_device *cdev;
+	unsigned long state;
+};
+
+static DEFINE_PER_CPU(struct cpuidle_cooling, cooling_devs);
+static cpumask_t cpuidle_cpu_mask;
+
+/* Used for module unload protection with idle injection operations */
+static DEFINE_MUTEX(idle_cooling_lock);
+
+static unsigned int cpuidle_cooling_runtime(unsigned int idle_duration_us,
+					    unsigned long state)
+{
+	if (!state)
+		return 0;
+
+	return ((idle_duration_us * 100) / state) - idle_duration_us;
+}
+
+static int cpuidle_idle_injection_register(struct cpuidle_cooling *cooling_dev)
+{
+	struct idle_inject_device *ii_dev;
+
+	ii_dev = idle_inject_register((struct cpumask *)cpumask_of(cooling_dev->cpu));
+	if (!ii_dev) {
+		/*
+		 * It is busy as some other device claimed idle injection for this CPU
+		 * Also it is possible that memory allocation failure.
+		 */
+		pr_err("idle_inject_register failed for cpu:%d\n", cooling_dev->cpu);
+		return -EAGAIN;
+	}
+
+	idle_inject_set_duration(ii_dev, TICK_USEC, idle_duration_us);
+	idle_inject_set_latency(ii_dev, idle_latency_us);
+
+	cooling_dev->ii_dev = ii_dev;
+
+	return 0;
+}
+
+static void cpuidle_idle_injection_unregister(struct cpuidle_cooling *cooling_dev)
+{
+	idle_inject_unregister(cooling_dev->ii_dev);
+}
+
+static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
+					 unsigned long *state)
+{
+	*state = 100;
+
+	return 0;
+}
+
+static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
+					 unsigned long *state)
+{
+	struct cpuidle_cooling *cooling_dev = cdev->devdata;
+
+	*state = READ_ONCE(cooling_dev->state);
+
+	return 0;
+}
+
+static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
+					 unsigned long state)
+{
+	struct cpuidle_cooling *cooling_dev = cdev->devdata;
+	unsigned int runtime_us;
+	unsigned long curr_state;
+	int ret = 0;
+
+	mutex_lock(&idle_cooling_lock);
+
+	curr_state = READ_ONCE(cooling_dev->state);
+
+	if (!curr_state && state > 0) {
+		/*
+		 * This is the first time to start cooling, so register with
+		 * idle injection framework.
+		 */
+		if (!cooling_dev->ii_dev) {
+			ret = cpuidle_idle_injection_register(cooling_dev);
+			if (ret)
+				goto unlock_set_state;
+		}
+
+		runtime_us = cpuidle_cooling_runtime(idle_duration_us, state);
+
+		idle_inject_set_duration(cooling_dev->ii_dev, runtime_us, idle_duration_us);
+		idle_inject_start(cooling_dev->ii_dev);
+	} else if (curr_state > 0 && state) {
+		/* Simply update runtime */
+		runtime_us = cpuidle_cooling_runtime(idle_duration_us, state);
+		idle_inject_set_duration(cooling_dev->ii_dev, runtime_us, idle_duration_us);
+	} else if (curr_state > 0 && !state) {
+		idle_inject_stop(cooling_dev->ii_dev);
+		cpuidle_idle_injection_unregister(cooling_dev);
+		cooling_dev->ii_dev = NULL;
+	}
+
+	WRITE_ONCE(cooling_dev->state, state);
+
+unlock_set_state:
+	mutex_unlock(&idle_cooling_lock);
+
+	return ret;
+}
+
+/**
+ * cpuidle_cooling_ops - thermal cooling device ops
+ */
+static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
+	.get_max_state = cpuidle_cooling_get_max_state,
+	.get_cur_state = cpuidle_cooling_get_cur_state,
+	.set_cur_state = cpuidle_cooling_set_cur_state,
+};
+
+static int cpuidle_cooling_register(int cpu)
+{
+	struct cpuidle_cooling *cooling_dev = &per_cpu(cooling_devs, cpu);
+	struct thermal_cooling_device *cdev;
+	char name[14]; /* storage for cpuidle-XXXX */
+	int ret = 0;
+
+	mutex_lock(&idle_cooling_lock);
+
+	snprintf(name, sizeof(name), "cpuidle-%d", cpu);
+	cdev = thermal_cooling_device_register(name, cooling_dev, &cpuidle_cooling_ops);
+	if (IS_ERR(cdev)) {
+		ret = PTR_ERR(cdev);
+		goto unlock_register;
+	}
+
+	cooling_dev->cdev = cdev;
+	cpumask_set_cpu(cpu, &cpuidle_cpu_mask);
+	cooling_dev->cpu = cpu;
+
+unlock_register:
+	mutex_unlock(&idle_cooling_lock);
+
+	return ret;
+}
+
+static void cpuidle_cooling_unregister(int cpu)
+{
+	struct cpuidle_cooling *cooling_dev = &per_cpu(cooling_devs, cpu);
+
+	mutex_lock(&idle_cooling_lock);
+
+	if (cooling_dev->state) {
+		idle_inject_stop(cooling_dev->ii_dev);
+		cpuidle_idle_injection_unregister(cooling_dev);
+	}
+
+	thermal_cooling_device_unregister(cooling_dev->cdev);
+	cooling_dev->state = 0;
+
+	mutex_unlock(&idle_cooling_lock);
+}
+
+static int cpuidle_cooling_cpu_online(unsigned int cpu)
+{
+	cpuidle_cooling_register(cpu);
+
+	return 0;
+}
+
+static int cpuidle_cooling_cpu_offline(unsigned int cpu)
+{
+	cpuidle_cooling_unregister(cpu);
+
+	return 0;
+}
+
+static enum cpuhp_state cpuidle_cooling_hp_state __read_mostly;
+
+static const struct x86_cpu_id intel_cpuidle_cooling_ids[] __initconst = {
+	X86_MATCH_VENDOR_FEATURE(INTEL, X86_FEATURE_MWAIT, NULL),
+	{}
+};
+MODULE_DEVICE_TABLE(x86cpu, intel_cpuidle_cooling_ids);
+
+static int __init cpuidle_cooling_init(void)
+{
+	int ret;
+
+	if (!x86_match_cpu(intel_cpuidle_cooling_ids))
+		return -ENODEV;
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+				"thermal/cpuidle_cooling:online",
+				cpuidle_cooling_cpu_online,
+				cpuidle_cooling_cpu_offline);
+	if (ret < 0)
+		return ret;
+
+	cpuidle_cooling_hp_state = ret;
+
+	return 0;
+}
+module_init(cpuidle_cooling_init)
+
+static void __exit cpuidle_cooling_exit(void)
+{
+	cpuhp_remove_state(cpuidle_cooling_hp_state);
+}
+module_exit(cpuidle_cooling_exit)
+
+MODULE_IMPORT_NS(IDLE_INJECT);
+
+MODULE_LICENSE("GPL");
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2022-11-29 23:34 ` [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks Srinivas Pandruvada
@ 2022-12-21 14:52   ` Daniel Lezcano
  2022-12-21 20:58     ` srinivas pandruvada
  2023-01-12 14:53   ` Rafael J. Wysocki
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Lezcano @ 2022-12-21 14:52 UTC (permalink / raw)
  To: Srinivas Pandruvada, rafael; +Cc: linux-pm, linux-kernel, rui.zhang, amitk


Hi Srinivas,

On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> The actual idle percentage can be less than the desired because of
> interrupts. Since the objective for CPU Idle injection is for thermal
> control, there should be some way to compensate for lost idle percentage.
> Some architectures provide interface to get actual idle percent observed
> by the hardware. So, the idle percent can be adjusted using the hardware
> feedback. For example, Intel CPUs provides package idle counters, which
> is currently used by intel powerclamp driver to adjust idle time.
Can you provide an example in terms of timings?

I'm not getting how 'prepare' would do by returning a positive value to 
skip the play_idle_precise() and what will do 'complete' ?


> The only way this can be done currently is by monitoring hardware idle
> percent from a different software thread. This can be avoided by adding
> callbacks.
> 
> Add a capability to register a prepare and complete callback during idle
> inject registry. Add a new register function idle_inject_register_full()
> which also allows to register callbacks.
> 
> If they are not NULL, then prepare callback is called before calling
> play_idle_precise() and complete callback is called after calling
> play_idle_precise().
> 
> If prepare callback is present and returns non 0 value then
> play_idle_precise() is not called to avoid over compensation.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> v2
> - Replace begin/end with prepare/complete
> - Add new interface idle_inject_register_full with callbacks
> - Update kernel doc
> - Update commit description
> 
>   drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
>   include/linux/idle_inject.h    |  4 +++
>   2 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> index dfa989182e71..f48e71501429 100644
> --- a/drivers/powercap/idle_inject.c
> +++ b/drivers/powercap/idle_inject.c
> @@ -63,13 +63,31 @@ struct idle_inject_thread {
>    * @idle_duration_us: duration of CPU idle time to inject
>    * @run_duration_us: duration of CPU run time to allow
>    * @latency_us: max allowed latency
> + * @prepare: Callback function which is called before calling
> + *		play_idle_precise()
> + * @complete: Callback function which is called after calling
> + *		play_idle_precise()
>    * @cpumask: mask of CPUs affected by idle injection
> + *
> + * This structure is used to define per instance idle inject device data. Each
> + * instance has an idle duration, a run duration and mask of CPUs to inject
> + * idle.
> + * Actual idle is injected by calling kernel scheduler interface
> + * play_idle_precise(). There are two optional callbacks which the caller can
> + * register by calling idle_inject_register_full():
> + * prepare() - This callback is called just before calling play_idle_precise()
> + *		If this callback returns non zero value then
> + *		play_idle_precise() is not called. This means skip injecting
> + *		idle during this period.
> + * complete() - This callback is called after calling play_idle_precise().
>    */
>   struct idle_inject_device {
>   	struct hrtimer timer;
>   	unsigned int idle_duration_us;
>   	unsigned int run_duration_us;
>   	unsigned int latency_us;
> +	int (*prepare)(unsigned int cpu);
> +	void (*complete)(unsigned int cpu);
>   	unsigned long cpumask[];
>   };
>   
> @@ -132,6 +150,7 @@ static void idle_inject_fn(unsigned int cpu)
>   {
>   	struct idle_inject_device *ii_dev;
>   	struct idle_inject_thread *iit;
> +	int ret;
>   
>   	ii_dev = per_cpu(idle_inject_device, cpu);
>   	iit = per_cpu_ptr(&idle_inject_thread, cpu);
> @@ -141,8 +160,18 @@ static void idle_inject_fn(unsigned int cpu)
>   	 */
>   	iit->should_run = 0;
>   
> +	if (ii_dev->prepare) {
> +		ret = ii_dev->prepare(cpu);
> +		if (ret)
> +			goto skip;
> +	}
> +
>   	play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) * NSEC_PER_USEC,
>   			  READ_ONCE(ii_dev->latency_us) * NSEC_PER_USEC);
> +
> +skip:
> +	if (ii_dev->complete)
> +		ii_dev->complete(cpu);
>   }
>   
>   /**
> @@ -295,17 +324,23 @@ static int idle_inject_should_run(unsigned int cpu)
>   }
>   
>   /**
> - * idle_inject_register - initialize idle injection on a set of CPUs
> + * idle_inject_register_full - initialize idle injection on a set of CPUs
>    * @cpumask: CPUs to be affected by idle injection
> + * @prepare: callback called before calling play_idle_precise()
> + * @complete: callback called after calling play_idle_precise()
>    *
>    * This function creates an idle injection control device structure for the
> - * given set of CPUs and initializes the timer associated with it.  It does not
> - * start any injection cycles.
> + * given set of CPUs and initializes the timer associated with it. This
> + * function also allows to register prepare() and complete() callbacks.
> + * It does not start any injection cycles.
>    *
>    * Return: NULL if memory allocation fails, idle injection control device
>    * pointer on success.
>    */
> -struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +						     int (*prepare)(unsigned int cpu),
> +						     void (*complete)(unsigned int cpu))
>   {
>   	struct idle_inject_device *ii_dev;
>   	int cpu, cpu_rb;
> @@ -318,6 +353,8 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>   	hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>   	ii_dev->timer.function = idle_inject_timer_fn;
>   	ii_dev->latency_us = UINT_MAX;
> +	ii_dev->prepare = prepare;
> +	ii_dev->complete = complete;
>   
>   	for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
>   
> @@ -342,6 +379,23 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>   
>   	return NULL;
>   }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
> +
> +/**
> + * idle_inject_register - initialize idle injection on a set of CPUs
> + * @cpumask: CPUs to be affected by idle injection
> + *
> + * This function creates an idle injection control device structure for the
> + * given set of CPUs and initializes the timer associated with it.  It does not
> + * start any injection cycles.
> + *
> + * Return: NULL if memory allocation fails, idle injection control device
> + * pointer on success.
> + */
> +struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +{
> +	return idle_inject_register_full(cpumask, NULL, NULL);
> +}
>   EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
>   
>   /**
> diff --git a/include/linux/idle_inject.h b/include/linux/idle_inject.h
> index fb88e23a99d3..e18d89793490 100644
> --- a/include/linux/idle_inject.h
> +++ b/include/linux/idle_inject.h
> @@ -13,6 +13,10 @@ struct idle_inject_device;
>   
>   struct idle_inject_device *idle_inject_register(struct cpumask *cpumask);
>   
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +						     int (*prepare)(unsigned int cpu),
> +						     void (*complete)(unsigned int cpu));
> +
>   void idle_inject_unregister(struct idle_inject_device *ii_dev);
>   
>   int idle_inject_start(struct idle_inject_device *ii_dev);

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/4] thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle cooling driver
  2022-11-29 23:34 ` [PATCH v2 4/4] thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle cooling driver Srinivas Pandruvada
@ 2022-12-21 15:10   ` Daniel Lezcano
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Lezcano @ 2022-12-21 15:10 UTC (permalink / raw)
  To: Srinivas Pandruvada, rafael; +Cc: linux-pm, linux-kernel, rui.zhang, amitk

On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> The cpu idle cooling is used to cool down a CPU by injecting idle cycles
> at runtime. The objective is similar to intel_powerclamp driver, which is
> used for system wide cooling by injecting idle on each CPU.
> 
> This driver is modeled after drivers/thermal/cpuidle_cooling.c by reusing
> powercap/idle_inject framework.
> 
> On each CPU online a thermal cooling device is registered. The minimum
> state of the cooling device is 0 and maximum is 100. When user space
> changes the current state to non zero, then register with idle inject
> framework and start idle inject.
> 
> The default idle duration is 24 milli seconds, matching intel_powerclamp,
> which doesn't change based on the current state of cooling device. The
> runtime is changed based on the current state.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> v2:
> - Removed callback arguments for idle_inject_register
> 
>   drivers/thermal/intel/Kconfig                 |  10 +
>   drivers/thermal/intel/Makefile                |   1 +
>   .../thermal/intel/intel_cpu_idle_cooling.c    | 261 ++++++++++++++++++
>   3 files changed, 272 insertions(+)
>   create mode 100644 drivers/thermal/intel/intel_cpu_idle_cooling.c
> 
> diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
> index 6c2a95f41c81..8c88d6e18414 100644
> --- a/drivers/thermal/intel/Kconfig
> +++ b/drivers/thermal/intel/Kconfig
> @@ -115,3 +115,13 @@ config INTEL_HFI_THERMAL
>   	  These capabilities may change as a result of changes in the operating
>   	  conditions of the system such power and thermal limits. If selected,
>   	  the kernel relays updates in CPUs' capabilities to userspace.
> +
> +config INTEL_CPU_IDLE_COOLING
> +	tristate "Intel CPU idle cooling device"
> +	depends on IDLE_INJECT
> +	help
> +	  This implements the CPU cooling mechanism through
> +	  idle injection. This will throttle the CPU by injecting
> +	  idle cycle.
> +	  Unlike Intel Power clamp driver, this driver provides
> +	  idle injection for each CPU.
> diff --git a/drivers/thermal/intel/Makefile b/drivers/thermal/intel/Makefile
> index 9a8d8054f316..8d5f7b5cf9b7 100644
> --- a/drivers/thermal/intel/Makefile
> +++ b/drivers/thermal/intel/Makefile
> @@ -14,3 +14,4 @@ obj-$(CONFIG_INTEL_TCC_COOLING)	+= intel_tcc_cooling.o
>   obj-$(CONFIG_X86_THERMAL_VECTOR) += therm_throt.o
>   obj-$(CONFIG_INTEL_MENLOW)	+= intel_menlow.o
>   obj-$(CONFIG_INTEL_HFI_THERMAL) += intel_hfi.o
> +obj-$(CONFIG_INTEL_CPU_IDLE_COOLING) += intel_cpu_idle_cooling.o
> diff --git a/drivers/thermal/intel/intel_cpu_idle_cooling.c b/drivers/thermal/intel/intel_cpu_idle_cooling.c
> new file mode 100644
> index 000000000000..cdd62756cc3d
> --- /dev/null
> +++ b/drivers/thermal/intel/intel_cpu_idle_cooling.c
> @@ -0,0 +1,261 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Per CPU Idle injection cooling device implementation
> + *
> + * Copyright (c) 2022, Intel Corporation.
> + * All rights reserved.
> + *
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/cpufeature.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/idle_inject.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/thermal.h>
> +#include <linux/topology.h>
> +
> +#include <asm/cpu_device_id.h>
> +
> +/* Duration match with intel_powerclamp driver */
> +#define IDLE_DURATION		24000
> +#define IDLE_LATENCY		UINT_MAX
> +
> +static int idle_duration_us = IDLE_DURATION;
> +static int idle_latency_us = IDLE_LATENCY;
> +
> +module_param(idle_duration_us, int, 0644);
> +MODULE_PARM_DESC(idle_duration_us,
> +		 "Idle duration in us.");
> +
> +module_param(idle_latency_us, int, 0644);
> +MODULE_PARM_DESC(idle_latency_us,
> +		 "Idle latency in us.");
> +
> +/**
> + * struct cpuidle_cooling - Per instance data for cooling device
> + * @cpu: CPU number for this cooling device
> + * @ii_dev: Idle inject core instance pointer
> + * @cdev: Thermal core cooling device instance
> + * @state:  Current cooling device state
> + *
> + * Stores per instance cooling device state.
> + */
> +struct cpuidle_cooling {
> +	int cpu;
> +	struct idle_inject_device *ii_dev;
> +	struct thermal_cooling_device *cdev;
> +	unsigned long state;
> +};
> +
> +static DEFINE_PER_CPU(struct cpuidle_cooling, cooling_devs);
> +static cpumask_t cpuidle_cpu_mask;

I don't see where it is used except its affectation

> +/* Used for module unload protection with idle injection operations */
> +static DEFINE_MUTEX(idle_cooling_lock);
> +
> +static unsigned int cpuidle_cooling_runtime(unsigned int idle_duration_us,
> +					    unsigned long state)
> +{
> +	if (!state)
> +		return 0;
> +
> +	return ((idle_duration_us * 100) / state) - idle_duration_us;
> +}
> +
> +static int cpuidle_idle_injection_register(struct cpuidle_cooling *cooling_dev)
> +{
> +	struct idle_inject_device *ii_dev;
> +
> +	ii_dev = idle_inject_register((struct cpumask *)cpumask_of(cooling_dev->cpu));
> +	if (!ii_dev) {
> +		/*
> +		 * It is busy as some other device claimed idle injection for this CPU
> +		 * Also it is possible that memory allocation failure.
> +		 */
> +		pr_err("idle_inject_register failed for cpu:%d\n", cooling_dev->cpu);
> +		return -EAGAIN;
> +	}
> +
> +	idle_inject_set_duration(ii_dev, TICK_USEC, idle_duration_us);
> +	idle_inject_set_latency(ii_dev, idle_latency_us);
> +
> +	cooling_dev->ii_dev = ii_dev;
> +
> +	return 0;
> +}
> +
> +static void cpuidle_idle_injection_unregister(struct cpuidle_cooling *cooling_dev)
> +{
> +	idle_inject_unregister(cooling_dev->ii_dev);
> +}
> +
> +static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
> +					 unsigned long *state)
> +{
> +	*state = 100;
> +
> +	return 0;
> +}
> +
> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
> +					 unsigned long *state)
> +{
> +	struct cpuidle_cooling *cooling_dev = cdev->devdata;
> +
> +	*state = READ_ONCE(cooling_dev->state);
> +
> +	return 0;
> +}
> +
> +static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
> +					 unsigned long state)
> +{
> +	struct cpuidle_cooling *cooling_dev = cdev->devdata;
> +	unsigned int runtime_us;
> +	unsigned long curr_state;
> +	int ret = 0;
> +
> +	mutex_lock(&idle_cooling_lock);
> +
> +	curr_state = READ_ONCE(cooling_dev->state);
> +
> +	if (!curr_state && state > 0) {
> +		/*
> +		 * This is the first time to start cooling, so register with
> +		 * idle injection framework.
> +		 */
> +		if (!cooling_dev->ii_dev) {
> +			ret = cpuidle_idle_injection_register(cooling_dev);
> +			if (ret)
> +				goto unlock_set_state;
> +		}
> +
> +		runtime_us = cpuidle_cooling_runtime(idle_duration_us, state);
> +
> +		idle_inject_set_duration(cooling_dev->ii_dev, runtime_us, idle_duration_us);
> +		idle_inject_start(cooling_dev->ii_dev);
> +	} else if (curr_state > 0 && state) {
> +		/* Simply update runtime */
> +		runtime_us = cpuidle_cooling_runtime(idle_duration_us, state);
> +		idle_inject_set_duration(cooling_dev->ii_dev, runtime_us, idle_duration_us);
> +	} else if (curr_state > 0 && !state) {
> +		idle_inject_stop(cooling_dev->ii_dev);
> +		cpuidle_idle_injection_unregister(cooling_dev);
> +		cooling_dev->ii_dev = NULL;
> +	}
> +
> +	WRITE_ONCE(cooling_dev->state, state);
> +
> +unlock_set_state:
> +	mutex_unlock(&idle_cooling_lock);
> +
> +	return ret;
> +}
> +
> +/**
> + * cpuidle_cooling_ops - thermal cooling device ops
> + */
> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
> +	.get_max_state = cpuidle_cooling_get_max_state,
> +	.get_cur_state = cpuidle_cooling_get_cur_state,
> +	.set_cur_state = cpuidle_cooling_set_cur_state,
> +};
> +
> +static int cpuidle_cooling_register(int cpu)
> +{
> +	struct cpuidle_cooling *cooling_dev = &per_cpu(cooling_devs, cpu);
> +	struct thermal_cooling_device *cdev;
> +	char name[14]; /* storage for cpuidle-XXXX */
> +	int ret = 0;

Why not register idle_inject here before calling 
thermal_cooling_device_register() so you get ride of the lock.

Well actually, I'm wondering if you can just have the same code than 
cpuidle_cooling_device and just replace the of_device cpu initialization 
by the cpu hotplug.

> +	mutex_lock(&idle_cooling_lock);
> +
> +	snprintf(name, sizeof(name), "cpuidle-%d", cpu);
> +	cdev = thermal_cooling_device_register(name, cooling_dev, &cpuidle_cooling_ops);
> +	if (IS_ERR(cdev)) {
> +		ret = PTR_ERR(cdev);
> +		goto unlock_register;
> +	}
> +
> +	cooling_dev->cdev = cdev;
> +	cpumask_set_cpu(cpu, &cpuidle_cpu_mask);
> +	cooling_dev->cpu = cpu;
> +
> +unlock_register:
> +	mutex_unlock(&idle_cooling_lock);
> +
> +	return ret;
> +}
> +
> +static void cpuidle_cooling_unregister(int cpu)
> +{
> +	struct cpuidle_cooling *cooling_dev = &per_cpu(cooling_devs, cpu);
> +
> +	mutex_lock(&idle_cooling_lock);
> +
> +	if (cooling_dev->state) {
> +		idle_inject_stop(cooling_dev->ii_dev);
> +		cpuidle_idle_injection_unregister(cooling_dev);
> +	}
> +
> +	thermal_cooling_device_unregister(cooling_dev->cdev);
> +	cooling_dev->state = 0;
> +
> +	mutex_unlock(&idle_cooling_lock);
> +}
> +
> +static int cpuidle_cooling_cpu_online(unsigned int cpu)
> +{
> +	cpuidle_cooling_register(cpu);
> +
> +	return 0;
> +}
> +
> +static int cpuidle_cooling_cpu_offline(unsigned int cpu)
> +{
> +	cpuidle_cooling_unregister(cpu);
> +
> +	return 0;
> +}
> +
> +static enum cpuhp_state cpuidle_cooling_hp_state __read_mostly;
> +
> +static const struct x86_cpu_id intel_cpuidle_cooling_ids[] __initconst = {
> +	X86_MATCH_VENDOR_FEATURE(INTEL, X86_FEATURE_MWAIT, NULL),
> +	{}
> +};
> +MODULE_DEVICE_TABLE(x86cpu, intel_cpuidle_cooling_ids);
> +
> +static int __init cpuidle_cooling_init(void)
> +{
> +	int ret;
> +
> +	if (!x86_match_cpu(intel_cpuidle_cooling_ids))
> +		return -ENODEV;
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +				"thermal/cpuidle_cooling:online",
> +				cpuidle_cooling_cpu_online,
> +				cpuidle_cooling_cpu_offline);
> +	if (ret < 0)
> +		return ret;
> +
> +	cpuidle_cooling_hp_state = ret;
> +
> +	return 0;
> +}
> +module_init(cpuidle_cooling_init)
> +
> +static void __exit cpuidle_cooling_exit(void)
> +{
> +	cpuhp_remove_state(cpuidle_cooling_hp_state);
> +}
> +module_exit(cpuidle_cooling_exit)
> +
> +MODULE_IMPORT_NS(IDLE_INJECT);
> +
> +MODULE_LICENSE("GPL");

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2022-12-21 14:52   ` Daniel Lezcano
@ 2022-12-21 20:58     ` srinivas pandruvada
  2022-12-22  9:50       ` Daniel Lezcano
  0 siblings, 1 reply; 18+ messages in thread
From: srinivas pandruvada @ 2022-12-21 20:58 UTC (permalink / raw)
  To: Daniel Lezcano, rafael; +Cc: linux-pm, linux-kernel, rui.zhang, amitk

Hi Daniel,

On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
> 
> Hi Srinivas,
> 
> On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> > The actual idle percentage can be less than the desired because of
> > interrupts. Since the objective for CPU Idle injection is for
> > thermal
> > control, there should be some way to compensate for lost idle
> > percentage.
> > Some architectures provide interface to get actual idle percent
> > observed
> > by the hardware. So, the idle percent can be adjusted using the
> > hardware
> > feedback. For example, Intel CPUs provides package idle counters,
> > which
> > is currently used by intel powerclamp driver to adjust idle time.
> Can you provide an example in terms of timings?
> 
> I'm not getting how 'prepare' would do by returning a positive value
> to 
> skip the play_idle_precise() and what will do 'complete' ?
> 
intel_powerclamp has a logic where if the current idle percentage
observed from hardware is more than the desired target inject percent,
it skips calling play_idle().

For example if you want to inject 50% idle and system is naturally idle
for 60%, there is no use of calling play_idle in the idle injection
framework to induce more idle. In this way a workload can run
immediately.

So trying to emulate the same logic by using powercap/idle_inject
framework. So prepare() callback in the intel_powerclamp driver calls
the existing function to check if idle-inject should skip for this time
or not.

The complete() callback is to do just to adjust run duration. I don't
have a use case, but to add complementary callback to prepare() if
required.

Thanks,
Srinivas

> 
> > The only way this can be done currently is by monitoring hardware
> > idle
> > percent from a different software thread. This can be avoided by
> > adding
> > callbacks.
> > 
> > Add a capability to register a prepare and complete callback during
> > idle
> > inject registry. Add a new register function
> > idle_inject_register_full()
> > which also allows to register callbacks.
> > 
> > If they are not NULL, then prepare callback is called before
> > calling
> > play_idle_precise() and complete callback is called after calling
> > play_idle_precise().
> > 
> > If prepare callback is present and returns non 0 value then
> > play_idle_precise() is not called to avoid over compensation.
> > 
> > Signed-off-by: Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com>
> > ---
> > v2
> > - Replace begin/end with prepare/complete
> > - Add new interface idle_inject_register_full with callbacks
> > - Update kernel doc
> > - Update commit description
> > 
> >   drivers/powercap/idle_inject.c | 62
> > +++++++++++++++++++++++++++++++---
> >   include/linux/idle_inject.h    |  4 +++
> >   2 files changed, 62 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/powercap/idle_inject.c
> > b/drivers/powercap/idle_inject.c
> > index dfa989182e71..f48e71501429 100644
> > --- a/drivers/powercap/idle_inject.c
> > +++ b/drivers/powercap/idle_inject.c
> > @@ -63,13 +63,31 @@ struct idle_inject_thread {
> >    * @idle_duration_us: duration of CPU idle time to inject
> >    * @run_duration_us: duration of CPU run time to allow
> >    * @latency_us: max allowed latency
> > + * @prepare: Callback function which is called before calling
> > + *             play_idle_precise()
> > + * @complete: Callback function which is called after calling
> > + *             play_idle_precise()
> >    * @cpumask: mask of CPUs affected by idle injection
> > + *
> > + * This structure is used to define per instance idle inject
> > device data. Each
> > + * instance has an idle duration, a run duration and mask of CPUs
> > to inject
> > + * idle.
> > + * Actual idle is injected by calling kernel scheduler interface
> > + * play_idle_precise(). There are two optional callbacks which the
> > caller can
> > + * register by calling idle_inject_register_full():
> > + * prepare() - This callback is called just before calling
> > play_idle_precise()
> > + *             If this callback returns non zero value then
> > + *             play_idle_precise() is not called. This means skip
> > injecting
> > + *             idle during this period.
> > + * complete() - This callback is called after calling
> > play_idle_precise().
> >    */
> >   struct idle_inject_device {
> >         struct hrtimer timer;
> >         unsigned int idle_duration_us;
> >         unsigned int run_duration_us;
> >         unsigned int latency_us;
> > +       int (*prepare)(unsigned int cpu);
> > +       void (*complete)(unsigned int cpu);
> >         unsigned long cpumask[];
> >   };
> >   
> > @@ -132,6 +150,7 @@ static void idle_inject_fn(unsigned int cpu)
> >   {
> >         struct idle_inject_device *ii_dev;
> >         struct idle_inject_thread *iit;
> > +       int ret;
> >   
> >         ii_dev = per_cpu(idle_inject_device, cpu);
> >         iit = per_cpu_ptr(&idle_inject_thread, cpu);
> > @@ -141,8 +160,18 @@ static void idle_inject_fn(unsigned int cpu)
> >          */
> >         iit->should_run = 0;
> >   
> > +       if (ii_dev->prepare) {
> > +               ret = ii_dev->prepare(cpu);
> > +               if (ret)
> > +                       goto skip;
> > +       }
> > +
> >         play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) *
> > NSEC_PER_USEC,
> >                           READ_ONCE(ii_dev->latency_us) *
> > NSEC_PER_USEC);
> > +
> > +skip:
> > +       if (ii_dev->complete)
> > +               ii_dev->complete(cpu);
> >   }
> >   
> >   /**
> > @@ -295,17 +324,23 @@ static int idle_inject_should_run(unsigned
> > int cpu)
> >   }
> >   
> >   /**
> > - * idle_inject_register - initialize idle injection on a set of
> > CPUs
> > + * idle_inject_register_full - initialize idle injection on a set
> > of CPUs
> >    * @cpumask: CPUs to be affected by idle injection
> > + * @prepare: callback called before calling play_idle_precise()
> > + * @complete: callback called after calling play_idle_precise()
> >    *
> >    * This function creates an idle injection control device
> > structure for the
> > - * given set of CPUs and initializes the timer associated with
> > it.  It does not
> > - * start any injection cycles.
> > + * given set of CPUs and initializes the timer associated with it.
> > This
> > + * function also allows to register prepare() and complete()
> > callbacks.
> > + * It does not start any injection cycles.
> >    *
> >    * Return: NULL if memory allocation fails, idle injection
> > control device
> >    * pointer on success.
> >    */
> > -struct idle_inject_device *idle_inject_register(struct cpumask
> > *cpumask)
> > +
> > +struct idle_inject_device *idle_inject_register_full(struct
> > cpumask *cpumask,
> > +                                                    int
> > (*prepare)(unsigned int cpu),
> > +                                                    void
> > (*complete)(unsigned int cpu))
> >   {
> >         struct idle_inject_device *ii_dev;
> >         int cpu, cpu_rb;
> > @@ -318,6 +353,8 @@ struct idle_inject_device
> > *idle_inject_register(struct cpumask *cpumask)
> >         hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC,
> > HRTIMER_MODE_REL);
> >         ii_dev->timer.function = idle_inject_timer_fn;
> >         ii_dev->latency_us = UINT_MAX;
> > +       ii_dev->prepare = prepare;
> > +       ii_dev->complete = complete;
> >   
> >         for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
> >   
> > @@ -342,6 +379,23 @@ struct idle_inject_device
> > *idle_inject_register(struct cpumask *cpumask)
> >   
> >         return NULL;
> >   }
> > +EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
> > +
> > +/**
> > + * idle_inject_register - initialize idle injection on a set of
> > CPUs
> > + * @cpumask: CPUs to be affected by idle injection
> > + *
> > + * This function creates an idle injection control device
> > structure for the
> > + * given set of CPUs and initializes the timer associated with
> > it.  It does not
> > + * start any injection cycles.
> > + *
> > + * Return: NULL if memory allocation fails, idle injection control
> > device
> > + * pointer on success.
> > + */
> > +struct idle_inject_device *idle_inject_register(struct cpumask
> > *cpumask)
> > +{
> > +       return idle_inject_register_full(cpumask, NULL, NULL);
> > +}
> >   EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
> >   
> >   /**
> > diff --git a/include/linux/idle_inject.h
> > b/include/linux/idle_inject.h
> > index fb88e23a99d3..e18d89793490 100644
> > --- a/include/linux/idle_inject.h
> > +++ b/include/linux/idle_inject.h
> > @@ -13,6 +13,10 @@ struct idle_inject_device;
> >   
> >   struct idle_inject_device *idle_inject_register(struct cpumask
> > *cpumask);
> >   
> > +struct idle_inject_device *idle_inject_register_full(struct
> > cpumask *cpumask,
> > +                                                    int
> > (*prepare)(unsigned int cpu),
> > +                                                    void
> > (*complete)(unsigned int cpu));
> > +
> >   void idle_inject_unregister(struct idle_inject_device *ii_dev);
> >   
> >   int idle_inject_start(struct idle_inject_device *ii_dev);
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2022-12-21 20:58     ` srinivas pandruvada
@ 2022-12-22  9:50       ` Daniel Lezcano
  2022-12-22 17:36         ` srinivas pandruvada
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Lezcano @ 2022-12-22  9:50 UTC (permalink / raw)
  To: srinivas pandruvada, rafael; +Cc: linux-pm, linux-kernel, rui.zhang, amitk


Hi Srinivas,


On 21/12/2022 21:58, srinivas pandruvada wrote:
> Hi Daniel,
> 
> On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
>>
>> Hi Srinivas,
>>
>> On 30/11/2022 00:34, Srinivas Pandruvada wrote:
>>> The actual idle percentage can be less than the desired because of
>>> interrupts. Since the objective for CPU Idle injection is for
>>> thermal
>>> control, there should be some way to compensate for lost idle
>>> percentage.
>>> Some architectures provide interface to get actual idle percent
>>> observed
>>> by the hardware. So, the idle percent can be adjusted using the
>>> hardware
>>> feedback. For example, Intel CPUs provides package idle counters,
>>> which
>>> is currently used by intel powerclamp driver to adjust idle time.
>> Can you provide an example in terms of timings?
>>
>> I'm not getting how 'prepare' would do by returning a positive value
>> to
>> skip the play_idle_precise() and what will do 'complete' ?
>>
> intel_powerclamp has a logic where if the current idle percentage
> observed from hardware is more than the desired target inject percent,
> it skips calling play_idle().
> 
> For example if you want to inject 50% idle and system is naturally idle
> for 60%, there is no use of calling play_idle in the idle injection
> framework to induce more idle. In this way a workload can run
> immediately.
> 
> So trying to emulate the same logic by using powercap/idle_inject
> framework. So prepare() callback in the intel_powerclamp driver calls
> the existing function to check if idle-inject should skip for this time
> or not.

The function 'prepare' has the 'cpu' parameter. How can it compare with 
the desired idle duration as this information is not passed to the 
callback ?


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2022-12-22  9:50       ` Daniel Lezcano
@ 2022-12-22 17:36         ` srinivas pandruvada
  2023-01-12 14:37           ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: srinivas pandruvada @ 2022-12-22 17:36 UTC (permalink / raw)
  To: Daniel Lezcano, rafael; +Cc: linux-pm, linux-kernel, rui.zhang, amitk

Hi Daniel,

On Thu, 2022-12-22 at 10:50 +0100, Daniel Lezcano wrote:
> 
> Hi Srinivas,
> 
> 
> On 21/12/2022 21:58, srinivas pandruvada wrote:
> > Hi Daniel,
> > 
> > On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
> > > 
> > > Hi Srinivas,
> > > 
> > > On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> > > > The actual idle percentage can be less than the desired because
> > > > of
> > > > interrupts. Since the objective for CPU Idle injection is for
> > > > thermal
> > > > control, there should be some way to compensate for lost idle
> > > > percentage.
> > > > Some architectures provide interface to get actual idle percent
> > > > observed
> > > > by the hardware. So, the idle percent can be adjusted using the
> > > > hardware
> > > > feedback. For example, Intel CPUs provides package idle
> > > > counters,
> > > > which
> > > > is currently used by intel powerclamp driver to adjust idle
> > > > time.
> > > Can you provide an example in terms of timings?
> > > 
> > > I'm not getting how 'prepare' would do by returning a positive
> > > value
> > > to
> > > skip the play_idle_precise() and what will do 'complete' ?
> > > 
> > intel_powerclamp has a logic where if the current idle percentage
> > observed from hardware is more than the desired target inject
> > percent,
> > it skips calling play_idle().
> > 
> > For example if you want to inject 50% idle and system is naturally
> > idle
> > for 60%, there is no use of calling play_idle in the idle injection
> > framework to induce more idle. In this way a workload can run
> > immediately.
> > 
> > So trying to emulate the same logic by using powercap/idle_inject
> > framework. So prepare() callback in the intel_powerclamp driver
> > calls
> > the existing function to check if idle-inject should skip for this
> > time
> > or not.
> 
> The function 'prepare' has the 'cpu' parameter. How can it compare
> with 
> the desired idle duration as this information is not passed to the 
> callback ?
Good question.

Calling driver knows what idle_duration he set.
In my first version, I passed *idle_duration (with current
idle_duration set), so the caller can change this for the current
play_idle call if required for one time.

But in powerclamp case we either skip the whole play_idle or not. It
doesn't change idle duration. So didn't add.

But we can add this back.

Thanks,
Srinivas



> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2022-12-22 17:36         ` srinivas pandruvada
@ 2023-01-12 14:37           ` Rafael J. Wysocki
  0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-01-12 14:37 UTC (permalink / raw)
  To: srinivas pandruvada
  Cc: Daniel Lezcano, rafael, linux-pm, linux-kernel, rui.zhang, amitk

On Thu, Dec 22, 2022 at 6:36 PM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> Hi Daniel,
>
> On Thu, 2022-12-22 at 10:50 +0100, Daniel Lezcano wrote:
> >
> > Hi Srinivas,
> >
> >
> > On 21/12/2022 21:58, srinivas pandruvada wrote:
> > > Hi Daniel,
> > >
> > > On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
> > > >
> > > > Hi Srinivas,
> > > >
> > > > On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> > > > > The actual idle percentage can be less than the desired because
> > > > > of
> > > > > interrupts. Since the objective for CPU Idle injection is for
> > > > > thermal
> > > > > control, there should be some way to compensate for lost idle
> > > > > percentage.
> > > > > Some architectures provide interface to get actual idle percent
> > > > > observed
> > > > > by the hardware. So, the idle percent can be adjusted using the
> > > > > hardware
> > > > > feedback. For example, Intel CPUs provides package idle
> > > > > counters,
> > > > > which
> > > > > is currently used by intel powerclamp driver to adjust idle
> > > > > time.
> > > > Can you provide an example in terms of timings?
> > > >
> > > > I'm not getting how 'prepare' would do by returning a positive
> > > > value
> > > > to
> > > > skip the play_idle_precise() and what will do 'complete' ?
> > > >
> > > intel_powerclamp has a logic where if the current idle percentage
> > > observed from hardware is more than the desired target inject
> > > percent,
> > > it skips calling play_idle().
> > >
> > > For example if you want to inject 50% idle and system is naturally
> > > idle
> > > for 60%, there is no use of calling play_idle in the idle injection
> > > framework to induce more idle. In this way a workload can run
> > > immediately.
> > >
> > > So trying to emulate the same logic by using powercap/idle_inject
> > > framework. So prepare() callback in the intel_powerclamp driver
> > > calls
> > > the existing function to check if idle-inject should skip for this
> > > time
> > > or not.
> >
> > The function 'prepare' has the 'cpu' parameter. How can it compare
> > with
> > the desired idle duration as this information is not passed to the
> > callback ?
> Good question.
>
> Calling driver knows what idle_duration he set.
> In my first version, I passed *idle_duration (with current
> idle_duration set), so the caller can change this for the current
> play_idle call if required for one time.
>
> But in powerclamp case we either skip the whole play_idle or not. It
> doesn't change idle duration. So didn't add.
>
> But we can add this back.

I don't think that it is necessary at this point.

Since powerclamp is the only user and it doesn't need idle_duration, I
would just not add it ATM.

I have a couple of other comments to the patch, but let me send them separately.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2022-11-29 23:34 ` [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks Srinivas Pandruvada
  2022-12-21 14:52   ` Daniel Lezcano
@ 2023-01-12 14:53   ` Rafael J. Wysocki
  2023-01-12 17:25     ` Rafael J. Wysocki
  1 sibling, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-01-12 14:53 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: rafael, linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk

On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> The actual idle percentage can be less than the desired because of
> interrupts.

This is somewhat unclear.

> Since the objective for CPU Idle injection is for thermal
> control, there should be some way to compensate for lost idle percentage.

What does "lost idle percentage" mean here?

> Some architectures provide interface to get actual idle percent observed
> by the hardware. So, the idle percent can be adjusted using the hardware
> feedback. For example, Intel CPUs provides package idle counters, which
> is currently used by intel powerclamp driver to adjust idle time.
>
> The only way this can be done currently is by monitoring hardware idle
> percent from a different software thread. This can be avoided by adding
> callbacks.
>
> Add a capability to register a prepare and complete callback during idle
> inject registry. Add a new register function idle_inject_register_full()
> which also allows to register callbacks.
>
> If they are not NULL, then prepare callback is called before calling
> play_idle_precise() and complete callback is called after calling
> play_idle_precise().
>
> If prepare callback is present and returns non 0 value then
> play_idle_precise() is not called to avoid over compensation.

This mechanism isn't particularly straightforward, but maybe there's
no better way.

> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> v2
> - Replace begin/end with prepare/complete
> - Add new interface idle_inject_register_full with callbacks
> - Update kernel doc
> - Update commit description
>
>  drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
>  include/linux/idle_inject.h    |  4 +++
>  2 files changed, 62 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> index dfa989182e71..f48e71501429 100644
> --- a/drivers/powercap/idle_inject.c
> +++ b/drivers/powercap/idle_inject.c
> @@ -63,13 +63,31 @@ struct idle_inject_thread {
>   * @idle_duration_us: duration of CPU idle time to inject
>   * @run_duration_us: duration of CPU run time to allow
>   * @latency_us: max allowed latency
> + * @prepare: Callback function which is called before calling
> + *             play_idle_precise()
> + * @complete: Callback function which is called after calling
> + *             play_idle_precise()

What about:

@prepare: Optional callback deciding whether or not to skip idle
injection in the given cycle.
@complete: Optional callback updating the state after idle injection.

>   * @cpumask: mask of CPUs affected by idle injection
> + *
> + * This structure is used to define per instance idle inject device data. Each
> + * instance has an idle duration, a run duration and mask of CPUs to inject
> + * idle.
> + * Actual idle is injected by calling kernel scheduler interface
> + * play_idle_precise(). There are two optional callbacks which the caller can
> + * register by calling idle_inject_register_full():
> + * prepare() - This callback is called just before calling play_idle_precise()
> + *             If this callback returns non zero value then
> + *             play_idle_precise() is not called. This means skip injecting
> + *             idle during this period.
> + * complete() - This callback is called after calling play_idle_precise().

I would keep the format of the comment more consistent with the
general information at the top and the member descriptions at the
bottom.

>   */
>  struct idle_inject_device {
>         struct hrtimer timer;
>         unsigned int idle_duration_us;
>         unsigned int run_duration_us;
>         unsigned int latency_us;
> +       int (*prepare)(unsigned int cpu);

Can it be bool?

> +       void (*complete)(unsigned int cpu);
>         unsigned long cpumask[];
>  };
>
> @@ -132,6 +150,7 @@ static void idle_inject_fn(unsigned int cpu)
>  {
>         struct idle_inject_device *ii_dev;
>         struct idle_inject_thread *iit;
> +       int ret;

This is redundant.

>
>         ii_dev = per_cpu(idle_inject_device, cpu);
>         iit = per_cpu_ptr(&idle_inject_thread, cpu);
> @@ -141,8 +160,18 @@ static void idle_inject_fn(unsigned int cpu)
>          */
>         iit->should_run = 0;
>
> +       if (ii_dev->prepare) {
> +               ret = ii_dev->prepare(cpu);
> +               if (ret)
> +                       goto skip;
> +       }

Because the above can be written as

if (ii_dev->prepare && ii_dev->prepare(cpu))
         goto skip;

> +
>         play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) * NSEC_PER_USEC,
>                           READ_ONCE(ii_dev->latency_us) * NSEC_PER_USEC);
> +
> +skip:
> +       if (ii_dev->complete)
> +               ii_dev->complete(cpu);
>  }
>
>  /**
> @@ -295,17 +324,23 @@ static int idle_inject_should_run(unsigned int cpu)
>  }
>
>  /**
> - * idle_inject_register - initialize idle injection on a set of CPUs
> + * idle_inject_register_full - initialize idle injection on a set of CPUs
>   * @cpumask: CPUs to be affected by idle injection
> + * @prepare: callback called before calling play_idle_precise()
> + * @complete: callback called after calling play_idle_precise()

IMO it would be slightly cleaner to say "invoked" instead of "called".

>   *
>   * This function creates an idle injection control device structure for the
> - * given set of CPUs and initializes the timer associated with it.  It does not
> - * start any injection cycles.
> + * given set of CPUs and initializes the timer associated with it. This
> + * function also allows to register prepare() and complete() callbacks.
> + * It does not start any injection cycles.
>   *
>   * Return: NULL if memory allocation fails, idle injection control device
>   * pointer on success.
>   */
> -struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +                                                    int (*prepare)(unsigned int cpu),
> +                                                    void (*complete)(unsigned int cpu))
>  {
>         struct idle_inject_device *ii_dev;
>         int cpu, cpu_rb;
> @@ -318,6 +353,8 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>         hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>         ii_dev->timer.function = idle_inject_timer_fn;
>         ii_dev->latency_us = UINT_MAX;
> +       ii_dev->prepare = prepare;
> +       ii_dev->complete = complete;
>
>         for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
>
> @@ -342,6 +379,23 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>
>         return NULL;
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
> +
> +/**
> + * idle_inject_register - initialize idle injection on a set of CPUs
> + * @cpumask: CPUs to be affected by idle injection
> + *
> + * This function creates an idle injection control device structure for the
> + * given set of CPUs and initializes the timer associated with it.  It does not
> + * start any injection cycles.
> + *
> + * Return: NULL if memory allocation fails, idle injection control device
> + * pointer on success.

It would be sufficient to say "Pass @cpumask to
idle_inject_register_full() to initialize idle injection on a set of
CPUs".

> + */
> +struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +{
> +       return idle_inject_register_full(cpumask, NULL, NULL);
> +}
>  EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
>
>  /**
> diff --git a/include/linux/idle_inject.h b/include/linux/idle_inject.h
> index fb88e23a99d3..e18d89793490 100644
> --- a/include/linux/idle_inject.h
> +++ b/include/linux/idle_inject.h
> @@ -13,6 +13,10 @@ struct idle_inject_device;
>
>  struct idle_inject_device *idle_inject_register(struct cpumask *cpumask);
>
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +                                                    int (*prepare)(unsigned int cpu),
> +                                                    void (*complete)(unsigned int cpu));
> +
>  void idle_inject_unregister(struct idle_inject_device *ii_dev);
>
>  int idle_inject_start(struct idle_inject_device *ii_dev);
> --

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/4] powercap: idle_inject: Export symbols
  2022-11-29 23:34 ` [PATCH v2 1/4] powercap: idle_inject: Export symbols Srinivas Pandruvada
@ 2023-01-12 15:05   ` Rafael J. Wysocki
  0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-01-12 15:05 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: rafael, linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk

On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> Export symbols for external interfaces, so that they can be used in
> other loadable modules.
>
> Export is done under name space IDLE_INJECT.
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
> v2:
>         No change
>
>  drivers/powercap/idle_inject.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> index fe86a09e3b67..dfa989182e71 100644
> --- a/drivers/powercap/idle_inject.c
> +++ b/drivers/powercap/idle_inject.c
> @@ -160,6 +160,7 @@ void idle_inject_set_duration(struct idle_inject_device *ii_dev,
>                 WRITE_ONCE(ii_dev->idle_duration_us, idle_duration_us);
>         }
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_set_duration, IDLE_INJECT);
>
>  /**
>   * idle_inject_get_duration - idle and run duration retrieval helper
> @@ -174,6 +175,7 @@ void idle_inject_get_duration(struct idle_inject_device *ii_dev,
>         *run_duration_us = READ_ONCE(ii_dev->run_duration_us);
>         *idle_duration_us = READ_ONCE(ii_dev->idle_duration_us);
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_get_duration, IDLE_INJECT);
>
>  /**
>   * idle_inject_set_latency - set the maximum latency allowed
> @@ -185,6 +187,7 @@ void idle_inject_set_latency(struct idle_inject_device *ii_dev,
>  {
>         WRITE_ONCE(ii_dev->latency_us, latency_us);
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_set_latency, IDLE_INJECT);
>
>  /**
>   * idle_inject_start - start idle injections
> @@ -216,6 +219,7 @@ int idle_inject_start(struct idle_inject_device *ii_dev)
>
>         return 0;
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_start, IDLE_INJECT);
>
>  /**
>   * idle_inject_stop - stops idle injections
> @@ -262,6 +266,7 @@ void idle_inject_stop(struct idle_inject_device *ii_dev)
>
>         cpu_hotplug_enable();
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_stop, IDLE_INJECT);
>
>  /**
>   * idle_inject_setup - prepare the current task for idle injection
> @@ -337,6 +342,7 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>
>         return NULL;
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
>
>  /**
>   * idle_inject_unregister - unregister idle injection control device
> @@ -357,6 +363,7 @@ void idle_inject_unregister(struct idle_inject_device *ii_dev)
>
>         kfree(ii_dev);
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_unregister, IDLE_INJECT);
>
>  static struct smp_hotplug_thread idle_inject_threads = {
>         .store = &idle_inject_thread.tsk,
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2023-01-12 14:53   ` Rafael J. Wysocki
@ 2023-01-12 17:25     ` Rafael J. Wysocki
  2023-01-12 18:13       ` srinivas pandruvada
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-01-12 17:25 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: rafael, linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk

On Thu, Jan 12, 2023 at 3:53 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> >
> > The actual idle percentage can be less than the desired because of
> > interrupts.
>
> This is somewhat unclear.
>
> > Since the objective for CPU Idle injection is for thermal
> > control, there should be some way to compensate for lost idle percentage.
>
> What does "lost idle percentage" mean here?
>
> > Some architectures provide interface to get actual idle percent observed
> > by the hardware. So, the idle percent can be adjusted using the hardware
> > feedback. For example, Intel CPUs provides package idle counters, which
> > is currently used by intel powerclamp driver to adjust idle time.
> >
> > The only way this can be done currently is by monitoring hardware idle
> > percent from a different software thread. This can be avoided by adding
> > callbacks.
> >
> > Add a capability to register a prepare and complete callback during idle
> > inject registry. Add a new register function idle_inject_register_full()
> > which also allows to register callbacks.
> >
> > If they are not NULL, then prepare callback is called before calling
> > play_idle_precise() and complete callback is called after calling
> > play_idle_precise().
> >
> > If prepare callback is present and returns non 0 value then
> > play_idle_precise() is not called to avoid over compensation.
>
> This mechanism isn't particularly straightforward, but maybe there's
> no better way.
>
> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > ---
> > v2
> > - Replace begin/end with prepare/complete
> > - Add new interface idle_inject_register_full with callbacks
> > - Update kernel doc
> > - Update commit description
> >
> >  drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
> >  include/linux/idle_inject.h    |  4 +++
> >  2 files changed, 62 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> > index dfa989182e71..f48e71501429 100644
> > --- a/drivers/powercap/idle_inject.c
> > +++ b/drivers/powercap/idle_inject.c
> > @@ -63,13 +63,31 @@ struct idle_inject_thread {
> >   * @idle_duration_us: duration of CPU idle time to inject
> >   * @run_duration_us: duration of CPU run time to allow
> >   * @latency_us: max allowed latency
> > + * @prepare: Callback function which is called before calling
> > + *             play_idle_precise()
> > + * @complete: Callback function which is called after calling
> > + *             play_idle_precise()
>
> What about:
>
> @prepare: Optional callback deciding whether or not to skip idle
> injection in the given cycle.
> @complete: Optional callback updating the state after idle injection.

One more thing: ->complete() is not even used by powerclamp AFAICS, so
I wouldn't add it at this time, because it isn't clear if it's going
to be useful at all.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks
  2023-01-12 17:25     ` Rafael J. Wysocki
@ 2023-01-12 18:13       ` srinivas pandruvada
  0 siblings, 0 replies; 18+ messages in thread
From: srinivas pandruvada @ 2023-01-12 18:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk

On Thu, 2023-01-12 at 18:25 +0100, Rafael J. Wysocki wrote:
> On Thu, Jan 12, 2023 at 3:53 PM Rafael J. Wysocki <rafael@kernel.org>
> wrote:
> > 
> > On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com> wrote:
> > > 
> > > The actual idle percentage can be less than the desired because
> > > of
> > > interrupts.
> > 
> > This is somewhat unclear.
> > 
Will try to make it better.

> > > Since the objective for CPU Idle injection is for thermal
> > > control, there should be some way to compensate for lost idle
> > > percentage.
> > 
> > What does "lost idle percentage" mean here?
User space wants 50% idle, but because of interrupts on target cpus we
were less than 50% idle. This may be because of too many interrupt
wakes. If we measure from the hardware counters, this may say we were
45% idle. So we were 5% less than user desired setting.

> > 
> > > Some architectures provide interface to get actual idle percent
> > > observed
> > > by the hardware. So, the idle percent can be adjusted using the
> > > hardware
> > > feedback. For example, Intel CPUs provides package idle counters,
> > > which
> > > is currently used by intel powerclamp driver to adjust idle time.
> > > 
> > > The only way this can be done currently is by monitoring hardware
> > > idle
> > > percent from a different software thread. This can be avoided by
> > > adding
> > > callbacks.
> > > 
> > > Add a capability to register a prepare and complete callback
> > > during idle
> > > inject registry. Add a new register function
> > > idle_inject_register_full()
> > > which also allows to register callbacks.
> > > 
> > > If they are not NULL, then prepare callback is called before
> > > calling
> > > play_idle_precise() and complete callback is called after calling
> > > play_idle_precise().
> > > 
> > > If prepare callback is present and returns non 0 value then
> > > play_idle_precise() is not called to avoid over compensation.
> > 
> > This mechanism isn't particularly straightforward, but maybe
> > there's
> > no better way.
> > 
> > > Signed-off-by: Srinivas Pandruvada
> > > <srinivas.pandruvada@linux.intel.com>
> > > ---
> > > v2
> > > - Replace begin/end with prepare/complete
> > > - Add new interface idle_inject_register_full with callbacks
> > > - Update kernel doc
> > > - Update commit description
> > > 
> > >  drivers/powercap/idle_inject.c | 62
> > > +++++++++++++++++++++++++++++++---
> > >  include/linux/idle_inject.h    |  4 +++
> > >  2 files changed, 62 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/powercap/idle_inject.c
> > > b/drivers/powercap/idle_inject.c
> > > index dfa989182e71..f48e71501429 100644
> > > --- a/drivers/powercap/idle_inject.c
> > > +++ b/drivers/powercap/idle_inject.c
> > > @@ -63,13 +63,31 @@ struct idle_inject_thread {
> > >   * @idle_duration_us: duration of CPU idle time to inject
> > >   * @run_duration_us: duration of CPU run time to allow
> > >   * @latency_us: max allowed latency
> > > + * @prepare: Callback function which is called before calling
> > > + *             play_idle_precise()
> > > + * @complete: Callback function which is called after calling
> > > + *             play_idle_precise()
> > 
> > What about:
> > 
> > @prepare: Optional callback deciding whether or not to skip idle
> > injection in the given cycle.
> > @complete: Optional callback updating the state after idle
> > injection.
Looks good.

> 
> One more thing: ->complete() is not even used by powerclamp AFAICS,
> so
> I wouldn't add it at this time, because it isn't clear if it's going
> to be useful at all.
Sure.

Thanks,
Srinivas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
  2022-11-29 23:34 ` [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework Srinivas Pandruvada
@ 2023-01-12 18:32   ` Rafael J. Wysocki
  2023-01-12 20:23     ` srinivas pandruvada
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-01-12 18:32 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: rafael, linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk,
	kernel test robot

On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> There are two idle injection implementation in the Linux kernel. One
> via intel_powerclamp and the other using powercap/idle_inject. Both
> implementation end up in calling play_idle* function from a FIFO
> priority thread. Both can't be used at the same time.
>
> Currently per core idle injection (cpuidle_cooling) is using
> powercap/idle_inject, which is not used in platforms where
> intel_powerclamp is used for system wide idle injection. So there is
> no conflict. But there are some use cases where per core idle injection
> is beneficial on the same system where system wide idle injection is
> also used via intel_powerclamp. To avoid conflict only one of the idle
> injection type must be in use at a time. This require a common framework
> which both per core and system wide idle injection can use.
>
> Here powercap/idle_inject can be used for both per-core and for system
> wide idle injection. This framework has a well defined interface
> which allow registry for per-core or for all CPUs (system wide). If
> particular CPU is already participating in idle injection, the call
> to registry fails. Here the registry can be done when user space
> changes the current cooling device state.
>
> Also one framework for idle injection is better as there is one loop
> calling play_idle*, instead of multiple for better maintenance.
>
> So, reuse powercap/idle_inject calls in intel_powerclamp. This simplifies
> the code as all per CPU kthreads which calls play_idle* can be removed.
>
> The changes include:
> - Remove unneeded include files
> - Remove per CPU kthread workers: balancing_work and idle_injection_work
> - Reuse the compensation related code by moving from previous worker
> thread to idle_injection callbacks
> - Adjust the idle_duration and runtime by using powercap/idle_inject
> interface
> - Remove all variables, which are not required once powercap/idle_inject
> is used
> - Add mutex to avoid race during removal of idle injection during module
> unload and user action to change idle inject percent
> - Use READ_ONCE and WRITE_ONCE for data accessed from multiple CPUs
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> v2:
> - Use idle_inject_register_full instead of idle_inject_register
> - Also fix dependency issue with POWERCAP config
> Reported-by: kernel test robot <lkp@intel.com>
>
>  drivers/thermal/intel/Kconfig            |   2 +
>  drivers/thermal/intel/intel_powerclamp.c | 292 ++++++++++-------------
>  2 files changed, 126 insertions(+), 168 deletions(-)
>
> diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
> index f0c845679250..6c2a95f41c81 100644
> --- a/drivers/thermal/intel/Kconfig
> +++ b/drivers/thermal/intel/Kconfig
> @@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
>         tristate "Intel PowerClamp idle injection driver"
>         depends on X86
>         depends on CPU_SUP_INTEL
> +       select POWERCAP
> +       select IDLE_INJECT
>         help
>           Enable this to enable Intel PowerClamp idle injection driver. This
>           enforce idle time which results in more package C-state residency. The
> diff --git a/drivers/thermal/intel/intel_powerclamp.c b/drivers/thermal/intel/intel_powerclamp.c
> index b80e25ec1261..3f2b20ae8f68 100644
> --- a/drivers/thermal/intel/intel_powerclamp.c
> +++ b/drivers/thermal/intel/intel_powerclamp.c
> @@ -2,7 +2,7 @@
>  /*
>   * intel_powerclamp.c - package c-state idle injection
>   *
> - * Copyright (c) 2012, Intel Corporation.
> + * Copyright (c) 2022, Intel Corporation.

Nit: I would retain the original year of introduction, so 2012 - 2022.

>   *
>   * Authors:
>   *     Arjan van de Ven <arjan@linux.intel.com>
> @@ -27,21 +27,15 @@
>  #include <linux/module.h>
>  #include <linux/kernel.h>
>  #include <linux/delay.h>
> -#include <linux/kthread.h>
>  #include <linux/cpu.h>
>  #include <linux/thermal.h>
> -#include <linux/slab.h>
> -#include <linux/tick.h>
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
> -#include <linux/sched/rt.h>
> -#include <uapi/linux/sched/types.h>
> +#include <linux/idle_inject.h>
>
> -#include <asm/nmi.h>
>  #include <asm/msr.h>
>  #include <asm/mwait.h>
>  #include <asm/cpu_device_id.h>
> -#include <asm/hardirq.h>
>
>  #define MAX_TARGET_RATIO (50U)
>  /* For each undisturbed clamping period (no extra wake ups during idle time),
> @@ -60,6 +54,7 @@ static struct dentry *debug_dir;
>
>  /* user selected target */
>  static unsigned int set_target_ratio;
> +static bool target_ratio_updated;
>  static unsigned int current_ratio;
>  static bool should_skip;
>
> @@ -67,26 +62,20 @@ static unsigned int control_cpu; /* The cpu assigned to collect stat and update
>                                   * control parameters. default to BSP but BSP
>                                   * can be offlined.
>                                   */
> -static bool clamping;
> -
> -struct powerclamp_worker_data {
> -       struct kthread_worker *worker;
> -       struct kthread_work balancing_work;
> -       struct kthread_delayed_work idle_injection_work;
> +struct powerclamp_data {
>         unsigned int cpu;
>         unsigned int count;
>         unsigned int guard;
>         unsigned int window_size_now;
>         unsigned int target_ratio;
> -       unsigned int duration_jiffies;
>         bool clamping;
>  };
>
> -static struct powerclamp_worker_data __percpu *worker_data;
> +static struct powerclamp_data powerclamp_data;
> +
>  static struct thermal_cooling_device *cooling_dev;
> -static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
> -                                          * clamping kthread worker
> -                                          */
> +
> +static DEFINE_MUTEX(powerclamp_lock);
>
>  static unsigned int duration;
>  static unsigned int pkg_cstate_ratio_cur;
> @@ -344,79 +333,33 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
>         return set_target_ratio + guard <= current_ratio;
>  }
>
> -static void clamp_balancing_func(struct kthread_work *work)
> +static unsigned int get_run_time(void)
>  {
> -       struct powerclamp_worker_data *w_data;
> -       int sleeptime;
> -       unsigned long target_jiffies;
>         unsigned int compensated_ratio;
> -       int interval; /* jiffies to sleep for each attempt */
> -
> -       w_data = container_of(work, struct powerclamp_worker_data,
> -                             balancing_work);
> +       unsigned int runtime;
>
>         /*
>          * make sure user selected ratio does not take effect until
>          * the next round. adjust target_ratio if user has changed
>          * target such that we can converge quickly.
>          */
> -       w_data->target_ratio = READ_ONCE(set_target_ratio);
> -       w_data->guard = 1 + w_data->target_ratio / 20;
> -       w_data->window_size_now = window_size;
> -       w_data->duration_jiffies = msecs_to_jiffies(duration);
> -       w_data->count++;
> +       powerclamp_data.target_ratio = READ_ONCE(set_target_ratio);
> +       powerclamp_data.guard = 1 + powerclamp_data.target_ratio / 20;
> +       powerclamp_data.window_size_now = window_size;
>
>         /*
>          * systems may have different ability to enter package level
>          * c-states, thus we need to compensate the injected idle ratio
>          * to achieve the actual target reported by the HW.
>          */
> -       compensated_ratio = w_data->target_ratio +
> -               get_compensation(w_data->target_ratio);
> +       compensated_ratio = powerclamp_data.target_ratio +
> +               get_compensation(powerclamp_data.target_ratio);
>         if (compensated_ratio <= 0)
>                 compensated_ratio = 1;
> -       interval = w_data->duration_jiffies * 100 / compensated_ratio;
> -
> -       /* align idle time */
> -       target_jiffies = roundup(jiffies, interval);
> -       sleeptime = target_jiffies - jiffies;
> -       if (sleeptime <= 0)
> -               sleeptime = 1;
> -
> -       if (clamping && w_data->clamping && cpu_online(w_data->cpu))
> -               kthread_queue_delayed_work(w_data->worker,
> -                                          &w_data->idle_injection_work,
> -                                          sleeptime);
> -}
>
> -static void clamp_idle_injection_func(struct kthread_work *work)
> -{
> -       struct powerclamp_worker_data *w_data;
> -
> -       w_data = container_of(work, struct powerclamp_worker_data,
> -                             idle_injection_work.work);
> +       runtime = duration * 100 / compensated_ratio - duration;
>
> -       /*
> -        * only elected controlling cpu can collect stats and update
> -        * control parameters.
> -        */
> -       if (w_data->cpu == control_cpu &&
> -           !(w_data->count % w_data->window_size_now)) {
> -               should_skip =
> -                       powerclamp_adjust_controls(w_data->target_ratio,
> -                                                  w_data->guard,
> -                                                  w_data->window_size_now);
> -               smp_mb();
> -       }
> -
> -       if (should_skip)
> -               goto balance;
> -
> -       play_idle(jiffies_to_usecs(w_data->duration_jiffies));
> -
> -balance:
> -       if (clamping && w_data->clamping && cpu_online(w_data->cpu))
> -               kthread_queue_work(w_data->worker, &w_data->balancing_work);
> +       return runtime;
>  }
>
>  /*
> @@ -452,104 +395,127 @@ static void poll_pkg_cstate(struct work_struct *dummy)
>         msr_last = msr_now;
>         tsc_last = tsc_now;
>
> -       if (true == clamping)
> +       if (powerclamp_data.clamping)
>                 schedule_delayed_work(&poll_pkg_cstate_work, HZ);
>  }
>
> -static void start_power_clamp_worker(unsigned long cpu)
> +static struct idle_inject_device *ii_dev;
> +
> +static int idle_inject_begin(unsigned int cpu)

So this would be the ->prepare() callback to be invoked on each CPU
from idle_inject_fn() IIUC.

>  {
> -       struct powerclamp_worker_data *w_data = per_cpu_ptr(worker_data, cpu);
> -       struct kthread_worker *worker;
> +       /*
> +        * only elected controlling cpu can collect stats and update
> +        * control parameters.
> +        */
> +       if (cpu == control_cpu) {
> +               bool update = READ_ONCE(target_ratio_updated);
> +
> +               if (!(powerclamp_data.count % powerclamp_data.window_size_now)) {
> +                       bool skip = powerclamp_adjust_controls(powerclamp_data.target_ratio,
> +                                                      powerclamp_data.guard,
> +                                                      powerclamp_data.window_size_now);
> +                       WRITE_ONCE(should_skip, skip);
> +                       update = true;
> +               }
>
> -       worker = kthread_create_worker_on_cpu(cpu, 0, "kidle_inj/%ld", cpu);
> -       if (IS_ERR(worker))
> -               return;
> +               if (update) {
> +                       unsigned int runtime;
> +
> +                       runtime = get_run_time();
> +                       idle_inject_set_duration(ii_dev, runtime, duration);
> +                       WRITE_ONCE(target_ratio_updated, false);
> +               }
> +               powerclamp_data.count++;
> +       }
> +
> +       if (READ_ONCE(should_skip))
> +               return -EAGAIN;

This has a bit of a synchronization issue, because the control CPU is
not guaranteed to run this code before any other CPUs in the given
cycle, so at least some of them may see a stale value of should_skip
and they will still inject idle in this cycle.  Or else, they may skip
idle injection when it should be done.

I think that it would be better to run the callback from
idle_inject_timer_fn() where it would decide whether or not to call
idle_inject_wakeup(), in which case the control CPU would not be
needed any more (which would be a plus), because the "control" could
be done by the CPU running the timer function, whichever it is.

Does this sound viable?  Or if it doesn't, then why?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
  2023-01-12 18:32   ` Rafael J. Wysocki
@ 2023-01-12 20:23     ` srinivas pandruvada
  2023-01-12 21:03       ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: srinivas pandruvada @ 2023-01-12 20:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, daniel.lezcano, rui.zhang, amitk,
	kernel test robot

On Thu, 2023-01-12 at 19:32 +0100, Rafael J. Wysocki wrote:
> On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > 
> > There are two idle injection implementation in the Linux kernel.
> > One
> > via intel_powerclamp and the other using powercap/idle_inject. Both
> > implementation end up in calling play_idle* function from a FIFO
> > priority thread. Both can't be used at the same time.
> > 
> > Currently per core idle injection (cpuidle_cooling) is using
> > powercap/idle_inject, which is not used in platforms where
> > intel_powerclamp is used for system wide idle injection. So there
> > is
> > no conflict. But there are some use cases where per core idle
> > injection
> > is beneficial on the same system where system wide idle injection
> > is
> > also used via intel_powerclamp. To avoid conflict only one of the
> > idle
> > injection type must be in use at a time. This require a common
> > framework
> > which both per core and system wide idle injection can use.
> > 
> > Here powercap/idle_inject can be used for both per-core and for
> > system
> > wide idle injection. This framework has a well defined interface
> > which allow registry for per-core or for all CPUs (system wide). If
> > particular CPU is already participating in idle injection, the call
> > to registry fails. Here the registry can be done when user space
> > changes the current cooling device state.
> > 
> > Also one framework for idle injection is better as there is one
> > loop
> > calling play_idle*, instead of multiple for better maintenance.
> > 
> > So, reuse powercap/idle_inject calls in intel_powerclamp. This
> > simplifies
> > the code as all per CPU kthreads which calls play_idle* can be
> > removed.
> > 
> > The changes include:
> > - Remove unneeded include files
> > - Remove per CPU kthread workers: balancing_work and
> > idle_injection_work
> > - Reuse the compensation related code by moving from previous
> > worker
> > thread to idle_injection callbacks
> > - Adjust the idle_duration and runtime by using
> > powercap/idle_inject
> > interface
> > - Remove all variables, which are not required once
> > powercap/idle_inject
> > is used
> > - Add mutex to avoid race during removal of idle injection during
> > module
> > unload and user action to change idle inject percent
> > - Use READ_ONCE and WRITE_ONCE for data accessed from multiple CPUs
> > 
> > Signed-off-by: Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com>
> > ---
> > v2:
> > - Use idle_inject_register_full instead of idle_inject_register
> > - Also fix dependency issue with POWERCAP config
> > Reported-by: kernel test robot <lkp@intel.com>
> > 
> >  drivers/thermal/intel/Kconfig            |   2 +
> >  drivers/thermal/intel/intel_powerclamp.c | 292 ++++++++++---------
> > ----
> >  2 files changed, 126 insertions(+), 168 deletions(-)
> > 
> > diff --git a/drivers/thermal/intel/Kconfig
> > b/drivers/thermal/intel/Kconfig
> > index f0c845679250..6c2a95f41c81 100644
> > --- a/drivers/thermal/intel/Kconfig
> > +++ b/drivers/thermal/intel/Kconfig
> > @@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
> >         tristate "Intel PowerClamp idle injection driver"
> >         depends on X86
> >         depends on CPU_SUP_INTEL
> > +       select POWERCAP
> > +       select IDLE_INJECT
> >         help
> >           Enable this to enable Intel PowerClamp idle injection
> > driver. This
> >           enforce idle time which results in more package C-state
> > residency. The
> > diff --git a/drivers/thermal/intel/intel_powerclamp.c
> > b/drivers/thermal/intel/intel_powerclamp.c
> > index b80e25ec1261..3f2b20ae8f68 100644
> > --- a/drivers/thermal/intel/intel_powerclamp.c
> > +++ b/drivers/thermal/intel/intel_powerclamp.c
> > @@ -2,7 +2,7 @@
> >  /*
> >   * intel_powerclamp.c - package c-state idle injection
> >   *
> > - * Copyright (c) 2012, Intel Corporation.
> > + * Copyright (c) 2022, Intel Corporation.
> 
> Nit: I would retain the original year of introduction, so 2012 -
> 2022.
OK

> 
> >   *
> > 

[...]

> > +
> > +static int idle_inject_begin(unsigned int cpu)
> 
> So this would be the ->prepare() callback to be invoked on each CPU
> from idle_inject_fn() IIUC.
> 
Yes

> >  {
> > -       struct powerclamp_worker_data *w_data =
> > per_cpu_ptr(worker_data, cpu);
> > -       struct kthread_worker *worker;
> > +       /*
> > +        * only elected controlling cpu can collect stats and
> > update
> > +        * control parameters.
> > +        */
> > +       if (cpu == control_cpu) {
> > +               bool update = READ_ONCE(target_ratio_updated);
> > +
> > +               if (!(powerclamp_data.count %
> > powerclamp_data.window_size_now)) {
> > +                       bool skip =
> > powerclamp_adjust_controls(powerclamp_data.target_ratio,
> > +                                                     
> > powerclamp_data.guard,
> > +                                                     
> > powerclamp_data.window_size_now);
> > +                       WRITE_ONCE(should_skip, skip);
> > +                       update = true;
> > +               }
> > 
> > -       worker = kthread_create_worker_on_cpu(cpu, 0,
> > "kidle_inj/%ld", cpu);
> > -       if (IS_ERR(worker))
> > -               return;
> > +               if (update) {
> > +                       unsigned int runtime;
> > +
> > +                       runtime = get_run_time();
> > +                       idle_inject_set_duration(ii_dev, runtime,
> > duration);
> > +                       WRITE_ONCE(target_ratio_updated, false);
> > +               }
> > +               powerclamp_data.count++;
> > +       }
> > +
> > +       if (READ_ONCE(should_skip))
> > +               return -EAGAIN;
> 
> This has a bit of a synchronization issue, because the control CPU is
> not guaranteed to run this code before any other CPUs in the given
> cycle, so at least some of them may see a stale value of should_skip
> and they will still inject idle in this cycle.  Or else, they may
> skip
> idle injection when it should be done.
This is correct observation. This is true in in even in current
implementation. The per thread timer in the existing implementation has
this sync issue. So I tried to just mimic current implementation as is.


> 
> I think that it would be better to run the callback from
> idle_inject_timer_fn() where it would decide whether or not to call
> idle_inject_wakeup(), in which case the control CPU would not be
> needed any more (which would be a plus), because the "control" could
> be done by the CPU running the timer function, whichever it is.
> 
> Does this sound viable? 
Yes it is. In this case prepare() callback from idle_inject core is not
per CPU, but per device.

Thanks,
Srinivas

>  Or if it doesn't, then why?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
  2023-01-12 20:23     ` srinivas pandruvada
@ 2023-01-12 21:03       ` Rafael J. Wysocki
  0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-01-12 21:03 UTC (permalink / raw)
  To: srinivas pandruvada
  Cc: Rafael J. Wysocki, linux-pm, linux-kernel, daniel.lezcano,
	rui.zhang, amitk, kernel test robot

On Thu, Jan 12, 2023 at 9:23 PM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Thu, 2023-01-12 at 19:32 +0100, Rafael J. Wysocki wrote:
> > On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com> wrote:
> > >
> > > There are two idle injection implementation in the Linux kernel.
> > > One
> > > via intel_powerclamp and the other using powercap/idle_inject. Both
> > > implementation end up in calling play_idle* function from a FIFO
> > > priority thread. Both can't be used at the same time.
> > >
> > > Currently per core idle injection (cpuidle_cooling) is using
> > > powercap/idle_inject, which is not used in platforms where
> > > intel_powerclamp is used for system wide idle injection. So there
> > > is
> > > no conflict. But there are some use cases where per core idle
> > > injection
> > > is beneficial on the same system where system wide idle injection
> > > is
> > > also used via intel_powerclamp. To avoid conflict only one of the
> > > idle
> > > injection type must be in use at a time. This require a common
> > > framework
> > > which both per core and system wide idle injection can use.
> > >
> > > Here powercap/idle_inject can be used for both per-core and for
> > > system
> > > wide idle injection. This framework has a well defined interface
> > > which allow registry for per-core or for all CPUs (system wide). If
> > > particular CPU is already participating in idle injection, the call
> > > to registry fails. Here the registry can be done when user space
> > > changes the current cooling device state.
> > >
> > > Also one framework for idle injection is better as there is one
> > > loop
> > > calling play_idle*, instead of multiple for better maintenance.
> > >
> > > So, reuse powercap/idle_inject calls in intel_powerclamp. This
> > > simplifies
> > > the code as all per CPU kthreads which calls play_idle* can be
> > > removed.
> > >
> > > The changes include:
> > > - Remove unneeded include files
> > > - Remove per CPU kthread workers: balancing_work and
> > > idle_injection_work
> > > - Reuse the compensation related code by moving from previous
> > > worker
> > > thread to idle_injection callbacks
> > > - Adjust the idle_duration and runtime by using
> > > powercap/idle_inject
> > > interface
> > > - Remove all variables, which are not required once
> > > powercap/idle_inject
> > > is used
> > > - Add mutex to avoid race during removal of idle injection during
> > > module
> > > unload and user action to change idle inject percent
> > > - Use READ_ONCE and WRITE_ONCE for data accessed from multiple CPUs
> > >
> > > Signed-off-by: Srinivas Pandruvada
> > > <srinivas.pandruvada@linux.intel.com>
> > > ---
> > > v2:
> > > - Use idle_inject_register_full instead of idle_inject_register
> > > - Also fix dependency issue with POWERCAP config
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > >  drivers/thermal/intel/Kconfig            |   2 +
> > >  drivers/thermal/intel/intel_powerclamp.c | 292 ++++++++++---------
> > > ----
> > >  2 files changed, 126 insertions(+), 168 deletions(-)
> > >
> > > diff --git a/drivers/thermal/intel/Kconfig
> > > b/drivers/thermal/intel/Kconfig
> > > index f0c845679250..6c2a95f41c81 100644
> > > --- a/drivers/thermal/intel/Kconfig
> > > +++ b/drivers/thermal/intel/Kconfig
> > > @@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
> > >         tristate "Intel PowerClamp idle injection driver"
> > >         depends on X86
> > >         depends on CPU_SUP_INTEL
> > > +       select POWERCAP
> > > +       select IDLE_INJECT
> > >         help
> > >           Enable this to enable Intel PowerClamp idle injection
> > > driver. This
> > >           enforce idle time which results in more package C-state
> > > residency. The
> > > diff --git a/drivers/thermal/intel/intel_powerclamp.c
> > > b/drivers/thermal/intel/intel_powerclamp.c
> > > index b80e25ec1261..3f2b20ae8f68 100644
> > > --- a/drivers/thermal/intel/intel_powerclamp.c
> > > +++ b/drivers/thermal/intel/intel_powerclamp.c
> > > @@ -2,7 +2,7 @@
> > >  /*
> > >   * intel_powerclamp.c - package c-state idle injection
> > >   *
> > > - * Copyright (c) 2012, Intel Corporation.
> > > + * Copyright (c) 2022, Intel Corporation.
> >
> > Nit: I would retain the original year of introduction, so 2012 -
> > 2022.
> OK
>
> >
> > >   *
> > >
>
> [...]
>
> > > +
> > > +static int idle_inject_begin(unsigned int cpu)
> >
> > So this would be the ->prepare() callback to be invoked on each CPU
> > from idle_inject_fn() IIUC.
> >
> Yes
>
> > >  {
> > > -       struct powerclamp_worker_data *w_data =
> > > per_cpu_ptr(worker_data, cpu);
> > > -       struct kthread_worker *worker;
> > > +       /*
> > > +        * only elected controlling cpu can collect stats and
> > > update
> > > +        * control parameters.
> > > +        */
> > > +       if (cpu == control_cpu) {
> > > +               bool update = READ_ONCE(target_ratio_updated);
> > > +
> > > +               if (!(powerclamp_data.count %
> > > powerclamp_data.window_size_now)) {
> > > +                       bool skip =
> > > powerclamp_adjust_controls(powerclamp_data.target_ratio,
> > > +
> > > powerclamp_data.guard,
> > > +
> > > powerclamp_data.window_size_now);
> > > +                       WRITE_ONCE(should_skip, skip);
> > > +                       update = true;
> > > +               }
> > >
> > > -       worker = kthread_create_worker_on_cpu(cpu, 0,
> > > "kidle_inj/%ld", cpu);
> > > -       if (IS_ERR(worker))
> > > -               return;
> > > +               if (update) {
> > > +                       unsigned int runtime;
> > > +
> > > +                       runtime = get_run_time();
> > > +                       idle_inject_set_duration(ii_dev, runtime,
> > > duration);
> > > +                       WRITE_ONCE(target_ratio_updated, false);
> > > +               }
> > > +               powerclamp_data.count++;
> > > +       }
> > > +
> > > +       if (READ_ONCE(should_skip))
> > > +               return -EAGAIN;
> >
> > This has a bit of a synchronization issue, because the control CPU is
> > not guaranteed to run this code before any other CPUs in the given
> > cycle, so at least some of them may see a stale value of should_skip
> > and they will still inject idle in this cycle.  Or else, they may
> > skip
> > idle injection when it should be done.
> This is correct observation. This is true in in even in current
> implementation. The per thread timer in the existing implementation has
> this sync issue. So I tried to just mimic current implementation as is.

I see, but I don't think that the new implementation has to be bug
compatible with the old one.

> >
> > I think that it would be better to run the callback from
> > idle_inject_timer_fn() where it would decide whether or not to call
> > idle_inject_wakeup(), in which case the control CPU would not be
> > needed any more (which would be a plus), because the "control" could
> > be done by the CPU running the timer function, whichever it is.
> >
> > Does this sound viable?
>
> Yes it is. In this case prepare() callback from idle_inject core is not
> per CPU, but per device.

Right.

BTW, I also would call it "update" and make it return bool, so
idle_inject_wakeup() would be called when it returned 'true'.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-01-12 21:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29 23:34 [PATCH v2 0/4] Per CPU idle injection Srinivas Pandruvada
2022-11-29 23:34 ` [PATCH v2 1/4] powercap: idle_inject: Export symbols Srinivas Pandruvada
2023-01-12 15:05   ` Rafael J. Wysocki
2022-11-29 23:34 ` [PATCH v2 2/4] powercap: idle_inject: Add prepare/complete callbacks Srinivas Pandruvada
2022-12-21 14:52   ` Daniel Lezcano
2022-12-21 20:58     ` srinivas pandruvada
2022-12-22  9:50       ` Daniel Lezcano
2022-12-22 17:36         ` srinivas pandruvada
2023-01-12 14:37           ` Rafael J. Wysocki
2023-01-12 14:53   ` Rafael J. Wysocki
2023-01-12 17:25     ` Rafael J. Wysocki
2023-01-12 18:13       ` srinivas pandruvada
2022-11-29 23:34 ` [PATCH v2 3/4] thermal/drivers/intel_powerclamp: Use powercap idle-inject framework Srinivas Pandruvada
2023-01-12 18:32   ` Rafael J. Wysocki
2023-01-12 20:23     ` srinivas pandruvada
2023-01-12 21:03       ` Rafael J. Wysocki
2022-11-29 23:34 ` [PATCH v2 4/4] thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle cooling driver Srinivas Pandruvada
2022-12-21 15:10   ` Daniel Lezcano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).