All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] CFS idle injection
@ 2015-11-10  0:21 Jacob Pan
  2015-11-10  0:21 ` [RFC PATCH v2 1/3] ktime: add a roundup function Jacob Pan
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jacob Pan @ 2015-11-10  0:21 UTC (permalink / raw)
  To: Peter Zijlstra, LKML
  Cc: Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, Jacob Pan

Hi Peter and all,

Changes since initial RFC:
	- integrated Peter's patch to optimize hot path
	- check softirq, allow softirqd
	- changed tracing function to include string message

Kconfig is not changed, wanted to hear more feedback. Whether this should be
a CONFIG option since we already optimized for the hot path.


Intro:
A while ago, we had discussion about how powerclamp is broken in the
sense of turning off idle ticks in the forced idle period.
https://lkml.org/lkml/2014/12/18/369

It was suggested to replace the current kthread play idle loop with a
timer based runqueue throttling scheme. I finally got around to implement
this and code is much simpler. I also have good test results in terms of
efficiency, scalability, etc.
http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon_Japan_2015_idle_injection1_0.pdf
slide #18+ shows the data on client and server.

I have two choices for this code:
1) be part of existing powerclamp driver but require exporting some
   sched APIs.
2) be part of sched since the genernal rule applies when it comes down
   to sycnhronized idle time for best power savings.

The patches below are for #2. There is a known problem with LOW RES timer
mode that I am working on. But I am hoping to get review earlier.

We are entering a very power limited environment on client side, frequency
scaling can only be efficient at certain range. e.g. on SKL, upto ~900MHz,
anything below, it is increasingly more efficient to do C-states insertion
if coordinated.

Looking forward, there are use case beyond thermal/power capping. I think
we can consolidate ballanced partial busy workload that are evenly
distributed among CPUs.

Please let me know what you think.

Thanks,



Jacob Pan (3):
  ktime: add a roundup function
  timer: relax tick stop in idle entry
  sched: introduce synchronized idle injection

 include/linux/ktime.h        |  10 ++
 include/linux/sched.h        |  12 ++
 include/linux/sched/sysctl.h |   5 +
 include/trace/events/sched.h |  23 +++
 init/Kconfig                 |   8 +
 kernel/sched/fair.c          | 381 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |   2 +-
 kernel/sysctl.c              |  20 +++
 kernel/time/tick-sched.c     |   2 +-
 9 files changed, 457 insertions(+), 6 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 1/3] ktime: add a roundup function
  2015-11-10  0:21 [RFC PATCH v2 0/3] CFS idle injection Jacob Pan
@ 2015-11-10  0:21 ` Jacob Pan
  2015-11-10  0:21 ` [RFC PATCH v2 2/3] timer: relax tick stop in idle entry Jacob Pan
  2015-11-10  0:21 ` [RFC PATCH v2 3/3] sched: introduce synchronized idle injection Jacob Pan
  2 siblings, 0 replies; 13+ messages in thread
From: Jacob Pan @ 2015-11-10  0:21 UTC (permalink / raw)
  To: Peter Zijlstra, LKML
  Cc: Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, Jacob Pan

ktime roundup function can be used to keep timer aligned and
prevent drift for recurring timeouts.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/linux/ktime.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index 2b6a204..2e293fa 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -233,6 +233,16 @@ static inline ktime_t ktime_sub_us(const ktime_t kt, const u64 usec)
 
 extern ktime_t ktime_add_safe(const ktime_t lhs, const ktime_t rhs);
 
+static inline ktime_t ktime_roundup(ktime_t x, ktime_t y)
+{
+	u64 temp_tv64;
+
+	temp_tv64 = x.tv64 + y.tv64 - 1;
+	temp_tv64 = div64_u64(temp_tv64, y.tv64);
+	x.tv64 = temp_tv64 * y.tv64;
+
+	return x;
+}
 /**
  * ktime_to_timespec_cond - convert a ktime_t variable to timespec
  *			    format only if the variable contains data
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 2/3] timer: relax tick stop in idle entry
  2015-11-10  0:21 [RFC PATCH v2 0/3] CFS idle injection Jacob Pan
  2015-11-10  0:21 ` [RFC PATCH v2 1/3] ktime: add a roundup function Jacob Pan
@ 2015-11-10  0:21 ` Jacob Pan
  2015-11-10  0:21 ` [RFC PATCH v2 3/3] sched: introduce synchronized idle injection Jacob Pan
  2 siblings, 0 replies; 13+ messages in thread
From: Jacob Pan @ 2015-11-10  0:21 UTC (permalink / raw)
  To: Peter Zijlstra, LKML
  Cc: Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, Jacob Pan

Upon entering idle, we can turn off tick if the next timeout
is exactly one tick away. Otherwise, we could enter inner idle loop
with tick still enabled, without resched set, the tick will continue
during idle therefore less optimal in terms of energy savings.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 kernel/time/tick-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7c7ec45..bcadaab 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -606,7 +606,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 	 * restart it proper.
 	 */
 	delta = next_tick - basemono;
-	if (delta <= (u64)TICK_NSEC) {
+	if (delta < (u64)TICK_NSEC) {
 		tick.tv64 = 0;
 		if (!ts->tick_stopped)
 			goto out;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10  0:21 [RFC PATCH v2 0/3] CFS idle injection Jacob Pan
  2015-11-10  0:21 ` [RFC PATCH v2 1/3] ktime: add a roundup function Jacob Pan
  2015-11-10  0:21 ` [RFC PATCH v2 2/3] timer: relax tick stop in idle entry Jacob Pan
@ 2015-11-10  0:21 ` Jacob Pan
  2015-11-10 13:23   ` Peter Zijlstra
  2 siblings, 1 reply; 13+ messages in thread
From: Jacob Pan @ 2015-11-10  0:21 UTC (permalink / raw)
  To: Peter Zijlstra, LKML
  Cc: Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, Jacob Pan

With increasingly constrained power and thermal budget, it's often
necessary to cap power via throttling. Throttling individual CPUs
or devices at random times can help power capping but may not be
optimal in terms of energy efficiency.

In general, the optimal solution in terms of energy efficiency is
to align idle periods such that more shared circuits can be power
gated to enter lower power states.

This patch introduces a scheduler based idle injection method, it
works by blocking CFS runqueue synchronously and periodically. The
actions on all online CPUs are orchestrated by per CPU hrtimers.

Two sysctl knobs are given to the userspce for selecting the
percentage of idle time as well as the forced idle duration for each
idle period injected.

Since only CFS class is targeted, other high priority tasks are not
affected, such as RT, softirq, and interrupts.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/linux/sched.h        |  12 ++
 include/linux/sched/sysctl.h |   5 +
 include/trace/events/sched.h |  23 +++
 init/Kconfig                 |   8 +
 kernel/sched/fair.c          | 381 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |   2 +-
 kernel/sysctl.c              |  20 +++
 7 files changed, 446 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b7b9501..ae8be25 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3181,3 +3181,15 @@ static inline unsigned long rlimit_max(unsigned int limit)
 }
 
 #endif
+
+#ifdef CONFIG_CFS_IDLE_INJECT
+extern int proc_sched_cfs_idle_inject_pct_handler(struct ctl_table *table,
+						int write,
+						void __user *buffer,
+						size_t *length, loff_t *ppos);
+extern int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
+						int write,
+						void __user *buffer,
+						size_t *length, loff_t *ppos);
+
+#endif
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..d32da45 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -81,6 +81,11 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 extern unsigned int sysctl_sched_autogroup_enabled;
 #endif
 
+#ifdef CONFIG_CFS_IDLE_INJECT
+extern unsigned int sysctl_sched_cfs_idle_inject_pct;
+extern unsigned int sysctl_sched_cfs_idle_inject_duration;
+#endif
+
 extern int sched_rr_timeslice;
 
 extern int sched_rr_handler(struct ctl_table *table, int write,
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 539d6bc..50bd7b6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -566,6 +566,29 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
 
 	TP_printk("cpu=%d", __entry->cpu)
 );
+
+/*
+ * Tracepoint for idle injection
+ */
+TRACE_EVENT(sched_cfs_idle_inject,
+
+	TP_PROTO(char *msg, int throttled),
+
+	TP_ARGS(msg, throttled),
+
+	TP_STRUCT__entry(
+		__string(msg, msg)
+		__field(int, throttled)
+	),
+
+	TP_fast_assign(
+		__assign_str(msg, msg);
+		__entry->throttled = throttled;
+	),
+
+	TP_printk("%s: throttled=%d", __get_str(msg), __entry->throttled)
+);
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..1f2960a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1086,6 +1086,14 @@ menuconfig CGROUP_SCHED
 	  bandwidth allocation to such task groups. It uses cgroups to group
 	  tasks.
 
+config CFS_IDLE_INJECT
+	bool "Synchronized CFS idle injection"
+	default n
+	help
+	  This option allows users to inject synchronized idle time across all CPUs.
+	  The feature will align idle time such that the entire CPU package can be duty
+	  cycled by going into the deepest/lowest power states.
+
 if CGROUP_SCHED
 config FAIR_GROUP_SCHED
 	bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..aa89227 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -30,6 +30,7 @@
 #include <linux/mempolicy.h>
 #include <linux/migrate.h>
 #include <linux/task_work.h>
+#include <linux/suspend.h>
 
 #include <trace/events/sched.h>
 
@@ -114,6 +115,17 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+/*
+ * Knobs for controlling percentage of time when idle is forced across all
+ * CPUs. This is a power management feature intended for achieving deepest
+ * and broadest idle without lower CPU frequencies to less optimal level.
+ * No action is taken if CPUs are natually idle.
+ */
+#ifdef CONFIG_CFS_IDLE_INJECT
+unsigned int sysctl_sched_cfs_idle_inject_pct;
+unsigned int sysctl_sched_cfs_idle_inject_duration = 10UL;
+#endif
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
@@ -2334,7 +2346,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 #endif
-	cfs_rq->nr_running++;
+
+	if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
+		cfs_rq->runnable = true;
 }
 
 static void
@@ -2347,7 +2361,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
 	}
-	cfs_rq->nr_running--;
+
+	if (!--cfs_rq->nr_running && !cfs_rq->forced_idle)
+		cfs_rq->runnable = false;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -5139,7 +5155,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 
 again:
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (!cfs_rq->nr_running)
+	if (!cfs_rq->runnable)
 		goto idle;
 
 	if (prev->sched_class != &fair_sched_class)
@@ -5218,7 +5234,7 @@ simple:
 	cfs_rq = &rq->cfs;
 #endif
 
-	if (!cfs_rq->nr_running)
+	if (!cfs_rq->runnable)
 		goto idle;
 
 	put_prev_task(rq, prev);
@@ -5237,6 +5253,16 @@ simple:
 	return p;
 
 idle:
+	if ((cfs_rq->forced_idle)) {
+		if (unlikely(local_softirq_pending())) {
+			trace_sched_cfs_idle_inject("softirq pending", 1);
+			cfs_rq->forced_idle = false;
+			cfs_rq->runnable = cfs_rq->nr_running;
+			goto again;
+		}
+		trace_sched_cfs_idle_inject("forced idle", 1);
+		return NULL;
+	}
 	/*
 	 * This is OK, because current is on_cpu, which avoids it being picked
 	 * for load-balance and preemption/IRQs are still disabled avoiding
@@ -8318,3 +8344,350 @@ __init void init_sched_fair_class(void)
 #endif /* SMP */
 
 }
+
+#ifdef CONFIG_CFS_IDLE_INJECT
+static atomic_t idle_inject_active;
+static DEFINE_PER_CPU(struct hrtimer, idle_inject_timer);
+static DEFINE_PER_CPU(bool, idle_injected);
+/* protect injection parameters from runtime changes */
+static DEFINE_SPINLOCK(idle_inject_lock);
+
+/* Track which CPUs are being injected with idle period */
+static unsigned long *idle_inject_cpumask;
+
+/* Default idle injection duration for each period. */
+#define DEFAULT_DURATION_MSECS (10)
+
+static unsigned int duration; /* idle inject duration in msec. */
+static unsigned int inject_interval; /* idle inject duration in msec. */
+static unsigned int idle_pct; /* percentage of time idle is forced */
+/* starting reference time for all CPUs to align idle period */
+static ktime_t inject_start_time;
+static int prepare_idle_inject(void);
+
+static void throttle_rq(int cpu)
+{
+	unsigned int resched = 0;
+	unsigned long flags;
+	struct rq *rq = cpu_rq(cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	rq->cfs.forced_idle = true;
+	resched = rq->cfs.runnable;
+	rq->cfs.runnable = false;
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	if (resched)
+		resched_cpu(cpu);
+}
+
+static void unthrottle_rq(int cpu)
+{
+	unsigned int resched = 0;
+	unsigned long flags;
+	struct rq *rq = cpu_rq(cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	rq->cfs.forced_idle = false;
+	resched = rq->cfs.runnable = !!rq->cfs.nr_running;
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	if (resched)
+		resched_cpu(cpu);
+}
+
+static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *hrtimer)
+{
+	int cpu = smp_processor_id();
+	struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
+	ktime_t now, delta, period;
+	bool status;
+
+	now = hrtimer_cb_get_time(hrt);
+
+	status = raw_cpu_read(idle_injected);
+	if (status) {
+		/*
+		 * We were injecting idle in the last phase, let's forward the
+		 * timer to the next period
+		 *
+		 * status: 1             0                1        0
+		 * ____          ____________________           _______
+		 *     |________|                    |_________|
+		 *
+		 *     |duration|      interval      |
+		 *
+		 *              ^ we are here
+		 *                  forward to here: ^
+		 */
+		delta = ktime_sub(now, inject_start_time);
+		period = ktime_add(ms_to_ktime(duration),
+				ms_to_ktime(inject_interval));
+		delta = ktime_roundup(delta, period);
+		hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));
+	} else {
+		/*
+		 * We were not injecting idle in the last phase, let's forward
+		 * timer after forced idle duration
+		 * ____          ____________________           _______
+		 *     |________|                    |_________|
+		 *
+		 *     |duration|      interval      |
+		 *
+		 *     ^ we are here
+		 *              ^ forward timer to here
+		 */
+		hrtimer_set_expires(hrt, ktime_add(ms_to_ktime(duration), now));
+	}
+	raw_cpu_write(idle_injected, !status);
+	trace_sched_cfs_idle_inject("idle sync timer", !status);
+	if (status)
+		unthrottle_rq(cpu);
+	else
+		throttle_rq(cpu);
+
+	return HRTIMER_RESTART;
+}
+
+static void idle_inject_timer_start(void *info)
+{
+	int cpu = smp_processor_id();
+	struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
+
+	this_cpu_write(idle_injected, true);
+	set_bit(cpu, idle_inject_cpumask);
+	hrtimer_start(hrt, ms_to_ktime(duration), HRTIMER_MODE_ABS_PINNED);
+	hrtimer_set_expires(hrt, *(ktime_t *)info);
+}
+
+static int start_idle_inject(void)
+{
+	int ret;
+	ktime_t now = ktime_get();
+
+	if (!atomic_read(&idle_inject_active)) {
+		/* called once per activation of idle injection */
+		ret = prepare_idle_inject();
+		if (ret)
+			return ret;
+	}
+	/* prevent cpu hotplug */
+	get_online_cpus();
+
+	/* set a future time to let all per cpu timers expires the same time */
+	now = ktime_roundup(now, ms_to_ktime(duration));
+
+	/* start one timer per online cpu */
+	inject_start_time = now;
+	on_each_cpu(idle_inject_timer_start, &now, 1);
+	atomic_set(&idle_inject_active, 1);
+
+	put_online_cpus();
+
+	return 0;
+}
+
+static void stop_idle_inject(void)
+{
+	int i;
+	struct hrtimer *hrt;
+
+	if (bitmap_weight(idle_inject_cpumask, num_possible_cpus())) {
+		for_each_set_bit(i, idle_inject_cpumask, num_possible_cpus()) {
+			hrt = &per_cpu(idle_inject_timer, i);
+			hrtimer_cancel(hrt);
+			unthrottle_rq(i);
+		}
+	}
+}
+
+static int idle_inject_cpu_callback(struct notifier_block *nfb,
+				unsigned long action, void *hcpu)
+{
+	unsigned long cpu = (unsigned long)hcpu;
+	struct hrtimer *hrt = &per_cpu(idle_inject_timer, cpu);
+	ktime_t now, delta, period;
+
+	if (!atomic_read(&idle_inject_active))
+		goto exit_ok;
+
+	switch (action) {
+	case CPU_STARTING:
+		raw_cpu_write(idle_injected, true);
+
+		hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+		hrt->function = idle_inject_timer_fn;
+		set_bit(cpu, idle_inject_cpumask);
+
+		now = hrtimer_cb_get_time(hrt);
+		hrtimer_start(hrt, ms_to_ktime(duration),
+			HRTIMER_MODE_ABS_PINNED);
+		/*
+		 * When a new CPU comes online, we need to make sure it aligns
+		 * its phase with the rest of the CPUs. So we set the
+		 * timer to the next period based on the common starting time,
+		 * then start injecting idle time.
+		 */
+		spin_lock_irq(&idle_inject_lock);
+		delta = ktime_sub(now, inject_start_time);
+		period = ktime_add(ms_to_ktime(duration),
+				ms_to_ktime(inject_interval));
+		delta = ktime_roundup(delta, period);
+		spin_unlock_irq(&idle_inject_lock);
+		hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));
+		break;
+	case CPU_DYING:
+		clear_bit(cpu, idle_inject_cpumask);
+		hrtimer_cancel(hrt);
+		raw_cpu_write(idle_injected, false);
+		unthrottle_rq(cpu);
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+exit_ok:
+	return NOTIFY_OK;
+}
+
+static int idle_inject_pm_callback(struct notifier_block *self,
+				unsigned long action, void *hcpu)
+{
+	switch (action) {
+	case PM_HIBERNATION_PREPARE:
+	case PM_SUSPEND_PREPARE:
+		if (atomic_read(&idle_inject_active))
+			stop_idle_inject();
+		break;
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		printk("POST SUSPEND restart idle injection\n");
+		start_idle_inject();
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block idle_inject_pm_notifier = {
+	.notifier_call = idle_inject_pm_callback,
+};
+
+static struct notifier_block idle_inject_cpu_notifier = {
+	.notifier_call = idle_inject_cpu_callback,
+};
+
+static void end_idle_inject(void) {
+	unregister_hotcpu_notifier(&idle_inject_cpu_notifier);
+	unregister_pm_notifier(&idle_inject_pm_notifier);
+	atomic_set(&idle_inject_active, 0);
+	kfree(idle_inject_cpumask);
+}
+
+static int prepare_idle_inject(void)
+{
+	int retval = 0;
+	int bitmap_size;
+	int cpu;
+	struct hrtimer *hrt;
+
+	bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
+	idle_inject_cpumask = kzalloc(bitmap_size, GFP_KERNEL);
+	if (!idle_inject_cpumask)
+		return -ENOMEM;
+
+	retval = register_pm_notifier(&idle_inject_pm_notifier);
+	if (retval)
+		goto exit_free;
+	retval = register_hotcpu_notifier(&idle_inject_cpu_notifier);
+	if (retval)
+		goto exit_unregister_pm;
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		hrt = &per_cpu(idle_inject_timer, cpu);
+		hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+		hrt->function = idle_inject_timer_fn;
+	}
+	put_online_cpus();
+
+	if (!duration)
+		duration = DEFAULT_DURATION_MSECS;
+
+	return 0;
+exit_unregister_pm:
+	unregister_pm_notifier(&idle_inject_pm_notifier);
+exit_free:
+	kfree(idle_inject_cpumask);
+	return retval;
+}
+
+int proc_sched_cfs_idle_inject_pct_handler(struct ctl_table *table,
+					int write,
+					void __user *buffer,
+					size_t *length,	loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (ret)
+		goto out;
+
+	if (idle_pct != sysctl_sched_cfs_idle_inject_pct) {
+		if (!idle_pct)
+			start_idle_inject();
+		else if (!sysctl_sched_cfs_idle_inject_pct) {
+			stop_idle_inject();
+			end_idle_inject();
+		}
+
+		/* recompute injection parameters */
+		spin_lock_irq(&idle_inject_lock);
+		idle_pct = sysctl_sched_cfs_idle_inject_pct;
+		/*
+		 * duration is fixed for each injection period, we adjust
+		 * non idle interval to satisfy the idle percentage set
+		 * by the user. e.g. if duration is 10 and we want 33% idle
+		 * then interval is 20.
+		 * 33% idle
+		 * ____          ___________________          _________
+		 *     |________|                   |________| 33% idle
+		 * ____          ________          _______
+		 *     |________|        |________|  50% idle
+		 *
+		 *     |duration|interval|
+		 */
+		if (idle_pct)
+			inject_interval = (duration * (100 - idle_pct))
+				/ idle_pct;
+
+		spin_unlock_irq(&idle_inject_lock);
+
+	}
+out:
+	return ret;
+}
+
+int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
+						int write,
+						void __user *buffer,
+						size_t *length,	loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (ret)
+		goto out;
+
+	if (duration == sysctl_sched_cfs_idle_inject_duration)
+		goto out;
+	/* recompute injection parameters */
+	spin_lock_irq(&idle_inject_lock);
+	duration = jiffies_to_msecs(sysctl_sched_cfs_idle_inject_duration);
+	if (idle_pct)
+		inject_interval = (duration * (100 - idle_pct)) / idle_pct;
+
+	spin_unlock_irq(&idle_inject_lock);
+out:
+	return ret;
+}
+
+#endif /* CONFIG_CFS_IDLE_INJECT */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d2a119..0301d54 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -343,7 +343,7 @@ struct cfs_bandwidth { };
 struct cfs_rq {
 	struct load_weight load;
 	unsigned int nr_running, h_nr_running;
-
+	unsigned int runnable, forced_idle;
 	u64 exec_clock;
 	u64 min_vruntime;
 #ifndef CONFIG_64BIT
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d..78c304b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -433,6 +433,26 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &one,
 	},
 #endif
+#ifdef CONFIG_CFS_IDLE_INJECT
+	{
+		.procname	= "sched_cfs_idle_inject_pct",
+		.data		= &sysctl_sched_cfs_idle_inject_pct,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_sched_cfs_idle_inject_pct_handler,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
+	{
+		.procname	= "sched_cfs_idle_inject_duration",
+		.data		= &sysctl_sched_cfs_idle_inject_duration,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_sched_cfs_idle_inject_duration_handler,
+		.extra1		= &four,
+		.extra2		= &one_hundred,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10  0:21 ` [RFC PATCH v2 3/3] sched: introduce synchronized idle injection Jacob Pan
@ 2015-11-10 13:23   ` Peter Zijlstra
  2015-11-10 14:01     ` Jacob Pan
  2015-11-10 18:41     ` Jacob Pan
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2015-11-10 13:23 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada

On Mon, Nov 09, 2015 at 04:21:23PM -0800, Jacob Pan wrote:
> +++ b/include/trace/events/sched.h

> +/*
> + * Tracepoint for idle injection
> + */
> +TRACE_EVENT(sched_cfs_idle_inject,
> +
> +	TP_PROTO(char *msg, int throttled),
> +
> +	TP_ARGS(msg, throttled),
> +
> +	TP_STRUCT__entry(
> +		__string(msg, msg)
> +		__field(int, throttled)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(msg, msg);
> +		__entry->throttled = throttled;
> +	),
> +
> +	TP_printk("%s: throttled=%d", __get_str(msg), __entry->throttled)
> +);

So I hate tracepoints.. and I'd rather not see them. But at the very
least kill that @msg field and replace it with an enum or so.


> +/*
> + * Knobs for controlling percentage of time when idle is forced across all
> + * CPUs. This is a power management feature intended for achieving deepest
> + * and broadest idle without lower CPU frequencies to less optimal level.
> + * No action is taken if CPUs are natually idle.
> + */
> +#ifdef CONFIG_CFS_IDLE_INJECT
> +unsigned int sysctl_sched_cfs_idle_inject_pct;
> +unsigned int sysctl_sched_cfs_idle_inject_duration = 10UL;

since you're playing the ifdef game, you might as well also do:

static inline void cfs_rq_nr_running_inc(struct cfs_rq *cfs_rq)
{
	if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
		cfs_rq->runnable = true;
}

static inline bool cfs_rq_runnable(struct cfs_rq *cfs_rq)
{
	return cfs_rq->runnable;
}

#else

static inline void cfs_rq_nr_running_inc(struct cfs_rq *cfs_rq)
{
	cfs_rq->nr_running++;
}

static inline bool cfs_rq_runnable(struct cfs_rq *cfs_rq)
{
	return !!cfs_rq->nr_running;
}

> +#endif
> +
>  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>  {
>  	lw->weight += inc;
> @@ -2334,7 +2346,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  		list_add(&se->group_node, &rq->cfs_tasks);
>  	}
>  #endif
> -	cfs_rq->nr_running++;
> +
> +	if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
> +		cfs_rq->runnable = true;

which makes that:
	cfs_rq_nr_running_inc();

>  }
>  
>  static void
> @@ -2347,7 +2361,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>  		list_del_init(&se->group_node);
>  	}
> -	cfs_rq->nr_running--;
> +
> +	if (!--cfs_rq->nr_running && !cfs_rq->forced_idle)
> +		cfs_rq->runnable = false;

	cfs_rq_nr_running_dec();

>  }
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -5139,7 +5155,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>  
>  again:
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -	if (!cfs_rq->nr_running)
> +	if (!cfs_rq->runnable)

	if (!cfs_rq_runnable(cfs_rq))
>  		goto idle;
>  
>  	if (prev->sched_class != &fair_sched_class)

>  idle:
> +	if ((cfs_rq->forced_idle)) {
> +		if (unlikely(local_softirq_pending())) {
> +			trace_sched_cfs_idle_inject("softirq pending", 1);

> +			cfs_rq->forced_idle = false;
> +			cfs_rq->runnable = cfs_rq->nr_running;

maybe:
			__unthrottle_cfs_rq(cfs_rq); ?

> +			goto again;
> +		}
> +		trace_sched_cfs_idle_inject("forced idle", 1);
> +		return NULL;
> +	}
>  	/*
>  	 * This is OK, because current is on_cpu, which avoids it being picked
>  	 * for load-balance and preemption/IRQs are still disabled avoiding
> @@ -8318,3 +8344,350 @@ __init void init_sched_fair_class(void)
>  #endif /* SMP */
>  
>  }
> +
> +#ifdef CONFIG_CFS_IDLE_INJECT

> +static atomic_t idle_inject_active;

You only use atomic_{read,set} on this, therefore atomic_t is pointless.

> +static DEFINE_PER_CPU(struct hrtimer, idle_inject_timer);
> +static DEFINE_PER_CPU(bool, idle_injected);

I tend to prefer to not use bool as a storage class; its ill defined.

> +/* protect injection parameters from runtime changes */
> +static DEFINE_SPINLOCK(idle_inject_lock);

A global lock, yay :-), I think you want this to be a RAW_SPINLOCK
though. As on -RT this would want to actually run from IRQ context too.

> +static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *hrtimer)
> +{
> +	struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
> +	int cpu = smp_processor_id();
> +	ktime_t now, delta, period;
> +	bool status;
> +
> +	now = hrtimer_cb_get_time(hrt);

You're not interested in the current time.

> +
> +	status = raw_cpu_read(idle_injected);
> +	if (status) {
> +		/*
> +		 * We were injecting idle in the last phase, let's forward the
> +		 * timer to the next period
> +		 *
> +		 * status: 1             0                1        0
> +		 * ____          ____________________           _______
> +		 *     |________|                    |_________|
> +		 *
> +		 *     |duration|      interval      |
> +		 *
> +		 *              ^ we are here
> +		 *                  forward to here: ^
> +		 */
> +		delta = ktime_sub(now, inject_start_time);
> +		period = ktime_add(ms_to_ktime(duration),
> +				ms_to_ktime(inject_interval));
> +		delta = ktime_roundup(delta, period);
> +		hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));

This doesn't make any sense. Who cares what the current time is.

> +	} else {
> +		/*
> +		 * We were not injecting idle in the last phase, let's forward
> +		 * timer after forced idle duration
> +		 * ____          ____________________           _______
> +		 *     |________|                    |_________|
> +		 *
> +		 *     |duration|      interval      |
> +		 *
> +		 *     ^ we are here
> +		 *              ^ forward timer to here
> +		 */
> +		hrtimer_set_expires(hrt, ktime_add(ms_to_ktime(duration), now));

Same here, we don't care about the current time. The timer was at the
previous start of injection, just forward it a whole period to find the
next injection slot.

> +	}

It looks like what you want is:

	hrtimer_forward(hrt, period);

unconditionally.

> +	raw_cpu_write(idle_injected, !status);
> +	trace_sched_cfs_idle_inject("idle sync timer", !status);
> +	if (status)
> +		unthrottle_rq(cpu);
> +	else
> +		throttle_rq(cpu);
> +
> +	return HRTIMER_RESTART;
> +}
> +
> +static void idle_inject_timer_start(void *info)
> +{
> +	int cpu = smp_processor_id();
> +	struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
> +
> +	this_cpu_write(idle_injected, true);
> +	set_bit(cpu, idle_inject_cpumask);
> +	hrtimer_start(hrt, ms_to_ktime(duration), HRTIMER_MODE_ABS_PINNED);
> +	hrtimer_set_expires(hrt, *(ktime_t *)info);

This is broken, _first_ set an expiration time, then start the timer.

Now you insert the timer into the RB tree on a previous expiration time,
then you modify the expiration time under it, effectively wrecking the
RB tree.

> +}

> +static void stop_idle_inject(void)
> +{
> +	int i;
> +	struct hrtimer *hrt;
> +
> +	if (bitmap_weight(idle_inject_cpumask, num_possible_cpus())) {

I don't get the point of this bitmap; with the cpu notifier you
basically ensure this is equal to online_mask.

Also, this weight test is pointless, if the bitmap is empty the
for_each_set_bit() should be of equal cost -- and afaict nothing calling
this is performance critical in the first place.

> +		for_each_set_bit(i, idle_inject_cpumask, num_possible_cpus()) {

> +			hrt = &per_cpu(idle_inject_timer, i);
> +			hrtimer_cancel(hrt);
> +			unthrottle_rq(i);
> +		}
> +	}
> +}
> +
> +static int idle_inject_cpu_callback(struct notifier_block *nfb,
> +				unsigned long action, void *hcpu)
> +{
> +	unsigned long cpu = (unsigned long)hcpu;
> +	struct hrtimer *hrt = &per_cpu(idle_inject_timer, cpu);
> +	ktime_t now, delta, period;
> +
> +	if (!atomic_read(&idle_inject_active))
> +		goto exit_ok;

We should never get here if that weren't set, right? I mean you
register/unregister these callbacks around setting that variable.

> +
> +	switch (action) {
> +	case CPU_STARTING:
> +		raw_cpu_write(idle_injected, true);
> +
> +		hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> +		hrt->function = idle_inject_timer_fn;
> +		set_bit(cpu, idle_inject_cpumask);
> +
> +		now = hrtimer_cb_get_time(hrt);
> +		hrtimer_start(hrt, ms_to_ktime(duration),
> +			HRTIMER_MODE_ABS_PINNED);
> +		/*
> +		 * When a new CPU comes online, we need to make sure it aligns
> +		 * its phase with the rest of the CPUs. So we set the
> +		 * timer to the next period based on the common starting time,
> +		 * then start injecting idle time.
> +		 */
> +		spin_lock_irq(&idle_inject_lock);
> +		delta = ktime_sub(now, inject_start_time);
> +		period = ktime_add(ms_to_ktime(duration),
> +				ms_to_ktime(inject_interval));
> +		delta = ktime_roundup(delta, period);
> +		spin_unlock_irq(&idle_inject_lock);
> +		hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));

Same broken, you cannot call that on a timer you've already started.

> +		break;
> +	case CPU_DYING:
> +		clear_bit(cpu, idle_inject_cpumask);
> +		hrtimer_cancel(hrt);
> +		raw_cpu_write(idle_injected, false);
> +		unthrottle_rq(cpu);
> +		break;
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +exit_ok:
> +	return NOTIFY_OK;
> +}
> +
> +static int idle_inject_pm_callback(struct notifier_block *self,
> +				unsigned long action, void *hcpu)
> +{
> +	switch (action) {
> +	case PM_HIBERNATION_PREPARE:
> +	case PM_SUSPEND_PREPARE:
> +		if (atomic_read(&idle_inject_active))
> +			stop_idle_inject();

As with the above, if that were false, this whole callback would not be
called, seeing how you unregister before actually clearing that
idle_inject_active thing.

> +		break;
> +	case PM_POST_HIBERNATION:
> +	case PM_POST_SUSPEND:
> +		printk("POST SUSPEND restart idle injection\n");

Seems a tad inconsistent, printing here but not when stopping it.

> +		start_idle_inject();
> +		break;
> +	default:
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block idle_inject_pm_notifier = {
> +	.notifier_call = idle_inject_pm_callback,
> +};
> +
> +static struct notifier_block idle_inject_cpu_notifier = {
> +	.notifier_call = idle_inject_cpu_callback,
> +};
> +
> +static void end_idle_inject(void) {
> +	unregister_hotcpu_notifier(&idle_inject_cpu_notifier);
> +	unregister_pm_notifier(&idle_inject_pm_notifier);

As per the above, these callbacks will not happen hereafter, and will
this never see:

> +	atomic_set(&idle_inject_active, 0);
> +	kfree(idle_inject_cpumask);
> +}
> +
> +static int prepare_idle_inject(void)
> +{
> +	int retval = 0;
> +	int bitmap_size;
> +	int cpu;
> +	struct hrtimer *hrt;
> +
> +	bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);

This is incorrect, you want nr_cpu_ids. There is no guarantee the CPU
space does not contain holes. But seeing I still don't see the point of
the mask, this might all fix itself by killing it alltogether.

> +	idle_inject_cpumask = kzalloc(bitmap_size, GFP_KERNEL);
> +	if (!idle_inject_cpumask)
> +		return -ENOMEM;
> +
> +	retval = register_pm_notifier(&idle_inject_pm_notifier);
> +	if (retval)
> +		goto exit_free;
> +	retval = register_hotcpu_notifier(&idle_inject_cpu_notifier);
> +	if (retval)
> +		goto exit_unregister_pm;
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		hrt = &per_cpu(idle_inject_timer, cpu);
> +		hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> +		hrt->function = idle_inject_timer_fn;
> +	}
> +	put_online_cpus();
> +
> +	if (!duration)
> +		duration = DEFAULT_DURATION_MSECS;
> +
> +	return 0;
> +exit_unregister_pm:
> +	unregister_pm_notifier(&idle_inject_pm_notifier);
> +exit_free:
> +	kfree(idle_inject_cpumask);
> +	return retval;
> +}
> +
> +int proc_sched_cfs_idle_inject_pct_handler(struct ctl_table *table,
> +					int write,
> +					void __user *buffer,
> +					size_t *length,	loff_t *ppos)
> +{
> +	int ret;
> +
> +	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> +	if (ret)
> +		goto out;
> +
> +	if (idle_pct != sysctl_sched_cfs_idle_inject_pct) {
> +		if (!idle_pct)
> +			start_idle_inject();
> +		else if (!sysctl_sched_cfs_idle_inject_pct) {
> +			stop_idle_inject();
> +			end_idle_inject();
> +		}
> +
> +		/* recompute injection parameters */
> +		spin_lock_irq(&idle_inject_lock);
> +		idle_pct = sysctl_sched_cfs_idle_inject_pct;
> +		/*
> +		 * duration is fixed for each injection period, we adjust
> +		 * non idle interval to satisfy the idle percentage set
> +		 * by the user. e.g. if duration is 10 and we want 33% idle
> +		 * then interval is 20.
> +		 * 33% idle
> +		 * ____          ___________________          _________
> +		 *     |________|                   |________| 33% idle
> +		 * ____          ________          _______
> +		 *     |________|        |________|  50% idle
> +		 *
> +		 *     |duration|interval|
> +		 */
> +		if (idle_pct)
> +			inject_interval = (duration * (100 - idle_pct))
> +				/ idle_pct;

This needs {} (or just exceed the 80 char thing).

> +		spin_unlock_irq(&idle_inject_lock);
> +
> +	}
> +out:
> +	return ret;
> +}
> +
> +int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
> +						int write,
> +						void __user *buffer,
> +						size_t *length,	loff_t *ppos)
> +{
> +	int ret;
> +
> +	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> +	if (ret)
> +		goto out;
> +
> +	if (duration == sysctl_sched_cfs_idle_inject_duration)
> +		goto out;
> +	/* recompute injection parameters */
> +	spin_lock_irq(&idle_inject_lock);
> +	duration = jiffies_to_msecs(sysctl_sched_cfs_idle_inject_duration);
> +	if (idle_pct)
> +		inject_interval = (duration * (100 - idle_pct)) / idle_pct;
> +
> +	spin_unlock_irq(&idle_inject_lock);
> +out:
> +	return ret;
> +}

And since you have proc handlers for both these, why not convert to
ktime here and avoid the enless ms_to_ktime() calls ?

Also, maybe precompute the period, since that is what you really need.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 13:23   ` Peter Zijlstra
@ 2015-11-10 14:01     ` Jacob Pan
  2015-11-10 14:58       ` Peter Zijlstra
  2015-11-10 18:41     ` Jacob Pan
  1 sibling, 1 reply; 13+ messages in thread
From: Jacob Pan @ 2015-11-10 14:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, jacob.jun.pan

On Tue, 10 Nov 2015 14:23:24 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > +static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer
> > *hrtimer) +{
> > +	struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
> > +	int cpu = smp_processor_id();
> > +	ktime_t now, delta, period;
> > +	bool status;
> > +
> > +	now = hrtimer_cb_get_time(hrt);  
> 
> You're not interested in the current time.
> 
> > +
> > +	status = raw_cpu_read(idle_injected);
> > +	if (status) {
> > +		/*
> > +		 * We were injecting idle in the last phase, let's
> > forward the
> > +		 * timer to the next period
> > +		 *
> > +		 * status: 1             0                1
> > 0
> > +		 * ____          ____________________
> > _______
> > +		 *     |________|                    |_________|
> > +		 *
> > +		 *     |duration|      interval      |
> > +		 *
> > +		 *              ^ we are here
> > +		 *                  forward to here: ^
> > +		 */
> > +		delta = ktime_sub(now, inject_start_time);
> > +		period = ktime_add(ms_to_ktime(duration),
> > +				ms_to_ktime(inject_interval));
> > +		delta = ktime_roundup(delta, period);
> > +		hrtimer_set_expires(hrt, ktime_add(delta,
> > inject_start_time));  
> 
> This doesn't make any sense. Who cares what the current time is.
> 
> > +	} else {
> > +		/*
> > +		 * We were not injecting idle in the last phase,
> > let's forward
> > +		 * timer after forced idle duration
> > +		 * ____          ____________________
> > _______
> > +		 *     |________|                    |_________|
> > +		 *
> > +		 *     |duration|      interval      |
> > +		 *
> > +		 *     ^ we are here
> > +		 *              ^ forward timer to here
> > +		 */
> > +		hrtimer_set_expires(hrt,
> > ktime_add(ms_to_ktime(duration), now));  
> 
> Same here, we don't care about the current time. The timer was at the
> previous start of injection, just forward it a whole period to find
> the next injection slot.
> 
> > +	}  
> 
> It looks like what you want is:
> 
> 	hrtimer_forward(hrt, period);
> 
> unconditionally.
In the ideal world yes. But my thinking was that timers may not be so
accurate to deliver interrupts, over the time the timeout error may
accumulate so that eventually timers will be out of sync. That is why
at the beginning of the period I realign the timer based on a common
start time so that timers will never drift off. I use current time and
the common start time to locate the next absolute timeout not the
relative timeout.
Overall, my idea was to make it more robust and handles runtime
parameters (percentage, duration) smoothly.

I will fix the rest comments in the patch.

Thank you,

Jacob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 14:01     ` Jacob Pan
@ 2015-11-10 14:58       ` Peter Zijlstra
  2015-11-10 16:28         ` Jacob Pan
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2015-11-10 14:58 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada

On Tue, Nov 10, 2015 at 06:01:16AM -0800, Jacob Pan wrote:
> On Tue, 10 Nov 2015 14:23:24 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:

> > It looks like what you want is:
> > 
> > 	hrtimer_forward(hrt, period);
> > 
> > unconditionally.

> In the ideal world yes. But my thinking was that timers may not be so
> accurate to deliver interrupts, over the time the timeout error may
> accumulate so that eventually timers will be out of sync.

Timers have a global time base. Even if individual deliveries have an
error, there is no accumulated error.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 14:58       ` Peter Zijlstra
@ 2015-11-10 16:28         ` Jacob Pan
  2015-11-10 16:36           ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Jacob Pan @ 2015-11-10 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, jacob.jun.pan

On Tue, 10 Nov 2015 15:58:23 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Nov 10, 2015 at 06:01:16AM -0800, Jacob Pan wrote:
> > On Tue, 10 Nov 2015 14:23:24 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > It looks like what you want is:
> > > 
> > > 	hrtimer_forward(hrt, period);
> > > 
> > > unconditionally.
> 
> > In the ideal world yes. But my thinking was that timers may not be
> > so accurate to deliver interrupts, over the time the timeout error
> > may accumulate so that eventually timers will be out of sync.
> 
> Timers have a global time base. Even if individual deliveries have an
> error, there is no accumulated error.
> 
great! I can get rid of the ktime_roundup(). It seems to work with
	now = hrtimer_cb_get_time(hrt);
	if (status)
		hrtimer_forward(hrt, now, ms_to_ktime(inject_interval));
	else
		hrtimer_forward(hrt, now, ms_to_ktime(duration));

The downside is that we need to restart the timers every time if
user were to change injection parameters, i.e. duration and percent.
Or do locking which might be too expensive. In the previous approach, it
will naturally catch up the parameter change.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 16:28         ` Jacob Pan
@ 2015-11-10 16:36           ` Peter Zijlstra
  2015-11-10 16:50             ` Jacob Pan
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2015-11-10 16:36 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada

On Tue, Nov 10, 2015 at 08:28:59AM -0800, Jacob Pan wrote:
> On Tue, 10 Nov 2015 15:58:23 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Nov 10, 2015 at 06:01:16AM -0800, Jacob Pan wrote:
> > > On Tue, 10 Nov 2015 14:23:24 +0100
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > > It looks like what you want is:
> > > > 
> > > > 	hrtimer_forward(hrt, period);
> > > > 
> > > > unconditionally.
> > 
> > > In the ideal world yes. But my thinking was that timers may not be
> > > so accurate to deliver interrupts, over the time the timeout error
> > > may accumulate so that eventually timers will be out of sync.
> > 
> > Timers have a global time base. Even if individual deliveries have an
> > error, there is no accumulated error.
> > 
> great! I can get rid of the ktime_roundup(). It seems to work with
> 	now = hrtimer_cb_get_time(hrt);
> 	if (status)
> 		hrtimer_forward(hrt, now, ms_to_ktime(inject_interval));
> 	else
> 		hrtimer_forward(hrt, now, ms_to_ktime(duration));

We have hrtimer_forward_now() for this ;-)

> The downside is that we need to restart the timers every time if
> user were to change injection parameters, i.e. duration and percent.
> Or do locking which might be too expensive. In the previous approach, it
> will naturally catch up the parameter change.

Why? the timer will fire and observe the new value for reprogramming the
next period. All you need to do is to ensure whole values are
written/read -- ie. avoid load/store tearing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 16:36           ` Peter Zijlstra
@ 2015-11-10 16:50             ` Jacob Pan
  2015-11-10 17:00               ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Jacob Pan @ 2015-11-10 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, jacob.jun.pan

On Tue, 10 Nov 2015 17:36:46 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > The downside is that we need to restart the timers every time if
> > user were to change injection parameters, i.e. duration and percent.
> > Or do locking which might be too expensive. In the previous
> > approach, it will naturally catch up the parameter change.  
> 
> Why? the timer will fire and observe the new value for reprogramming
> the next period. All you need to do is to ensure whole values are
> written/read -- ie. avoid load/store tearing.
Different per CPU timer may intercept parameter changes at slightly
different time, so there is a race condition such that some CPUs may
catch the period change later by one period, which results in a correct
period change but at a different time, i.e. out of sync.

Jacob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 16:50             ` Jacob Pan
@ 2015-11-10 17:00               ` Peter Zijlstra
  2015-11-10 17:14                 ` Jacob Pan
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2015-11-10 17:00 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada

On Tue, Nov 10, 2015 at 08:50:55AM -0800, Jacob Pan wrote:
> On Tue, 10 Nov 2015 17:36:46 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > The downside is that we need to restart the timers every time if
> > > user were to change injection parameters, i.e. duration and percent.
> > > Or do locking which might be too expensive. In the previous
> > > approach, it will naturally catch up the parameter change.  
> > 
> > Why? the timer will fire and observe the new value for reprogramming
> > the next period. All you need to do is to ensure whole values are
> > written/read -- ie. avoid load/store tearing.
> Different per CPU timer may intercept parameter changes at slightly
> different time, so there is a race condition such that some CPUs may
> catch the period change later by one period, which results in a correct
> period change but at a different time, i.e. out of sync.

Ah yes. So if the locking hurts I can come up with a lockless algorithm
for this. Shouldn't be too hard.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 17:00               ` Peter Zijlstra
@ 2015-11-10 17:14                 ` Jacob Pan
  0 siblings, 0 replies; 13+ messages in thread
From: Jacob Pan @ 2015-11-10 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, jacob.jun.pan

On Tue, 10 Nov 2015 18:00:10 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > Different per CPU timer may intercept parameter changes at slightly
> > different time, so there is a race condition such that some CPUs may
> > catch the period change later by one period, which results in a
> > correct period change but at a different time, i.e. out of sync.  
> 
> Ah yes. So if the locking hurts I can come up with a lockless
> algorithm for this. Shouldn't be too hard.
Great! the current patch was designed to be lockless but a
little awkward. My idea was to have a common start time, then we don't
need to worry about CPUs out of sync, at most they will be off by one
period then catch up.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/3] sched: introduce synchronized idle injection
  2015-11-10 13:23   ` Peter Zijlstra
  2015-11-10 14:01     ` Jacob Pan
@ 2015-11-10 18:41     ` Jacob Pan
  1 sibling, 0 replies; 13+ messages in thread
From: Jacob Pan @ 2015-11-10 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Rafael Wysocki, Len Brown, Andi Kleen, Thomas Gleixner,
	Paul Turner, Tim Chen, Dietmar Eggemann, Eduardo Valentin,
	Punit Agrawal, Srinivas Pandruvada, jacob.jun.pan

On Tue, 10 Nov 2015 14:23:24 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > +/* protect injection parameters from runtime changes */
> > +static DEFINE_SPINLOCK(idle_inject_lock);  
> 
> A global lock, yay :-), I think you want this to be a RAW_SPINLOCK
> though. As on -RT this would want to actually run from IRQ context
> too.
> 
I am not using this lock in per cpu timer handler, I guess it would not
be in IRQ context?
> > +	if (bitmap_weight(idle_inject_cpumask,
> > num_possible_cpus())) {  
> 
> I don't get the point of this bitmap; with the cpu notifier you
> basically ensure this is equal to online_mask.
> 
> Also, this weight test is pointless, if the bitmap is empty the
> for_each_set_bit() should be of equal cost -- and afaict nothing
> calling this is performance critical in the first place.
Agreed, I will remove that. I was planning on a subset of CPUs but not
for now, e.g. a socket or domain.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-11-10 18:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-10  0:21 [RFC PATCH v2 0/3] CFS idle injection Jacob Pan
2015-11-10  0:21 ` [RFC PATCH v2 1/3] ktime: add a roundup function Jacob Pan
2015-11-10  0:21 ` [RFC PATCH v2 2/3] timer: relax tick stop in idle entry Jacob Pan
2015-11-10  0:21 ` [RFC PATCH v2 3/3] sched: introduce synchronized idle injection Jacob Pan
2015-11-10 13:23   ` Peter Zijlstra
2015-11-10 14:01     ` Jacob Pan
2015-11-10 14:58       ` Peter Zijlstra
2015-11-10 16:28         ` Jacob Pan
2015-11-10 16:36           ` Peter Zijlstra
2015-11-10 16:50             ` Jacob Pan
2015-11-10 17:00               ` Peter Zijlstra
2015-11-10 17:14                 ` Jacob Pan
2015-11-10 18:41     ` Jacob Pan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.