All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
@ 2014-02-25 12:33 Henrik Austad
  2014-02-25 12:33 ` [PATCH 1/6] Expose do_timer CPU from tick-common Henrik Austad
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Henrik Austad @ 2014-02-25 12:33 UTC (permalink / raw)
  To: LKML, Thomas Gleixner
  Cc: Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, John Stultz, Paul E. McKenney

From: Henrik Austad <haustad@cisco.com>

Hi!

This is a rework of the preiovus patch based on the feedback gathered
from the last round. I've split it up a bit, mostly to make it easier to
single out the parts that require more attention (#4 comes to mind).

Being able to read (and possible force a specific CPU to handle all
do_timer() updates) can be very handy when debugging a system and tuning
for performance. It is not always easy to route interrupts to a specific
core (or away from one, for that matter).

These patches therefore enables userspace to explicitly set which core
to handle do_timer(), regardless of NO_HZ_FULL or NO_HZ_IDLE.


Tested on x86_64 (in a VM with 8 emulated cores). Dumping the cores
every 100ms using

for i in $(seq 1 300); do
    cat /sys/kernel/timekeeping/current_cpu ;
    sleep 0.1;
done|sort -n|uniq -c

* CONFIG_NO_HZ_FULL:

Forced timer disabled
     38 -1
      6 0
     12 1
      9 2
      4 3
      5 4
    176 5
      4 6
     46 7

sysctl -w kernel.forced_timer_cpu=3
    300 3

* CONFIG_NO_HZ_FULL_ALL:

Forced timer disabled
    300 0
sysctl -w kernel.forced_timer_cpu=2
    300 2



* CONFIG_NO_HZ_IDLE

Forced timer disabled
     38 -1
     35 0
     20 1
     23 2
     69 3
     30 4
     39 5
     22 6
     24 7
sysctl -w kernel.forced_timer_cpu=4
    300 4

* CONFIG_HZ_PERIODIC

Forced timer disabled
    300 0

sysctl -w kernel.forced_timer_cpu=5
    300 5


I also took the liberty of adding paulmck as RCU is testing for
tick_do_timer_cpu in tree_plugin.h I don't expect RCU to blow up, but
better safe than sorry and all.

The series include
Patch 1:
  Make tick_do_timer_cpu readable from other parts of the kernel

Patch 2:
  Based on #1, add a simple sysfs interface for reading it from
  userspace. This also lays the foundation for #6.

Patch 3:
  Add sysctl-hook for exposing tick_do_timer_cpu to userspace.

Patch 4:
  Add support for forcing a specific core to handle all do_timer()
  updates. This is probably the patch that requires the most careful
  review.

Patch 5:
  Allow for setting/getting forced_cpu via sysctl

Patch 6:
  Allow for setting/getting forced_cpu via sysfs

Series is based on Linus' current upstream head 7472e0.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: John Stultz <john.stultz@linaro.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


Henrik Austad (6):
  Expose do_timer CPU from tick-common
  Add sysfs RO interface to tick_do_timer_cpu
  Expose do_timer CPU as RO variable to userspace via sysctl
  Force a specific CPU to handle all do_timer() events.
  Expose the forced_timer_cpu to userspace via sysctl as R/W
  Expose tick_set_forced_cpu() via sysfs

 include/linux/clocksource.h |   19 +++++++
 kernel/sysctl.c             |   15 ++++++
 kernel/time/tick-common.c   |  115 +++++++++++++++++++++++++++++++++++++++++++
 kernel/time/tick-internal.h |    5 ++
 kernel/time/tick-sched.c    |   14 +++++-
 kernel/time/timekeeping.c   |  110 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 277 insertions(+), 1 deletion(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/6] Expose do_timer CPU from tick-common
  2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
@ 2014-02-25 12:33 ` Henrik Austad
  2014-02-25 12:33 ` [PATCH 2/6] Add sysfs RO interface to tick_do_timer_cpu Henrik Austad
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Henrik Austad @ 2014-02-25 12:33 UTC (permalink / raw)
  To: LKML, Thomas Gleixner
  Cc: Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, John Stultz, Henrik Austad

This allows other parts of the kernel access to the tick_do_timer_cpu
variable in a controlled manner. The values will differ depending on
which mode is compiled in:

* CONFIG_HZ_PERIODIC: One (normally boot-cpu) is responsible for the
  time update. This will never change unless the CPU is removed
  (hotplugging). You will normally see a 0 returned every time here.

* CONFIG_NO_HZ_IDLE: system will disable periodic timer-interrupts on
  the cpu if only the idle-task is running. This means that the timer
  cpu will change often.

* CONFIG_NO_HZ_FULL: any CPU running only a single task can have
  timer-interrupts disabled. As for NO_HZ_IDLE, timer CPU will change
  often.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: John Stultz <john.stultz@linaro.org>
Signed-off-by: Henrik Austad <haustad@cisco.com>
---
 kernel/time/tick-common.c   |   12 ++++++++++++
 kernel/time/tick-internal.h |    1 +
 2 files changed, 13 insertions(+)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 20b2fe3..1729b4b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -348,6 +348,18 @@ void tick_handover_do_timer(int *cpup)
 }
 
 /*
+ * Return the current timer-CPU (RO)
+ *
+ * Warning! this value can (and will) change depending on how timer-ticks are
+ * handled, the value returned could be outdated the moment it is read by
+ * userspace.
+ */
+int tick_expose_cpu(void)
+{
+	return tick_do_timer_cpu;
+}
+
+/*
  * Shutdown an event device on a given cpu:
  *
  * This is called on a life CPU, when a CPU is dead. So we cannot
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 8329669..5051dbd 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -154,5 +154,6 @@ static inline int tick_device_is_functional(struct clock_event_device *dev)
 
 #endif
 
+extern int tick_expose_cpu(void);
 extern void do_timer(unsigned long ticks);
 extern void update_wall_time(void);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/6] Add sysfs RO interface to tick_do_timer_cpu
  2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
  2014-02-25 12:33 ` [PATCH 1/6] Expose do_timer CPU from tick-common Henrik Austad
@ 2014-02-25 12:33 ` Henrik Austad
  2014-02-25 12:33 ` [PATCH 3/6] Expose do_timer CPU as RO variable to userspace via sysctl Henrik Austad
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Henrik Austad @ 2014-02-25 12:33 UTC (permalink / raw)
  To: LKML, Thomas Gleixner
  Cc: Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, John Stultz, Henrik Austad

Most of this is 'extras' needed in order to get sysfs working.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: John Stultz <john.stultz@linaro.org>

Signed-off-by: Henrik Austad <haustad@cisco.com>
---
 kernel/time/timekeeping.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0aa4ce8..f7c6b1f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -22,11 +22,54 @@
 #include <linux/tick.h>
 #include <linux/stop_machine.h>
 #include <linux/pvclock_gtod.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
 
 #include "tick-internal.h"
 #include "ntp_internal.h"
 #include "timekeeping_internal.h"
 
+/*
+ * sysfs interface to timer-cpu
+ */
+static ssize_t current_cpu_show(struct kobject *kobj,
+				struct kobj_attribute *attr,
+				char *buf)
+{
+	return sprintf(buf, "%d\n", tick_expose_cpu());
+}
+
+static struct kobj_attribute current_cpu_attribute =
+	__ATTR_RO(current_cpu);
+
+static struct attribute *timekeeping_attrs[] = {
+	&current_cpu_attribute.attr,
+	NULL,
+};
+static struct attribute_group timekeeping_ag = {
+	.attrs = timekeeping_attrs,
+};
+static struct kobject *timekeeping_kobj;
+
+static __init int timekeeping_sysfs_init(void)
+{
+	int ret = 0;
+
+	timekeeping_kobj = kobject_create_and_add("timekeeping", kernel_kobj);
+	if (!timekeeping_kobj)
+		return -ENOMEM;
+
+	ret = sysfs_create_group(timekeeping_kobj, &timekeeping_ag);
+	if (ret) {
+		pr_err("timekeeping: could not create attribute-group %d\n", ret);
+		kobject_put(timekeeping_kobj);
+	}
+	return ret;
+}
+
+/* need to make sure that kobj and sysfs is initialized before running this */
+late_initcall(timekeeping_sysfs_init);
+
 #define TK_CLEAR_NTP		(1 << 0)
 #define TK_MIRROR		(1 << 1)
 #define TK_CLOCK_WAS_SET	(1 << 2)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/6] Expose do_timer CPU as RO variable to userspace via sysctl
  2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
  2014-02-25 12:33 ` [PATCH 1/6] Expose do_timer CPU from tick-common Henrik Austad
  2014-02-25 12:33 ` [PATCH 2/6] Add sysfs RO interface to tick_do_timer_cpu Henrik Austad
@ 2014-02-25 12:33 ` Henrik Austad
  2014-02-25 12:33 ` [PATCH 4/6] Force a specific CPU to handle all do_timer() events Henrik Austad
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Henrik Austad @ 2014-02-25 12:33 UTC (permalink / raw)
  To: LKML, Thomas Gleixner
  Cc: Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, John Stultz

From: Henrik Austad <haustad@cisco.com>

This allows userspace to see which CPU is currently responsible for
handling the do_timer update of the time machinery.

     sysctl kernel.current_timer_cpu
     /proc/sys/kernel/current_timer_cpu

Note that this value can be fleeting if CONFIG_NO_HZ_FULL is enabled. If
not read, no additional overhead is generated in the system.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: John Stultz <john.stultz@linaro.org>
Signed-off-by: Henrik Austad <haustad@cisco.com>
---
 include/linux/clocksource.h |   11 +++++++++++
 kernel/sysctl.c             |    8 ++++++++
 kernel/time/timekeeping.c   |   21 +++++++++++++++++++++
 3 files changed, 40 insertions(+)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 67301a4..cfd39e8 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -8,6 +8,7 @@
 #ifndef _LINUX_CLOCKSOURCE_H
 #define _LINUX_CLOCKSOURCE_H
 
+#include <linux/sysctl.h>
 #include <linux/types.h>
 #include <linux/timex.h>
 #include <linux/time.h>
@@ -358,4 +359,14 @@ static inline void clocksource_of_init(void) {}
 		     .data = (fn == (clocksource_of_init_fn)NULL) ? fn : fn }
 #endif
 
+/*
+ * expose the CPU that handles the timer-tick.
+ */
+extern int expose_tick_do_timer_cpu;
+extern int timekeeping_expose_timer_cpu(struct ctl_table *table,
+					int write,
+					void __user *buffer,
+					size_t *lenp,
+					loff_t *ppos);
+
 #endif /* _LINUX_CLOCKSOURCE_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 49e13e1..a882c9e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -63,6 +63,7 @@
 #include <linux/binfmts.h>
 #include <linux/sched/sysctl.h>
 #include <linux/kexec.h>
+#include <linux/clocksource.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -424,6 +425,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rr_handler,
 	},
+	{
+		.procname  = "current_timer_cpu",
+		.data      = &expose_tick_do_timer_cpu,
+		.maxlen	   = sizeof(int),
+		.mode	   = 0444,
+		.proc_handler	= timekeeping_expose_timer_cpu,
+	},
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index f7c6b1f..55428f9 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -30,6 +30,13 @@
 #include "timekeeping_internal.h"
 
 /*
+ * Hold the current value of tick_do_timer_cpu. We could expose this directly,
+ * but putting this will prevent a casual update of the mode in sysctl.c to
+ * suddenly change the timer-cpu.
+ */
+int expose_tick_do_timer_cpu;
+
+/*
  * sysfs interface to timer-cpu
  */
 static ssize_t current_cpu_show(struct kobject *kobj,
@@ -1782,3 +1789,17 @@ void xtime_update(unsigned long ticks)
 	write_sequnlock(&jiffies_lock);
 	update_wall_time();
 }
+
+/*
+ * sysctl interface for exposing timer tick CPU
+ */
+int timekeeping_expose_timer_cpu(struct ctl_table *table,
+				 int write,
+				 void __user *buffer,
+				 size_t *lenp,
+				 loff_t *ppos)
+{
+	/* proc_dointvec will update the buffer written userspace. */
+	expose_tick_do_timer_cpu = tick_expose_cpu();
+	return proc_dointvec(table, write, buffer, lenp, ppos);
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/6] Force a specific CPU to handle all do_timer() events.
  2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
                   ` (2 preceding siblings ...)
  2014-02-25 12:33 ` [PATCH 3/6] Expose do_timer CPU as RO variable to userspace via sysctl Henrik Austad
@ 2014-02-25 12:33 ` Henrik Austad
  2014-02-25 12:34 ` [PATCH 5/6] Expose the forced_timer_cpu to userspace via sysctl as R/W Henrik Austad
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Henrik Austad @ 2014-02-25 12:33 UTC (permalink / raw)
  To: LKML, Thomas Gleixner
  Cc: Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, John Stultz, Paul E. McKenney

From: Henrik Austad <haustad@cisco.com>

By default, this is disabled (set to -1) and must be explicitly set in order
to have an effect. The CPU must be online.

If the specified CPU is part of the NO_HZ_FULL set, it is removed from
the set. When forced tick is either disabled or moved to another CPU, it
will try to re-enable NO_HZ_FULL on previous CPU.

If the CPU is removed (hotpluggable CPUs), forced tick will be disabled.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: John Stultz <john.stultz@linaro.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Henrik Austad <haustad@cisco.com>
---
 kernel/time/tick-common.c   |  103 +++++++++++++++++++++++++++++++++++++++++++
 kernel/time/tick-internal.h |    4 ++
 kernel/time/tick-sched.c    |   14 +++++-
 3 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 1729b4b..d4e660a 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -51,6 +51,19 @@ ktime_t tick_period;
 int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
 
 /*
+ * tick_do_timer_cpu_forced is YATCIV responsible for controlling the
+ * forced-timer logic. It has 2 modes
+ *
+ * - 'Off': -1  Default mode, timer ticks are being processed as normal.
+ *
+ * - 'On' : Stores the CPUid of the CPU currently assigned to handle all
+ *          do_timer() events. A CPU given this responsible will not be
+ *          allowed to enter NO_HZ_IDLE or NO_HZ_FULL mode.
+ */
+int tick_do_timer_cpu_forced __read_mostly = -1;
+
+
+/*
  * Debugging: see timer_list.c
  */
 struct tick_device *tick_get_device(int cpu)
@@ -342,6 +355,12 @@ void tick_handover_do_timer(int *cpup)
 	if (*cpup == tick_do_timer_cpu) {
 		int cpu = cpumask_first(cpu_online_mask);
 
+		/* forced tick on this */
+		if (tick_do_timer_cpu_forced == tick_do_timer_cpu) {
+			tick_set_forced_cpu(-1);
+			pr_info("Disabled forced timer-tick due to dying CPU\n");
+		}
+
 		tick_do_timer_cpu = (cpu < nr_cpu_ids) ? cpu :
 			TICK_DO_TIMER_NONE;
 	}
@@ -360,6 +379,90 @@ int tick_expose_cpu(void)
 }
 
 /*
+ * Set a forced value for tick_do_timer_cpu
+ *
+ * When not set, forced_tick_do_timer_cpu is -1, otherwise it is
+ * identical to tick_do_timer_cpu.
+ *
+ * When the core is set, this core is not able to drop into NOHZ idle
+ * and NOHZ full mode.
+ */
+int tick_set_forced_cpu(int new_cpu)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	static int nohz_cpu_pre_forced = -1;
+#endif
+	int ret = 0;
+
+	DEFINE_MUTEX(forced_cpu_mutex);
+
+	mutex_lock(&forced_cpu_mutex);
+	if (new_cpu == tick_do_timer_cpu_forced)
+		goto out;
+
+	if (new_cpu == -1) {
+#ifdef CONFIG_NO_HZ_FULL
+		if (nohz_cpu_pre_forced != -1) {
+			cpumask_set_cpu(nohz_cpu_pre_forced, tick_nohz_full_mask);
+			nohz_cpu_pre_forced = -1;
+		}
+#endif
+
+		/* let the timer-machinery pick up the fallout */
+		tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+		tick_do_timer_cpu_forced = -1;
+		goto out;
+	}
+
+	/*
+	 * Make sure new_cpu is actually a valid CPU.
+	 */
+	if (!cpu_online(new_cpu)) {
+		ret =  -EINVAL;
+		pr_warn("tick_set_forced_cpu(): Suggested CPU not available\n");
+		goto out;
+	}
+
+#ifdef CONFIG_NO_HZ_FULL
+	/*
+	 * no_hz_full require some extra care
+	 */
+	if (nohz_cpu_pre_forced != -1) {
+		cpumask_set_cpu(nohz_cpu_pre_forced, tick_nohz_full_mask);
+		nohz_cpu_pre_forced = -1;
+	}
+	if (tick_nohz_full_cpu(new_cpu)) {
+
+		nohz_cpu_pre_forced = new_cpu;
+		cpumask_clear_cpu(new_cpu, tick_nohz_full_mask);
+	}
+#endif
+	tick_do_timer_cpu_forced = new_cpu;
+	tick_do_timer_cpu = new_cpu;
+
+out:
+	mutex_unlock(&forced_cpu_mutex);
+	return ret;
+}
+int tick_get_forced_cpu(void)
+{
+	return tick_do_timer_cpu_forced;
+}
+
+bool forced_timer_can_stop_tick(void)
+{
+	/*
+	 * can be racy if we get an update of tick_do_timer_cpu_foced
+	 * right in the middle of this call
+	 *
+	 * Currently only called from tick-sched::can_stop_full_tick() and can_stop_idle_tick()
+	 *
+	 * If disabled, tick_do_timer_cpu_forced is -1, which will never be a CPUid.
+	 */
+	return !(tick_do_timer_cpu_forced == raw_smp_processor_id());
+}
+
+/*
  * Shutdown an event device on a given cpu:
  *
  * This is called on a life CPU, when a CPU is dead. So we cannot
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 5051dbd..9eed487 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -155,5 +155,9 @@ static inline int tick_device_is_functional(struct clock_event_device *dev)
 #endif
 
 extern int tick_expose_cpu(void);
+int tick_set_forced_cpu(int new_cpu);
+int tick_get_forced_cpu(void);
+bool forced_timer_can_stop_tick(void);
+
 extern void do_timer(unsigned long ticks);
 extern void update_wall_time(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..95e6d7d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -172,6 +172,11 @@ static bool can_stop_full_tick(void)
 		return false;
 	}
 
+	if (!forced_timer_can_stop_tick()) {
+		trace_tick_stop(0, "forced timer-tick running on CPU\n");
+		return false;
+	}
+
 	/* sched_clock_tick() needs us? */
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 	/*
@@ -704,11 +709,18 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 	 * invoked.
 	 */
 	if (unlikely(!cpu_online(cpu))) {
-		if (cpu == tick_do_timer_cpu)
+		if (cpu == tick_do_timer_cpu) {
 			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+		}
 		return false;
 	}
 
+	/*
+	 * if forced do_timer is set, we cannot stop the tick on this CPU.
+	 */
+	if (unlikely(!forced_timer_can_stop_tick()))
+		return false;
+
 	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
 		ts->sleep_length = (ktime_t) { .tv64 = NSEC_PER_SEC/HZ };
 		return false;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/6] Expose the forced_timer_cpu to userspace via sysctl as R/W
  2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
                   ` (3 preceding siblings ...)
  2014-02-25 12:33 ` [PATCH 4/6] Force a specific CPU to handle all do_timer() events Henrik Austad
@ 2014-02-25 12:34 ` Henrik Austad
  2014-02-25 12:34 ` [PATCH 6/6] Expose tick_set_forced_cpu() via sysfs Henrik Austad
  2014-02-25 14:19 ` [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Frederic Weisbecker
  6 siblings, 0 replies; 15+ messages in thread
From: Henrik Austad @ 2014-02-25 12:34 UTC (permalink / raw)
  To: LKML, Thomas Gleixner
  Cc: Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, John Stultz, Paul E. McKenney

From: Henrik Austad <haustad@cisco.com>

This allows userspace to set a specific CPU as the only core able
to handle do_timer() updates via

     cat sysctl kernel.forced_timer_cpu
     /proc/sys/kernel/forced_timer_cpu

and for writing:

     sysctl -w kernel.forced_timer_cpu=2
     /bin/echo 2 > /proc/sys/kernel/forced_timer_cpu

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: John Stultz <john.stultz@linaro.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Henrik Austad <haustad@cisco.com>
---
 include/linux/clocksource.h |    8 ++++++++
 kernel/sysctl.c             |    7 +++++++
 kernel/time/timekeeping.c   |   24 +++++++++++++++++++++++-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index cfd39e8..4ef47e7 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -361,6 +361,8 @@ static inline void clocksource_of_init(void) {}
 
 /*
  * expose the CPU that handles the timer-tick.
+ *
+ * Both the current as well as the forced value.
  */
 extern int expose_tick_do_timer_cpu;
 extern int timekeeping_expose_timer_cpu(struct ctl_table *table,
@@ -368,5 +370,11 @@ extern int timekeeping_expose_timer_cpu(struct ctl_table *table,
 					void __user *buffer,
 					size_t *lenp,
 					loff_t *ppos);
+extern int expose_tick_forced_timer_cpu;
+int timekeeping_expose_forced_timer_cpu(struct ctl_table *table,
+					int write,
+					void __user *buffer,
+					size_t *lenp,
+					loff_t *ppos);
 
 #endif /* _LINUX_CLOCKSOURCE_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a882c9e..c84c17a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -432,6 +432,13 @@ static struct ctl_table kern_table[] = {
 		.mode	   = 0444,
 		.proc_handler	= timekeeping_expose_timer_cpu,
 	},
+	{
+		.procname  = "forced_timer_cpu",
+		.data      = &expose_tick_forced_timer_cpu,
+		.maxlen	   = sizeof(int),
+		.mode	   = 0644,
+		.proc_handler	= timekeeping_expose_forced_timer_cpu,
+	},
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 55428f9..4bdfa11 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -37,6 +37,11 @@
 int expose_tick_do_timer_cpu;
 
 /*
+ * Hold the current value of forced CPU.
+ */
+int expose_tick_forced_timer_cpu = -1;
+
+/*
  * sysfs interface to timer-cpu
  */
 static ssize_t current_cpu_show(struct kobject *kobj,
@@ -1791,7 +1796,7 @@ void xtime_update(unsigned long ticks)
 }
 
 /*
- * sysctl interface for exposing timer tick CPU
+ * sysctl-interface exposing timer tick CPU
  */
 int timekeeping_expose_timer_cpu(struct ctl_table *table,
 				 int write,
@@ -1803,3 +1808,20 @@ int timekeeping_expose_timer_cpu(struct ctl_table *table,
 	expose_tick_do_timer_cpu = tick_expose_cpu();
 	return proc_dointvec(table, write, buffer, lenp, ppos);
 }
+
+int timekeeping_expose_forced_timer_cpu(struct ctl_table *table,
+					int write,
+					void __user *buffer,
+					size_t *lenp,
+					loff_t *ppos)
+{
+	int ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (ret || !write)
+		goto out;
+
+	ret = tick_set_forced_cpu(expose_tick_forced_timer_cpu);
+	BUG_ON(tick_expose_cpu() != expose_tick_forced_timer_cpu);
+out:
+	expose_tick_forced_timer_cpu = tick_get_forced_cpu();
+	return ret;
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/6] Expose tick_set_forced_cpu() via sysfs
  2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
                   ` (4 preceding siblings ...)
  2014-02-25 12:34 ` [PATCH 5/6] Expose the forced_timer_cpu to userspace via sysctl as R/W Henrik Austad
@ 2014-02-25 12:34 ` Henrik Austad
  2014-02-25 14:19 ` [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Frederic Weisbecker
  6 siblings, 0 replies; 15+ messages in thread
From: Henrik Austad @ 2014-02-25 12:34 UTC (permalink / raw)
  To: LKML, Thomas Gleixner
  Cc: Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, John Stultz, Paul E. McKenney

From: Henrik Austad <haustad@cisco.com>

This allows userspace to set the forced CPU via sysfs in
/sys/kernel/timekeeping/forced_cpu.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: John Stultz <john.stultz@linaro.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Henrik Austad <haustad@cisco.com>
---
 kernel/time/timekeeping.c |   36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4bdfa11..b148062 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -44,23 +44,47 @@ int expose_tick_forced_timer_cpu = -1;
 /*
  * sysfs interface to timer-cpu
  */
-static ssize_t current_cpu_show(struct kobject *kobj,
-				struct kobj_attribute *attr,
-				char *buf)
-{
-	return sprintf(buf, "%d\n", tick_expose_cpu());
+static ssize_t cpu_show(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			char *buf)
+{
+	int var = -EINVAL;
+	if (strcmp(attr->attr.name, "current_cpu") == 0)
+		var = sprintf(buf, "%d\n", tick_expose_cpu());
+	else if (strcmp(attr->attr.name, "forced_cpu") == 0)
+		var = sprintf(buf, "%d\n", tick_get_forced_cpu());
+	return var;
+}
+
+static ssize_t cpu_store(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			const char * buf,
+			size_t count)
+{
+	int var = 0;
+	if (strcmp(attr->attr.name, "forced_cpu") == 0) {
+		sscanf(buf, "%du", &var);
+		if ((var = tick_set_forced_cpu(var)) != 0) {
+			pr_err("trouble setting forced CPU (%d)\n", var);
+		}
+	}
+	return count;
 }
 
 static struct kobj_attribute current_cpu_attribute =
-	__ATTR_RO(current_cpu);
+	__ATTR(current_cpu, 0444, cpu_show, NULL);
+static struct kobj_attribute forced_cpu_attribute =
+	__ATTR(forced_cpu, 0644, cpu_show, cpu_store);
 
 static struct attribute *timekeeping_attrs[] = {
 	&current_cpu_attribute.attr,
+	&forced_cpu_attribute.attr,
 	NULL,
 };
 static struct attribute_group timekeeping_ag = {
 	.attrs = timekeeping_attrs,
 };
+
 static struct kobject *timekeeping_kobj;
 
 static __init int timekeeping_sysfs_init(void)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
                   ` (5 preceding siblings ...)
  2014-02-25 12:34 ` [PATCH 6/6] Expose tick_set_forced_cpu() via sysfs Henrik Austad
@ 2014-02-25 14:19 ` Frederic Weisbecker
  2014-02-26  8:16   ` Henrik Austad
  6 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2014-02-25 14:19 UTC (permalink / raw)
  To: Henrik Austad
  Cc: LKML, Thomas Gleixner, Henrik Austad, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Paul E. McKenney

On Tue, Feb 25, 2014 at 01:33:55PM +0100, Henrik Austad wrote:
> From: Henrik Austad <haustad@cisco.com>
> 
> Hi!
> 
> This is a rework of the preiovus patch based on the feedback gathered
> from the last round. I've split it up a bit, mostly to make it easier to
> single out the parts that require more attention (#4 comes to mind).
> 
> Being able to read (and possible force a specific CPU to handle all
> do_timer() updates) can be very handy when debugging a system and tuning
> for performance. It is not always easy to route interrupts to a specific
> core (or away from one, for that matter).

It's a bit vague as a reason for the patchset. Do we really need it?

Concerning the read-only part, if I want to know which CPU is handling the
timekeeping, I'd rather use tracing than a sysfs file. I can correlate
timekeeping update traces with other events. Especially as the timekeeping duty
can change hands and move to any CPU all the time. We really don't want to
poll on a sysfs file to get that information. It's not adapted and doesn't
carry any timestamp. It may be useful only if the timekeeping CPU is static.

Now looking at the write part. What kind of usecase do you have in mind?

It's also important to consider that, in the case of NO_HZ_IDLE, if you force
the timekeeping duty to a specific CPU, it won't be able to enter in dynticks
idle mode as long as any other CPU is running. Because those CPUs can make use of
jiffies or gettimeofday() and must have uptodate values. This involve quite
some complication like using the full system idle detection (CONFIG_NO_HZ_FULL_SYSIDLE)
to avoid races between timekeeper entering dynticks idle mode and other CPUs waking
up from idle. But the worst here is the powesaving issues resulting from the timekeeper
who can't sleep.

These issues are being dealt with in NO_HZ_FULL because we want the timekeeping duty
to be affine to the CPUs that are no full dynticks. But in the case of NO_HZ_IDLE,
I fear it's not going to be desirable.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-25 14:19 ` [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Frederic Weisbecker
@ 2014-02-26  8:16   ` Henrik Austad
  2014-02-26 13:02     ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Henrik Austad @ 2014-02-26  8:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Paul E. McKenney

On Tue, Feb 25, 2014 at 03:19:09PM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 25, 2014 at 01:33:55PM +0100, Henrik Austad wrote:
> > From: Henrik Austad <haustad@cisco.com>
> > 
> > Hi!
> > 
> > This is a rework of the preiovus patch based on the feedback gathered
> > from the last round. I've split it up a bit, mostly to make it easier to
> > single out the parts that require more attention (#4 comes to mind).
> > 
> > Being able to read (and possible force a specific CPU to handle all
> > do_timer() updates) can be very handy when debugging a system and tuning
> > for performance. It is not always easy to route interrupts to a specific
> > core (or away from one, for that matter).
> 
> It's a bit vague as a reason for the patchset. Do we really need it?

One case is to move the timekeeping away from cores I know have 
interrupt-issues (in an embedded setup, it is not always easy to move 
interrupts away).

Another is to remove jitter from cores doing either real-time work or heavy 
workerthreads. The timekeeping update is pretty fast, but I do not see any 
reason for letting timekeeping interfere with my workers if it does not 
have to.

> Concerning the read-only part, if I want to know which CPU is handling the
> timekeeping, I'd rather use tracing than a sysfs file. I can correlate
> timekeeping update traces with other events. Especially as the timekeeping duty
> can change hands and move to any CPU all the time. We really don't want to
> poll on a sysfs file to get that information. It's not adapted and doesn't
> carry any timestamp. It may be useful only if the timekeeping CPU is static.

I agree that not having a timestamp will make it useless wrt to tracing, 
but that was never the intention. By having a sysfs/sysctl value you can 
quickly determine if the timekeeping is bound to a single core or if it is 
handled everywhere.

Tracing will give you the most accurate result, but that's not always what 
you want as tracing also provides an overhead (both in the kernel as well 
as in the head of the user) using the sysfs/sysctl interface for grabbing 
the CPU does not.

You can also use it to verify that the forced-cpu you just sat, did in fact 
have the desired effect.

Another approach I was contemplating, was to let current_cpu return the 
current mask CPUs where the timer is running, once you set it via 
forced_cpu, it will narrow down to that particular core. Would that be more 
useful for the RO approach outisde TICK_PERIODIC?

> Now looking at the write part. What kind of usecase do you have in mind?

Forcing the timer to run on single core only, and a core of my choosing at 
that.

- Get timekeeping away from cores with bad interrupts (no, I cannot move 
  them).
- Avoid running timekeeping udpates on worker-cores.

> It's also important to consider that, in the case of NO_HZ_IDLE, if you force
> the timekeeping duty to a specific CPU, it won't be able to enter in dynticks
> idle mode as long as any other CPU is running. 

Yes, it will in effect be a TICK_PERIODIC core where I can configure which 
core the timekeeping update will happen.

> Because those CPUs can make use of jiffies or gettimeofday() and must 
> have uptodate values. This involve quite some complication like using the 
> full system idle detection (CONFIG_NO_HZ_FULL_SYSIDLE) to avoid races 
> between timekeeper entering dynticks idle mode and other CPUs waking up 
> from idle. But the worst here is the powesaving issues resulting from the 
> timekeeper who can't sleep.

Personally, when I force the timer to be bound to a specific CPU, I'm 
pretty happy with the fact that it won't be allowed to turn ticks off. At 
that stage, powersave is the least of my concerns, throughput and/or jitter 
is.

I know that what I'm doing is in effect turning the kernel into a 
somewhat more configurable TICK_PERIODIC kernel (in the sense that I can 
set the timer to run on something other than the boot-cpu).

> These issues are being dealt with in NO_HZ_FULL because we want the 
> timekeeping duty to be affine to the CPUs that are no full dynticks. But 
> in the case of NO_HZ_IDLE, I fear it's not going to be desirable.

Hum? I didn't get that one, what do you mean?

-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-26  8:16   ` Henrik Austad
@ 2014-02-26 13:02     ` Frederic Weisbecker
  2014-02-27  8:37       ` Henrik Austad
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2014-02-26 13:02 UTC (permalink / raw)
  To: Henrik Austad
  Cc: LKML, Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Paul E. McKenney

On Wed, Feb 26, 2014 at 09:16:03AM +0100, Henrik Austad wrote:
> On Tue, Feb 25, 2014 at 03:19:09PM +0100, Frederic Weisbecker wrote:
> > On Tue, Feb 25, 2014 at 01:33:55PM +0100, Henrik Austad wrote:
> > > From: Henrik Austad <haustad@cisco.com>
> > > 
> > > Hi!
> > > 
> > > This is a rework of the preiovus patch based on the feedback gathered
> > > from the last round. I've split it up a bit, mostly to make it easier to
> > > single out the parts that require more attention (#4 comes to mind).
> > > 
> > > Being able to read (and possible force a specific CPU to handle all
> > > do_timer() updates) can be very handy when debugging a system and tuning
> > > for performance. It is not always easy to route interrupts to a specific
> > > core (or away from one, for that matter).
> > 
> > It's a bit vague as a reason for the patchset. Do we really need it?
> 
> One case is to move the timekeeping away from cores I know have 
> interrupt-issues (in an embedded setup, it is not always easy to move 
> interrupts away).
> 
> Another is to remove jitter from cores doing either real-time work or heavy 
> workerthreads. The timekeeping update is pretty fast, but I do not see any 
> reason for letting timekeeping interfere with my workers if it does not 
> have to.

Ok. I'll get back to that below.
 
> > Concerning the read-only part, if I want to know which CPU is handling the
> > timekeeping, I'd rather use tracing than a sysfs file. I can correlate
> > timekeeping update traces with other events. Especially as the timekeeping duty
> > can change hands and move to any CPU all the time. We really don't want to
> > poll on a sysfs file to get that information. It's not adapted and doesn't
> > carry any timestamp. It may be useful only if the timekeeping CPU is static.
> 
> I agree that not having a timestamp will make it useless wrt to tracing, 
> but that was never the intention. By having a sysfs/sysctl value you can 
> quickly determine if the timekeeping is bound to a single core or if it is 
> handled everywhere.
> 
> Tracing will give you the most accurate result, but that's not always what 
> you want as tracing also provides an overhead (both in the kernel as well 
> as in the head of the user) using the sysfs/sysctl interface for grabbing 
> the CPU does not.
> 
> You can also use it to verify that the forced-cpu you just sat, did in fact 
> have the desired effect.
> 
> Another approach I was contemplating, was to let current_cpu return the 
> current mask CPUs where the timer is running, once you set it via 
> forced_cpu, it will narrow down to that particular core. Would that be more 
> useful for the RO approach outisde TICK_PERIODIC?

Ok so this is about checking which CPU the timekeeping is bound to.
But what do you diplay in the normal case (ie: when timekeeping is globally affine?)

-1 could be an option but hmm...

Wouldn't it be saner to use a cpumask of the timer affinity instead? This
is the traditional way we affine something in /proc or /sys

> 
> > Now looking at the write part. What kind of usecase do you have in mind?
> 
> Forcing the timer to run on single core only, and a core of my choosing at 
> that.
> 
> - Get timekeeping away from cores with bad interrupts (no, I cannot move 
>   them).
> - Avoid running timekeeping udpates on worker-cores.

Ok but what you're moving away is not the tick but the timekeeping duty, which
is only a part of the tick. A significant part but still just a part.

Does this all make sense outside the NO_HZ_FULL case?

> 
> > It's also important to consider that, in the case of NO_HZ_IDLE, if you force
> > the timekeeping duty to a specific CPU, it won't be able to enter in dynticks
> > idle mode as long as any other CPU is running. 
> 
> Yes, it will in effect be a TICK_PERIODIC core where I can configure which 
> core the timekeeping update will happen.

Ok, I missed that part. So when the timekeeping is affine to a specific CPU,
this CPU is prevented to enter into dynticks idle mode?

> 
> > Because those CPUs can make use of jiffies or gettimeofday() and must 
> > have uptodate values. This involve quite some complication like using the 
> > full system idle detection (CONFIG_NO_HZ_FULL_SYSIDLE) to avoid races 
> > between timekeeper entering dynticks idle mode and other CPUs waking up 
> > from idle. But the worst here is the powesaving issues resulting from the 
> > timekeeper who can't sleep.
> 
> Personally, when I force the timer to be bound to a specific CPU, I'm 
> pretty happy with the fact that it won't be allowed to turn ticks off. At 
> that stage, powersave is the least of my concerns, throughput and/or jitter 
> is.
> 
> I know that what I'm doing is in effect turning the kernel into a 
> somewhat more configurable TICK_PERIODIC kernel (in the sense that I can 
> set the timer to run on something other than the boot-cpu).

I see.

> 
> > These issues are being dealt with in NO_HZ_FULL because we want the 
> > timekeeping duty to be affine to the CPUs that are no full dynticks. But 
> > in the case of NO_HZ_IDLE, I fear it's not going to be desirable.
> 
> Hum? I didn't get that one, what do you mean?

So in NO_HZ_FULL we do something that is very close to what're doing: the timekeeping
is affine to the boot CPU and it stays periodic whatever happens.

But we start to worry about powersaving. When the whole system is idle, there is
no point is preventing the CPU 0 to sleep. So we are dealing with that by using a
full system idle detection that lets CPU 0 go to sleep when there is strictly nothing
to do. Then when nohz full CPU wakes up from idle, CPU 0 is woken up as well to get back
to its timekeeping duty.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-26 13:02     ` Frederic Weisbecker
@ 2014-02-27  8:37       ` Henrik Austad
  2014-02-27 13:56         ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Henrik Austad @ 2014-02-27  8:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Paul E. McKenney

On Wed, Feb 26, 2014 at 02:02:42PM +0100, Frederic Weisbecker wrote:
> On Wed, Feb 26, 2014 at 09:16:03AM +0100, Henrik Austad wrote:
> > On Tue, Feb 25, 2014 at 03:19:09PM +0100, Frederic Weisbecker wrote:
> > > On Tue, Feb 25, 2014 at 01:33:55PM +0100, Henrik Austad wrote:
> > > > From: Henrik Austad <haustad@cisco.com>
> > > > 
> > > > Hi!
> > > > 
> > > > This is a rework of the preiovus patch based on the feedback gathered
> > > > from the last round. I've split it up a bit, mostly to make it easier to
> > > > single out the parts that require more attention (#4 comes to mind).
> > > > 
> > > > Being able to read (and possible force a specific CPU to handle all
> > > > do_timer() updates) can be very handy when debugging a system and tuning
> > > > for performance. It is not always easy to route interrupts to a specific
> > > > core (or away from one, for that matter).
> > > 
> > > It's a bit vague as a reason for the patchset. Do we really need it?
> > 
> > One case is to move the timekeeping away from cores I know have 
> > interrupt-issues (in an embedded setup, it is not always easy to move 
> > interrupts away).
> > 
> > Another is to remove jitter from cores doing either real-time work or heavy 
> > workerthreads. The timekeeping update is pretty fast, but I do not see any 
> > reason for letting timekeeping interfere with my workers if it does not 
> > have to.
> 
> Ok. I'll get back to that below.
>  
> > > Concerning the read-only part, if I want to know which CPU is handling the
> > > timekeeping, I'd rather use tracing than a sysfs file. I can correlate
> > > timekeeping update traces with other events. Especially as the timekeeping duty
> > > can change hands and move to any CPU all the time. We really don't want to
> > > poll on a sysfs file to get that information. It's not adapted and doesn't
> > > carry any timestamp. It may be useful only if the timekeeping CPU is static.
> > 
> > I agree that not having a timestamp will make it useless wrt to tracing, 
> > but that was never the intention. By having a sysfs/sysctl value you can 
> > quickly determine if the timekeeping is bound to a single core or if it is 
> > handled everywhere.
> > 
> > Tracing will give you the most accurate result, but that's not always what 
> > you want as tracing also provides an overhead (both in the kernel as well 
> > as in the head of the user) using the sysfs/sysctl interface for grabbing 
> > the CPU does not.
> > 
> > You can also use it to verify that the forced-cpu you just sat, did in fact 
> > have the desired effect.
> > 
> > Another approach I was contemplating, was to let current_cpu return the 
> > current mask CPUs where the timer is running, once you set it via 
> > forced_cpu, it will narrow down to that particular core. Would that be more 
> > useful for the RO approach outisde TICK_PERIODIC?
> 
> Ok so this is about checking which CPU the timekeeping is bound to.
> But what do you diplay in the normal case (ie: when timekeeping is globally affine?)
> 
> -1 could be an option but hmm...

I don't really like -1, that indicates that it is disabled and could 
confuse people, letting them think that timekeeping is disabled at all 
cores.

> Wouldn't it be saner to use a cpumask of the timer affinity instead? This
> is the traditional way we affine something in /proc or /sys

Yes, that's what I'm starting to think as well, that would make a lot more 
sense when the timer is bounced around.

something like a 'current_cpu_mask' which would return a hex-mask 
of the cores where the timekeeping update _could_ run.

For periodic, that would be a single core (normally boot), and when forced, 
it would return a cpu-mask with only one cpu set. Then the result would be 
a lot more informative for NO_HZ_(IDLE|FULL) as well.

Worth a shot? (completely disjoint from the write-discussion below)

> > > Now looking at the write part. What kind of usecase do you have in mind?
> > 
> > Forcing the timer to run on single core only, and a core of my choosing at 
> > that.
> > 
> > - Get timekeeping away from cores with bad interrupts (no, I cannot move 
> >   them).
> > - Avoid running timekeeping udpates on worker-cores.
> 
> Ok but what you're moving away is not the tick but the timekeeping duty, which
> is only a part of the tick. A significant part but still just a part.

That is certainly true, but that part happens to be of global influence, so 
if I have a core where a driver disables interrupts a lot (or drops into a 
hypervisor, or any other silly thing it really shouldn't be doing), then I 
would like to be able to move the timekeeping updates away from that core. 

The same goes for cores running rt-tasks (>1), I really do not want -any- 
interference at all, and if I can remove the extra jitter from the 
timekeeping, I'm pretty happy to do so.

> Does this all make sense outside the NO_HZ_FULL case?

In my view, it makes sense in the periodic case as well since all 
timekeeping updates then happens on the boot-cpu (unless it is hotunplugged 
that is).

> > 
> > > It's also important to consider that, in the case of NO_HZ_IDLE, if you force
> > > the timekeeping duty to a specific CPU, it won't be able to enter in dynticks
> > > idle mode as long as any other CPU is running. 
> > 
> > Yes, it will in effect be a TICK_PERIODIC core where I can configure which 
> > core the timekeeping update will happen.
> 
> Ok, I missed that part. So when the timekeeping is affine to a specific CPU,
> this CPU is prevented to enter into dynticks idle mode?

That's what I aimed at, and I *think* I managed that. I added a 
forced_timer_can_stop_tick() and let can_stop_full_tick() and 
can_stop_idle_tick() call that. I think that is sufficient, at least I did 
not see that the timerduty was transferred to another core afterwards.

> > > Because those CPUs can make use of jiffies or gettimeofday() and must 
> > > have uptodate values. This involve quite some complication like using the 
> > > full system idle detection (CONFIG_NO_HZ_FULL_SYSIDLE) to avoid races 
> > > between timekeeper entering dynticks idle mode and other CPUs waking up 
> > > from idle. But the worst here is the powesaving issues resulting from the 
> > > timekeeper who can't sleep.
> > 
> > Personally, when I force the timer to be bound to a specific CPU, I'm 
> > pretty happy with the fact that it won't be allowed to turn ticks off. At 
> > that stage, powersave is the least of my concerns, throughput and/or jitter 
> > is.
> > 
> > I know that what I'm doing is in effect turning the kernel into a 
> > somewhat more configurable TICK_PERIODIC kernel (in the sense that I can 
> > set the timer to run on something other than the boot-cpu).
> 
> I see.
> 
> > 
> > > These issues are being dealt with in NO_HZ_FULL because we want the 
> > > timekeeping duty to be affine to the CPUs that are no full dynticks. But 
> > > in the case of NO_HZ_IDLE, I fear it's not going to be desirable.
> > 
> > Hum? I didn't get that one, what do you mean?
> 
> So in NO_HZ_FULL we do something that is very close to what're doing: the timekeeping
> is affine to the boot CPU and it stays periodic whatever happens.
> 
> But we start to worry about powersaving. When the whole system is idle, there is
> no point is preventing the CPU 0 to sleep. So we are dealing with that by using a
> full system idle detection that lets CPU 0 go to sleep when there is strictly nothing
> to do. Then when nohz full CPU wakes up from idle, CPU 0 is woken up as well to get back
> to its timekeeping duty.

Hmm, I had the impreesion that when a CPU with timekeeping-duty was sent to 
sleep, it would set tick_do_timer_cpu to TICK_DO_TIMER_NONE, and whenever 
another core would run do_timer() it would see if tick_do_timer_cpu was set 
to TICK_DO_TIMER_NONE and if so, grab it and run with it.

I really don't see how this wakes up CPU0 (but then again, there's probably 
several layers of logic here that I'm missing :)


-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-27  8:37       ` Henrik Austad
@ 2014-02-27 13:56         ` Frederic Weisbecker
  2014-02-28  9:40           ` Henrik Austad
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2014-02-27 13:56 UTC (permalink / raw)
  To: Henrik Austad
  Cc: LKML, Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Paul E. McKenney

On Thu, Feb 27, 2014 at 09:37:35AM +0100, Henrik Austad wrote:
> On Wed, Feb 26, 2014 at 02:02:42PM +0100, Frederic Weisbecker wrote:
> > 
> > -1 could be an option but hmm...
> 
> I don't really like -1, that indicates that it is disabled and could 
> confuse people, letting them think that timekeeping is disabled at all 
> cores.
> 
> > Wouldn't it be saner to use a cpumask of the timer affinity instead? This
> > is the traditional way we affine something in /proc or /sys
> 
> Yes, that's what I'm starting to think as well, that would make a lot more 
> sense when the timer is bounced around.
> 
> something like a 'current_cpu_mask' which would return a hex-mask 
> of the cores where the timekeeping update _could_ run.

Right, or timekeeper_cpumask.

> 
> For periodic, that would be a single core (normally boot), and when forced, 
> it would return a cpu-mask with only one cpu set. Then the result would be 
> a lot more informative for NO_HZ_(IDLE|FULL) as well.

I'd rather suggest that for periodic and NO_HZ_IDLE, this should be equal to cpu_online_mask
by default. Because the timekeeping duty can be taken by any CPU.

Then on subsequent writes on this cpumask, this reflects the reduced subset of CPUs that
we allow to take the duty.

In your usecase you're interested in a single CPU to take that duty but the cpumask
allows more flexibility.

Now NO_HZ_FULL is a bit different. For now only CPU 0 can take the duty. This will extend
later to every CPUs outside the range of full dynticks.

> 
> Worth a shot? (completely disjoint from the write-discussion below)

I can't judge alone if we really want this patchset. We need the input of timer
maintainers for that.

Assuming we want it then yeah, the cpumask affinity in sysfs/procfs looks like a sane
approach to me. In term we should also plug it to rcu full system idle detection:
https://lwn.net/Articles/558284/ so that we can shutdown the reduced set of timekeepers
when there is no other CPU running. I have some patches for it that I can plug
afterward. So no worry for you on that side.

> 
> > > > Now looking at the write part. What kind of usecase do you have in mind?
> > > 
> > > Forcing the timer to run on single core only, and a core of my choosing at 
> > > that.
> > > 
> > > - Get timekeeping away from cores with bad interrupts (no, I cannot move 
> > >   them).
> > > - Avoid running timekeeping udpates on worker-cores.
> > 
> > Ok but what you're moving away is not the tick but the timekeeping duty, which
> > is only a part of the tick. A significant part but still just a part.
> 
> That is certainly true, but that part happens to be of global influence, so 
> if I have a core where a driver disables interrupts a lot (or drops into a 
> hypervisor, or any other silly thing it really shouldn't be doing), then I 
> would like to be able to move the timekeeping updates away from that core. 

I don't understand how these things are linked together. If your driver disables
interrupt and you don't want to be disturbed, moving the timekeeping duty doesn't
move the tick itself.

What happens to be disturbing for you in the timekeeping update that is not with
the tick as a whole? Is the delta of cputime added by jiffies and gtod update
alone a problem?

This sounds surprising but possible. Still I want to be sure that's the exact
problem you're encoutering.

> 
> The same goes for cores running rt-tasks (>1), I really do not want -any- 
> interference at all, and if I can remove the extra jitter from the 
> timekeeping, I'm pretty happy to do so.

Then again if you don't want interference at all, NO_HZ_FULL sounds like a better
solution. NO_HZ_FULL implies that the timekeeping is handled by CPUs in the nohz_full
boot paramater range. If NO_HZ_FULL_ALL then it's CPU 0.

> 
> > Does this all make sense outside the NO_HZ_FULL case?
> 
> In my view, it makes sense in the periodic case as well since all 
> timekeeping updates then happens on the boot-cpu (unless it is hotunplugged 
> that is).

But if we get back to your requirements, you want no interference at all. HZ_PERIODIC
doesn't look like what you want.

> 
> > > 
> > > > It's also important to consider that, in the case of NO_HZ_IDLE, if you force
> > > > the timekeeping duty to a specific CPU, it won't be able to enter in dynticks
> > > > idle mode as long as any other CPU is running. 
> > > 
> > > Yes, it will in effect be a TICK_PERIODIC core where I can configure which 
> > > core the timekeeping update will happen.
> > 
> > Ok, I missed that part. So when the timekeeping is affine to a specific CPU,
> > this CPU is prevented to enter into dynticks idle mode?
> 
> That's what I aimed at, and I *think* I managed that. I added a 
> forced_timer_can_stop_tick() and let can_stop_full_tick() and 
> can_stop_idle_tick() call that. I think that is sufficient, at least I did 
> not see that the timerduty was transferred to another core afterwards.

Ok, I need to look at the details.

> 
> > > > Because those CPUs can make use of jiffies or gettimeofday() and must 
> > > > have uptodate values. This involve quite some complication like using the 
> > > > full system idle detection (CONFIG_NO_HZ_FULL_SYSIDLE) to avoid races 
> > > > between timekeeper entering dynticks idle mode and other CPUs waking up 
> > > > from idle. But the worst here is the powesaving issues resulting from the 
> > > > timekeeper who can't sleep.
> > > 
> > > Personally, when I force the timer to be bound to a specific CPU, I'm 
> > > pretty happy with the fact that it won't be allowed to turn ticks off. At 
> > > that stage, powersave is the least of my concerns, throughput and/or jitter 
> > > is.
> > > 
> > > I know that what I'm doing is in effect turning the kernel into a 
> > > somewhat more configurable TICK_PERIODIC kernel (in the sense that I can 
> > > set the timer to run on something other than the boot-cpu).
> > 
> > I see.
> > 
> > > 
> > > > These issues are being dealt with in NO_HZ_FULL because we want the 
> > > > timekeeping duty to be affine to the CPUs that are no full dynticks. But 
> > > > in the case of NO_HZ_IDLE, I fear it's not going to be desirable.
> > > 
> > > Hum? I didn't get that one, what do you mean?
> > 
> > So in NO_HZ_FULL we do something that is very close to what're doing: the timekeeping
> > is affine to the boot CPU and it stays periodic whatever happens.
> > 
> > But we start to worry about powersaving. When the whole system is idle, there is
> > no point is preventing the CPU 0 to sleep. So we are dealing with that by using a
> > full system idle detection that lets CPU 0 go to sleep when there is strictly nothing
> > to do. Then when nohz full CPU wakes up from idle, CPU 0 is woken up as well to get back
> > to its timekeeping duty.
> 
> Hmm, I had the impreesion that when a CPU with timekeeping-duty was sent to 
> sleep, it would set tick_do_timer_cpu to TICK_DO_TIMER_NONE, and whenever 
> another core would run do_timer() it would see if tick_do_timer_cpu was set 
> to TICK_DO_TIMER_NONE and if so, grab it and run with it.

Yep that's true for periodic and dynticks idle. Not for full dynticks.

> 
> I really don't see how this wakes up CPU0 (but then again, there's probably 
> several layers of logic here that I'm missing :)

By way of an IPI.

Scenario is: CPU 0 handles timekeeping and CPU 1 is full dynticks. Both are running
then CPU 1 goes to sleep. CPU 0 notices that CPU 1 went to sleep and thus nobody else
needs the timekeeping to be uptodate. If CPU 0 is idle as well it can go to sleep. So
does it. Then later if CPU 1 wakes up to do something, it sends an IPI to CPU 0 such that
CPU 0 wakes up, notices that CPU 1 is alive and run the timekeeping update on its behalf.

It's not yet upstream but that's the plan :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-27 13:56         ` Frederic Weisbecker
@ 2014-02-28  9:40           ` Henrik Austad
  2014-02-28 23:08             ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Henrik Austad @ 2014-02-28  9:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Paul E. McKenney

On Thu, Feb 27, 2014 at 02:56:20PM +0100, Frederic Weisbecker wrote:
> On Thu, Feb 27, 2014 at 09:37:35AM +0100, Henrik Austad wrote:
> > On Wed, Feb 26, 2014 at 02:02:42PM +0100, Frederic Weisbecker wrote:
> > > 
> > > -1 could be an option but hmm...
> > 
> > I don't really like -1, that indicates that it is disabled and could 
> > confuse people, letting them think that timekeeping is disabled at all 
> > cores.
> > 
> > > Wouldn't it be saner to use a cpumask of the timer affinity instead? 
> > > This is the traditional way we affine something in /proc or /sys
> > 
> > Yes, that's what I'm starting to think as well, that would make a lot 
> > more sense when the timer is bounced around.
> > 
> > something like a 'current_cpu_mask' which would return a hex-mask of 
> > the cores where the timekeeping update _could_ run.
> 
> Right, or timekeeper_cpumask.

Right, paying some attention to a proper, descriptive name is probably a 
good thing.

Also, for future reference, a consistent name for "the core running the 
timeeping update" would be nice. Ref suggestion above, 'timekeeper' seems 
to be a valid name for the core currently assigned the duty.

All in favour?

> > For periodic, that would be a single core (normally boot), and when 
> > forced, it would return a cpu-mask with only one cpu set. Then the 
> > result would be a lot more informative for NO_HZ_(IDLE|FULL) as well.
> 
> I'd rather suggest that for periodic and NO_HZ_IDLE, this should be equal 
> to cpu_online_mask by default. Because the timekeeping duty can be taken 
> by any CPU.

So basically, we're looking at a few more values here then

# the current (un)restricted mask where timekeeper can run this would be 
# the writable mask

     timekeeper_cpumask

# the mask for which you can assign cores to timekeeper_cpumask For 
# NO_HZ_FULL, I guess this would be all cores except those in 
# nohz_full-mask.
# read-only

     timekeeper_cpumask_possible

# makes sense only in periodic system where it does not bounce around like 
# crazy.
# read-only

     timekeeper_current_cpu

Then force_cpu should accept a mask, or a range, of CPUs on which 
timekeeper should be allowed.

     timekeeper_cpumask_forced

Makes sense?

> Then on subsequent writes on this cpumask, this reflects the reduced 
> subset of CPUs that we allow to take the duty.
> 
> In your usecase you're interested in a single CPU to take that duty but 
> the cpumask allows more flexibility.

I agree.

> Now NO_HZ_FULL is a bit different. For now only CPU 0 can take the duty. 
> This will extend later to every CPUs outside the range of full dynticks.
> 
> > 
> > Worth a shot? (completely disjoint from the write-discussion below)
> 
> I can't judge alone if we really want this patchset. We need the input of 
> timer maintainers for that.

Sure!

> Assuming we want it then yeah, the cpumask affinity in sysfs/procfs looks 
> like a sane approach to me. In term we should also plug it to rcu full 
> system idle detection: https://lwn.net/Articles/558284/ so that we can 
> shutdown the reduced set of timekeepers when there is no other CPU 
> running. I have some patches for it that I can plug afterward. So no 
> worry for you on that side.

Ok, I'm not going to start to worry about that just yet then, but it does 
sound very interesting! :)

> > > > > Now looking at the write part. What kind of usecase do you have 
> > > > > in mind?
> > > > 
> > > > Forcing the timer to run on single core only, and a core of my 
> > > > choosing at that.
> > > > 
> > > > - Get timekeeping away from cores with bad interrupts (no, I cannot 
> > > > move
> > > >   them). - Avoid running timekeeping udpates on worker-cores.
> > > 
> > > Ok but what you're moving away is not the tick but the timekeeping 
> > > duty, which is only a part of the tick. A significant part but still 
> > > just a part.
> > 
> > That is certainly true, but that part happens to be of global 
> > influence, so if I have a core where a driver disables interrupts a lot 
> > (or drops into a hypervisor, or any other silly thing it really 
> > shouldn't be doing), then I would like to be able to move the 
> > timekeeping updates away from that core.
> 
> I don't understand how these things are linked together. If your driver 
> disables interrupt and you don't want to be disturbed, moving the 
> timekeeping duty doesn't move the tick itself.

True, but it reduces some amount of jitter from the tick itself. It's more 
like damagecontrol than damage-prevention.

> What happens to be disturbing for you in the timekeeping update that is 
> not with the tick as a whole? Is the delta of cputime added by jiffies 
> and gtod update alone a problem?

I had some examples where timekeeper would grab timekeeper.lock and hold it 
for a very long time. The reason was not very easy to pinpoint, so one of 
the things I wanted to try, was to move the timekeeper around and see what 
happened (yeah, debugging at it's best!)

> This sounds surprising but possible. Still I want to be sure that's the 
> exact problem you're encoutering.
> 
> > 
> > The same goes for cores running rt-tasks (>1), I really do not want 
> > -any- interference at all, and if I can remove the extra jitter from 
> > the timekeeping, I'm pretty happy to do so.
> 
> Then again if you don't want interference at all, NO_HZ_FULL sounds like 
> a better solution. NO_HZ_FULL implies that the timekeeping is handled by 
> CPUs in the nohz_full boot paramater range. If NO_HZ_FULL_ALL then it's 
> CPU 0.

Aha, thanks for clearing that up!

> > > Does this all make sense outside the NO_HZ_FULL case?
> > 
> > In my view, it makes sense in the periodic case as well since all 
> > timekeeping updates then happens on the boot-cpu (unless it is 
> > hotunplugged that is).
> 
> But if we get back to your requirements, you want no interference at all. 
> HZ_PERIODIC doesn't look like what you want.

True, but I also needed a mechanism for moving the timekeeper away from a 
problematic core.

Another benefit from periodic, is that it supports skewed timers and that 
the overhead is lower than NO_HZ when you do have timer-interrupts.

> > > > > It's also important to consider that, in the case of NO_HZ_IDLE, 
> > > > > if you force the timekeeping duty to a specific CPU, it won't be 
> > > > > able to enter in dynticks idle mode as long as any other CPU is 
> > > > > running.
> > > > 
> > > > Yes, it will in effect be a TICK_PERIODIC core where I can 
> > > > configure which core the timekeeping update will happen.
> > > 
> > > Ok, I missed that part. So when the timekeeping is affine to a 
> > > specific CPU, this CPU is prevented to enter into dynticks idle mode?
> > 
> > That's what I aimed at, and I *think* I managed that. I added a 
> > forced_timer_can_stop_tick() and let can_stop_full_tick() and 
> > can_stop_idle_tick() call that. I think that is sufficient, at least I 
> > did not see that the timerduty was transferred to another core 
> > afterwards.
> 
> Ok, I need to look at the details.

Ahem, no, my understanding of the NOHZ-cpumask was flawed. I'm reworking 
this based on feedback as well as the discussion below.

> > > > > Because those CPUs can make use of jiffies or gettimeofday() and 
> > > > > must have uptodate values. This involve quite some complication 
> > > > > like using the full system idle detection 
> > > > > (CONFIG_NO_HZ_FULL_SYSIDLE) to avoid races between timekeeper 
> > > > > entering dynticks idle mode and other CPUs waking up from idle. 
> > > > > But the worst here is the powesaving issues resulting from the 
> > > > > timekeeper who can't sleep.
> > > > 
> > > > Personally, when I force the timer to be bound to a specific CPU, 
> > > > I'm pretty happy with the fact that it won't be allowed to turn 
> > > > ticks off. At that stage, powersave is the least of my concerns, 
> > > > throughput and/or jitter is.
> > > > 
> > > > I know that what I'm doing is in effect turning the kernel into a 
> > > > somewhat more configurable TICK_PERIODIC kernel (in the sense that 
> > > > I can set the timer to run on something other than the boot-cpu).
> > > 
> > > I see.
> > > 
> > > > 
> > > > > These issues are being dealt with in NO_HZ_FULL because we want 
> > > > > the timekeeping duty to be affine to the CPUs that are no full 
> > > > > dynticks. But in the case of NO_HZ_IDLE, I fear it's not going to 
> > > > > be desirable.
> > > > 
> > > > Hum? I didn't get that one, what do you mean?
> > > 
> > > So in NO_HZ_FULL we do something that is very close to what're doing: 
> > > the timekeeping is affine to the boot CPU and it stays periodic 
> > > whatever happens.
> > > 
> > > But we start to worry about powersaving. When the whole system is 
> > > idle, there is no point is preventing the CPU 0 to sleep. So we are 
> > > dealing with that by using a full system idle detection that lets CPU 
> > > 0 go to sleep when there is strictly nothing to do. Then when nohz 
> > > full CPU wakes up from idle, CPU 0 is woken up as well to get back to 
> > > its timekeeping duty.
> > 
> > Hmm, I had the impreesion that when a CPU with timekeeping-duty was 
> > sent to sleep, it would set tick_do_timer_cpu to TICK_DO_TIMER_NONE, 
> > and whenever another core would run do_timer() it would see if 
> > tick_do_timer_cpu was set to TICK_DO_TIMER_NONE and if so, grab it and 
> > run with it.
> 
> Yep that's true for periodic and dynticks idle. Not for full dynticks.
>
> > 
> > I really don't see how this wakes up CPU0 (but then again, there's 
> > probably several layers of logic here that I'm missing :)
> 
> By way of an IPI.
> 
> Scenario is: CPU 0 handles timekeeping and CPU 1 is full dynticks. Both 
> are running then CPU 1 goes to sleep. CPU 0 notices that CPU 1 went to 
> sleep and thus nobody else needs the timekeeping to be uptodate. If CPU 0 
> is idle as well it can go to sleep. So does it. Then later if CPU 1 wakes 
> up to do something, it sends an IPI to CPU 0 such that CPU 0 wakes up, 
> notices that CPU 1 is alive and run the timekeeping update on its behalf.
> 
> It's not yet upstream but that's the plan :)

So all dynticks-kernel will send an IPI to CPU0 upon wakeup to notify  
that they're no longer sleeping?

Sounds like having a (effective) way to detect that the entire system is 
idle in an effective manner would be really nice in this scenario! That 
way, if CPU1 could determine that the system was not idle, it would not 
have to send an IPI to CPU0 to wake it up.

-- 
Henrik

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-28  9:40           ` Henrik Austad
@ 2014-02-28 23:08             ` Frederic Weisbecker
  2014-03-03 10:42               ` Henrik Austad
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2014-02-28 23:08 UTC (permalink / raw)
  To: Henrik Austad
  Cc: LKML, Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Paul E. McKenney

On Fri, Feb 28, 2014 at 10:40:35AM +0100, Henrik Austad wrote:
> On Thu, Feb 27, 2014 at 02:56:20PM +0100, Frederic Weisbecker wrote:
> > Right, or timekeeper_cpumask.
> 
> Right, paying some attention to a proper, descriptive name is probably a 
> good thing.
> 
> Also, for future reference, a consistent name for "the core running the 
> timeeping update" would be nice. Ref suggestion above, 'timekeeper' seems 
> to be a valid name for the core currently assigned the duty.
> 
> All in favour?

Yeah definetly. I'm already using it in my changelogs for a few.

> 
> > > For periodic, that would be a single core (normally boot), and when 
> > > forced, it would return a cpu-mask with only one cpu set. Then the 
> > > result would be a lot more informative for NO_HZ_(IDLE|FULL) as well.
> > 
> > I'd rather suggest that for periodic and NO_HZ_IDLE, this should be equal 
> > to cpu_online_mask by default. Because the timekeeping duty can be taken 
> > by any CPU.
> 
> So basically, we're looking at a few more values here then

I feel we don't need more than a strictly single cpumask that can do all of that.
Look:

> 
> # the current (un)restricted mask where timekeeper can run this would be 
> # the writable mask
> 
>      timekeeper_cpumask

Ok. By default cpu_possible_mask.

> 
> # the mask for which you can assign cores to timekeeper_cpumask For 
> # NO_HZ_FULL, I guess this would be all cores except those in 
> # nohz_full-mask.
> # read-only
> 
>      timekeeper_cpumask_possible

Right but why a different mask here? Just in the case of NO_HZ_FULL,
timekeeper_cpumask is (cpu_possible_mask & ~tick_nohz_full_mask)

> 
> # makes sense only in periodic system where it does not bounce around like 
> # crazy.
> # read-only
> 
>      timekeeper_current_cpu

It may not bounce crazy but it can bounce actually. Just offline the timekeeper
and this happens.

Then again I feel like timekeeper_cpumask is still a good fit here.

> 
> Then force_cpu should accept a mask, or a range, of CPUs on which 
> timekeeper should be allowed.
> 
>      timekeeper_cpumask_forced

That too can be timekeeper_cpumask.

> > Assuming we want it then yeah, the cpumask affinity in sysfs/procfs looks 
> > like a sane approach to me. In term we should also plug it to rcu full 
> > system idle detection: https://lwn.net/Articles/558284/ so that we can 
> > shutdown the reduced set of timekeepers when there is no other CPU 
> > running. I have some patches for it that I can plug afterward. So no 
> > worry for you on that side.
> 
> Ok, I'm not going to start to worry about that just yet then, but it does 
> sound very interesting! :)

And moreover it's very convenient for me if we do that since that would
setup all the infrastrucuture I need to affine timekeeper and adaptively
run the timekeeper in dynticks idle mode on NO_HZ_FULL.

> > > That is certainly true, but that part happens to be of global 
> > > influence, so if I have a core where a driver disables interrupts a lot 
> > > (or drops into a hypervisor, or any other silly thing it really 
> > > shouldn't be doing), then I would like to be able to move the 
> > > timekeeping updates away from that core.
> > 
> > I don't understand how these things are linked together. If your driver 
> > disables interrupt and you don't want to be disturbed, moving the 
> > timekeeping duty doesn't move the tick itself.
> 
> True, but it reduces some amount of jitter from the tick itself. It's more 
> like damagecontrol than damage-prevention.

Ok.

> 
> > What happens to be disturbing for you in the timekeeping update that is 
> > not with the tick as a whole? Is the delta of cputime added by jiffies 
> > and gtod update alone a problem?
> 
> I had some examples where timekeeper would grab timekeeper.lock and hold it 
> for a very long time. The reason was not very easy to pinpoint, so one of 
> the things I wanted to try, was to move the timekeeper around and see what 
> happened (yeah, debugging at it's best!)

And you got better results?

> > > > Does this all make sense outside the NO_HZ_FULL case?
> > > 
> > > In my view, it makes sense in the periodic case as well since all 
> > > timekeeping updates then happens on the boot-cpu (unless it is 
> > > hotunplugged that is).
> > 
> > But if we get back to your requirements, you want no interference at all. 
> > HZ_PERIODIC doesn't look like what you want.
> 
> True, but I also needed a mechanism for moving the timekeeper away from a 
> problematic core.
> 
> Another benefit from periodic, is that it supports skewed timers and that 
> the overhead is lower than NO_HZ when you do have timer-interrupts.

Yeah that's a point.

> > > That's what I aimed at, and I *think* I managed that. I added a 
> > > forced_timer_can_stop_tick() and let can_stop_full_tick() and 
> > > can_stop_idle_tick() call that. I think that is sufficient, at least I 
> > > did not see that the timerduty was transferred to another core 
> > > afterwards.
> > 
> > Ok, I need to look at the details.
> 
> Ahem, no, my understanding of the NOHZ-cpumask was flawed. I'm reworking 
> this based on feedback as well as the discussion below.

Ok.

> > Scenario is: CPU 0 handles timekeeping and CPU 1 is full dynticks. Both 
> > are running then CPU 1 goes to sleep. CPU 0 notices that CPU 1 went to 
> > sleep and thus nobody else needs the timekeeping to be uptodate. If CPU 0 
> > is idle as well it can go to sleep. So does it. Then later if CPU 1 wakes 
> > up to do something, it sends an IPI to CPU 0 such that CPU 0 wakes up, 
> > notices that CPU 1 is alive and run the timekeeping update on its behalf.
> > 
> > It's not yet upstream but that's the plan :)
> 
> So all dynticks-kernel will send an IPI to CPU0 upon wakeup to notify  
> that they're no longer sleeping?

I guess you meant s/dynticks-kernel/full-dynticks CPU/
So yes that's exactly what happens!

> Sounds like having a (effective) way to detect that the entire system is 
> idle in an effective manner would be really nice in this scenario! That 
> way, if CPU1 could determine that the system was not idle, it would not 
> have to send an IPI to CPU0 to wake it up.

Exactly! The full system idle detection provides that clever information (taking care
of races and all) that lets us know if the IPI is really needed or if it can be spared,
depending on CPU 0 state.

This all relies on a tricky lockless machine state. Checkout CONFIG_NO_HZ_FULL_SYSIDLE
for details, it's an interesting devilry ;)

It's not yet used but we are in the good way: https://lkml.org/lkml/2013/12/17/708

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace
  2014-02-28 23:08             ` Frederic Weisbecker
@ 2014-03-03 10:42               ` Henrik Austad
  0 siblings, 0 replies; 15+ messages in thread
From: Henrik Austad @ 2014-03-03 10:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Henrik Austad, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Paul E. McKenney

On Sat, Mar 01, 2014 at 12:08:36AM +0100, Frederic Weisbecker wrote:
> On Fri, Feb 28, 2014 at 10:40:35AM +0100, Henrik Austad wrote:
> > On Thu, Feb 27, 2014 at 02:56:20PM +0100, Frederic Weisbecker wrote:
> > > Right, or timekeeper_cpumask.
> > 
> > Right, paying some attention to a proper, descriptive name is probably a 
> > good thing.
> > 
> > Also, for future reference, a consistent name for "the core running the 
> > timeeping update" would be nice. Ref suggestion above, 'timekeeper' seems 
> > to be a valid name for the core currently assigned the duty.
> > 
> > All in favour?
> 
> Yeah definetly. I'm already using it in my changelogs for a few.
> 
> > 
> > > > For periodic, that would be a single core (normally boot), and when 
> > > > forced, it would return a cpu-mask with only one cpu set. Then the 
> > > > result would be a lot more informative for NO_HZ_(IDLE|FULL) as well.
> > > 
> > > I'd rather suggest that for periodic and NO_HZ_IDLE, this should be equal 
> > > to cpu_online_mask by default. Because the timekeeping duty can be taken 
> > > by any CPU.
> > 
> > So basically, we're looking at a few more values here then
> 
> I feel we don't need more than a strictly single cpumask that can do all of that.
> Look:
> 
> > 
> > # the current (un)restricted mask where timekeeper can run this would be 
> > # the writable mask
> > 
> >      timekeeper_cpumask
> 
> Ok. By default cpu_possible_mask.
>
> > # the mask for which you can assign cores to timekeeper_cpumask For 
> > # NO_HZ_FULL, I guess this would be all cores except those in 
> > # nohz_full-mask.
> > # read-only
> > 
> >      timekeeper_cpumask_possible
> 
> Right but why a different mask here? Just in the case of NO_HZ_FULL,
> timekeeper_cpumask is (cpu_possible_mask & ~tick_nohz_full_mask)

I wanted to distinguish between what's actually set and what's possible to 
set.

> > # makes sense only in periodic system where it does not bounce around like 
> > # crazy.
> > # read-only
> > 
> >      timekeeper_current_cpu
> 
> It may not bounce crazy but it can bounce actually. Just offline the timekeeper
> and this happens.
> 
> Then again I feel like timekeeper_cpumask is still a good fit here.

True, this was more for getting an idea about where it is _right_now_. But 

I guess you're right, if you want to know -excatly- where timekeeper is, 
you need to do some tracing and see where it has been.

> > Then force_cpu should accept a mask, or a range, of CPUs on which 
> > timekeeper should be allowed.
> > 
> >      timekeeper_cpumask_forced
> 
> That too can be timekeeper_cpumask.

Yes, and no.

In a system where you haven't set a forced-mask, then you could use this to 
read that info out (-1 or a cleared mask returned). Once you set a mask, 
then you can get the exact mask the system has used from here.

This is information you can infer, but it makes using the system a bit 
easier.

I know you can deduce most of this based on original state, hardware 
konwledge and how timekeeper works, but exposing it in cleartext back to 
userland would be beneficial in my view.

Why obscure it more than necessary?

> > > Assuming we want it then yeah, the cpumask affinity in sysfs/procfs looks 
> > > like a sane approach to me. In term we should also plug it to rcu full 
> > > system idle detection: https://lwn.net/Articles/558284/ so that we can 
> > > shutdown the reduced set of timekeepers when there is no other CPU 
> > > running. I have some patches for it that I can plug afterward. So no 
> > > worry for you on that side.
> > 
> > Ok, I'm not going to start to worry about that just yet then, but it does 
> > sound very interesting! :)
> 
> And moreover it's very convenient for me if we do that since that would
> setup all the infrastrucuture I need to affine timekeeper and adaptively
> run the timekeeper in dynticks idle mode on NO_HZ_FULL.
> 
> > > > That is certainly true, but that part happens to be of global 
> > > > influence, so if I have a core where a driver disables interrupts a lot 
> > > > (or drops into a hypervisor, or any other silly thing it really 
> > > > shouldn't be doing), then I would like to be able to move the 
> > > > timekeeping updates away from that core.
> > > 
> > > I don't understand how these things are linked together. If your driver 
> > > disables interrupt and you don't want to be disturbed, moving the 
> > > timekeeping duty doesn't move the tick itself.
> > 
> > True, but it reduces some amount of jitter from the tick itself. It's more 
> > like damagecontrol than damage-prevention.
> 
> Ok.
> 
> > 
> > > What happens to be disturbing for you in the timekeeping update that is 
> > > not with the tick as a whole? Is the delta of cputime added by jiffies 
> > > and gtod update alone a problem?
> > 
> > I had some examples where timekeeper would grab timekeeper.lock and hold it 
> > for a very long time. The reason was not very easy to pinpoint, so one of 
> > the things I wanted to try, was to move the timekeeper around and see what 
> > happened (yeah, debugging at it's best!)
> 
> And you got better results?

Define better, I was trying to work around a fault, and then I realized 
that I could use this for a lot more than just fixing that particular
issue.

> > > > > Does this all make sense outside the NO_HZ_FULL case?
> > > > 
> > > > In my view, it makes sense in the periodic case as well since all 
> > > > timekeeping updates then happens on the boot-cpu (unless it is 
> > > > hotunplugged that is).
> > > 
> > > But if we get back to your requirements, you want no interference at all. 
> > > HZ_PERIODIC doesn't look like what you want.
> > 
> > True, but I also needed a mechanism for moving the timekeeper away from a 
> > problematic core.
> > 
> > Another benefit from periodic, is that it supports skewed timers and that 
> > the overhead is lower than NO_HZ when you do have timer-interrupts.
> 
> Yeah that's a point.
> 
> > > > That's what I aimed at, and I *think* I managed that. I added a 
> > > > forced_timer_can_stop_tick() and let can_stop_full_tick() and 
> > > > can_stop_idle_tick() call that. I think that is sufficient, at least I 
> > > > did not see that the timerduty was transferred to another core 
> > > > afterwards.
> > > 
> > > Ok, I need to look at the details.
> > 
> > Ahem, no, my understanding of the NOHZ-cpumask was flawed. I'm reworking 
> > this based on feedback as well as the discussion below.
> 
> Ok.
> 
> > > Scenario is: CPU 0 handles timekeeping and CPU 1 is full dynticks. Both 
> > > are running then CPU 1 goes to sleep. CPU 0 notices that CPU 1 went to 
> > > sleep and thus nobody else needs the timekeeping to be uptodate. If CPU 0 
> > > is idle as well it can go to sleep. So does it. Then later if CPU 1 wakes 
> > > up to do something, it sends an IPI to CPU 0 such that CPU 0 wakes up, 
> > > notices that CPU 1 is alive and run the timekeeping update on its behalf.
> > > 
> > > It's not yet upstream but that's the plan :)
> > 
> > So all dynticks-kernel will send an IPI to CPU0 upon wakeup to notify  
> > that they're no longer sleeping?
> 
> I guess you meant s/dynticks-kernel/full-dynticks CPU/

Yes.

> So yes that's exactly what happens!

Ok, so that gives you a nice way of knowing that the system is *not* idle. 

But how do you know that it -is- idle? If the full dynticks-kernel are 
running a single task, they're not idle and would require timekeeper to 
run. How do you ensure that timekeeper does not suspend?

Perhaps it doesn't matter, if the other cores can detect when CPU#0 is idle 
and send an IPI, then it will probably be woken up pretty fast anyway.

> > Sounds like having a (effective) way to detect that the entire system is 
> > idle in an effective manner would be really nice in this scenario! That 
> > way, if CPU1 could determine that the system was not idle, it would not 
> > have to send an IPI to CPU0 to wake it up.
> 
> Exactly! The full system idle detection provides that clever information (taking care
> of races and all) that lets us know if the IPI is really needed or if it can be spared,
> depending on CPU 0 state.
> 
> This all relies on a tricky lockless machine state. Checkout CONFIG_NO_HZ_FULL_SYSIDLE
> for details, it's an interesting devilry ;)

/me reads 'interesting devilry' and instantly gets a bad feeling. :)
 
> It's not yet used but we are in the good way: https://lkml.org/lkml/2013/12/17/708

-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-03-03 10:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-25 12:33 [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Henrik Austad
2014-02-25 12:33 ` [PATCH 1/6] Expose do_timer CPU from tick-common Henrik Austad
2014-02-25 12:33 ` [PATCH 2/6] Add sysfs RO interface to tick_do_timer_cpu Henrik Austad
2014-02-25 12:33 ` [PATCH 3/6] Expose do_timer CPU as RO variable to userspace via sysctl Henrik Austad
2014-02-25 12:33 ` [PATCH 4/6] Force a specific CPU to handle all do_timer() events Henrik Austad
2014-02-25 12:34 ` [PATCH 5/6] Expose the forced_timer_cpu to userspace via sysctl as R/W Henrik Austad
2014-02-25 12:34 ` [PATCH 6/6] Expose tick_set_forced_cpu() via sysfs Henrik Austad
2014-02-25 14:19 ` [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace Frederic Weisbecker
2014-02-26  8:16   ` Henrik Austad
2014-02-26 13:02     ` Frederic Weisbecker
2014-02-27  8:37       ` Henrik Austad
2014-02-27 13:56         ` Frederic Weisbecker
2014-02-28  9:40           ` Henrik Austad
2014-02-28 23:08             ` Frederic Weisbecker
2014-03-03 10:42               ` Henrik Austad

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.