All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] nohz: Tick dependency mask v3
@ 2015-11-13 14:22 Frederic Weisbecker
  2015-11-13 14:22 ` [PATCH 1/7] atomic: Export fetch_or() Frederic Weisbecker
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

This series changes the nohz full infrastructure to let subsystems notify
when they have a tick dependency. Checking if we can stop the tick is thus
perfomed in a push rather than a pull model, see previous iterations:
http://lkml.kernel.org/r/1437669735-8786-1-git-send-email-fweisbec@gmail.com

Changes here:

* Reuse fetch_or()
* Change posix cpu timers to a per thread and signal dependency (we may
  want to merge all that to thread level only) instead of global dependency.
* Scheduler tick dependency now is evaluated at the CPU level instead
  of the task level. We don't anymore depend on the current task policy
  but on the whole tasks of the runqueue. That too can be further optimized.
  
Only lightly tested, just checking for current state opinions.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	timers/core-v5

HEAD: ca9e998d0be7e6ef1ffab888436cbabf5446c865

Thanks,
	Frederic
---

Frederic Weisbecker (7):
      atomic: Export fetch_or()
      nohz: New tick dependency mask
      perf: Migrate perf to use new tick dependency mask model
      sched: Account rr and fifo tasks separately
      sched: Migrate sched to use new tick dependency mask model
      posix-cpu-timers: Migrate to use new tick dependency mask model
      sched-clock: Migrate to use new tick dependency mask model


 include/linux/atomic.h         |  18 ++++++
 include/linux/perf_event.h     |   6 --
 include/linux/posix-timers.h   |   3 -
 include/linux/sched.h          |  11 +++-
 include/linux/tick.h           |  21 ++++++
 kernel/events/core.c           |  22 ++-----
 kernel/sched/clock.c           |   5 ++
 kernel/sched/core.c            |  47 ++++++--------
 kernel/sched/rt.c              |  34 ++++++++++
 kernel/sched/sched.h           |  49 ++++++++++----
 kernel/time/posix-cpu-timers.c |  52 ++++-----------
 kernel/time/tick-sched.c       | 142 +++++++++++++++++++++++++++++++++--------
 kernel/time/tick-sched.h       |   1 +
 13 files changed, 274 insertions(+), 137 deletions(-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/7] atomic: Export fetch_or()
  2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
@ 2015-11-13 14:22 ` Frederic Weisbecker
  2015-11-24 15:58   ` Chris Metcalf
  2015-11-13 14:22 ` [PATCH 2/7] nohz: New tick dependency mask Frederic Weisbecker
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

Export fetch_or() that's implemented and used internally by the
scheduler. We are going to use it for NO_HZ so make it generally
available.

Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/atomic.h | 18 ++++++++++++++++++
 kernel/sched/core.c    | 14 --------------
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 00a5763..c3b99f8 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -451,6 +451,24 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
 
+/**
+ * fetch_or - perform *ptr |= mask and return old value of *ptr
+ * @ptr: pointer to value
+ * @mask: mask to OR on the value
+ *
+ * cmpxchg based fetch_or, macro so it works for different integer types
+ */
+#define fetch_or(ptr, mask)						\
+({	typeof(*(ptr)) __old, __val = *(ptr);				\
+	for (;;) {							\
+		__old = cmpxchg((ptr), __val, __val | (mask));		\
+		if (__old == __val)					\
+			break;						\
+		__val = __old;						\
+	}								\
+	__old;								\
+})
+
 #include <asm-generic/atomic-long.h>
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 10a8faa..63d22bd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -453,20 +453,6 @@ static inline void init_hrtick(void)
 }
 #endif	/* CONFIG_SCHED_HRTICK */
 
-/*
- * cmpxchg based fetch_or, macro so it works for different integer types
- */
-#define fetch_or(ptr, val)						\
-({	typeof(*(ptr)) __old, __val = *(ptr);				\
- 	for (;;) {							\
- 		__old = cmpxchg((ptr), __val, __val | (val));		\
- 		if (__old == __val)					\
- 			break;						\
- 		__val = __old;						\
- 	}								\
- 	__old;								\
-})
-
 #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
  * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/7] nohz: New tick dependency mask
  2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
  2015-11-13 14:22 ` [PATCH 1/7] atomic: Export fetch_or() Frederic Weisbecker
@ 2015-11-13 14:22 ` Frederic Weisbecker
  2015-11-24 16:19   ` Chris Metcalf
  2015-12-01 20:41   ` Peter Zijlstra
  2015-11-13 14:22 ` [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model Frederic Weisbecker
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

The tick dependency is evaluated on every IRQ. This is a batch of checks
which determine whether it is safe to stop the tick or not. These checks
are often split in many details: posix cpu timers, scheduler, sched clock,
perf events. Each of which are made of smaller details: posix cpu
timer involves checking process wide timers then thread wide timers. Perf
involves checking freq events then more per cpu details.

Checking these details asynchronously every time we update the full
dynticks state bring avoidable overhead and a messy layout.

Lets introduce instead tick dependency masks: one for system wide
dependency (unstable sched clock), one for CPU wide dependency (sched,
perf), and task/signal level dependencies. The subsystems are responsible
of setting and clearing their dependency through a set of APIs that will
take care of concurrent dependency mask modifications and kick targets
to restart the relevant CPU tick whenever needed.

This new dependency engine stays beside the old one until all subsystems
having a tick dependency are converted to it.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/sched.h    |   8 +++
 include/linux/tick.h     |  21 ++++++++
 kernel/time/tick-sched.c | 130 +++++++++++++++++++++++++++++++++++++++++++++--
 kernel/time/tick-sched.h |   1 +
 4 files changed, 155 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f87559d..a65782f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -706,6 +706,10 @@ struct signal_struct {
 	/* Earliest-expiration cache. */
 	struct task_cputime cputime_expires;
 
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned long tick_dependency;
+#endif
+
 	struct list_head cpu_timers[3];
 
 	struct pid *tty_old_pgrp;
@@ -1528,6 +1532,10 @@ struct task_struct {
 		VTIME_SYS,
 	} vtime_snap_whence;
 #endif
+
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned long tick_dependency;
+#endif
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	u64 start_time;		/* monotonic time in nsec */
 	u64 real_start_time;	/* boot based time in nsec */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index e312219..472fd59 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -97,6 +97,18 @@ static inline void tick_broadcast_exit(void)
 	tick_broadcast_oneshot_control(TICK_BROADCAST_EXIT);
 }
 
+enum tick_dependency_bit {
+	TICK_POSIX_TIMER_BIT	= 0,
+	TICK_PERF_EVENTS_BIT	= 1,
+	TICK_SCHED_BIT		= 2,
+	TICK_CLOCK_UNSTABLE_BIT	= 3
+};
+
+#define TICK_POSIX_TIMER_MASK		(1 << TICK_POSIX_TIMER_BIT)
+#define TICK_PERF_EVENTS_MASK		(1 << TICK_PERF_EVENTS_BIT)
+#define TICK_SCHED_MASK			(1 << TICK_SCHED_BIT)
+#define TICK_CLOCK_UNSTABLE_MASK	(1 << TICK_CLOCK_UNSTABLE_BIT)
+
 #ifdef CONFIG_NO_HZ_COMMON
 extern int tick_nohz_tick_stopped(void);
 extern void tick_nohz_idle_enter(void);
@@ -152,6 +164,15 @@ static inline int housekeeping_any_cpu(void)
 	return cpumask_any_and(housekeeping_mask, cpu_online_mask);
 }
 
+extern void __tick_nohz_set_dep_delayed(enum tick_dependency_bit bit,
+					unsigned long *dep);
+extern void __tick_nohz_clear_dep(enum tick_dependency_bit bit,
+				  unsigned long *dep);
+extern void tick_nohz_set_dep(enum tick_dependency_bit bit);
+extern void tick_nohz_clear_dep(enum tick_dependency_bit bit);
+extern void tick_nohz_set_dep_cpu(enum tick_dependency_bit bit, int cpu);
+extern void tick_nohz_clear_dep_cpu(enum tick_dependency_bit bit, int cpu);
+
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7c7ec45..b9ea21d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -156,11 +156,53 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
 cpumask_var_t tick_nohz_full_mask;
 cpumask_var_t housekeeping_mask;
 bool tick_nohz_full_running;
+static unsigned long tick_dependency;
 
-static bool can_stop_full_tick(void)
+static void trace_tick_dependency(unsigned long dep)
+{
+	if (dep & TICK_POSIX_TIMER_MASK) {
+		trace_tick_stop(0, "posix timers running\n");
+		return;
+	}
+
+	if (dep & TICK_PERF_EVENTS_MASK) {
+		trace_tick_stop(0, "perf events running\n");
+		return;
+	}
+
+	if (dep & TICK_SCHED_MASK) {
+		trace_tick_stop(0, "more than 1 task in runqueue\n");
+		return;
+	}
+
+	if (dep & TICK_CLOCK_UNSTABLE_MASK)
+		trace_tick_stop(0, "unstable sched clock\n");
+}
+
+static bool can_stop_full_tick(struct tick_sched *ts)
 {
 	WARN_ON_ONCE(!irqs_disabled());
 
+	if (tick_dependency) {
+		trace_tick_dependency(tick_dependency);
+		return false;
+	}
+
+	if (ts->tick_dependency) {
+		trace_tick_dependency(ts->tick_dependency);
+		return false;
+	}
+
+	if (current->tick_dependency) {
+		trace_tick_dependency(current->tick_dependency);
+		return false;
+	}
+
+	if (current->signal->tick_dependency) {
+		trace_tick_dependency(current->signal->tick_dependency);
+		return false;
+	}
+
 	if (!sched_can_stop_tick()) {
 		trace_tick_stop(0, "more than 1 task in runqueue\n");
 		return false;
@@ -176,9 +218,10 @@ static bool can_stop_full_tick(void)
 		return false;
 	}
 
-	/* sched_clock_tick() needs us? */
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 	/*
+	 * sched_clock_tick() needs us?
+	 *
 	 * TODO: kick full dynticks CPUs when
 	 * sched_clock_stable is set.
 	 */
@@ -253,6 +296,79 @@ void tick_nohz_full_kick_all(void)
 	preempt_enable();
 }
 
+void __tick_nohz_clear_dep(enum tick_dependency_bit bit,
+				       unsigned long *dep)
+{
+	clear_bit(bit, dep);
+}
+
+static void kick_all_work_fn(struct work_struct *work)
+{
+       tick_nohz_full_kick_all();
+}
+static DECLARE_WORK(kick_all_work, kick_all_work_fn);
+
+void __tick_nohz_set_dep_delayed(enum tick_dependency_bit bit, unsigned long *dep)
+{
+	unsigned long prev;
+
+	prev = fetch_or(dep, BIT_MASK(bit));
+	if (!prev) {
+		/*
+		* We need the IPIs to be sent from sane process context.
+		* The posix cpu timers are always set with irqs disabled.
+		*/
+		schedule_work(&kick_all_work);
+	}
+}
+
+/*
+ * Set a global tick dependency. Lets do the wide IPI kick asynchronously
+ * for callers with irqs disabled.
+ */
+void tick_nohz_set_dep(enum tick_dependency_bit bit)
+{
+	unsigned long prev;
+
+	prev = fetch_or(&tick_dependency, BIT_MASK(bit));
+	if (!prev)
+		tick_nohz_full_kick_all();
+}
+
+void tick_nohz_clear_dep(enum tick_dependency_bit bit)
+{
+	__tick_nohz_clear_dep(bit, &tick_dependency);
+}
+
+void tick_nohz_set_dep_cpu(enum tick_dependency_bit bit, int cpu)
+{
+	unsigned long prev;
+	struct tick_sched *ts;
+
+	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+	prev = fetch_or(&ts->tick_dependency, BIT_MASK(bit));
+	if (!prev) {
+		preempt_disable();
+		/* Perf needs local kick that is NMI safe */
+		if (cpu == smp_processor_id()) {
+			tick_nohz_full_kick();
+		} else {
+			/* Remote irq work not NMI-safe */
+			WARN_ON_ONCE(in_nmi());
+			tick_nohz_full_kick_cpu(cpu);
+		}
+		preempt_enable();
+	}
+}
+
+void tick_nohz_clear_dep_cpu(enum tick_dependency_bit bit, int cpu)
+{
+	struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+	__tick_nohz_clear_dep(bit, &ts->tick_dependency);
+}
+
 /*
  * Re-evaluate the need for the tick as we switch the current task.
  * It might need the tick due to per task/process properties:
@@ -261,15 +377,19 @@ void tick_nohz_full_kick_all(void)
 void __tick_nohz_task_switch(void)
 {
 	unsigned long flags;
+	struct tick_sched *ts;
 
 	local_irq_save(flags);
 
 	if (!tick_nohz_full_cpu(smp_processor_id()))
 		goto out;
 
-	if (tick_nohz_tick_stopped() && !can_stop_full_tick())
-		tick_nohz_full_kick();
+	ts = this_cpu_ptr(&tick_cpu_sched);
 
+	if (ts->tick_stopped) {
+		if (current->tick_dependency || current->signal->tick_dependency)
+			tick_nohz_full_kick();
+	}
 out:
 	local_irq_restore(flags);
 }
@@ -722,7 +842,7 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
 		return;
 
-	if (can_stop_full_tick())
+	if (can_stop_full_tick(ts))
 		tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
 	else if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick(ts, ktime_get());
diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index a4a8d4e..d327f70 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -60,6 +60,7 @@ struct tick_sched {
 	u64				next_timer;
 	ktime_t				idle_expires;
 	int				do_timer_last;
+	unsigned long			tick_dependency;
 };
 
 extern struct tick_sched *tick_get_tick_sched(int cpu);
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model
  2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
  2015-11-13 14:22 ` [PATCH 1/7] atomic: Export fetch_or() Frederic Weisbecker
  2015-11-13 14:22 ` [PATCH 2/7] nohz: New tick dependency mask Frederic Weisbecker
@ 2015-11-13 14:22 ` Frederic Weisbecker
  2015-11-24 16:19   ` Chris Metcalf
  2015-11-13 14:22 ` [PATCH 4/7] sched: Account rr and fifo tasks separately Frederic Weisbecker
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

Instead of providing asynchronous checks for the nohz subsystem to verify
perf event tick dependency, migrate perf to the new mask.

Perf needs the tick for two situations:

1) Freq events. We could set the tick dependency when those are
installed on a CPU context. But setting a global dependency on top of
the global freq events accounting is much easier. If people want that
to be optimized, we can still refine that on the per-CPU tick dependency
level. This patch dooesn't change the current behaviour anyway.

2) Throttled events: this is a per-cpu dependency.

Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/perf_event.h |  6 ------
 kernel/events/core.c       | 22 +++++++---------------
 kernel/time/tick-sched.c   |  6 ------
 3 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 092a0e8..63a377f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1023,12 +1023,6 @@ static inline void perf_event_task_tick(void)				{ }
 static inline int perf_event_release_kernel(struct perf_event *event)	{ return 0; }
 #endif
 
-#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL)
-extern bool perf_event_can_stop_tick(void);
-#else
-static inline bool perf_event_can_stop_tick(void)			{ return true; }
-#endif
-
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
 extern void perf_restore_debug_store(void);
 #else
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..1ee70ba 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3099,17 +3099,6 @@ done:
 	return rotate;
 }
 
-#ifdef CONFIG_NO_HZ_FULL
-bool perf_event_can_stop_tick(void)
-{
-	if (atomic_read(&nr_freq_events) ||
-	    __this_cpu_read(perf_throttled_count))
-		return false;
-	else
-		return true;
-}
-#endif
-
 void perf_event_task_tick(void)
 {
 	struct list_head *head = this_cpu_ptr(&active_ctx_list);
@@ -3120,6 +3109,7 @@ void perf_event_task_tick(void)
 
 	__this_cpu_inc(perf_throttled_seq);
 	throttled = __this_cpu_xchg(perf_throttled_count, 0);
+	tick_nohz_clear_dep_cpu(TICK_PERF_EVENTS_BIT, smp_processor_id());
 
 	list_for_each_entry_safe(ctx, tmp, head, active_ctx_list)
 		perf_adjust_freq_unthr_context(ctx, throttled);
@@ -3540,8 +3530,10 @@ static void unaccount_event(struct perf_event *event)
 		atomic_dec(&nr_comm_events);
 	if (event->attr.task)
 		atomic_dec(&nr_task_events);
-	if (event->attr.freq)
-		atomic_dec(&nr_freq_events);
+	if (event->attr.freq) {
+		if (atomic_dec_and_test(&nr_freq_events))
+			tick_nohz_clear_dep(TICK_PERF_EVENTS_BIT);
+	}
 	if (event->attr.context_switch) {
 		static_key_slow_dec_deferred(&perf_sched_events);
 		atomic_dec(&nr_switch_events);
@@ -6307,9 +6299,9 @@ static int __perf_event_overflow(struct perf_event *event,
 		if (unlikely(throttle
 			     && hwc->interrupts >= max_samples_per_tick)) {
 			__this_cpu_inc(perf_throttled_count);
+			tick_nohz_set_dep_cpu(TICK_PERF_EVENTS_BIT, smp_processor_id());
 			hwc->interrupts = MAX_INTERRUPTS;
 			perf_log_throttle(event, 0);
-			tick_nohz_full_kick();
 			ret = 1;
 		}
 	}
@@ -7695,7 +7687,7 @@ static void account_event(struct perf_event *event)
 		atomic_inc(&nr_task_events);
 	if (event->attr.freq) {
 		if (atomic_inc_return(&nr_freq_events) == 1)
-			tick_nohz_full_kick_all();
+			tick_nohz_set_dep(TICK_PERF_EVENTS_BIT);
 	}
 	if (event->attr.context_switch) {
 		atomic_inc(&nr_switch_events);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b9ea21d..9bb19c8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -22,7 +22,6 @@
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
-#include <linux/perf_event.h>
 #include <linux/context_tracking.h>
 
 #include <asm/irq_regs.h>
@@ -213,11 +212,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-	if (!perf_event_can_stop_tick()) {
-		trace_tick_stop(0, "perf events running\n");
-		return false;
-	}
-
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 	/*
 	 * sched_clock_tick() needs us?
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/7] sched: Account rr and fifo tasks separately
  2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2015-11-13 14:22 ` [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model Frederic Weisbecker
@ 2015-11-13 14:22 ` Frederic Weisbecker
  2015-12-02 12:53   ` Peter Zijlstra
  2015-11-13 14:22 ` [PATCH 5/7] sched: Migrate sched to use new tick dependency mask model Frederic Weisbecker
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

In order to evaluate tick dependency, we need to account SCHED_RR and
SCHED_FIFO tasks separately as those policies don't have the same
preemption requirements.

We still keep rt_nr_running as a cache to avoid additions between nr_rr
and nr_fifo all over the place.

Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/rt.c    | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 36 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d2ea593..0e80458 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1152,12 +1152,43 @@ unsigned int rt_se_nr_running(struct sched_rt_entity *rt_se)
 }
 
 static inline
+unsigned int rt_se_fifo_nr_running(struct sched_rt_entity *rt_se)
+{
+	struct rt_rq *group_rq = group_rt_rq(rt_se);
+	struct task_struct *tsk;
+
+	if (group_rq)
+		return group_rq->fifo_nr_running;
+
+	tsk = rt_task_of(rt_se);
+
+	return (tsk->policy == SCHED_FIFO) ? 1 : 0;
+}
+
+static inline
+unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
+{
+	struct rt_rq *group_rq = group_rt_rq(rt_se);
+	struct task_struct *tsk;
+
+	if (group_rq)
+		return group_rq->rr_nr_running;
+
+	tsk = rt_task_of(rt_se);
+
+	return (tsk->policy == SCHED_RR) ? 1 : 0;
+}
+
+static inline
 void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
 	int prio = rt_se_prio(rt_se);
 
 	WARN_ON(!rt_prio(prio));
 	rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
+	rt_rq->fifo_nr_running += rt_se_fifo_nr_running(rt_se);
+	rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
+	WARN_ON_ONCE(rt_rq->rt_nr_running != rt_rq->fifo_nr_running + rt_rq->rr_nr_running);
 
 	inc_rt_prio(rt_rq, prio);
 	inc_rt_migration(rt_se, rt_rq);
@@ -1170,6 +1201,9 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	WARN_ON(!rt_prio(rt_se_prio(rt_se)));
 	WARN_ON(!rt_rq->rt_nr_running);
 	rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
+	rt_rq->fifo_nr_running -= rt_se_fifo_nr_running(rt_se);
+	rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
+	WARN_ON_ONCE(rt_rq->rt_nr_running != rt_rq->fifo_nr_running + rt_rq->rr_nr_running);
 
 	dec_rt_prio(rt_rq, rt_se_prio(rt_se));
 	dec_rt_migration(rt_se, rt_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d2a119..cfafbdd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -433,6 +433,8 @@ static inline int rt_bandwidth_enabled(void)
 struct rt_rq {
 	struct rt_prio_array active;
 	unsigned int rt_nr_running;
+	unsigned int fifo_nr_running;
+	unsigned int rr_nr_running;
 #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
 	struct {
 		int curr; /* highest queued rt task prio */
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/7] sched: Migrate sched to use new tick dependency mask model
  2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2015-11-13 14:22 ` [PATCH 4/7] sched: Account rr and fifo tasks separately Frederic Weisbecker
@ 2015-11-13 14:22 ` Frederic Weisbecker
  2015-11-13 14:22 ` [PATCH 6/7] posix-cpu-timers: Migrate " Frederic Weisbecker
  2015-11-13 14:22 ` [PATCH 7/7] sched-clock: " Frederic Weisbecker
  6 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

Instead of providing asynchronous checks for the nohz subsystem to verify
sched tick dependency, migrate sched to the new mask.

Everytime a task is enqueued or dequeued, we evaluate the state of the
tick dependency on top of the policy of the tasks in the runqueue, by
order of priority:

SCHED_DEADLINE: Need the tick in order to periodically check for runtime
SCHED_FIFO    : Don't need the tick (no round-robin)
SCHED_RR      : Need the tick if more than 1 task of the same priority
                for round robin (simplified with checking if more than
                one SCHED_RR task no matter what priority).
SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.

We could optimize that further with one flag per sched policy on the tick
dependency mask and perform only the checks relevant to the policy
concerned by an enqueue/dequeue operation.

Since the checks aren't based on the current task anymore, we could get
rid of the task switch hook but it's still needed for posix cpu
timers.

Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/sched.h    |  3 ---
 kernel/sched/core.c      | 33 ++++++++++++++++++---------------
 kernel/sched/sched.h     | 47 +++++++++++++++++++++++++++++++++--------------
 kernel/time/tick-sched.c |  5 -----
 4 files changed, 51 insertions(+), 37 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a65782f..116fa9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2348,10 +2348,7 @@ static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
-extern bool sched_can_stop_tick(void);
 extern u64 scheduler_tick_max_deferment(void);
-#else
-static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63d22bd..73957e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -701,31 +701,34 @@ static inline bool got_nohz_idle_kick(void)
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
-bool sched_can_stop_tick(void)
+bool sched_can_stop_tick(struct rq *rq)
 {
+	/* Deadline tasks, even if single, need the tick */
+	if (rq->dl.dl_nr_running)
+		return false;
+
 	/*
-	 * FIFO realtime policy runs the highest priority task. Other runnable
-	 * tasks are of a lower priority. The scheduler tick does nothing.
+	 * FIFO realtime policy runs the highest priority task (after DEADLINE).
+	 * Other runnable tasks are of a lower priority. The scheduler tick
+	 * isn't needed.
 	 */
-	if (current->policy == SCHED_FIFO)
+	if (rq->rt.fifo_nr_running)
 		return true;
 
 	/*
 	 * Round-robin realtime tasks time slice with other tasks at the same
-	 * realtime priority. Is this task the only one at this priority?
+	 * realtime priority.
 	 */
-	if (current->policy == SCHED_RR) {
-		struct sched_rt_entity *rt_se = &current->rt;
-
-		return rt_se->run_list.prev == rt_se->run_list.next;
+	if (rq->rt.rr_nr_running) {
+		if (rq->rt.rr_nr_running == 1)
+			return true;
+		else
+			return false;
 	}
 
-	/*
-	 * More than one running task need preemption.
-	 * nr_running update is assumed to be visible
-	 * after IPI is sent from wakers.
-	 */
-	if (this_rq()->nr_running > 1)
+
+	/* Normal multitasking need periodic preemption checks */
+	if (rq->cfs.nr_running > 1)
 		return false;
 
 	return true;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cfafbdd..e2946f7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1313,6 +1313,35 @@ unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_entity_runnable_average(struct sched_entity *se);
 
+#ifdef CONFIG_NO_HZ_FULL
+extern bool sched_can_stop_tick(struct rq *rq);
+
+/*
+ * Tick may be needed by tasks in the runqueue depending on their policy and
+ * requirements. If tick is needed, lets send the target an IPI to kick it out of
+ * nohz mode if necessary.
+ */
+static inline void sched_update_tick_dependency(struct rq *rq)
+{
+	int cpu;
+
+	if (!tick_nohz_full_enabled())
+		return;
+
+	cpu = cpu_of(rq);
+
+	if (!tick_nohz_full_cpu(rq->cpu))
+		return;
+
+	if (sched_can_stop_tick(rq))
+		tick_nohz_clear_dep_cpu(TICK_SCHED_BIT, cpu);
+	else
+		tick_nohz_set_dep_cpu(TICK_SCHED_BIT, cpu);
+}
+#else
+static inline void sched_update_tick_dependency(struct rq *rq) { }
+#endif
+
 static inline void add_nr_running(struct rq *rq, unsigned count)
 {
 	unsigned prev_nr = rq->nr_running;
@@ -1324,26 +1353,16 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 		if (!rq->rd->overload)
 			rq->rd->overload = true;
 #endif
-
-#ifdef CONFIG_NO_HZ_FULL
-		if (tick_nohz_full_cpu(rq->cpu)) {
-			/*
-			 * Tick is needed if more than one task runs on a CPU.
-			 * Send the target an IPI to kick it out of nohz mode.
-			 *
-			 * We assume that IPI implies full memory barrier and the
-			 * new value of rq->nr_running is visible on reception
-			 * from the target.
-			 */
-			tick_nohz_full_kick_cpu(rq->cpu);
-		}
-#endif
 	}
+
+	sched_update_tick_dependency(rq);
 }
 
 static inline void sub_nr_running(struct rq *rq, unsigned count)
 {
 	rq->nr_running -= count;
+	/* Check if we still need preemption */
+	sched_update_tick_dependency(rq);
 }
 
 static inline void rq_last_tick_reset(struct rq *rq)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9bb19c8..d670729 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -202,11 +202,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-	if (!sched_can_stop_tick()) {
-		trace_tick_stop(0, "more than 1 task in runqueue\n");
-		return false;
-	}
-
 	if (!posix_cpu_timers_can_stop_tick(current)) {
 		trace_tick_stop(0, "posix timers running\n");
 		return false;
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 6/7] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2015-11-13 14:22 ` [PATCH 5/7] sched: Migrate sched to use new tick dependency mask model Frederic Weisbecker
@ 2015-11-13 14:22 ` Frederic Weisbecker
  2015-11-13 14:22 ` [PATCH 7/7] sched-clock: " Frederic Weisbecker
  6 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

Instead of providing asynchronous checks for the nohz subsystem to verify
posix cpu timers tick dependency, migrate the latter to the new mask.

In order to keep track of the running timers and expose the tick
dependency accordingly, we must probe the timers queuing and dequeuing
on threads and process lists.

Unfortunately it implies both task and signal level dependencies. We
should be able to further optimize this and merge all that on the task
level dependency, at the cost of a bit of complexity and may be overhead.

Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/posix-timers.h   |  3 ---
 kernel/time/posix-cpu-timers.c | 52 +++++++++---------------------------------
 kernel/time/tick-sched.c       |  5 ----
 3 files changed, 11 insertions(+), 49 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 907f3fd..62d44c1 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -128,9 +128,6 @@ void posix_cpu_timer_schedule(struct k_itimer *timer);
 void run_posix_cpu_timers(struct task_struct *task);
 void posix_cpu_timers_exit(struct task_struct *task);
 void posix_cpu_timers_exit_group(struct task_struct *task);
-
-bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk);
-
 void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
 			   cputime_t *newval, cputime_t *oldval);
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index f5e86d2..3751330 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -333,7 +333,6 @@ static int posix_cpu_clock_get(const clockid_t which_clock, struct timespec *tp)
 	return err;
 }
 
-
 /*
  * Validate the clockid_t for a new CPU-clock timer, and initialize the timer.
  * This is called from sys_timer_create() and do_cpu_nanosleep() with the
@@ -474,13 +473,16 @@ static void arm_timer(struct k_itimer *timer)
 	struct task_cputime *cputime_expires;
 	struct cpu_timer_list *const nt = &timer->it.cpu;
 	struct cpu_timer_list *next;
+	unsigned long *tick_dep;
 
 	if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
 		head = p->cpu_timers;
 		cputime_expires = &p->cputime_expires;
+		tick_dep = &p->tick_dependency;
 	} else {
 		head = p->signal->cpu_timers;
 		cputime_expires = &p->signal->cputime_expires;
+		tick_dep = &p->signal->tick_dependency;
 	}
 	head += CPUCLOCK_WHICH(timer->it_clock);
 
@@ -517,6 +519,7 @@ static void arm_timer(struct k_itimer *timer)
 				cputime_expires->sched_exp = exp;
 			break;
 		}
+		__tick_nohz_set_dep_delayed(TICK_POSIX_TIMER_BIT, tick_dep);
 	}
 }
 
@@ -582,39 +585,6 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
 	return 0;
 }
 
-#ifdef CONFIG_NO_HZ_FULL
-static void nohz_kick_work_fn(struct work_struct *work)
-{
-	tick_nohz_full_kick_all();
-}
-
-static DECLARE_WORK(nohz_kick_work, nohz_kick_work_fn);
-
-/*
- * We need the IPIs to be sent from sane process context.
- * The posix cpu timers are always set with irqs disabled.
- */
-static void posix_cpu_timer_kick_nohz(void)
-{
-	if (context_tracking_is_enabled())
-		schedule_work(&nohz_kick_work);
-}
-
-bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk)
-{
-	if (!task_cputime_zero(&tsk->cputime_expires))
-		return false;
-
-	/* Check if cputimer is running. This is accessed without locking. */
-	if (READ_ONCE(tsk->signal->cputimer.running))
-		return false;
-
-	return true;
-}
-#else
-static inline void posix_cpu_timer_kick_nohz(void) { }
-#endif
-
 /*
  * Guts of sys_timer_settime for CPU timers.
  * This is called with the timer locked and interrupts disabled.
@@ -761,8 +731,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags,
 		sample_to_timespec(timer->it_clock,
 				   old_incr, &old->it_interval);
 	}
-	if (!ret)
-		posix_cpu_timer_kick_nohz();
+
 	return ret;
 }
 
@@ -911,6 +880,8 @@ static void check_thread_timers(struct task_struct *tsk,
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 		}
 	}
+	if (task_cputime_zero(tsk_expires))
+		__tick_nohz_clear_dep(TICK_POSIX_TIMER_BIT, &tsk->tick_dependency);
 }
 
 static inline void stop_process_timers(struct signal_struct *sig)
@@ -919,6 +890,7 @@ static inline void stop_process_timers(struct signal_struct *sig)
 
 	/* Turn off cputimer->running. This is done without locking. */
 	WRITE_ONCE(cputimer->running, false);
+	__tick_nohz_clear_dep(TICK_POSIX_TIMER_BIT, &sig->tick_dependency);
 }
 
 static u32 onecputick;
@@ -1095,8 +1067,6 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
 	arm_timer(timer);
 	unlock_task_sighand(p, &flags);
 
-	/* Kick full dynticks CPUs in case they need to tick on the new timer */
-	posix_cpu_timer_kick_nohz();
 out:
 	timer->it_overrun_last = timer->it_overrun;
 	timer->it_overrun = -1;
@@ -1270,7 +1240,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 		}
 
 		if (!*newval)
-			goto out;
+			return;
 		*newval += now;
 	}
 
@@ -1288,8 +1258,8 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 			tsk->signal->cputime_expires.virt_exp = *newval;
 		break;
 	}
-out:
-	posix_cpu_timer_kick_nohz();
+
+	__tick_nohz_set_dep_delayed(TICK_POSIX_TIMER_BIT, &tsk->signal->tick_dependency);
 }
 
 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d670729..2a1ce82 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -202,11 +202,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-	if (!posix_cpu_timers_can_stop_tick(current)) {
-		trace_tick_stop(0, "posix timers running\n");
-		return false;
-	}
-
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 	/*
 	 * sched_clock_tick() needs us?
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 7/7] sched-clock: Migrate to use new tick dependency mask model
  2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2015-11-13 14:22 ` [PATCH 6/7] posix-cpu-timers: Migrate " Frederic Weisbecker
@ 2015-11-13 14:22 ` Frederic Weisbecker
  6 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-13 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter, Ingo Molnar,
	Viresh Kumar, Rik van Riel

Instead of checking sched_clock_stable from the nohz subsystem to verify
its tick dependency, migrate it to the new mask in order to include it
to the all-in-one check.

Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/clock.c     |  5 +++++
 kernel/time/tick-sched.c | 26 ++++++--------------------
 2 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c0a2051..1389e1a 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -61,6 +61,7 @@
 #include <linux/static_key.h>
 #include <linux/workqueue.h>
 #include <linux/compiler.h>
+#include <linux/tick.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -89,6 +90,8 @@ static void __set_sched_clock_stable(void)
 {
 	if (!sched_clock_stable())
 		static_key_slow_inc(&__sched_clock_stable);
+
+	tick_nohz_clear_dep(TICK_CLOCK_UNSTABLE_BIT);
 }
 
 void set_sched_clock_stable(void)
@@ -108,6 +111,8 @@ static void __clear_sched_clock_stable(struct work_struct *work)
 	/* XXX worry about clock continuity */
 	if (sched_clock_stable())
 		static_key_slow_dec(&__sched_clock_stable);
+
+	tick_nohz_set_dep(TICK_CLOCK_UNSTABLE_BIT);
 }
 
 static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2a1ce82..bc419f84 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -174,8 +174,13 @@ static void trace_tick_dependency(unsigned long dep)
 		return;
 	}
 
-	if (dep & TICK_CLOCK_UNSTABLE_MASK)
+#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
+	if (dep & TICK_CLOCK_UNSTABLE_MASK) {
 		trace_tick_stop(0, "unstable sched clock\n");
+		WARN_ONCE(tick_nohz_full_running,
+			  "NO_HZ FULL will not work with unstable sched clock");
+	}
+#endif
 }
 
 static bool can_stop_full_tick(struct tick_sched *ts)
@@ -202,25 +207,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-	/*
-	 * sched_clock_tick() needs us?
-	 *
-	 * TODO: kick full dynticks CPUs when
-	 * sched_clock_stable is set.
-	 */
-	if (!sched_clock_stable()) {
-		trace_tick_stop(0, "unstable sched clock\n");
-		/*
-		 * Don't allow the user to think they can get
-		 * full NO_HZ with this machine.
-		 */
-		WARN_ONCE(tick_nohz_full_running,
-			  "NO_HZ FULL will not work with unstable sched clock");
-		return false;
-	}
-#endif
-
 	return true;
 }
 
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] atomic: Export fetch_or()
  2015-11-13 14:22 ` [PATCH 1/7] atomic: Export fetch_or() Frederic Weisbecker
@ 2015-11-24 15:58   ` Chris Metcalf
  2015-11-24 21:19     ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Metcalf @ 2015-11-24 15:58 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 11/13/2015 09:22 AM, Frederic Weisbecker wrote:
> Export fetch_or() that's implemented and used internally by the
> scheduler. We are going to use it for NO_HZ so make it generally
> available.
>
> Cc: Christoph Lameter<cl@linux.com>
> Cc: Chris Metcalf<cmetcalf@ezchip.com>
> Cc: Ingo Molnar<mingo@kernel.org>
> Cc: Luiz Capitulino<lcapitulino@redhat.com>
> Cc: Peter Zijlstra<peterz@infradead.org>
> Cc: Rik van Riel<riel@redhat.com>
> Cc: Thomas Gleixner<tglx@linutronix.de>
> Cc: Viresh Kumar<viresh.kumar@linaro.org>
> Signed-off-by: Frederic Weisbecker<fweisbec@gmail.com>
> ---
>   include/linux/atomic.h | 18 ++++++++++++++++++
>   kernel/sched/core.c    | 14 --------------
>   2 files changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/atomic.h b/include/linux/atomic.h
> index 00a5763..c3b99f8 100644
> --- a/include/linux/atomic.h
> +++ b/include/linux/atomic.h
> @@ -451,6 +451,24 @@ static inline int atomic_dec_if_positive(atomic_t *v)
>   }
>   #endif
>   
> +/**
> + * fetch_or - perform *ptr |= mask and return old value of *ptr
> + * @ptr: pointer to value
> + * @mask: mask to OR on the value
> + *
> + * cmpxchg based fetch_or, macro so it works for different integer types
> + */
> +#define fetch_or(ptr, mask)						\
> +({	typeof(*(ptr)) __old, __val = *(ptr);				\
> +	for (;;) {							\
> +		__old = cmpxchg((ptr), __val, __val | (mask));		\
> +		if (__old == __val)					\
> +			break;						\
> +		__val = __old;						\
> +	}								\
> +	__old;								\
> +})
> +
>   #include <asm-generic/atomic-long.h>
>   #ifdef CONFIG_GENERIC_ATOMIC64
>   #include <asm-generic/atomic64.h>

I think this should be guarded by an "#ifndef" like other things in
this file, so architectures can provide their own implementations,
or can use the C11 atomic_fetch_or() for newer compilers.

Also, I wonder about the nomenclature here: other than cmpxchg
and xchg, all the atomic ops are named "atomic_xxx".  For something
that returns the old value, I'd expect it to be atomic_or_return()
and be otherwise like the existing atomic_or() routine, and thus you'd
specify "atomic_t tick_dependency".

Avoiding all of these issues is probably why fetch_or() is not exported :-)
I made some similar comments last time around:

https://lkml.kernel.org/r/55B2794A.8040707@ezchip.com

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-11-13 14:22 ` [PATCH 2/7] nohz: New tick dependency mask Frederic Weisbecker
@ 2015-11-24 16:19   ` Chris Metcalf
  2015-11-25 11:32     ` Frederic Weisbecker
  2015-12-01 20:41   ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Metcalf @ 2015-11-24 16:19 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 11/13/2015 09:22 AM, Frederic Weisbecker wrote:
> The tick dependency is evaluated on every IRQ. This is a batch of checks
> which determine whether it is safe to stop the tick or not. These checks
> are often split in many details: posix cpu timers, scheduler, sched clock,
> perf events. Each of which are made of smaller details: posix cpu
> timer involves checking process wide timers then thread wide timers. Perf
> involves checking freq events then more per cpu details.
>
> Checking these details asynchronously every time we update the full
> dynticks state bring avoidable overhead and a messy layout.
>
> Lets introduce instead tick dependency masks: one for system wide
> dependency (unstable sched clock), one for CPU wide dependency (sched,
> perf), and task/signal level dependencies. The subsystems are responsible
> of setting and clearing their dependency through a set of APIs that will
> take care of concurrent dependency mask modifications and kick targets
> to restart the relevant CPU tick whenever needed.
>
> This new dependency engine stays beside the old one until all subsystems
> having a tick dependency are converted to it.
>
>
> +void tick_nohz_set_dep_cpu(enum tick_dependency_bit bit, int cpu)
> +{
> +	unsigned long prev;
> +	struct tick_sched *ts;
> +
> +	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
> +
> +	prev = fetch_or(&ts->tick_dependency, BIT_MASK(bit));
> +	if (!prev) {
> +		preempt_disable();
> +		/* Perf needs local kick that is NMI safe */
> +		if (cpu == smp_processor_id()) {
> +			tick_nohz_full_kick();
> +		} else {
> +			/* Remote irq work not NMI-safe */
> +			WARN_ON_ONCE(in_nmi());

Better to say "if (!WARN_ON_ONCE(in_nmi()))" here instead so
we don't actually try to kick if we are in an NMI?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model
  2015-11-13 14:22 ` [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model Frederic Weisbecker
@ 2015-11-24 16:19   ` Chris Metcalf
  2015-11-25 12:34     ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Metcalf @ 2015-11-24 16:19 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 11/13/2015 09:22 AM, Frederic Weisbecker wrote:
> Instead of providing asynchronous checks for the nohz subsystem to verify
> perf event tick dependency, migrate perf to the new mask.
>
> Perf needs the tick for two situations:
>
> 1) Freq events. We could set the tick dependency when those are
> installed on a CPU context. But setting a global dependency on top of
> the global freq events accounting is much easier. If people want that
> to be optimized, we can still refine that on the per-CPU tick dependency
> level. This patch dooesn't change the current behaviour anyway.
>
> 2) Throttled events: this is a per-cpu dependency.
>
>
> @@ -3540,8 +3530,10 @@ static void unaccount_event(struct perf_event *event)
>   		atomic_dec(&nr_comm_events);
>   	if (event->attr.task)
>   		atomic_dec(&nr_task_events);
> -	if (event->attr.freq)
> -		atomic_dec(&nr_freq_events);
> +	if (event->attr.freq) {
> +		if (atomic_dec_and_test(&nr_freq_events))
> +			tick_nohz_clear_dep(TICK_PERF_EVENTS_BIT);
> +	}
>   	if (event->attr.context_switch) {
>   		static_key_slow_dec_deferred(&perf_sched_events);
>   		atomic_dec(&nr_switch_events);
>
> @@ -7695,7 +7687,7 @@ static void account_event(struct perf_event *event)
>   		atomic_inc(&nr_task_events);
>   	if (event->attr.freq) {
>   		if (atomic_inc_return(&nr_freq_events) == 1)
> -			tick_nohz_full_kick_all();
> +			tick_nohz_set_dep(TICK_PERF_EVENTS_BIT);
>   	}
>   	if (event->attr.context_switch) {
>   		atomic_inc(&nr_switch_events);

It would be helpful to have a comment explaining why these two
can't race with each other, e.g. this race:

[cpu 1]  atomic_dec_and_test
[cpu 2] atomic_inc_return
[cpu 2] tick_nohz_set_dep()
[cpu 1] tick_nohz_clear_dep()

Or perhaps this is a true race condition possibility?

I think we're OK for the sched cases since they're protected under
the rq lock, I think.  I'm not sure about the POSIX cpu timers.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] atomic: Export fetch_or()
  2015-11-24 15:58   ` Chris Metcalf
@ 2015-11-24 21:19     ` Frederic Weisbecker
  2015-11-24 21:48       ` Chris Metcalf
  2015-11-25  9:13       ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-24 21:19 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Nov 24, 2015 at 10:58:00AM -0500, Chris Metcalf wrote:
> On 11/13/2015 09:22 AM, Frederic Weisbecker wrote:
> >Export fetch_or() that's implemented and used internally by the
> >scheduler. We are going to use it for NO_HZ so make it generally
> >available.
> >
> >Cc: Christoph Lameter<cl@linux.com>
> >Cc: Chris Metcalf<cmetcalf@ezchip.com>
> >Cc: Ingo Molnar<mingo@kernel.org>
> >Cc: Luiz Capitulino<lcapitulino@redhat.com>
> >Cc: Peter Zijlstra<peterz@infradead.org>
> >Cc: Rik van Riel<riel@redhat.com>
> >Cc: Thomas Gleixner<tglx@linutronix.de>
> >Cc: Viresh Kumar<viresh.kumar@linaro.org>
> >Signed-off-by: Frederic Weisbecker<fweisbec@gmail.com>
> >---
> >  include/linux/atomic.h | 18 ++++++++++++++++++
> >  kernel/sched/core.c    | 14 --------------
> >  2 files changed, 18 insertions(+), 14 deletions(-)
> >
> >diff --git a/include/linux/atomic.h b/include/linux/atomic.h
> >index 00a5763..c3b99f8 100644
> >--- a/include/linux/atomic.h
> >+++ b/include/linux/atomic.h
> >@@ -451,6 +451,24 @@ static inline int atomic_dec_if_positive(atomic_t *v)
> >  }
> >  #endif
> >+/**
> >+ * fetch_or - perform *ptr |= mask and return old value of *ptr
> >+ * @ptr: pointer to value
> >+ * @mask: mask to OR on the value
> >+ *
> >+ * cmpxchg based fetch_or, macro so it works for different integer types
> >+ */
> >+#define fetch_or(ptr, mask)						\
> >+({	typeof(*(ptr)) __old, __val = *(ptr);				\
> >+	for (;;) {							\
> >+		__old = cmpxchg((ptr), __val, __val | (mask));		\
> >+		if (__old == __val)					\
> >+			break;						\
> >+		__val = __old;						\
> >+	}								\
> >+	__old;								\
> >+})
> >+
> >  #include <asm-generic/atomic-long.h>
> >  #ifdef CONFIG_GENERIC_ATOMIC64
> >  #include <asm-generic/atomic64.h>
> 
> I think this should be guarded by an "#ifndef" like other things in
> this file, so architectures can provide their own implementations,
> or can use the C11 atomic_fetch_or() for newer compilers.

Right.

> 
> Also, I wonder about the nomenclature here: other than cmpxchg
> and xchg, all the atomic ops are named "atomic_xxx".  For something
> that returns the old value, I'd expect it to be atomic_or_return()
> and be otherwise like the existing atomic_or() routine, and thus you'd
> specify "atomic_t tick_dependency".

I think Peterz needs it to be type-generic, like cmpxchg, such that he can
use it on tsk->thread_info->flags which type can vary. But if we happen to
need an atomic_t version, we can also provide an atomic_fetch_or() version.

Also note that or_return() means that you first do OR and then return the new value.

I remember debating a bit the name with Peterz and the current one now makes quite
some sense to me too :-)

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] atomic: Export fetch_or()
  2015-11-24 21:19     ` Frederic Weisbecker
@ 2015-11-24 21:48       ` Chris Metcalf
  2015-11-30 17:36         ` Frederic Weisbecker
  2015-11-25  9:13       ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Metcalf @ 2015-11-24 21:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 11/24/2015 04:19 PM, Frederic Weisbecker wrote:
>> Also, I wonder about the nomenclature here: other than cmpxchg
>> >and xchg, all the atomic ops are named "atomic_xxx".  For something
>> >that returns the old value, I'd expect it to be atomic_or_return()
>> >and be otherwise like the existing atomic_or() routine, and thus you'd
>> >specify "atomic_t tick_dependency".
> I think Peterz needs it to be type-generic, like cmpxchg, such that he can
> use it on tsk->thread_info->flags which type can vary. But if we happen to
> need an atomic_t version, we can also provide an atomic_fetch_or() version.

Yes, I think my point is that Peter Z's version is what is needed for
the scheduler, but it may not be the thing we want to start providing
to the entire kernel without thinking a little harder about the semantics,
the namespace issues, and whether there is or should be room for
some appropriate family of similar calls.  Just tossing in fetch_or()
because it's easy doesn't necessarily seem like what we should be doing.

> Also note that or_return() means that you first do OR and then return the new value.

Yeah, actually fair point.  I keep forgetting Linux does this "backwards".

I still think we should use an atomic_xxx() name here rather than just
adding things into the namespace willy-nilly.

It's tempting to use atomic_fetch_or() but that actually conflicts with the
C11 names, and remarkably, we haven't done that yet in the kernel,
so we may want to avoid doing so for now.  I can imagine in the not
too distant future we detect C11 compilers and allow using <stdatomic.h>
if possible.  (Obviously some platforms require kernel support or
other tricky things for stdatomic.h, so we couldn't use it everywhere.)

We could use something like gcc's old __sync_fetch_and_XXX vs
__sync_XXX_and_fetch nomenclature and call it atomic_return_or()
to contrast with atomic_or_return().  That avoids clashing with C11
for now and is obviously distinct from atomic_or_return().  I suppose
something like atomic_fetch_and_or() isn't terrible either.

There is boilerplate for building generic atomic functions already in
include/asm-generic/atomic.h and that could be extended.
Unfortunately the atomic64 generic code isn't similarly constructed,
so you can't just provide a default atomic64_return_or() based on that
stuff, or at least, you only can on platforms that use an array of locks
to implement 64-bit atomics.  Ugh.

If we did this and then wanted Peter Z's code to take advantage of it,
in principle we could just have some macrology which would compare
the sizeof(thread_info.flags) to sizeof(int) and sizeof(long) to see which
one to use and then cast to the appropriate atomic_t or atomic64_t.
That's a little painful but not terrible.

Boy, the whole situation is pretty tangled up, though.

Unless you want to take a big diversion into atomics, I'd be tempted
to leave Peter's macro alone and just write it off as necessary evil
to handle the fact that thread_info.flags is all kinds of different sizes
and types on different platforms, and definitely never an atomic_t.
Instead just create an inline function atomic_return_or(), or
whatever name you prefer, that operates on an atomic_t, and use
the atomic_t type for your structure field.  It's clearly a win to mark
the data types as being atomic to the extent we can do so, I think.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] atomic: Export fetch_or()
  2015-11-24 21:19     ` Frederic Weisbecker
  2015-11-24 21:48       ` Chris Metcalf
@ 2015-11-25  9:13       ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2015-11-25  9:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, LKML, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Nov 24, 2015 at 10:19:52PM +0100, Frederic Weisbecker wrote:
> Also note that or_return() means that you first do OR and then return the new value.

Yes, that's useless. OR is an irreversible operator, which means
or_return() looses data. You can never say if a bit included in the
mask was set before.

You really must have fetch_or().

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-11-24 16:19   ` Chris Metcalf
@ 2015-11-25 11:32     ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-25 11:32 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Nov 24, 2015 at 11:19:15AM -0500, Chris Metcalf wrote:
> On 11/13/2015 09:22 AM, Frederic Weisbecker wrote:
> >The tick dependency is evaluated on every IRQ. This is a batch of checks
> >which determine whether it is safe to stop the tick or not. These checks
> >are often split in many details: posix cpu timers, scheduler, sched clock,
> >perf events. Each of which are made of smaller details: posix cpu
> >timer involves checking process wide timers then thread wide timers. Perf
> >involves checking freq events then more per cpu details.
> >
> >Checking these details asynchronously every time we update the full
> >dynticks state bring avoidable overhead and a messy layout.
> >
> >Lets introduce instead tick dependency masks: one for system wide
> >dependency (unstable sched clock), one for CPU wide dependency (sched,
> >perf), and task/signal level dependencies. The subsystems are responsible
> >of setting and clearing their dependency through a set of APIs that will
> >take care of concurrent dependency mask modifications and kick targets
> >to restart the relevant CPU tick whenever needed.
> >
> >This new dependency engine stays beside the old one until all subsystems
> >having a tick dependency are converted to it.
> >
> >
> >+void tick_nohz_set_dep_cpu(enum tick_dependency_bit bit, int cpu)
> >+{
> >+	unsigned long prev;
> >+	struct tick_sched *ts;
> >+
> >+	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
> >+
> >+	prev = fetch_or(&ts->tick_dependency, BIT_MASK(bit));
> >+	if (!prev) {
> >+		preempt_disable();
> >+		/* Perf needs local kick that is NMI safe */
> >+		if (cpu == smp_processor_id()) {
> >+			tick_nohz_full_kick();
> >+		} else {
> >+			/* Remote irq work not NMI-safe */
> >+			WARN_ON_ONCE(in_nmi());
> 
> Better to say "if (!WARN_ON_ONCE(in_nmi()))" here instead so
> we don't actually try to kick if we are in an NMI?

Makes sense yeah. I'll fix that.

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model
  2015-11-24 16:19   ` Chris Metcalf
@ 2015-11-25 12:34     ` Frederic Weisbecker
  2015-12-02 16:17       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-25 12:34 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Nov 24, 2015 at 11:19:33AM -0500, Chris Metcalf wrote:
> On 11/13/2015 09:22 AM, Frederic Weisbecker wrote:
> >Instead of providing asynchronous checks for the nohz subsystem to verify
> >perf event tick dependency, migrate perf to the new mask.
> >
> >Perf needs the tick for two situations:
> >
> >1) Freq events. We could set the tick dependency when those are
> >installed on a CPU context. But setting a global dependency on top of
> >the global freq events accounting is much easier. If people want that
> >to be optimized, we can still refine that on the per-CPU tick dependency
> >level. This patch dooesn't change the current behaviour anyway.
> >
> >2) Throttled events: this is a per-cpu dependency.
> >
> >
> >@@ -3540,8 +3530,10 @@ static void unaccount_event(struct perf_event *event)
> >  		atomic_dec(&nr_comm_events);
> >  	if (event->attr.task)
> >  		atomic_dec(&nr_task_events);
> >-	if (event->attr.freq)
> >-		atomic_dec(&nr_freq_events);
> >+	if (event->attr.freq) {
> >+		if (atomic_dec_and_test(&nr_freq_events))
> >+			tick_nohz_clear_dep(TICK_PERF_EVENTS_BIT);
> >+	}
> >  	if (event->attr.context_switch) {
> >  		static_key_slow_dec_deferred(&perf_sched_events);
> >  		atomic_dec(&nr_switch_events);
> >
> >@@ -7695,7 +7687,7 @@ static void account_event(struct perf_event *event)
> >  		atomic_inc(&nr_task_events);
> >  	if (event->attr.freq) {
> >  		if (atomic_inc_return(&nr_freq_events) == 1)
> >-			tick_nohz_full_kick_all();
> >+			tick_nohz_set_dep(TICK_PERF_EVENTS_BIT);
> >  	}
> >  	if (event->attr.context_switch) {
> >  		atomic_inc(&nr_switch_events);
> 
> It would be helpful to have a comment explaining why these two
> can't race with each other, e.g. this race:
> 
> [cpu 1]  atomic_dec_and_test
> [cpu 2] atomic_inc_return
> [cpu 2] tick_nohz_set_dep()
> [cpu 1] tick_nohz_clear_dep()
> 
> Or perhaps this is a true race condition possibility?
> 
> I think we're OK for the sched cases since they're protected under
> the rq lock, I think.  I'm not sure about the POSIX cpu timers.

Hmm, how did I miss that...

So in the case of perf, either we need locking, in which case we may want
to use something like tick_nohz_add_dep() which takes care of counting.
But perf would be the only user.

Another possibility is to rather set/clear the tick mask on the task level
in event_sched_in/event_sched_out using ctx->nr_freq which is protected by
ctx->lock. I think I should rather do that.

Concerning the others:

_ sched: we are under the rq lock, like you noticed, we are fine.

_ posix timers: we are under sighand lock, so we are fine too.

_ sched_clock_stable: that one is more obscure. It seems that set_sched_clock_stable()
  and clear_sched_clock_stable() can race on static keys if running concurrently, and
  that would concern tick mask as well.

Thanks.

> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] atomic: Export fetch_or()
  2015-11-24 21:48       ` Chris Metcalf
@ 2015-11-30 17:36         ` Frederic Weisbecker
  2015-11-30 18:17           ` Chris Metcalf
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2015-11-30 17:36 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Nov 24, 2015 at 04:48:35PM -0500, Chris Metcalf wrote:
> On 11/24/2015 04:19 PM, Frederic Weisbecker wrote:
> >>Also, I wonder about the nomenclature here: other than cmpxchg
> >>>and xchg, all the atomic ops are named "atomic_xxx".  For something
> >>>that returns the old value, I'd expect it to be atomic_or_return()
> >>>and be otherwise like the existing atomic_or() routine, and thus you'd
> >>>specify "atomic_t tick_dependency".
> >I think Peterz needs it to be type-generic, like cmpxchg, such that he can
> >use it on tsk->thread_info->flags which type can vary. But if we happen to
> >need an atomic_t version, we can also provide an atomic_fetch_or() version.
> 
> Yes, I think my point is that Peter Z's version is what is needed for
> the scheduler, but it may not be the thing we want to start providing
> to the entire kernel without thinking a little harder about the semantics,
> the namespace issues, and whether there is or should be room for
> some appropriate family of similar calls.  Just tossing in fetch_or()
> because it's easy doesn't necessarily seem like what we should be doing.
> 
> >Also note that or_return() means that you first do OR and then return the new value.
> 
> Yeah, actually fair point.  I keep forgetting Linux does this "backwards".
> 
> I still think we should use an atomic_xxx() name here rather than just
> adding things into the namespace willy-nilly.
> 
> It's tempting to use atomic_fetch_or() but that actually conflicts with the
> C11 names, and remarkably, we haven't done that yet in the kernel,
> so we may want to avoid doing so for now.  I can imagine in the not
> too distant future we detect C11 compilers and allow using <stdatomic.h>
> if possible.  (Obviously some platforms require kernel support or
> other tricky things for stdatomic.h, so we couldn't use it everywhere.)
> 
> We could use something like gcc's old __sync_fetch_and_XXX vs
> __sync_XXX_and_fetch nomenclature and call it atomic_return_or()
> to contrast with atomic_or_return().  That avoids clashing with C11
> for now and is obviously distinct from atomic_or_return().  I suppose
> something like atomic_fetch_and_or() isn't terrible either.
> 
> There is boilerplate for building generic atomic functions already in
> include/asm-generic/atomic.h and that could be extended.
> Unfortunately the atomic64 generic code isn't similarly constructed,
> so you can't just provide a default atomic64_return_or() based on that
> stuff, or at least, you only can on platforms that use an array of locks
> to implement 64-bit atomics.  Ugh.
> 
> If we did this and then wanted Peter Z's code to take advantage of it,
> in principle we could just have some macrology which would compare
> the sizeof(thread_info.flags) to sizeof(int) and sizeof(long) to see which
> one to use and then cast to the appropriate atomic_t or atomic64_t.
> That's a little painful but not terrible.
> 
> Boy, the whole situation is pretty tangled up, though.
> 
> Unless you want to take a big diversion into atomics, I'd be tempted
> to leave Peter's macro alone and just write it off as necessary evil
> to handle the fact that thread_info.flags is all kinds of different sizes
> and types on different platforms, and definitely never an atomic_t.
> Instead just create an inline function atomic_return_or(), or
> whatever name you prefer, that operates on an atomic_t, and use
> the atomic_t type for your structure field.  It's clearly a win to mark
> the data types as being atomic to the extent we can do so, I think.

I agree that cmpxchg, test_and_set_bit, fetch_or... functions with loose
namespaces aren't the best layout.

But casting thread_info to atomic_t really worries me, I'm not sure the ending
result would be correct at all. I prefer to sacrify correctness over namespace
sanity :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] atomic: Export fetch_or()
  2015-11-30 17:36         ` Frederic Weisbecker
@ 2015-11-30 18:17           ` Chris Metcalf
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Metcalf @ 2015-11-30 18:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

(Resending as plain text.  Not sure what Thunderbird was smoking to make
this message multipart/alternative originally...)

On 11/30/2015 12:36 PM, Frederic Weisbecker wrote:
> On Tue, Nov 24, 2015 at 04:48:35PM -0500, Chris Metcalf wrote:
>> >Unless you want to take a big diversion into atomics, I'd be tempted
>> >to leave Peter's macro alone and just write it off as necessary evil
>> >to handle the fact that thread_info.flags is all kinds of different sizes
>> >and types on different platforms, and definitely never an atomic_t.
>> >Instead just create an inline function atomic_return_or(), or
>> >whatever name you prefer, that operates on an atomic_t, and use
>> >the atomic_t type for your structure field.  It's clearly a win to mark
>> >the data types as being atomic to the extent we can do so, I think.
> I agree that cmpxchg, test_and_set_bit, fetch_or... functions with loose
> namespaces aren't the best layout.
>
> But casting thread_info to atomic_t really worries me, I'm not sure the ending
> result would be correct at all. I prefer to sacrify correctness over namespace
> sanity:-)

Just to clear, I wasn't suggesting that that necessarily was the way for you to go.
The last four lines of my message quoted above are what I think might be the
best way forward, and don't involve messing with thread_info at all.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-11-13 14:22 ` [PATCH 2/7] nohz: New tick dependency mask Frederic Weisbecker
  2015-11-24 16:19   ` Chris Metcalf
@ 2015-12-01 20:41   ` Peter Zijlstra
  2015-12-01 22:20     ` Frederic Weisbecker
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-01 20:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Fri, Nov 13, 2015 at 03:22:04PM +0100, Frederic Weisbecker wrote:
> The tick dependency is evaluated on every IRQ. This is a batch of checks
> which determine whether it is safe to stop the tick or not. These checks
> are often split in many details: posix cpu timers, scheduler, sched clock,
> perf events. Each of which are made of smaller details: posix cpu
> timer involves checking process wide timers then thread wide timers. Perf
> involves checking freq events then more per cpu details.
> 
> Checking these details asynchronously every time we update the full
> dynticks state bring avoidable overhead and a messy layout.
> 
> Lets introduce instead tick dependency masks: one for system wide
> dependency (unstable sched clock), one for CPU wide dependency (sched,
> perf), and task/signal level dependencies. The subsystems are responsible
> of setting and clearing their dependency through a set of APIs that will
> take care of concurrent dependency mask modifications and kick targets
> to restart the relevant CPU tick whenever needed.

Maybe better explain why we need the per task and per signal thingy?

> +static void trace_tick_dependency(unsigned long dep)
> +{
> +	if (dep & TICK_POSIX_TIMER_MASK) {
> +		trace_tick_stop(0, "posix timers running\n");
> +		return;
> +	}
> +
> +	if (dep & TICK_PERF_EVENTS_MASK) {
> +		trace_tick_stop(0, "perf events running\n");
> +		return;
> +	}
> +
> +	if (dep & TICK_SCHED_MASK) {
> +		trace_tick_stop(0, "more than 1 task in runqueue\n");
> +		return;
> +	}
> +
> +	if (dep & TICK_CLOCK_UNSTABLE_MASK)
> +		trace_tick_stop(0, "unstable sched clock\n");
> +}

I would suggest ditching the strings and using the 

> +static void kick_all_work_fn(struct work_struct *work)
> +{
> +       tick_nohz_full_kick_all();
> +}
> +static DECLARE_WORK(kick_all_work, kick_all_work_fn);
> +
> +void __tick_nohz_set_dep_delayed(enum tick_dependency_bit bit, unsigned long *dep)
> +{
> +	unsigned long prev;
> +
> +	prev = fetch_or(dep, BIT_MASK(bit));
> +	if (!prev) {
> +		/*
> +		* We need the IPIs to be sent from sane process context.

Why ?

> +		* The posix cpu timers are always set with irqs disabled.
> +		*/
> +		schedule_work(&kick_all_work);
> +	}
> +}
> +
> +/*
> + * Set a global tick dependency. Lets do the wide IPI kick asynchronously
> + * for callers with irqs disabled.

This seems to suggest you can call this with IRQs disabled

> + */
> +void tick_nohz_set_dep(enum tick_dependency_bit bit)
> +{
> +	unsigned long prev;
> +
> +	prev = fetch_or(&tick_dependency, BIT_MASK(bit));
> +	if (!prev)
> +		tick_nohz_full_kick_all();

But that function seems implemented using smp_call_function_many() which
cannot be called with IRQs disabled.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-01 20:41   ` Peter Zijlstra
@ 2015-12-01 22:20     ` Frederic Weisbecker
  2015-12-02 10:56       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-12-01 22:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Dec 01, 2015 at 09:41:09PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 13, 2015 at 03:22:04PM +0100, Frederic Weisbecker wrote:
> > The tick dependency is evaluated on every IRQ. This is a batch of checks
> > which determine whether it is safe to stop the tick or not. These checks
> > are often split in many details: posix cpu timers, scheduler, sched clock,
> > perf events. Each of which are made of smaller details: posix cpu
> > timer involves checking process wide timers then thread wide timers. Perf
> > involves checking freq events then more per cpu details.
> > 
> > Checking these details asynchronously every time we update the full
> > dynticks state bring avoidable overhead and a messy layout.
> > 
> > Lets introduce instead tick dependency masks: one for system wide
> > dependency (unstable sched clock), one for CPU wide dependency (sched,
> > perf), and task/signal level dependencies. The subsystems are responsible
> > of setting and clearing their dependency through a set of APIs that will
> > take care of concurrent dependency mask modifications and kick targets
> > to restart the relevant CPU tick whenever needed.
> 
> Maybe better explain why we need the per task and per signal thingy?

I'll detail that some more in the changelog. The only user of the per task/per signal
tick dependency is posix cpu timer. I've been first proposing a global tick dependency
as soon as any posix cpu timer is armed. It simplified everything but some reviewers
complained (eg: some users might want to run posix timers on housekeepers without
bothering full dynticks CPUs). I could remove the per signal dependency with dispatching
it through all threads in the group each time there is an update but that's the best I can
think of.

> 
> > +static void trace_tick_dependency(unsigned long dep)
> > +{
> > +	if (dep & TICK_POSIX_TIMER_MASK) {
> > +		trace_tick_stop(0, "posix timers running\n");
> > +		return;
> > +	}
> > +
> > +	if (dep & TICK_PERF_EVENTS_MASK) {
> > +		trace_tick_stop(0, "perf events running\n");
> > +		return;
> > +	}
> > +
> > +	if (dep & TICK_SCHED_MASK) {
> > +		trace_tick_stop(0, "more than 1 task in runqueue\n");
> > +		return;
> > +	}
> > +
> > +	if (dep & TICK_CLOCK_UNSTABLE_MASK)
> > +		trace_tick_stop(0, "unstable sched clock\n");
> > +}
> 
> I would suggest ditching the strings and using the

Using a code value instead?

> 
> > +static void kick_all_work_fn(struct work_struct *work)
> > +{
> > +       tick_nohz_full_kick_all();
> > +}
> > +static DECLARE_WORK(kick_all_work, kick_all_work_fn);
> > +
> > +void __tick_nohz_set_dep_delayed(enum tick_dependency_bit bit, unsigned long *dep)
> > +{
> > +	unsigned long prev;
> > +
> > +	prev = fetch_or(dep, BIT_MASK(bit));
> > +	if (!prev) {
> > +		/*
> > +		* We need the IPIs to be sent from sane process context.
> 
> Why ?

Because posix timers code is all called with interrupts disabled and we can't
send IPIs then.

> 
> > +		* The posix cpu timers are always set with irqs disabled.
> > +		*/
> > +		schedule_work(&kick_all_work);
> > +	}
> > +}
> > +
> > +/*
> > + * Set a global tick dependency. Lets do the wide IPI kick asynchronously
> > + * for callers with irqs disabled.
> 
> This seems to suggest you can call this with IRQs disabled

Ah right, that's a misleading comment. We need to use the _delayed() version
when interrupts are disabled.

Thanks.

> 
> > + */
> > +void tick_nohz_set_dep(enum tick_dependency_bit bit)
> > +{
> > +	unsigned long prev;
> > +
> > +	prev = fetch_or(&tick_dependency, BIT_MASK(bit));
> > +	if (!prev)
> > +		tick_nohz_full_kick_all();
> 
> But that function seems implemented using smp_call_function_many() which
> cannot be called with IRQs disabled.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-01 22:20     ` Frederic Weisbecker
@ 2015-12-02 10:56       ` Peter Zijlstra
  2015-12-02 14:08         ` Frederic Weisbecker
  2015-12-02 12:45       ` Peter Zijlstra
  2015-12-02 12:48       ` Peter Zijlstra
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-02 10:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:
> > > +	prev = fetch_or(dep, BIT_MASK(bit));
> > > +	if (!prev) {
> > > +		/*
> > > +		* We need the IPIs to be sent from sane process context.
> > 
> > Why ?
> 
> Because posix timers code is all called with interrupts disabled and we can't
> send IPIs then.
> 
> > 
> > > +		* The posix cpu timers are always set with irqs disabled.
> > > +		*/
> > > +		schedule_work(&kick_all_work);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Set a global tick dependency. Lets do the wide IPI kick asynchronously
> > > + * for callers with irqs disabled.
> > 
> > This seems to suggest you can call this with IRQs disabled
> 
> Ah right, that's a misleading comment. We need to use the _delayed() version
> when interrupts are disabled.

Why can't you use tick_nohz_full_kick_cpu() for all that, which is
usable from IRQ context and avoid all that delayed muck?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-01 22:20     ` Frederic Weisbecker
  2015-12-02 10:56       ` Peter Zijlstra
@ 2015-12-02 12:45       ` Peter Zijlstra
  2015-12-02 14:10         ` Frederic Weisbecker
  2015-12-02 12:48       ` Peter Zijlstra
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-02 12:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:

> > > +static void trace_tick_dependency(unsigned long dep)
> > > +{
> > > +	if (dep & TICK_POSIX_TIMER_MASK) {
> > > +		trace_tick_stop(0, "posix timers running\n");
> > > +		return;
> > > +	}
> > > +
> > > +	if (dep & TICK_PERF_EVENTS_MASK) {
> > > +		trace_tick_stop(0, "perf events running\n");
> > > +		return;
> > > +	}
> > > +
> > > +	if (dep & TICK_SCHED_MASK) {
> > > +		trace_tick_stop(0, "more than 1 task in runqueue\n");
> > > +		return;
> > > +	}
> > > +
> > > +	if (dep & TICK_CLOCK_UNSTABLE_MASK)
> > > +		trace_tick_stop(0, "unstable sched clock\n");
> > > +}
> > 
> > I would suggest ditching the strings and using the
> 
> Using a code value instead?

Duh, it seems I forgot to finish that sentence :/

I meant to say use the ftrace enum stuff. So yes, you encode a value and
the ftrace printing muck will generate a string if and when required.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-01 22:20     ` Frederic Weisbecker
  2015-12-02 10:56       ` Peter Zijlstra
  2015-12-02 12:45       ` Peter Zijlstra
@ 2015-12-02 12:48       ` Peter Zijlstra
  2015-12-02 14:11         ` Frederic Weisbecker
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-02 12:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 01, 2015 at 09:41:09PM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 13, 2015 at 03:22:04PM +0100, Frederic Weisbecker wrote:
> > > The tick dependency is evaluated on every IRQ. This is a batch of checks
> > > which determine whether it is safe to stop the tick or not. These checks
> > > are often split in many details: posix cpu timers, scheduler, sched clock,
> > > perf events. Each of which are made of smaller details: posix cpu
> > > timer involves checking process wide timers then thread wide timers. Perf
> > > involves checking freq events then more per cpu details.
> > > 
> > > Checking these details asynchronously every time we update the full
> > > dynticks state bring avoidable overhead and a messy layout.
> > > 
> > > Lets introduce instead tick dependency masks: one for system wide
> > > dependency (unstable sched clock), one for CPU wide dependency (sched,
> > > perf), and task/signal level dependencies. The subsystems are responsible
> > > of setting and clearing their dependency through a set of APIs that will
> > > take care of concurrent dependency mask modifications and kick targets
> > > to restart the relevant CPU tick whenever needed.
> > 
> > Maybe better explain why we need the per task and per signal thingy?
> 
> I'll detail that some more in the changelog. The only user of the per
> task/per signal tick dependency is posix cpu timer. I've been first
> proposing a global tick dependency as soon as any posix cpu timer is
> armed. 

> It simplified everything but some reviewers complained (eg:
> some users might want to run posix timers on housekeepers without
> bothering full dynticks CPUs). I could remove the per signal
> dependency with dispatching it through all threads in the group each
> time there is an update but that's the best I can think of.

Right, I remember some of that. Seems worth preserving these reasons.
Maybe even in code comments near the definition of these things.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] sched: Account rr and fifo tasks separately
  2015-11-13 14:22 ` [PATCH 4/7] sched: Account rr and fifo tasks separately Frederic Weisbecker
@ 2015-12-02 12:53   ` Peter Zijlstra
  2015-12-02 14:16     ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-02 12:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Fri, Nov 13, 2015 at 03:22:06PM +0100, Frederic Weisbecker wrote:
> In order to evaluate tick dependency, we need to account SCHED_RR and
> SCHED_FIFO tasks separately as those policies don't have the same
> preemption requirements.
> 
> We still keep rt_nr_running as a cache to avoid additions between nr_rr
> and nr_fifo all over the place.

In which case you only need one of nr_fifo/nr_rr. Less accounting is
better.

Pick the one you need for the nohz_full condition, and leave the other.
A quick look at sched_can_stop_tick() seems to suggest nr_rr is the
interesting one. nr_rr < 2 should allow stopping the tick.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-02 10:56       ` Peter Zijlstra
@ 2015-12-02 14:08         ` Frederic Weisbecker
  2015-12-02 15:09           ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2015-12-02 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 11:56:33AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:
> > > > +	prev = fetch_or(dep, BIT_MASK(bit));
> > > > +	if (!prev) {
> > > > +		/*
> > > > +		* We need the IPIs to be sent from sane process context.
> > > 
> > > Why ?
> > 
> > Because posix timers code is all called with interrupts disabled and we can't
> > send IPIs then.
> > 
> > > 
> > > > +		* The posix cpu timers are always set with irqs disabled.
> > > > +		*/
> > > > +		schedule_work(&kick_all_work);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Set a global tick dependency. Lets do the wide IPI kick asynchronously
> > > > + * for callers with irqs disabled.
> > > 
> > > This seems to suggest you can call this with IRQs disabled
> > 
> > Ah right, that's a misleading comment. We need to use the _delayed() version
> > when interrupts are disabled.
> 
> Why can't you use tick_nohz_full_kick_cpu() for all that, which is
> usable from IRQ context and avoid all that delayed muck?

Because I need to kick all the CPUs where the task/signal is running on. That's
a bit difficult to do though. I think we had something to try to send an IPI to a task,
but I can't retrieve it. Looks easy to do anyway. But in the signal case I'd need to do
that for all tasks in the group. That sounds like a costly loop.

So I simplify that with a global IPI.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-02 12:45       ` Peter Zijlstra
@ 2015-12-02 14:10         ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-12-02 14:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 01:45:31PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:
> 
> > > > +static void trace_tick_dependency(unsigned long dep)
> > > > +{
> > > > +	if (dep & TICK_POSIX_TIMER_MASK) {
> > > > +		trace_tick_stop(0, "posix timers running\n");
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	if (dep & TICK_PERF_EVENTS_MASK) {
> > > > +		trace_tick_stop(0, "perf events running\n");
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	if (dep & TICK_SCHED_MASK) {
> > > > +		trace_tick_stop(0, "more than 1 task in runqueue\n");
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	if (dep & TICK_CLOCK_UNSTABLE_MASK)
> > > > +		trace_tick_stop(0, "unstable sched clock\n");
> > > > +}
> > > 
> > > I would suggest ditching the strings and using the
> > 
> > Using a code value instead?
> 
> Duh, it seems I forgot to finish that sentence :/
> 
> I meant to say use the ftrace enum stuff. So yes, you encode a value and
> the ftrace printing muck will generate a string if and when required.

Right I first wanted to avoid that because it makes parsing more complicated for
tools but probably it works fine now with libtraceevents.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-02 12:48       ` Peter Zijlstra
@ 2015-12-02 14:11         ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-12-02 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 01:48:06PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:
> > On Tue, Dec 01, 2015 at 09:41:09PM +0100, Peter Zijlstra wrote:
> > > On Fri, Nov 13, 2015 at 03:22:04PM +0100, Frederic Weisbecker wrote:
> > > > The tick dependency is evaluated on every IRQ. This is a batch of checks
> > > > which determine whether it is safe to stop the tick or not. These checks
> > > > are often split in many details: posix cpu timers, scheduler, sched clock,
> > > > perf events. Each of which are made of smaller details: posix cpu
> > > > timer involves checking process wide timers then thread wide timers. Perf
> > > > involves checking freq events then more per cpu details.
> > > > 
> > > > Checking these details asynchronously every time we update the full
> > > > dynticks state bring avoidable overhead and a messy layout.
> > > > 
> > > > Lets introduce instead tick dependency masks: one for system wide
> > > > dependency (unstable sched clock), one for CPU wide dependency (sched,
> > > > perf), and task/signal level dependencies. The subsystems are responsible
> > > > of setting and clearing their dependency through a set of APIs that will
> > > > take care of concurrent dependency mask modifications and kick targets
> > > > to restart the relevant CPU tick whenever needed.
> > > 
> > > Maybe better explain why we need the per task and per signal thingy?
> > 
> > I'll detail that some more in the changelog. The only user of the per
> > task/per signal tick dependency is posix cpu timer. I've been first
> > proposing a global tick dependency as soon as any posix cpu timer is
> > armed. 
> 
> > It simplified everything but some reviewers complained (eg:
> > some users might want to run posix timers on housekeepers without
> > bothering full dynticks CPUs). I could remove the per signal
> > dependency with dispatching it through all threads in the group each
> > time there is an update but that's the best I can think of.
> 
> Right, I remember some of that. Seems worth preserving these reasons.
> Maybe even in code comments near the definition of these things.

Agreed! I'll comment that some more.

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] sched: Account rr and fifo tasks separately
  2015-12-02 12:53   ` Peter Zijlstra
@ 2015-12-02 14:16     ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-12-02 14:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 01:53:39PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 13, 2015 at 03:22:06PM +0100, Frederic Weisbecker wrote:
> > In order to evaluate tick dependency, we need to account SCHED_RR and
> > SCHED_FIFO tasks separately as those policies don't have the same
> > preemption requirements.
> > 
> > We still keep rt_nr_running as a cache to avoid additions between nr_rr
> > and nr_fifo all over the place.
> 
> In which case you only need one of nr_fifo/nr_rr. Less accounting is
> better.
> 
> Pick the one you need for the nohz_full condition, and leave the other.
> A quick look at sched_can_stop_tick() seems to suggest nr_rr is the
> interesting one. nr_rr < 2 should allow stopping the tick.

Sounds pretty good! I'm going to do that!

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-02 14:08         ` Frederic Weisbecker
@ 2015-12-02 15:09           ` Peter Zijlstra
  2015-12-08 15:57             ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-02 15:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 03:08:16PM +0100, Frederic Weisbecker wrote:
> On Wed, Dec 02, 2015 at 11:56:33AM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:
> > > > > +	prev = fetch_or(dep, BIT_MASK(bit));
> > > > > +	if (!prev) {
> > > > > +		/*
> > > > > +		* We need the IPIs to be sent from sane process context.
> > > > 
> > > > Why ?
> > > 
> > > Because posix timers code is all called with interrupts disabled and we can't
> > > send IPIs then.
> > > 
> > > > 
> > > > > +		* The posix cpu timers are always set with irqs disabled.
> > > > > +		*/
> > > > > +		schedule_work(&kick_all_work);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Set a global tick dependency. Lets do the wide IPI kick asynchronously
> > > > > + * for callers with irqs disabled.
> > > > 
> > > > This seems to suggest you can call this with IRQs disabled
> > > 
> > > Ah right, that's a misleading comment. We need to use the _delayed() version
> > > when interrupts are disabled.
> > 
> > Why can't you use tick_nohz_full_kick_cpu() for all that, which is
> > usable from IRQ context and avoid all that delayed muck?
> 
> Because I need to kick all the CPUs where the task/signal is running on. That's
> a bit difficult to do though. I think we had something to try to send an IPI to a task,
> but I can't retrieve it. Looks easy to do anyway. But in the signal case I'd need to do
> that for all tasks in the group. That sounds like a costly loop.
> 
> So I simplify that with a global IPI.

Sure, but what I meant was that:

	for_each_cpu(cpu, mask)
		tick_nohz_full_kick_cpu(cpu);

is IRQ safe, whereas the current thingy is not. Sure, its a wee bit less
efficient, but do we really care?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model
  2015-11-25 12:34     ` Frederic Weisbecker
@ 2015-12-02 16:17       ` Peter Zijlstra
  2015-12-02 17:03         ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-02 16:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, LKML, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Nov 25, 2015 at 01:34:30PM +0100, Frederic Weisbecker wrote:
> On Tue, Nov 24, 2015 at 11:19:33AM -0500, Chris Metcalf wrote:

> > It would be helpful to have a comment explaining why these two
> > can't race with each other, e.g. this race:
> > 
> > [cpu 1]  atomic_dec_and_test
> > [cpu 2] atomic_inc_return
> > [cpu 2] tick_nohz_set_dep()
> > [cpu 1] tick_nohz_clear_dep()
> > 
> > Or perhaps this is a true race condition possibility?
> > 
> > I think we're OK for the sched cases since they're protected under
> > the rq lock, I think.  I'm not sure about the POSIX cpu timers.
> 
> Hmm, how did I miss that...
> 
> So in the case of perf, either we need locking, in which case we may want
> to use something like tick_nohz_add_dep() which takes care of counting.
> But perf would be the only user.

Right; so you could use something like atomic_dec_and_mutex_lock(), that
would only globally serialize the 0<->1 transitions without incurring
overhead to any other state transitions.

A possibly even less expensive option might be something along the lines
of:

tick_nohz_update_perf_dep()
{
	static DEFINE_SPINLOCK(lock);
	bool freq;

	spin_lock(&lock);
	freq = !!atomic_read(&nr_freq_events);
	if (freq ^ !!tick_nohz_test_dep(PERF)) {
		if (freq)
			tick_nohz_set_dep(PERF);
		else
			tick_nohz_clear_dep(PERF);
	}
	spin_unlock(&lock);
}


	if (atomic_inc_return(&nr_freq_events) == 1)
		tick_nohz_update_perf_dep();


	if (atomic_dec_return(&nr_freq_events) == 0)
		tick_nohz_update_perf_dep();


That avoids the atomic_add_unless() muckery.

> _ sched_clock_stable: that one is more obscure. It seems that set_sched_clock_stable()
>   and clear_sched_clock_stable() can race on static keys if running concurrently, and
>   that would concern tick mask as well.

All you need to ensure here is that clear wins. Once we clear
sched_clock_stable we should _never_ set it again.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model
  2015-12-02 16:17       ` Peter Zijlstra
@ 2015-12-02 17:03         ` Frederic Weisbecker
  2015-12-02 17:15           ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2015-12-02 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, LKML, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 05:17:58PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 25, 2015 at 01:34:30PM +0100, Frederic Weisbecker wrote:
> > On Tue, Nov 24, 2015 at 11:19:33AM -0500, Chris Metcalf wrote:
> 
> > > It would be helpful to have a comment explaining why these two
> > > can't race with each other, e.g. this race:
> > > 
> > > [cpu 1]  atomic_dec_and_test
> > > [cpu 2] atomic_inc_return
> > > [cpu 2] tick_nohz_set_dep()
> > > [cpu 1] tick_nohz_clear_dep()
> > > 
> > > Or perhaps this is a true race condition possibility?
> > > 
> > > I think we're OK for the sched cases since they're protected under
> > > the rq lock, I think.  I'm not sure about the POSIX cpu timers.
> > 
> > Hmm, how did I miss that...
> > 
> > So in the case of perf, either we need locking, in which case we may want
> > to use something like tick_nohz_add_dep() which takes care of counting.
> > But perf would be the only user.
> 
> Right; so you could use something like atomic_dec_and_mutex_lock(), that
> would only globally serialize the 0<->1 transitions without incurring
> overhead to any other state transitions.
> 
> A possibly even less expensive option might be something along the lines
> of:
> 
> tick_nohz_update_perf_dep()
> {
> 	static DEFINE_SPINLOCK(lock);
> 	bool freq;
> 
> 	spin_lock(&lock);
> 	freq = !!atomic_read(&nr_freq_events);
> 	if (freq ^ !!tick_nohz_test_dep(PERF)) {
> 		if (freq)
> 			tick_nohz_set_dep(PERF);
> 		else
> 			tick_nohz_clear_dep(PERF);
> 	}
> 	spin_unlock(&lock);
> }

Well, doing the inc/dec inside the lock is easier to read :-)

> 
> 
> 	if (atomic_inc_return(&nr_freq_events) == 1)
> 		tick_nohz_update_perf_dep();
> 
> 
> 	if (atomic_dec_return(&nr_freq_events) == 0)
> 		tick_nohz_update_perf_dep();
> 
> 
> That avoids the atomic_add_unless() muckery.

Right, I can do either that or I can move the dependency to the CPU level
and count nr_freq to the cpu_ctx when any ctx gets scheduled in/out. Then
everytime we inc and nr_freq == 1, we set the dependency (all that should
be serialized as it only happens locally).

Which way do you prefer? Arguably, the global dependency approach is probably more
simple (although the above based on _test() and atomic_read() is tricky) and the CPU
dependency is more fine-grained (and we avoid the above tricks). Not sure we need it to
be fine-grained in this level though.

> 
> > _ sched_clock_stable: that one is more obscure. It seems that set_sched_clock_stable()
> >   and clear_sched_clock_stable() can race on static keys if running concurrently, and
> >   that would concern tick mask as well.
> 
> All you need to ensure here is that clear wins. Once we clear
> sched_clock_stable we should _never_ set it again.

Ok, so we better make sure that such sequence:

     set_sched_clock_stable();
     clear_sched_clock_stable();

...is always serialized by callers somewhow. If they can't happen in parallel we are
fine because the set is sequential whereas the clear does the static key change asynchronously.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model
  2015-12-02 17:03         ` Frederic Weisbecker
@ 2015-12-02 17:15           ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2015-12-02 17:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, LKML, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 06:03:14PM +0100, Frederic Weisbecker wrote:
> Right, I can do either that or I can move the dependency to the CPU level
> and count nr_freq to the cpu_ctx when any ctx gets scheduled in/out. Then
> everytime we inc and nr_freq == 1, we set the dependency (all that should
> be serialized as it only happens locally).

Doing it per CPU would, as you say, add accounting crap to the context
switch path. Now given that context switches with perf enabled are
already silly expensive that might just fall away into the noise.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] nohz: New tick dependency mask
  2015-12-02 15:09           ` Peter Zijlstra
@ 2015-12-08 15:57             ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2015-12-08 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Wed, Dec 02, 2015 at 04:09:08PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 02, 2015 at 03:08:16PM +0100, Frederic Weisbecker wrote:
> > On Wed, Dec 02, 2015 at 11:56:33AM +0100, Peter Zijlstra wrote:
> > > On Tue, Dec 01, 2015 at 11:20:28PM +0100, Frederic Weisbecker wrote:
> > > > > > +	prev = fetch_or(dep, BIT_MASK(bit));
> > > > > > +	if (!prev) {
> > > > > > +		/*
> > > > > > +		* We need the IPIs to be sent from sane process context.
> > > > > 
> > > > > Why ?
> > > > 
> > > > Because posix timers code is all called with interrupts disabled and we can't
> > > > send IPIs then.
> > > > 
> > > > > 
> > > > > > +		* The posix cpu timers are always set with irqs disabled.
> > > > > > +		*/
> > > > > > +		schedule_work(&kick_all_work);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Set a global tick dependency. Lets do the wide IPI kick asynchronously
> > > > > > + * for callers with irqs disabled.
> > > > > 
> > > > > This seems to suggest you can call this with IRQs disabled
> > > > 
> > > > Ah right, that's a misleading comment. We need to use the _delayed() version
> > > > when interrupts are disabled.
> > > 
> > > Why can't you use tick_nohz_full_kick_cpu() for all that, which is
> > > usable from IRQ context and avoid all that delayed muck?
> > 
> > Because I need to kick all the CPUs where the task/signal is running on. That's
> > a bit difficult to do though. I think we had something to try to send an IPI to a task,
> > but I can't retrieve it. Looks easy to do anyway. But in the signal case I'd need to do
> > that for all tasks in the group. That sounds like a costly loop.
> > 
> > So I simplify that with a global IPI.
> 
> Sure, but what I meant was that:
> 
> 	for_each_cpu(cpu, mask)
> 		tick_nohz_full_kick_cpu(cpu);
> 
> is IRQ safe, whereas the current thingy is not. Sure, its a wee bit less
> efficient, but do we really care?

Right, this overhead probably doesn't matter much. I'll do that and we'll see if
people complain.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2015-12-08 15:57 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-13 14:22 [PATCH 0/7] nohz: Tick dependency mask v3 Frederic Weisbecker
2015-11-13 14:22 ` [PATCH 1/7] atomic: Export fetch_or() Frederic Weisbecker
2015-11-24 15:58   ` Chris Metcalf
2015-11-24 21:19     ` Frederic Weisbecker
2015-11-24 21:48       ` Chris Metcalf
2015-11-30 17:36         ` Frederic Weisbecker
2015-11-30 18:17           ` Chris Metcalf
2015-11-25  9:13       ` Peter Zijlstra
2015-11-13 14:22 ` [PATCH 2/7] nohz: New tick dependency mask Frederic Weisbecker
2015-11-24 16:19   ` Chris Metcalf
2015-11-25 11:32     ` Frederic Weisbecker
2015-12-01 20:41   ` Peter Zijlstra
2015-12-01 22:20     ` Frederic Weisbecker
2015-12-02 10:56       ` Peter Zijlstra
2015-12-02 14:08         ` Frederic Weisbecker
2015-12-02 15:09           ` Peter Zijlstra
2015-12-08 15:57             ` Frederic Weisbecker
2015-12-02 12:45       ` Peter Zijlstra
2015-12-02 14:10         ` Frederic Weisbecker
2015-12-02 12:48       ` Peter Zijlstra
2015-12-02 14:11         ` Frederic Weisbecker
2015-11-13 14:22 ` [PATCH 3/7] perf: Migrate perf to use new tick dependency mask model Frederic Weisbecker
2015-11-24 16:19   ` Chris Metcalf
2015-11-25 12:34     ` Frederic Weisbecker
2015-12-02 16:17       ` Peter Zijlstra
2015-12-02 17:03         ` Frederic Weisbecker
2015-12-02 17:15           ` Peter Zijlstra
2015-11-13 14:22 ` [PATCH 4/7] sched: Account rr and fifo tasks separately Frederic Weisbecker
2015-12-02 12:53   ` Peter Zijlstra
2015-12-02 14:16     ` Frederic Weisbecker
2015-11-13 14:22 ` [PATCH 5/7] sched: Migrate sched to use new tick dependency mask model Frederic Weisbecker
2015-11-13 14:22 ` [PATCH 6/7] posix-cpu-timers: Migrate " Frederic Weisbecker
2015-11-13 14:22 ` [PATCH 7/7] sched-clock: " Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.