All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] sched: Cleanup and improve polling idle loops
@ 2014-06-04 17:31 Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 1/5] cpuidle: Set polling in poll_idle Andy Lutomirski
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Andy Lutomirski @ 2014-06-04 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

This series reduces the number of IPIs on my workload by something like
99%.  It's down from many hundreds per second to very few.

The basic idea behind this series is to make TIF_POLLING_NRFLAG be a
reliable indication that the idle task is polling.  Once that's done,
the rest is reasonably straightforward.

Patches 1 and 2 are related improvements: patch 1 teaches the cpuidle
polling loop how to poll, and patch 2 adds tracepoints so that avoided
IPIs are visible.  Patch 3 is the main semantic change, patch 4 is
cleanup, and patch 5 is peterz's code, rebased on top of my stuff, and
fixed up a bit.

Changes from v1:

 - Squashed the two idle loop rearrangement patches.
 - Improved comments.

Andy Lutomirski (4):
  cpuidle: Set polling in poll_idle
  sched,trace: Add a tracepoint for IPI-less remote wakeups
  sched,idle: Clear polling before descheduling the idle thread
  sched,idle: Simplify wake_up_idle_cpu

Peter Zijlstra (1):
  sched: Optimize ttwu IPI

 drivers/cpuidle/driver.c     |  7 ++--
 include/trace/events/sched.h | 20 +++++++++++
 kernel/sched/core.c          | 79 +++++++++++++++++++++++++++++---------------
 kernel/sched/idle.c          | 30 ++++++++++++++++-
 kernel/sched/sched.h         |  6 ++++
 5 files changed, 113 insertions(+), 29 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/5] cpuidle: Set polling in poll_idle
  2014-06-04 17:31 [PATCH v2 0/5] sched: Cleanup and improve polling idle loops Andy Lutomirski
@ 2014-06-04 17:31 ` Andy Lutomirski
  2014-06-05 14:37   ` [tip:sched/core] " tip-bot for Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 2/5] sched,trace: Add a tracepoint for IPI-less remote wakeups Andy Lutomirski
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2014-06-04 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

poll_idle is the archetypal polling idle loop; tell the core idle
code about it.

This avoids pointless IPIs when all of the other cpuidle states are
disabled.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/cpuidle/driver.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index 136d6a2..9634f20 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -187,8 +187,11 @@ static int poll_idle(struct cpuidle_device *dev,
 
 	t1 = ktime_get();
 	local_irq_enable();
-	while (!need_resched())
-		cpu_relax();
+	if (!current_set_polling_and_test()) {
+		while (!need_resched())
+			cpu_relax();
+	}
+	current_clr_polling();
 
 	t2 = ktime_get();
 	diff = ktime_to_us(ktime_sub(t2, t1));
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/5] sched,trace: Add a tracepoint for IPI-less remote wakeups
  2014-06-04 17:31 [PATCH v2 0/5] sched: Cleanup and improve polling idle loops Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 1/5] cpuidle: Set polling in poll_idle Andy Lutomirski
@ 2014-06-04 17:31 ` Andy Lutomirski
  2014-06-05 14:37   ` [tip:sched/core] sched, trace: " tip-bot for Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 3/5] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2014-06-04 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

Remote wakeups of polling CPUs are a valuable performance
improvement; add a tracepoint to make it much easier to verify that
they're working.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/trace/events/sched.h | 20 ++++++++++++++++++++
 kernel/sched/core.c          |  4 ++++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..0a68d5a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,26 @@ TRACE_EVENT(sched_swap_numa,
 			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
 			__entry->dst_cpu, __entry->dst_nid)
 );
+
+/*
+ * Tracepoint for waking a polling cpu without an IPI.
+ */
+TRACE_EVENT(sched_wake_idle_without_ipi,
+
+	TP_PROTO(int cpu),
+
+	TP_ARGS(cpu),
+
+	TP_STRUCT__entry(
+		__field(	int,	cpu	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+	),
+
+	TP_printk("cpu=%d", __entry->cpu)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 321d800..a002be7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -564,6 +564,8 @@ void resched_task(struct task_struct *p)
 
 	if (set_nr_and_not_polling(p))
 		smp_send_reschedule(cpu);
+	else
+		trace_sched_wake_idle_without_ipi(cpu);
 }
 
 void resched_cpu(int cpu)
@@ -647,6 +649,8 @@ static void wake_up_idle_cpu(int cpu)
 	smp_mb();
 	if (!tsk_is_polling(rq->idle))
 		smp_send_reschedule(cpu);
+	else
+		trace_sched_wake_idle_without_ipi(cpu);
 }
 
 static bool wake_up_full_nohz_cpu(int cpu)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/5] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04 17:31 [PATCH v2 0/5] sched: Cleanup and improve polling idle loops Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 1/5] cpuidle: Set polling in poll_idle Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 2/5] sched,trace: Add a tracepoint for IPI-less remote wakeups Andy Lutomirski
@ 2014-06-04 17:31 ` Andy Lutomirski
  2014-06-04 17:36   ` Peter Zijlstra
  2014-06-05 14:37   ` [tip:sched/core] sched/idle: " tip-bot for Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 4/5] sched,idle: Simplify wake_up_idle_cpu Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 5/5] sched: Optimize ttwu IPI Andy Lutomirski
  4 siblings, 2 replies; 12+ messages in thread
From: Andy Lutomirski @ 2014-06-04 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

Currently, the only real guarantee provided by the polling bit is
that, if you hold rq->lock and the polling bit is set, then you can
set need_resched to force a reschedule.

The only reason the lock is needed is that the idle thread might not
be running at all when setting its need_resched bit, and rq->lock
keeps it pinned.

This is easy to fix: just clear the polling bit before scheduling.
Now the idle thread's polling bit is only ever set when
rq->curr == rq->idle.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 kernel/sched/idle.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 25b9423..1065347 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -67,6 +67,10 @@ void __weak arch_cpu_idle(void)
  * cpuidle_idle_call - the main idle function
  *
  * NOTE: no locks or semaphores should be used here
+ *
+ * On archs that support TIF_POLLING_NRFLAG, is called with polling
+ * set, and it returns with polling set.  If it ever stops polling, it
+ * must clear the polling bit.
  */
 static void cpuidle_idle_call(void)
 {
@@ -175,10 +179,22 @@ exit_idle:
 
 /*
  * Generic idle loop implementation
+ *
+ * Called with polling cleared.
  */
 static void cpu_idle_loop(void)
 {
 	while (1) {
+		/*
+		 * If the arch has a polling bit, we maintain an invariant:
+		 *
+		 * Our polling bit is clear if we're not scheduled (i.e. if
+		 * rq->curr != rq->idle).  This means that, if rq->idle has
+		 * the polling bit set, then setting need_resched is
+		 * guaranteed to cause the cpu to reschedule.
+		 */
+
+		__current_set_polling();
 		tick_nohz_idle_enter();
 
 		while (!need_resched()) {
@@ -218,6 +234,15 @@ static void cpu_idle_loop(void)
 		 */
 		preempt_set_need_resched();
 		tick_nohz_idle_exit();
+		__current_clr_polling();
+
+		/*
+		 * We promise to reschedule if need_resched is set while
+		 * polling is set.  That means that clearing polling
+		 * needs to be visible before rescheduling.
+		 */
+		smp_mb__after_clear_bit();
+
 		schedule_preempt_disabled();
 	}
 }
@@ -239,7 +264,6 @@ void cpu_startup_entry(enum cpuhp_state state)
 	 */
 	boot_init_stack_canary();
 #endif
-	__current_set_polling();
 	arch_cpu_idle_prepare();
 	cpu_idle_loop();
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/5] sched,idle: Simplify wake_up_idle_cpu
  2014-06-04 17:31 [PATCH v2 0/5] sched: Cleanup and improve polling idle loops Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-06-04 17:31 ` [PATCH v2 3/5] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
@ 2014-06-04 17:31 ` Andy Lutomirski
  2014-06-05 14:37   ` [tip:sched/core] sched/idle: Simplify wake_up_idle_cpu() tip-bot for Andy Lutomirski
  2014-06-04 17:31 ` [PATCH v2 5/5] sched: Optimize ttwu IPI Andy Lutomirski
  4 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2014-06-04 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

Now that rq->idle's polling bit is a reliable indication that the cpu is
polling, use it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 kernel/sched/core.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a002be7..f6e4621 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -628,26 +628,7 @@ static void wake_up_idle_cpu(int cpu)
 	if (cpu == smp_processor_id())
 		return;
 
-	/*
-	 * This is safe, as this function is called with the timer
-	 * wheel base lock of (cpu) held. When the CPU is on the way
-	 * to idle and has not yet set rq->curr to idle then it will
-	 * be serialized on the timer wheel base lock and take the new
-	 * timer into account automatically.
-	 */
-	if (rq->curr != rq->idle)
-		return;
-
-	/*
-	 * We can set TIF_RESCHED on the idle task of the other CPU
-	 * lockless. The worst case is that the other CPU runs the
-	 * idle task through an additional NOOP schedule()
-	 */
-	set_tsk_need_resched(rq->idle);
-
-	/* NEED_RESCHED must be visible before we test polling */
-	smp_mb();
-	if (!tsk_is_polling(rq->idle))
+	if (set_nr_and_not_polling(rq->idle))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/5] sched: Optimize ttwu IPI
  2014-06-04 17:31 [PATCH v2 0/5] sched: Cleanup and improve polling idle loops Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-06-04 17:31 ` [PATCH v2 4/5] sched,idle: Simplify wake_up_idle_cpu Andy Lutomirski
@ 2014-06-04 17:31 ` Andy Lutomirski
  2014-06-05 14:37   ` [tip:sched/core] sched/idle: Optimize try-to-wake-up IPI tip-bot for Peter Zijlstra
  4 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2014-06-04 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

From: Peter Zijlstra <peterz@infradead.org>

When enqueueing tasks on remote LLC domains, we send an IPI to do the
work 'locally' and avoid bouncing all the cachelines over.

However, when the remote CPU is idle (and polling, say x86 mwait), we
don't need to send an IPI, we can simply kick the TIF word to wake it
up and have the 'idle' loop do the work.

So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet)
set, set _TIF_NEED_RESCHED and avoid sending the IPI.

[Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.]

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Much-requested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 kernel/sched/core.c  | 54 ++++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/idle.c  | 10 +++++++---
 kernel/sched/sched.h |  6 ++++++
 3 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f6e4621..936a081 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -519,7 +519,7 @@ static inline void init_hrtick(void)
  	__old;								\
 })
 
-#ifdef TIF_POLLING_NRFLAG
+#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
  * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
  * this avoids any races wrt polling state changes and thereby avoids
@@ -530,12 +530,44 @@ static bool set_nr_and_not_polling(struct task_struct *p)
 	struct thread_info *ti = task_thread_info(p);
 	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
 }
+
+/*
+ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
+ *
+ * If this returns true, then the idle task promises to call
+ * sched_ttwu_pending() and reschedule soon.
+ */
+static bool set_nr_if_polling(struct task_struct *p)
+{
+	struct thread_info *ti = task_thread_info(p);
+	typeof(ti->flags) old, val = ACCESS_ONCE(ti->flags);
+
+	for (;;) {
+		if (!(val & _TIF_POLLING_NRFLAG))
+			return false;
+		if (val & _TIF_NEED_RESCHED)
+			return true;
+		old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
+		if (old == val)
+			break;
+		val = old;
+	}
+	return true;
+}
+
 #else
 static bool set_nr_and_not_polling(struct task_struct *p)
 {
 	set_tsk_need_resched(p);
 	return true;
 }
+
+#ifdef CONFIG_SMP
+static bool set_nr_if_polling(struct task_struct *p)
+{
+	return false;
+}
+#endif
 #endif
 
 /*
@@ -1490,13 +1522,17 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
 {
 	struct rq *rq = this_rq();
 	struct llist_node *llist = llist_del_all(&rq->wake_list);
 	struct task_struct *p;
+	unsigned long flags;
 
-	raw_spin_lock(&rq->lock);
+	if (!llist)
+		return;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	while (llist) {
 		p = llist_entry(llist, struct task_struct, wake_entry);
@@ -1504,7 +1540,7 @@ static void sched_ttwu_pending(void)
 		ttwu_do_activate(rq, p, 0);
 	}
 
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 void scheduler_ipi(void)
@@ -1550,8 +1586,14 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-		smp_send_reschedule(cpu);
+	struct rq *rq = cpu_rq(cpu);
+
+	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+		if (!set_nr_if_polling(rq->idle))
+			smp_send_reschedule(cpu);
+		else
+			trace_sched_wake_idle_without_ipi(cpu);
+	}
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1065347..1234324 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -12,6 +12,8 @@
 
 #include <trace/events/power.h>
 
+#include "sched.h"
+
 static int __read_mostly cpu_idle_force_poll;
 
 void cpu_idle_poll_ctrl(bool enable)
@@ -237,12 +239,14 @@ static void cpu_idle_loop(void)
 		__current_clr_polling();
 
 		/*
-		 * We promise to reschedule if need_resched is set while
-		 * polling is set.  That means that clearing polling
-		 * needs to be visible before rescheduling.
+		 * We promise to call sched_ttwu_pending and reschedule
+		 * if need_resched is set while polling is set.  That
+		 * means that clearing polling needs to be visible
+		 * before doing these things.
 		 */
 		smp_mb__after_clear_bit();
 
+		sched_ttwu_pending();
 		schedule_preempt_disabled();
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 600e229..99d9e81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -670,6 +670,8 @@ extern int migrate_swap(struct task_struct *, struct task_struct *);
 
 #ifdef CONFIG_SMP
 
+extern void sched_ttwu_pending(void);
+
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -787,6 +789,10 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+#else
+
+static inline void sched_ttwu_pending(void) { }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/5] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04 17:31 ` [PATCH v2 3/5] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
@ 2014-06-04 17:36   ` Peter Zijlstra
  2014-06-05 14:37   ` [tip:sched/core] sched/idle: " tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2014-06-04 17:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: umgwanakikbuti, mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel

On Wed, Jun 04, 2014 at 10:31:16AM -0700, Andy Lutomirski wrote:
> @@ -218,6 +234,15 @@ static void cpu_idle_loop(void)
>  		 */
>  		preempt_set_need_resched();
>  		tick_nohz_idle_exit();
> +		__current_clr_polling();
> +
> +		/*
> +		 * We promise to reschedule if need_resched is set while
> +		 * polling is set.  That means that clearing polling
> +		 * needs to be visible before rescheduling.
> +		 */
> +		smp_mb__after_clear_bit();
> +
>  		schedule_preempt_disabled();
>  	}
>  }

I recently renamed those barriers, its now called:

  smp_mb__after_atomic();

It'll still compile with the old names, and even work, but you'll get
__deprecated warns and horrid code generation.

I'll fix up when applying these patches, no need to resend.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:sched/core] cpuidle: Set polling in poll_idle
  2014-06-04 17:31 ` [PATCH v2 1/5] cpuidle: Set polling in poll_idle Andy Lutomirski
@ 2014-06-05 14:37   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-06-05 14:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, luto, tglx, rjw,
	daniel.lezcano

Commit-ID:  84c407084137d4e491b07ea5ff8665d19106a5ac
Gitweb:     http://git.kernel.org/tip/84c407084137d4e491b07ea5ff8665d19106a5ac
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Wed, 4 Jun 2014 10:31:14 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jun 2014 12:09:49 +0200

cpuidle: Set polling in poll_idle

poll_idle is the archetypal polling idle loop; tell the core idle
code about it.

This avoids pointless IPIs when all of the other cpuidle states are
disabled.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: nicolas.pitre@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Link: http://lkml.kernel.org/r/c65ce49615d338bae8fb79df5daffab19353c900.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/cpuidle/driver.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index 136d6a2..9634f20 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -187,8 +187,11 @@ static int poll_idle(struct cpuidle_device *dev,
 
 	t1 = ktime_get();
 	local_irq_enable();
-	while (!need_resched())
-		cpu_relax();
+	if (!current_set_polling_and_test()) {
+		while (!need_resched())
+			cpu_relax();
+	}
+	current_clr_polling();
 
 	t2 = ktime_get();
 	diff = ktime_to_us(ktime_sub(t2, t1));

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:sched/core] sched, trace: Add a tracepoint for IPI-less remote wakeups
  2014-06-04 17:31 ` [PATCH v2 2/5] sched,trace: Add a tracepoint for IPI-less remote wakeups Andy Lutomirski
@ 2014-06-05 14:37   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-06-05 14:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, luto, fweisbec,
	rostedt, mgorman, dsahern, oleg, tglx

Commit-ID:  dfc68f29ae67f2a6e799b44e6a4eb3417dffbfcd
Gitweb:     http://git.kernel.org/tip/dfc68f29ae67f2a6e799b44e6a4eb3417dffbfcd
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Wed, 4 Jun 2014 10:31:15 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jun 2014 12:09:50 +0200

sched, trace: Add a tracepoint for IPI-less remote wakeups

Remote wakeups of polling CPUs are a valuable performance
improvement; add a tracepoint to make it much easier to verify that
they're working.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/16205aee116772aa686814f9b13bccb562108047.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/trace/events/sched.h | 20 ++++++++++++++++++++
 kernel/sched/core.c          |  4 ++++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..0a68d5a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,26 @@ TRACE_EVENT(sched_swap_numa,
 			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
 			__entry->dst_cpu, __entry->dst_nid)
 );
+
+/*
+ * Tracepoint for waking a polling cpu without an IPI.
+ */
+TRACE_EVENT(sched_wake_idle_without_ipi,
+
+	TP_PROTO(int cpu),
+
+	TP_ARGS(cpu),
+
+	TP_STRUCT__entry(
+		__field(	int,	cpu	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+	),
+
+	TP_printk("cpu=%d", __entry->cpu)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5976ca5..e4c0ddd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -564,6 +564,8 @@ void resched_task(struct task_struct *p)
 
 	if (set_nr_and_not_polling(p))
 		smp_send_reschedule(cpu);
+	else
+		trace_sched_wake_idle_without_ipi(cpu);
 }
 
 void resched_cpu(int cpu)
@@ -647,6 +649,8 @@ static void wake_up_idle_cpu(int cpu)
 	smp_mb();
 	if (!tsk_is_polling(rq->idle))
 		smp_send_reschedule(cpu);
+	else
+		trace_sched_wake_idle_without_ipi(cpu);
 }
 
 static bool wake_up_full_nohz_cpu(int cpu)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:sched/core] sched/idle: Clear polling before descheduling the idle thread
  2014-06-04 17:31 ` [PATCH v2 3/5] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
  2014-06-04 17:36   ` Peter Zijlstra
@ 2014-06-05 14:37   ` tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-06-05 14:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, rafael.j.wysocki, luto, tglx

Commit-ID:  82c65d60d64401aedc1006d6572469bbfdf148de
Gitweb:     http://git.kernel.org/tip/82c65d60d64401aedc1006d6572469bbfdf148de
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Wed, 4 Jun 2014 10:31:16 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jun 2014 12:09:51 +0200

sched/idle: Clear polling before descheduling the idle thread

Currently, the only real guarantee provided by the polling bit is
that, if you hold rq->lock and the polling bit is set, then you can
set need_resched to force a reschedule.

The only reason the lock is needed is that the idle thread might not
be running at all when setting its need_resched bit, and rq->lock
keeps it pinned.

This is easy to fix: just clear the polling bit before scheduling.
Now the idle thread's polling bit is only ever set when
rq->curr == rq->idle.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/b2059fcb4c613d520cb503b6fad6e47033c7c203.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/idle.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 25b9423..fe4b24b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -67,6 +67,10 @@ void __weak arch_cpu_idle(void)
  * cpuidle_idle_call - the main idle function
  *
  * NOTE: no locks or semaphores should be used here
+ *
+ * On archs that support TIF_POLLING_NRFLAG, is called with polling
+ * set, and it returns with polling set.  If it ever stops polling, it
+ * must clear the polling bit.
  */
 static void cpuidle_idle_call(void)
 {
@@ -175,10 +179,22 @@ exit_idle:
 
 /*
  * Generic idle loop implementation
+ *
+ * Called with polling cleared.
  */
 static void cpu_idle_loop(void)
 {
 	while (1) {
+		/*
+		 * If the arch has a polling bit, we maintain an invariant:
+		 *
+		 * Our polling bit is clear if we're not scheduled (i.e. if
+		 * rq->curr != rq->idle).  This means that, if rq->idle has
+		 * the polling bit set, then setting need_resched is
+		 * guaranteed to cause the cpu to reschedule.
+		 */
+
+		__current_set_polling();
 		tick_nohz_idle_enter();
 
 		while (!need_resched()) {
@@ -218,6 +234,15 @@ static void cpu_idle_loop(void)
 		 */
 		preempt_set_need_resched();
 		tick_nohz_idle_exit();
+		__current_clr_polling();
+
+		/*
+		 * We promise to reschedule if need_resched is set while
+		 * polling is set.  That means that clearing polling
+		 * needs to be visible before rescheduling.
+		 */
+		smp_mb__after_atomic();
+
 		schedule_preempt_disabled();
 	}
 }
@@ -239,7 +264,6 @@ void cpu_startup_entry(enum cpuhp_state state)
 	 */
 	boot_init_stack_canary();
 #endif
-	__current_set_polling();
 	arch_cpu_idle_prepare();
 	cpu_idle_loop();
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:sched/core] sched/idle: Simplify wake_up_idle_cpu()
  2014-06-04 17:31 ` [PATCH v2 4/5] sched,idle: Simplify wake_up_idle_cpu Andy Lutomirski
@ 2014-06-05 14:37   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-06-05 14:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, rafael.j.wysocki, luto, tglx

Commit-ID:  67b9ca70c3030e832999e8d1cdba2984c7bb5bfc
Gitweb:     http://git.kernel.org/tip/67b9ca70c3030e832999e8d1cdba2984c7bb5bfc
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Wed, 4 Jun 2014 10:31:17 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jun 2014 12:09:52 +0200

sched/idle: Simplify wake_up_idle_cpu()

Now that rq->idle's polling bit is a reliable indication that the cpu is
polling, use it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/922f00761445a830ebb23d058e2ae53956ce2d73.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4c0ddd..6afbfee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -628,26 +628,7 @@ static void wake_up_idle_cpu(int cpu)
 	if (cpu == smp_processor_id())
 		return;
 
-	/*
-	 * This is safe, as this function is called with the timer
-	 * wheel base lock of (cpu) held. When the CPU is on the way
-	 * to idle and has not yet set rq->curr to idle then it will
-	 * be serialized on the timer wheel base lock and take the new
-	 * timer into account automatically.
-	 */
-	if (rq->curr != rq->idle)
-		return;
-
-	/*
-	 * We can set TIF_RESCHED on the idle task of the other CPU
-	 * lockless. The worst case is that the other CPU runs the
-	 * idle task through an additional NOOP schedule()
-	 */
-	set_tsk_need_resched(rq->idle);
-
-	/* NEED_RESCHED must be visible before we test polling */
-	smp_mb();
-	if (!tsk_is_polling(rq->idle))
+	if (set_nr_and_not_polling(rq->idle))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:sched/core] sched/idle: Optimize try-to-wake-up IPI
  2014-06-04 17:31 ` [PATCH v2 5/5] sched: Optimize ttwu IPI Andy Lutomirski
@ 2014-06-05 14:37   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-06-05 14:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, rafael.j.wysocki,
	umgwanakikbuti, luto, tglx

Commit-ID:  e3baac47f0e82c4be632f4f97215bb93bf16b342
Gitweb:     http://git.kernel.org/tip/e3baac47f0e82c4be632f4f97215bb93bf16b342
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 4 Jun 2014 10:31:18 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jun 2014 12:09:53 +0200

sched/idle: Optimize try-to-wake-up IPI

[ This series reduces the number of IPIs on Andy's workload by something like
  99%. It's down from many hundreds per second to very few.

  The basic idea behind this series is to make TIF_POLLING_NRFLAG be a
  reliable indication that the idle task is polling.  Once that's done,
  the rest is reasonably straightforward. ]

When enqueueing tasks on remote LLC domains, we send an IPI to do the
work 'locally' and avoid bouncing all the cachelines over.

However, when the remote CPU is idle (and polling, say x86 mwait), we
don't need to send an IPI, we can simply kick the TIF word to wake it
up and have the 'idle' loop do the work.

So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet)
set, set _TIF_NEED_RESCHED and avoid sending the IPI.

Much-requested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: umgwanakikbuti@gmail.com
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 54 ++++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/idle.c  | 10 +++++++---
 kernel/sched/sched.h |  6 ++++++
 3 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6afbfee..60d4e05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -519,7 +519,7 @@ static inline void init_hrtick(void)
  	__old;								\
 })
 
-#ifdef TIF_POLLING_NRFLAG
+#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
  * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
  * this avoids any races wrt polling state changes and thereby avoids
@@ -530,12 +530,44 @@ static bool set_nr_and_not_polling(struct task_struct *p)
 	struct thread_info *ti = task_thread_info(p);
 	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
 }
+
+/*
+ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
+ *
+ * If this returns true, then the idle task promises to call
+ * sched_ttwu_pending() and reschedule soon.
+ */
+static bool set_nr_if_polling(struct task_struct *p)
+{
+	struct thread_info *ti = task_thread_info(p);
+	typeof(ti->flags) old, val = ACCESS_ONCE(ti->flags);
+
+	for (;;) {
+		if (!(val & _TIF_POLLING_NRFLAG))
+			return false;
+		if (val & _TIF_NEED_RESCHED)
+			return true;
+		old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
+		if (old == val)
+			break;
+		val = old;
+	}
+	return true;
+}
+
 #else
 static bool set_nr_and_not_polling(struct task_struct *p)
 {
 	set_tsk_need_resched(p);
 	return true;
 }
+
+#ifdef CONFIG_SMP
+static bool set_nr_if_polling(struct task_struct *p)
+{
+	return false;
+}
+#endif
 #endif
 
 /*
@@ -1490,13 +1522,17 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
 {
 	struct rq *rq = this_rq();
 	struct llist_node *llist = llist_del_all(&rq->wake_list);
 	struct task_struct *p;
+	unsigned long flags;
 
-	raw_spin_lock(&rq->lock);
+	if (!llist)
+		return;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	while (llist) {
 		p = llist_entry(llist, struct task_struct, wake_entry);
@@ -1504,7 +1540,7 @@ static void sched_ttwu_pending(void)
 		ttwu_do_activate(rq, p, 0);
 	}
 
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 void scheduler_ipi(void)
@@ -1550,8 +1586,14 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-		smp_send_reschedule(cpu);
+	struct rq *rq = cpu_rq(cpu);
+
+	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+		if (!set_nr_if_polling(rq->idle))
+			smp_send_reschedule(cpu);
+		else
+			trace_sched_wake_idle_without_ipi(cpu);
+	}
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index fe4b24b..cf009fb 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -12,6 +12,8 @@
 
 #include <trace/events/power.h>
 
+#include "sched.h"
+
 static int __read_mostly cpu_idle_force_poll;
 
 void cpu_idle_poll_ctrl(bool enable)
@@ -237,12 +239,14 @@ static void cpu_idle_loop(void)
 		__current_clr_polling();
 
 		/*
-		 * We promise to reschedule if need_resched is set while
-		 * polling is set.  That means that clearing polling
-		 * needs to be visible before rescheduling.
+		 * We promise to call sched_ttwu_pending and reschedule
+		 * if need_resched is set while polling is set.  That
+		 * means that clearing polling needs to be visible
+		 * before doing these things.
 		 */
 		smp_mb__after_atomic();
 
+		sched_ttwu_pending();
 		schedule_preempt_disabled();
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 956b8ca..2f86361 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -670,6 +670,8 @@ extern int migrate_swap(struct task_struct *, struct task_struct *);
 
 #ifdef CONFIG_SMP
 
+extern void sched_ttwu_pending(void);
+
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -787,6 +789,10 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+#else
+
+static inline void sched_ttwu_pending(void) { }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-06-05 14:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-04 17:31 [PATCH v2 0/5] sched: Cleanup and improve polling idle loops Andy Lutomirski
2014-06-04 17:31 ` [PATCH v2 1/5] cpuidle: Set polling in poll_idle Andy Lutomirski
2014-06-05 14:37   ` [tip:sched/core] " tip-bot for Andy Lutomirski
2014-06-04 17:31 ` [PATCH v2 2/5] sched,trace: Add a tracepoint for IPI-less remote wakeups Andy Lutomirski
2014-06-05 14:37   ` [tip:sched/core] sched, trace: " tip-bot for Andy Lutomirski
2014-06-04 17:31 ` [PATCH v2 3/5] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
2014-06-04 17:36   ` Peter Zijlstra
2014-06-05 14:37   ` [tip:sched/core] sched/idle: " tip-bot for Andy Lutomirski
2014-06-04 17:31 ` [PATCH v2 4/5] sched,idle: Simplify wake_up_idle_cpu Andy Lutomirski
2014-06-05 14:37   ` [tip:sched/core] sched/idle: Simplify wake_up_idle_cpu() tip-bot for Andy Lutomirski
2014-06-04 17:31 ` [PATCH v2 5/5] sched: Optimize ttwu IPI Andy Lutomirski
2014-06-05 14:37   ` [tip:sched/core] sched/idle: Optimize try-to-wake-up IPI tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.