linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] sched: Fix wakeup preemption regression
@ 2016-05-10 17:43 Peter Zijlstra
  2016-05-10 17:43 ` [PATCH 1/3] sched,fair: Move record_wakee() Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-10 17:43 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: peterz, Pavan Kondeti, Ben Segall, Matt Fleming, Mike Galbraith,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

A recent commit caused an interactivity/starvation issue because we wrecked rq
local wakeup preemption.

These patches rectify this while also (hopefully) keeping the problem that led
to the fault patch fixed.

Mike, Pavan, could you guys please confirm?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/3] sched,fair: Move record_wakee()
  2016-05-10 17:43 [PATCH 0/3] sched: Fix wakeup preemption regression Peter Zijlstra
@ 2016-05-10 17:43 ` Peter Zijlstra
  2016-05-12 10:27   ` Matt Fleming
  2016-05-10 17:43 ` [PATCH 2/3] sched,fair: Fix local starvation Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-10 17:43 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: peterz, Pavan Kondeti, Ben Segall, Matt Fleming, Mike Galbraith,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

[-- Attachment #1: peterz-sched-frob-record-wakee.patch --]
[-- Type: text/plain, Size: 3982 bytes --]

Since I want to make ->task_woken() conditional on the task getting
migrated, we cannot use it to call record_wakee().

Move it to select_task_rq_fair(), which gets called in almost all the
same conditions. The only exception is if the woken task (@p) is
cpu-bound (as per the nr_cpus_allowed test in select_task_rq()).

Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: Andrew Hunter <ahh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++------------------------
 1 file changed, 33 insertions(+), 28 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4811,24 +4811,6 @@ static unsigned long cpu_avg_load_per_ta
 	return 0;
 }
 
-static void record_wakee(struct task_struct *p)
-{
-	/*
-	 * Rough decay (wiping) for cost saving, don't worry
-	 * about the boundary, really active task won't care
-	 * about the loss.
-	 */
-	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
-		current->wakee_flips >>= 1;
-		current->wakee_flip_decay_ts = jiffies;
-	}
-
-	if (current->last_wakee != p) {
-		current->last_wakee = p;
-		current->wakee_flips++;
-	}
-}
-
 static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
@@ -4848,7 +4830,6 @@ static void task_waking_fair(struct task
 #endif
 
 	se->vruntime -= min_vruntime;
-	record_wakee(p);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -4966,17 +4947,39 @@ static long effective_load(struct task_g
 
 #endif
 
+static void record_wakee(struct task_struct *p)
+{
+	/*
+	 * Only decay a single time; tasks that have less then 1 wakeup per
+	 * jiffy will not have built up many flips.
+	 */
+	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
+		current->wakee_flips >>= 1;
+		current->wakee_flip_decay_ts = jiffies;
+	}
+
+	if (current->last_wakee != p) {
+		current->last_wakee = p;
+		current->wakee_flips++;
+	}
+}
+
 /*
  * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ *
  * A waker of many should wake a different task than the one last awakened
- * at a frequency roughly N times higher than one of its wakees.  In order
- * to determine whether we should let the load spread vs consolodating to
- * shared cache, we look for a minimum 'flip' frequency of llc_size in one
- * partner, and a factor of lls_size higher frequency in the other.  With
- * both conditions met, we can be relatively sure that the relationship is
- * non-monogamous, with partner count exceeding socket size.  Waker/wakee
- * being client/server, worker/dispatcher, interrupt source or whatever is
- * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ * at a frequency roughly N times higher than one of its wakees.
+ *
+ * In order to determine whether we should let the load spread vs consolidating
+ * to shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.
+ *
+ * With both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.
+ *
+ * Waker/wakee being client/server, worker/dispatcher, interrupt source or
+ * whatever is irrelevant, spread criteria is apparent partner count exceeds
+ * socket size.
  */
 static int wake_wide(struct task_struct *p)
 {
@@ -5281,8 +5284,10 @@ select_task_rq_fair(struct task_struct *
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
+	if (sd_flag & SD_BALANCE_WAKE) {
+		record_wakee(p);
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+	}
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-10 17:43 [PATCH 0/3] sched: Fix wakeup preemption regression Peter Zijlstra
  2016-05-10 17:43 ` [PATCH 1/3] sched,fair: Move record_wakee() Peter Zijlstra
@ 2016-05-10 17:43 ` Peter Zijlstra
  2016-05-10 20:21   ` Ingo Molnar
                     ` (2 more replies)
  2016-05-10 17:43 ` [PATCH 3/3] sched: Kill sched_class::task_waking Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-10 17:43 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: peterz, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter, Mike Galbraith

[-- Attachment #1: peterz-sched-fix-local-starvation.patch --]
[-- Type: text/plain, Size: 5203 bytes --]

Mike reported that the recent commit 3a47d5124a95 ("sched/fair: Fix
fairness issue on migration") broke interactivity and the signal
starve test.

The problem is that I assumed ENQUEUE_WAKING was only set when we do a
cross-cpu wakeup (migration), which isn't true. This means we now
destroy the vruntime history of tasks and wakeup-preemption suffers.

Cure this by making my assumption true, only call
sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
the indirect call in the case we do a local wakeup.

Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: Andrew Hunter <ahh@google.com>
Fixes: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration")
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   29 +++++++++++++++++++++--------
 kernel/sched/fair.c |   41 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 61 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1709,14 +1709,22 @@ static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct pin_cookie cookie)
 {
+	int en_flags = ENQUEUE_WAKEUP;
+
 	lockdep_assert_held(&rq->lock);
 
 #ifdef CONFIG_SMP
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
+
+	/*
+	 * If we migrated; we must have called sched_class::task_waking().
+	 */
+	if (wake_flags & WF_MIGRATED)
+		en_flags |= ENQUEUE_WAKING;
 #endif
 
-	ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
+	ttwu_activate(rq, p, en_flags);
 	ttwu_do_wakeup(rq, p, wake_flags, cookie);
 }
 
@@ -1762,7 +1770,11 @@ void sched_ttwu_pending(void)
 	while (llist) {
 		p = llist_entry(llist, struct task_struct, wake_entry);
 		llist = llist_next(llist);
-		ttwu_do_activate(rq, p, 0, cookie);
+		/*
+		 * See ttwu_queue(); we only call ttwu_queue_remote() when
+		 * its a x-cpu wakeup.
+		 */
+		ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
 	}
 
 	lockdep_unpin_lock(&rq->lock, cookie);
@@ -1849,7 +1861,7 @@ bool cpus_share_cache(int this_cpu, int
 }
 #endif /* CONFIG_SMP */
 
-static void ttwu_queue(struct task_struct *p, int cpu)
+static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct pin_cookie cookie;
@@ -1864,7 +1876,7 @@ static void ttwu_queue(struct task_struc
 
 	raw_spin_lock(&rq->lock);
 	cookie = lockdep_pin_lock(&rq->lock);
-	ttwu_do_activate(rq, p, 0, cookie);
+	ttwu_do_activate(rq, p, wake_flags, cookie);
 	lockdep_unpin_lock(&rq->lock, cookie);
 	raw_spin_unlock(&rq->lock);
 }
@@ -2034,17 +2046,18 @@ try_to_wake_up(struct task_struct *p, un
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
-	if (p->sched_class->task_waking)
-		p->sched_class->task_waking(p);
-
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
+
+		if (p->sched_class->task_waking)
+			p->sched_class->task_waking(p);
+
 		set_task_cpu(p, cpu);
 	}
 #endif /* CONFIG_SMP */
 
-	ttwu_queue(p, cpu);
+	ttwu_queue(p, cpu, wake_flags);
 stat:
 	if (schedstat_enabled())
 		ttwu_stat(p, cpu, wake_flags);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3247,6 +3247,37 @@ static inline void check_schedstat_requi
 #endif
 }
 
+
+/*
+ * MIGRATION
+ *
+ *	dequeue
+ *	  update_curr()
+ *	    update_min_vruntime()
+ *	  vruntime -= min_vruntime
+ *
+ *	enqueue
+ *	  update_curr()
+ *	    update_min_vruntime()
+ *	  vruntime += min_vruntime
+ *
+ * this way the vruntime transition between RQs is done when both
+ * min_vruntime are up-to-date.
+ *
+ * WAKEUP (remote)
+ *
+ *	->task_waking_fair()
+ *	  vruntime -= min_vruntime
+ *
+ *	enqueue
+ *	  update_curr()
+ *	    update_min_vruntime()
+ *	  vruntime += min_vruntime
+ *
+ * this way we don't have the most up-to-date min_vruntime on the originating
+ * CPU and an up-to-date min_vruntime on the destination CPU.
+ */
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -3264,7 +3295,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	/*
 	 * Otherwise, renormalise after, such that we're placed at the current
-	 * moment in time, instead of some random moment in the past.
+	 * moment in time, instead of some random moment in the past. Being
+	 * placed in the past could significantly boost this task to the
+	 * fairness detriment of existing tasks.
 	 */
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
@@ -4811,6 +4844,12 @@ static unsigned long cpu_avg_load_per_ta
 	return 0;
 }
 
+/*
+ * Called to migrate a waking task; as blocked tasks retain absolute vruntime
+ * the migration needs to deal with this by subtracting the old and adding the
+ * new min_vruntime -- the latter is done by enqueue_entity() when placing
+ * the task on the new runqueue.
+ */
 static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/3] sched: Kill sched_class::task_waking
  2016-05-10 17:43 [PATCH 0/3] sched: Fix wakeup preemption regression Peter Zijlstra
  2016-05-10 17:43 ` [PATCH 1/3] sched,fair: Move record_wakee() Peter Zijlstra
  2016-05-10 17:43 ` [PATCH 2/3] sched,fair: Fix local starvation Peter Zijlstra
@ 2016-05-10 17:43 ` Peter Zijlstra
  2016-05-11  5:55 ` [PATCH 0/3] sched: Fix wakeup preemption regression Mike Galbraith
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-10 17:43 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: peterz, Pavan Kondeti, Ben Segall, Matt Fleming, Mike Galbraith,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

[-- Attachment #1: peterz-sched-kill-task_waking.patch --]
[-- Type: text/plain, Size: 5443 bytes --]

With sched_class::task_waking being called only when we do
set_task_cpu(), we can make sched_class::migrate_task_rq() do the work
and eliminate sched_class::task_waking entirely.

Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: Andrew Hunter <ahh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    9 -------
 kernel/sched/fair.c  |   58 ++++++++++++++++++++++++---------------------------
 kernel/sched/sched.h |    7 ++----
 3 files changed, 32 insertions(+), 42 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1717,11 +1717,8 @@ ttwu_do_activate(struct rq *rq, struct t
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
 
-	/*
-	 * If we migrated; we must have called sched_class::task_waking().
-	 */
 	if (wake_flags & WF_MIGRATED)
-		en_flags |= ENQUEUE_WAKING;
+		en_flags |= ENQUEUE_MIGRATED;
 #endif
 
 	ttwu_activate(rq, p, en_flags);
@@ -2049,10 +2046,6 @@ try_to_wake_up(struct task_struct *p, un
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
-
-		if (p->sched_class->task_waking)
-			p->sched_class->task_waking(p);
-
 		set_task_cpu(p, cpu);
 	}
 #endif /* CONFIG_SMP */
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3282,7 +3282,7 @@ static inline void check_schedstat_requi
  *
  * WAKEUP (remote)
  *
- *	->task_waking_fair()
+ *	->migrate_task_rq_fair() (p->state == TASK_WAKING)
  *	  vruntime -= min_vruntime
  *
  *	enqueue
@@ -3297,7 +3297,7 @@ static inline void check_schedstat_requi
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING);
+	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
 	bool curr = cfs_rq->curr == se;
 
 	/*
@@ -4865,33 +4865,6 @@ static unsigned long cpu_avg_load_per_ta
 	return 0;
 }
 
-/*
- * Called to migrate a waking task; as blocked tasks retain absolute vruntime
- * the migration needs to deal with this by subtracting the old and adding the
- * new min_vruntime -- the latter is done by enqueue_entity() when placing
- * the task on the new runqueue.
- */
-static void task_waking_fair(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 min_vruntime;
-
-#ifndef CONFIG_64BIT
-	u64 min_vruntime_copy;
-
-	do {
-		min_vruntime_copy = cfs_rq->min_vruntime_copy;
-		smp_rmb();
-		min_vruntime = cfs_rq->min_vruntime;
-	} while (min_vruntime != min_vruntime_copy);
-#else
-	min_vruntime = cfs_rq->min_vruntime;
-#endif
-
-	se->vruntime -= min_vruntime;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * effective_load() calculates the load change as seen from the root_task_group
@@ -5648,6 +5621,32 @@ select_task_rq_fair(struct task_struct *
 static void migrate_task_rq_fair(struct task_struct *p)
 {
 	/*
+	 * As blocked tasks retain absolute vruntime the migration needs to
+	 * deal with this by subtracting the old and adding the new
+	 * min_vruntime -- the latter is done by enqueue_entity() when placing
+	 * the task on the new runqueue.
+	 */
+	if (p->state == TASK_WAKING) {
+		struct sched_entity *se = &p->se;
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		u64 min_vruntime;
+
+#ifndef CONFIG_64BIT
+		u64 min_vruntime_copy;
+
+		do {
+			min_vruntime_copy = cfs_rq->min_vruntime_copy;
+			smp_rmb();
+			min_vruntime = cfs_rq->min_vruntime;
+		} while (min_vruntime != min_vruntime_copy);
+#else
+		min_vruntime = cfs_rq->min_vruntime;
+#endif
+
+		se->vruntime -= min_vruntime;
+	}
+
+	/*
 	 * We are supposed to update the task to "current" time, then its up to date
 	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
 	 * what current time is, so simply throw away the out-of-date time. This
@@ -8916,7 +8915,6 @@ const struct sched_class fair_sched_clas
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
-	.task_waking		= task_waking_fair,
 	.task_dead		= task_dead_fair,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1164,7 +1164,7 @@ extern const u32 sched_prio_to_wmult[40]
  *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
- * ENQUEUE_WAKING    - sched_class::task_waking was called
+ * ENQUEUE_MIGRATED  - the task was migrated during wakeup
  *
  */
 
@@ -1179,9 +1179,9 @@ extern const u32 sched_prio_to_wmult[40]
 #define ENQUEUE_HEAD		0x08
 #define ENQUEUE_REPLENISH	0x10
 #ifdef CONFIG_SMP
-#define ENQUEUE_WAKING		0x20
+#define ENQUEUE_MIGRATED	0x20
 #else
-#define ENQUEUE_WAKING		0x00
+#define ENQUEUE_MIGRATED	0x00
 #endif
 
 #define RETRY_TASK		((void *)-1UL)
@@ -1213,7 +1213,6 @@ struct sched_class {
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p);
 
-	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-10 17:43 ` [PATCH 2/3] sched,fair: Fix local starvation Peter Zijlstra
@ 2016-05-10 20:21   ` Ingo Molnar
  2016-05-10 22:23     ` Peter Zijlstra
  2016-05-20 21:24   ` Matt Fleming
  2016-05-21 14:04   ` Mike Galbraith
  2 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2016-05-10 20:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter, Mike Galbraith


* Peter Zijlstra <peterz@infradead.org> wrote:

> Mike reported that the recent commit 3a47d5124a95 ("sched/fair: Fix
> fairness issue on migration") broke interactivity and the signal
> starve test.

This looks pretty dangerous for v4.6 - how about simply reverting in 3a47d5124a95 
and re-trying in v4.7?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-10 20:21   ` Ingo Molnar
@ 2016-05-10 22:23     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-10 22:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter, Mike Galbraith

On Tue, May 10, 2016 at 10:21:39PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Mike reported that the recent commit 3a47d5124a95 ("sched/fair: Fix
> > fairness issue on migration") broke interactivity and the signal
> > starve test.
> 
> This looks pretty dangerous for v4.6 - how about simply reverting in 3a47d5124a95 
> and re-trying in v4.7?

Sure.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sched: Fix wakeup preemption regression
  2016-05-10 17:43 [PATCH 0/3] sched: Fix wakeup preemption regression Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-05-10 17:43 ` [PATCH 3/3] sched: Kill sched_class::task_waking Peter Zijlstra
@ 2016-05-11  5:55 ` Mike Galbraith
  2016-05-12  9:56 ` Pavan Kondeti
  2016-05-12 10:52 ` Matt Fleming
  5 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2016-05-11  5:55 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel
  Cc: Pavan Kondeti, Ben Segall, Matt Fleming, Morten Rasmussen,
	Paul Turner, Thomas Gleixner, byungchul.park, Andrew Hunter

On Tue, 2016-05-10 at 19:43 +0200, Peter Zijlstra wrote:

> Mike, Pavan, could you guys please confirm?

Plugging the series into master.today, all seems peachy on my end.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sched: Fix wakeup preemption regression
  2016-05-10 17:43 [PATCH 0/3] sched: Fix wakeup preemption regression Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-05-11  5:55 ` [PATCH 0/3] sched: Fix wakeup preemption regression Mike Galbraith
@ 2016-05-12  9:56 ` Pavan Kondeti
  2016-05-12 10:52 ` Matt Fleming
  5 siblings, 0 replies; 35+ messages in thread
From: Pavan Kondeti @ 2016-05-12  9:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Pavan Kondeti, Ben Segall,
	Matt Fleming, Mike Galbraith, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, byungchul.park, Andrew Hunter

Hi Peter,

On Tue, May 10, 2016 at 11:13 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> A recent commit caused an interactivity/starvation issue because we wrecked rq
> local wakeup preemption.
>
> These patches rectify this while also (hopefully) keeping the problem that led
> to the fault patch fixed.
>
> Mike, Pavan, could you guys please confirm?
>
>

I tested with your latest patches. The original problem (migrated
task's vruntime falling
beyond the min_vruntime) is not reproducible.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched,fair: Move record_wakee()
  2016-05-10 17:43 ` [PATCH 1/3] sched,fair: Move record_wakee() Peter Zijlstra
@ 2016-05-12 10:27   ` Matt Fleming
  2016-05-12 10:31     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Matt Fleming @ 2016-05-12 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Mike Galbraith,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Tue, 10 May, at 07:43:15PM, Peter Zijlstra wrote:
> Since I want to make ->task_woken() conditional on the task getting
> migrated, we cannot use it to call record_wakee().
 
You mean ->task_waking(), right?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched,fair: Move record_wakee()
  2016-05-12 10:27   ` Matt Fleming
@ 2016-05-12 10:31     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-12 10:31 UTC (permalink / raw)
  To: Matt Fleming
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Mike Galbraith,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Thu, May 12, 2016 at 11:27:31AM +0100, Matt Fleming wrote:
> On Tue, 10 May, at 07:43:15PM, Peter Zijlstra wrote:
> > Since I want to make ->task_woken() conditional on the task getting
> > migrated, we cannot use it to call record_wakee().
>  
> You mean ->task_waking(), right?

Uh yes :-) typing be hard.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sched: Fix wakeup preemption regression
  2016-05-10 17:43 [PATCH 0/3] sched: Fix wakeup preemption regression Peter Zijlstra
                   ` (4 preceding siblings ...)
  2016-05-12  9:56 ` Pavan Kondeti
@ 2016-05-12 10:52 ` Matt Fleming
  5 siblings, 0 replies; 35+ messages in thread
From: Matt Fleming @ 2016-05-12 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Mike Galbraith,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Tue, 10 May, at 07:43:14PM, Peter Zijlstra wrote:
> A recent commit caused an interactivity/starvation issue because we wrecked rq
> local wakeup preemption.
> 
> These patches rectify this while also (hopefully) keeping the problem that led
> to the fault patch fixed.
> 
> Mike, Pavan, could you guys please confirm?
 
FWIW, I took a quick look over these patches and they made sense to
me. (I appreciate the comment block above enqueue_entity())

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-10 17:43 ` [PATCH 2/3] sched,fair: Fix local starvation Peter Zijlstra
  2016-05-10 20:21   ` Ingo Molnar
@ 2016-05-20 21:24   ` Matt Fleming
  2016-05-21 14:04   ` Mike Galbraith
  2 siblings, 0 replies; 35+ messages in thread
From: Matt Fleming @ 2016-05-20 21:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Morten Rasmussen,
	Paul Turner, Thomas Gleixner, byungchul.park, Andrew Hunter,
	Mike Galbraith, Mel Gorman

On Tue, 10 May, at 07:43:16PM, Peter Zijlstra wrote:
> Mike reported that the recent commit 3a47d5124a95 ("sched/fair: Fix
> fairness issue on migration") broke interactivity and the signal
> starve test.
> 
> The problem is that I assumed ENQUEUE_WAKING was only set when we do a
> cross-cpu wakeup (migration), which isn't true. This means we now
> destroy the vruntime history of tasks and wakeup-preemption suffers.
> 
> Cure this by making my assumption true, only call
> sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
> the indirect call in the case we do a local wakeup.
> 
> Cc: Pavan Kondeti <pkondeti@codeaurora.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: byungchul.park@lge.com
> Cc: Andrew Hunter <ahh@google.com>
> Fixes: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration")
> Reported-by: Mike Galbraith <mgalbraith@suse.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   29 +++++++++++++++++++++--------
>  kernel/sched/fair.c |   41 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 61 insertions(+), 9 deletions(-)
 
This patch appears to cause a regression for hackbench -pipe of
between ~8% and ~10% with groups >= NR_CPU.

I haven't probed much yet, but it looks like the vruntime of tasks has
gone nuts.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-10 17:43 ` [PATCH 2/3] sched,fair: Fix local starvation Peter Zijlstra
  2016-05-10 20:21   ` Ingo Molnar
  2016-05-20 21:24   ` Matt Fleming
@ 2016-05-21 14:04   ` Mike Galbraith
  2016-05-21 19:00     ` Mike Galbraith
  2016-05-22  6:50     ` [PATCH 2/3] sched,fair: Fix local starvation Wanpeng Li
  2 siblings, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2016-05-21 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel
  Cc: Pavan Kondeti, Ben Segall, Matt Fleming, Morten Rasmussen,
	Paul Turner, Thomas Gleixner, byungchul.park, Andrew Hunter

On Tue, 2016-05-10 at 19:43 +0200, Peter Zijlstra wrote:

(Evolution authors must either not do patch review, or use some other
mailer.  Squint hard, this crud really is your patch;)

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
>  
> @@ -1762,7 +1770,11 @@ void sched_ttwu_pending(void)
>  > 	> while (llist) {
>  > 	> 	> p = llist_entry(llist, struct task_struct, wake_entry);
>  > 	> 	> llist = llist_next(llist);
> -> 	> 	> ttwu_do_activate(rq, p, 0, cookie);
> +> 	> 	> /*
> +> 	> 	>  * See ttwu_queue(); we only call ttwu_queue_remote() when
> +> 	> 	>  * its a x-cpu wakeup.
> +> 	> 	>  */
> +> 	> 	> ttwu_do_activate(rq, p, WF_MIGRATED, cookie);

Wakees that were not migrated/normalized eat an unwanted min_vruntime,
and likely take a size XXL latency hit.  Big box running master bled
profusely under heavy load until I turned TTWU_QUEUE off.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-21 14:04   ` Mike Galbraith
@ 2016-05-21 19:00     ` Mike Galbraith
  2016-05-22  7:00       ` [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity Mike Galbraith
  2016-05-22  6:50     ` [PATCH 2/3] sched,fair: Fix local starvation Wanpeng Li
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2016-05-21 19:00 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel
  Cc: Pavan Kondeti, Ben Segall, Matt Fleming, Morten Rasmussen,
	Paul Turner, Thomas Gleixner, byungchul.park, Andrew Hunter

On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:

> Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> and likely take a size XXL latency hit.  Big box running master bled
> profusely under heavy load until I turned TTWU_QUEUE off.

The below made big box a happy camper again.

sched/fair: Move se->vruntime normalization state into struct sched_entity

Make ->vruntime normalization state explicit.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    1 +
 kernel/sched/fair.c   |   42 ++++++++++++++----------------------------
 3 files changed, 16 insertions(+), 28 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1348,6 +1348,7 @@ struct sched_entity {
 	struct rb_node		run_node;
 	struct list_head	group_node;
 	unsigned int		on_rq;
+	bool			normalized;
 
 	u64			exec_start;
 	u64			sum_exec_runtime;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2259,6 +2259,7 @@ static void __sched_fork(unsigned long c
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.normalized		= true;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3305,14 +3305,13 @@ static inline void check_schedstat_requi
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
 	bool curr = cfs_rq->curr == se;
 
 	/*
 	 * If we're the current task, we must renormalise before calling
 	 * update_curr().
 	 */
-	if (renorm && curr)
+	if (se->normalized && curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
 	update_curr(cfs_rq);
@@ -3323,9 +3322,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * placed in the past could significantly boost this task to the
 	 * fairness detriment of existing tasks.
 	 */
-	if (renorm && !curr)
+	if (se->normalized && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
+	se->normalized = false;
+
 	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -3422,8 +3423,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 * update can refer to the ->curr item and we need to reflect this
 	 * movement in our normalized position.
 	 */
-	if (!(flags & DEQUEUE_SLEEP))
+	if (!(flags & DEQUEUE_SLEEP)) {
 		se->vruntime -= cfs_rq->min_vruntime;
+		se->normalized = true;
+	}
 
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
@@ -5681,6 +5684,7 @@ static void migrate_task_rq_fair(struct
 #endif
 
 		se->vruntime -= min_vruntime;
+		se->normalized = true;
 	}
 
 	/*
@@ -8591,6 +8595,7 @@ static void task_fork_fair(struct task_s
 	}
 
 	se->vruntime -= cfs_rq->min_vruntime;
+	se->normalized = true;
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
@@ -8619,29 +8624,7 @@ prio_changed_fair(struct rq *rq, struct
 
 static inline bool vruntime_normalized(struct task_struct *p)
 {
-	struct sched_entity *se = &p->se;
-
-	/*
-	 * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
-	 * the dequeue_entity(.flags=0) will already have normalized the
-	 * vruntime.
-	 */
-	if (p->on_rq)
-		return true;
-
-	/*
-	 * When !on_rq, vruntime of the task has usually NOT been normalized.
-	 * But there are some cases where it has already been normalized:
-	 *
-	 * - A forked child which is waiting for being woken up by
-	 *   wake_up_new_task().
-	 * - A task which has been woken up by try_to_wake_up() and
-	 *   waiting for actually being woken up by sched_ttwu_pending().
-	 */
-	if (!se->sum_exec_runtime || p->state == TASK_WAKING)
-		return true;
-
-	return false;
+	return p->se.normalized;
 }
 
 static void detach_task_cfs_rq(struct task_struct *p)
@@ -8656,6 +8639,7 @@ static void detach_task_cfs_rq(struct ta
 		 */
 		place_entity(cfs_rq, se, 0);
 		se->vruntime -= cfs_rq->min_vruntime;
+		se->normalized = true;
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
@@ -8678,8 +8662,10 @@ static void attach_task_cfs_rq(struct ta
 	/* Synchronize task with its cfs_rq */
 	attach_entity_load_avg(cfs_rq, se);
 
-	if (!vruntime_normalized(p))
+	if (vruntime_normalized(p)) {
 		se->vruntime += cfs_rq->min_vruntime;
+		se->normalized = false;
+	}
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-21 14:04   ` Mike Galbraith
  2016-05-21 19:00     ` Mike Galbraith
@ 2016-05-22  6:50     ` Wanpeng Li
  2016-05-22  7:15       ` Mike Galbraith
  1 sibling, 1 reply; 35+ messages in thread
From: Wanpeng Li @ 2016-05-22  6:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

2016-05-21 22:04 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
> On Tue, 2016-05-10 at 19:43 +0200, Peter Zijlstra wrote:
>
> (Evolution authors must either not do patch review, or use some other
> mailer.  Squint hard, this crud really is your patch;)
>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>>
>> @@ -1762,7 +1770,11 @@ void sched_ttwu_pending(void)
>>  >    > while (llist) {
>>  >    >       > p = llist_entry(llist, struct task_struct, wake_entry);
>>  >    >       > llist = llist_next(llist);
>> ->    >       > ttwu_do_activate(rq, p, 0, cookie);
>> +>    >       > /*
>> +>    >       >  * See ttwu_queue(); we only call ttwu_queue_remote() when
>> +>    >       >  * its a x-cpu wakeup.
>> +>    >       >  */
>> +>    >       > ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
>
> Wakees that were not migrated/normalized eat an unwanted min_vruntime,

Why there were wakees queued by twu_queue_remote() not migrated?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-21 19:00     ` Mike Galbraith
@ 2016-05-22  7:00       ` Mike Galbraith
  2016-05-22  9:36         ` Peter Zijlstra
  2016-05-23  9:19         ` Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2016-05-22  7:00 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel
  Cc: Pavan Kondeti, Ben Segall, Matt Fleming, Morten Rasmussen,
	Paul Turner, Thomas Gleixner, byungchul.park, Andrew Hunter

On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> 
> > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> > and likely take a size XXL latency hit.  Big box running master bled
> > profusely under heavy load until I turned TTWU_QUEUE off.

May as well make it official and against master.today.  Fly or die
little patchlet.

sched/fair: Move se->vruntime normalization state into struct sched_entity

b5179ac70de ceased globally normalizing wakee vruntime in ttwu(), leaving
sched_ttwu_pending() with the need to know whether each wakee on wake_list
was migrated or not, to pass that on to fair class functions so they can
DTRT wrt vruntime normalization.  Store vruntime normalization state in
struct sched_entity, so fair class functions that need it always have it,
and sched_ttwu_pending() again doesn't need to care whether tasks on the
wake_list have been migrated or not.

Since there are now no consumers of ENQUEUE_MIGRATED, drop it as well. 

master v4.6-8889-gf6c658df6385 virgin
 256     49096  71698.99 MB/sec  warmup   1 sec  latency 1136.488 ms
 256    155009  72862.08 MB/sec  execute   1 sec  latency 3136.900 ms
 256    207430  72628.04 MB/sec  execute   2 sec  latency 4137.001 ms
 256    259635  72442.97 MB/sec  execute   3 sec  latency 5137.105 ms
 256    311905  72371.84 MB/sec  execute   4 sec  latency 6137.214 ms
 256    364210  72564.99 MB/sec  execute   5 sec  latency 7137.323 ms
 256    416551  72598.74 MB/sec  execute   6 sec  latency 5816.895 ms
 256    468824  72601.54 MB/sec  execute   7 sec  latency 6815.386 ms
 256    520996  72621.87 MB/sec  execute   8 sec  latency 7815.499 ms
 256    573113  72608.75 MB/sec  execute   9 sec  latency 8815.609 ms
 256  cleanup  10 sec
   0  cleanup  10 sec

master v4.6-8889-gf6c658df6385 post
 256     51527  75357.55 MB/sec  warmup   1 sec  latency 21.591 ms
 256    157610  73188.06 MB/sec  execute   1 sec  latency 12.985 ms
 256    210089  72809.01 MB/sec  execute   2 sec  latency 11.543 ms
 256    262554  72681.86 MB/sec  execute   3 sec  latency 0.209 ms
 256    315432  72798.65 MB/sec  execute   4 sec  latency 0.206 ms
 256    368162  72963.33 MB/sec  execute   5 sec  latency 8.052 ms
 256    420854  72976.50 MB/sec  execute   6 sec  latency 0.221 ms
 256    473420  72953.76 MB/sec  execute   7 sec  latency 0.198 ms
 256    525859  73011.17 MB/sec  execute   8 sec  latency 2.810 ms
 256    578301  73052.84 MB/sec  execute   9 sec  latency 0.247 ms
 256  cleanup  10 sec
   0  cleanup  10 sec


Fixes: b5179ac70de sched/fair: Prepare to fix fairness problems on migration
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    6 +----
 kernel/sched/fair.c   |   60 ++++++++++++++++++++------------------------------
 3 files changed, 28 insertions(+), 39 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1319,6 +1319,7 @@ struct sched_entity {
 	struct rb_node		run_node;
 	struct list_head	group_node;
 	unsigned int		on_rq;
+	bool			normalized;
 
 	u64			exec_start;
 	u64			sum_exec_runtime;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1719,9 +1719,6 @@ ttwu_do_activate(struct rq *rq, struct t
 #ifdef CONFIG_SMP
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
-
-	if (wake_flags & WF_MIGRATED)
-		en_flags |= ENQUEUE_MIGRATED;
 #endif
 
 	ttwu_activate(rq, p, en_flags);
@@ -1774,7 +1771,7 @@ void sched_ttwu_pending(void)
 		 * See ttwu_queue(); we only call ttwu_queue_remote() when
 		 * its a x-cpu wakeup.
 		 */
-		ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
+		ttwu_do_activate(rq, p, 0, cookie);
 	}
 
 	lockdep_unpin_lock(&rq->lock, cookie);
@@ -2166,6 +2163,7 @@ static void __sched_fork(unsigned long c
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.normalized		= true;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3230,6 +3230,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 
 	/* ensure we never gain time by being placed backwards. */
 	se->vruntime = max_vruntime(se->vruntime, vruntime);
+	se->normalized = false;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -3285,29 +3286,40 @@ static inline void check_schedstat_requi
  * CPU and an up-to-date min_vruntime on the destination CPU.
  */
 
+static void normalize_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	se->vruntime -= cfs_rq->min_vruntime;
+	se->normalized = true;
+}
+
+static void renormalize_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	se->vruntime += cfs_rq->min_vruntime;
+	se->normalized = false;
+}
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
-	bool curr = cfs_rq->curr == se;
+	bool renorm = se->normalized, curr = cfs_rq->curr == se;
 
 	/*
-	 * If we're the current task, we must renormalise before calling
+	 * If we're the current task, we must renormalize before calling
 	 * update_curr().
 	 */
 	if (renorm && curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		renormalize_entity(cfs_rq, se);
 
 	update_curr(cfs_rq);
 
 	/*
-	 * Otherwise, renormalise after, such that we're placed at the current
+	 * Otherwise, renormalize after, such that we're placed at the current
 	 * moment in time, instead of some random moment in the past. Being
 	 * placed in the past could significantly boost this task to the
 	 * fairness detriment of existing tasks.
 	 */
 	if (renorm && !curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		renormalize_entity(cfs_rq, se);
 
 	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
@@ -3406,7 +3418,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 * movement in our normalized position.
 	 */
 	if (!(flags & DEQUEUE_SLEEP))
-		se->vruntime -= cfs_rq->min_vruntime;
+		normalize_entity(cfs_rq, se);
 
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
@@ -5408,7 +5420,7 @@ static void migrate_task_rq_fair(struct
 		min_vruntime = cfs_rq->min_vruntime;
 #endif
 
-		se->vruntime -= min_vruntime;
+		normalize_entity(cfs_rq, se);
 	}
 
 	/*
@@ -8319,7 +8331,7 @@ static void task_fork_fair(struct task_s
 		resched_curr(rq);
 	}
 
-	se->vruntime -= cfs_rq->min_vruntime;
+	normalize_entity(cfs_rq, se);
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
@@ -8348,29 +8360,7 @@ prio_changed_fair(struct rq *rq, struct
 
 static inline bool vruntime_normalized(struct task_struct *p)
 {
-	struct sched_entity *se = &p->se;
-
-	/*
-	 * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
-	 * the dequeue_entity(.flags=0) will already have normalized the
-	 * vruntime.
-	 */
-	if (p->on_rq)
-		return true;
-
-	/*
-	 * When !on_rq, vruntime of the task has usually NOT been normalized.
-	 * But there are some cases where it has already been normalized:
-	 *
-	 * - A forked child which is waiting for being woken up by
-	 *   wake_up_new_task().
-	 * - A task which has been woken up by try_to_wake_up() and
-	 *   waiting for actually being woken up by sched_ttwu_pending().
-	 */
-	if (!se->sum_exec_runtime || p->state == TASK_WAKING)
-		return true;
-
-	return false;
+	return p->se.normalized;
 }
 
 static void detach_task_cfs_rq(struct task_struct *p)
@@ -8384,7 +8374,7 @@ static void detach_task_cfs_rq(struct ta
 		 * cause 'unlimited' sleep bonus.
 		 */
 		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
+		normalize_entity(cfs_rq, se);
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
@@ -8407,8 +8397,8 @@ static void attach_task_cfs_rq(struct ta
 	/* Synchronize task with its cfs_rq */
 	attach_entity_load_avg(cfs_rq, se);
 
-	if (!vruntime_normalized(p))
-		se->vruntime += cfs_rq->min_vruntime;
+	if (vruntime_normalized(p))
+		renormalize_entity(cfs_rq, se);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  6:50     ` [PATCH 2/3] sched,fair: Fix local starvation Wanpeng Li
@ 2016-05-22  7:15       ` Mike Galbraith
  2016-05-22  7:27         ` Wanpeng Li
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2016-05-22  7:15 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

On Sun, 2016-05-22 at 14:50 +0800, Wanpeng Li wrote:
> 2016-05-21 22:04 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
> > On Tue, 2016-05-10 at 19:43 +0200, Peter Zijlstra wrote:
> > 
> > (Evolution authors must either not do patch review, or use some other
> > mailer.  Squint hard, this crud really is your patch;)
> > 
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > 
> > > @@ -1762,7 +1770,11 @@ void sched_ttwu_pending(void)
> > >  >    > while (llist) {
> > >  >    >       > p = llist_entry(llist, struct task_struct, wake_entry);
> > >  >    >       > llist = llist_next(llist);
> > > ->    >       > ttwu_do_activate(rq, p, 0, cookie);
> > > +>    >       > /*
> > > +>    >       >  * See ttwu_queue(); we only call ttwu_queue_remote() when
> > > +>    >       >  * its a x-cpu wakeup.
> > > +>    >       >  */
> > > +>    >       > ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
> > 
> > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> 
> Why there were wakees queued by twu_queue_remote() not migrated?

Queuing to a remote cache domain implies x-cpu wakeup, but does not
imply migration.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  7:15       ` Mike Galbraith
@ 2016-05-22  7:27         ` Wanpeng Li
  2016-05-22  7:32           ` Mike Galbraith
  0 siblings, 1 reply; 35+ messages in thread
From: Wanpeng Li @ 2016-05-22  7:27 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

2016-05-22 15:15 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
> On Sun, 2016-05-22 at 14:50 +0800, Wanpeng Li wrote:
>> 2016-05-21 22:04 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
>> > On Tue, 2016-05-10 at 19:43 +0200, Peter Zijlstra wrote:
>> >
>> > (Evolution authors must either not do patch review, or use some other
>> > mailer.  Squint hard, this crud really is your patch;)
>> >
>> > > --- a/kernel/sched/core.c
>> > > +++ b/kernel/sched/core.c
>> > >
>> > > @@ -1762,7 +1770,11 @@ void sched_ttwu_pending(void)
>> > >  >    > while (llist) {
>> > >  >    >       > p = llist_entry(llist, struct task_struct, wake_entry);
>> > >  >    >       > llist = llist_next(llist);
>> > > ->    >       > ttwu_do_activate(rq, p, 0, cookie);
>> > > +>    >       > /*
>> > > +>    >       >  * See ttwu_queue(); we only call ttwu_queue_remote() when
>> > > +>    >       >  * its a x-cpu wakeup.
>> > > +>    >       >  */
>> > > +>    >       > ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
>> >
>> > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
>>
>> Why there were wakees queued by twu_queue_remote() not migrated?
>
> Queuing to a remote cache domain implies x-cpu wakeup, but does not
> imply migration.

What's the meaning of 'x-cpu wakeup'? ;-)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  7:27         ` Wanpeng Li
@ 2016-05-22  7:32           ` Mike Galbraith
  2016-05-22  7:42             ` Wanpeng Li
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2016-05-22  7:32 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

On Sun, 2016-05-22 at 15:27 +0800, Wanpeng Li wrote:

> What's the meaning of 'x-cpu wakeup'? ;-)

Generally, cross CPU, as in waker/wakee reside on different CPUs, but
in this case, it's cross socket wakeup.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  7:32           ` Mike Galbraith
@ 2016-05-22  7:42             ` Wanpeng Li
  2016-05-22  8:04               ` Mike Galbraith
  0 siblings, 1 reply; 35+ messages in thread
From: Wanpeng Li @ 2016-05-22  7:42 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

2016-05-22 15:32 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
> On Sun, 2016-05-22 at 15:27 +0800, Wanpeng Li wrote:
>
>> What's the meaning of 'x-cpu wakeup'? ;-)
>
> Generally, cross CPU, as in waker/wakee reside on different CPUs, but
> in this case, it's cross socket wakeup.

Do you mean wakeup wakees on remote socket don't imply
migration/normalized, why?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  7:42             ` Wanpeng Li
@ 2016-05-22  8:04               ` Mike Galbraith
  2016-05-22  8:24                 ` Wanpeng Li
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2016-05-22  8:04 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

On Sun, 2016-05-22 at 15:42 +0800, Wanpeng Li wrote:
> 2016-05-22 15:32 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
> > On Sun, 2016-05-22 at 15:27 +0800, Wanpeng Li wrote:
> > 
> > > What's the meaning of 'x-cpu wakeup'? ;-)
> > 
> > Generally, cross CPU, as in waker/wakee reside on different CPUs,
> > but
> > in this case, it's cross socket wakeup.
> 
> Do you mean wakeup wakees on remote socket don't imply
> migration/normalized, why?

Because the wakee is NOT necessarily migrated simply because it lives
in some remote cache domain.  It was a simple but nasty booboo.

ttwu():
        cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
        if (task_cpu(p) != cpu) {
                wake_flags |= WF_MIGRATED;
                set_task_cpu(p, cpu);

set_task_cpu():
                if (task_cpu(p) != new_cpu) {
                if (p->sched_class->migrate_task_rq)
                        p->sched_class->migrate_task_rq(p);
                                        ^^^^^^^^^^^^^^^^^^

migrate_task_rq_fair() normalizes wakee, those wakees that did not
migrate have NOT been normalized, leaving two flavors of wakee on the
wake_list, with no discriminator.  Store class information internally,
and the x-socket information disconnect evaporates.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  8:04               ` Mike Galbraith
@ 2016-05-22  8:24                 ` Wanpeng Li
  2016-05-22  8:39                   ` Mike Galbraith
  0 siblings, 1 reply; 35+ messages in thread
From: Wanpeng Li @ 2016-05-22  8:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

2016-05-22 16:04 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
> On Sun, 2016-05-22 at 15:42 +0800, Wanpeng Li wrote:
>> 2016-05-22 15:32 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
>> > On Sun, 2016-05-22 at 15:27 +0800, Wanpeng Li wrote:
>> >
>> > > What's the meaning of 'x-cpu wakeup'? ;-)
>> >
>> > Generally, cross CPU, as in waker/wakee reside on different CPUs,
>> > but
>> > in this case, it's cross socket wakeup.
>>
>> Do you mean wakeup wakees on remote socket don't imply
>> migration/normalized, why?
>
> Because the wakee is NOT necessarily migrated simply because it lives
> in some remote cache domain.  It was a simple but nasty booboo.
>
> ttwu():
>         cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>         if (task_cpu(p) != cpu) {
>                 wake_flags |= WF_MIGRATED;
>                 set_task_cpu(p, cpu);
>
> set_task_cpu():
>                 if (task_cpu(p) != new_cpu) {
>                 if (p->sched_class->migrate_task_rq)
>                         p->sched_class->migrate_task_rq(p);
>                                         ^^^^^^^^^^^^^^^^^^
>
> migrate_task_rq_fair() normalizes wakee, those wakees that did not
> migrate have NOT been normalized, leaving two flavors of wakee on the
> wake_list, with no discriminator.  Store class information internally,
> and the x-socket information disconnect evaporates.

Do all wakees live on some remote cache domain will not migrate or
just the ones which can't find an suitable cpu in current local
socket(failed in select_task_rq())?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  8:24                 ` Wanpeng Li
@ 2016-05-22  8:39                   ` Mike Galbraith
  2016-05-22  8:50                     ` Wanpeng Li
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2016-05-22  8:39 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

On Sun, 2016-05-22 at 16:24 +0800, Wanpeng Li wrote:

> Do all wakees live on some remote cache domain will not migrate or
> just the ones which can't find an suitable cpu in current local
> socket(failed in select_task_rq())?

??  The picture I drew already answered that.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched,fair: Fix local starvation
  2016-05-22  8:39                   ` Mike Galbraith
@ 2016-05-22  8:50                     ` Wanpeng Li
  0 siblings, 0 replies; 35+ messages in thread
From: Wanpeng Li @ 2016-05-22  8:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

2016-05-22 16:39 GMT+08:00 Mike Galbraith <mgalbraith@suse.de>:
> On Sun, 2016-05-22 at 16:24 +0800, Wanpeng Li wrote:
>
>> Do all wakees live on some remote cache domain will not migrate or
>> just the ones which can't find an suitable cpu in current local
>> socket(failed in select_task_rq())?
>
> ??  The picture I drew already answered that.

I see, thanks for your great explanation. :-)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-22  7:00       ` [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity Mike Galbraith
@ 2016-05-22  9:36         ` Peter Zijlstra
  2016-05-22  9:52           ` Mike Galbraith
  2016-05-22 10:33           ` Peter Zijlstra
  2016-05-23  9:19         ` Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-22  9:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
> On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> > 
> > > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> > > and likely take a size XXL latency hit.  Big box running master bled
> > > profusely under heavy load until I turned TTWU_QUEUE off.
> 
> May as well make it official and against master.today.  Fly or die
> little patchlet.
> 
> sched/fair: Move se->vruntime normalization state into struct sched_entity

Yeah, I used to have a patch like this; but for debugging. I don't
particularly like carrying this information other than for verification
because it means we either do too much or too little normalization.

I'll try and have a look on Monday, but I got some real-life things to
sort out first..

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-22  9:36         ` Peter Zijlstra
@ 2016-05-22  9:52           ` Mike Galbraith
  2016-05-22 10:33           ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2016-05-22  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Sun, 2016-05-22 at 11:36 +0200, Peter Zijlstra wrote:
> On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
> > On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> > > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> > > 
> > > > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> > > > and likely take a size XXL latency hit.  Big box running master bled
> > > > profusely under heavy load until I turned TTWU_QUEUE off.
> > 
> > May as well make it official and against master.today.  Fly or die
> > little patchlet.
> > 
> > sched/fair: Move se->vruntime normalization state into struct sched_entity
> 
> Yeah, I used to have a patch like this; but for debugging. I don't
> particularly like carrying this information other than for verification
> because it means we either do too much or too little normalization.
> 
> I'll try and have a look on Monday, but I got some real-life things to
> sort out first..

Ok (flush, say hi to the goldfish little patchlet). I don't care how it
gets fixed, only that it does, and yours will likely be prettier :)

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-22  9:36         ` Peter Zijlstra
  2016-05-22  9:52           ` Mike Galbraith
@ 2016-05-22 10:33           ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-22 10:33 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Sun, May 22, 2016 at 11:36:38AM +0200, Peter Zijlstra wrote:
> > 
> > sched/fair: Move se->vruntime normalization state into struct sched_entity
> 
> Yeah, I used to have a patch like this; but for debugging. I don't
> particularly like carrying this information other than for verification
> because it means we either do too much or too little normalization.
> 
> I'll try and have a look on Monday, but I got some real-life things to
> sort out first..

Ah, I think I see why we might actually need a variable this time...
Brain seems to have worked overtime while pulling weeds in the garden
:-)

In any case, I'll have a look on Monday.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-22  7:00       ` [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity Mike Galbraith
  2016-05-22  9:36         ` Peter Zijlstra
@ 2016-05-23  9:19         ` Peter Zijlstra
  2016-05-23  9:40           ` Mike Galbraith
  2016-05-25  7:12           ` [tip:sched/urgent] sched/core: Fix remote wakeups tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-23  9:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
> On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> > 
> > > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> > > and likely take a size XXL latency hit.  Big box running master bled
> > > profusely under heavy load until I turned TTWU_QUEUE off.
> 
> May as well make it official and against master.today.  Fly or die
> little patchlet.
> 
> sched/fair: Move se->vruntime normalization state into struct sched_entity

Does this work?

---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 18 +++++++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b43b45a22b9..a2001e01b3df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1534,6 +1534,7 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
+	unsigned sched_remote_wakeup:1;
 	unsigned :0; /* force alignment to the next boundary */
 
 	/* unserialized, strictly 'current' */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 404c0784b1fc..7f2cae4620c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1768,13 +1768,15 @@ void sched_ttwu_pending(void)
 	cookie = lockdep_pin_lock(&rq->lock);
 
 	while (llist) {
+		int wake_flags = 0;
+
 		p = llist_entry(llist, struct task_struct, wake_entry);
 		llist = llist_next(llist);
-		/*
-		 * See ttwu_queue(); we only call ttwu_queue_remote() when
-		 * its a x-cpu wakeup.
-		 */
-		ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
+
+		if (p->sched_remote_wakeup)
+			wake_flags = WF_MIGRATED;
+
+		ttwu_do_activate(rq, p, wake_flags, cookie);
 	}
 
 	lockdep_unpin_lock(&rq->lock, cookie);
@@ -1819,10 +1821,12 @@ void scheduler_ipi(void)
 	irq_exit();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu)
+static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
+	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+
 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
 			smp_send_reschedule(cpu);
@@ -1869,7 +1873,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 #if defined(CONFIG_SMP)
 	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
-		ttwu_queue_remote(p, cpu);
+		ttwu_queue_remote(p, cpu, wake_flags);
 		return;
 	}
 #endif

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-23  9:19         ` Peter Zijlstra
@ 2016-05-23  9:40           ` Mike Galbraith
  2016-05-23 10:13             ` Wanpeng Li
  2016-05-23 12:28             ` Peter Zijlstra
  2016-05-25  7:12           ` [tip:sched/urgent] sched/core: Fix remote wakeups tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2016-05-23  9:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Mon, 2016-05-23 at 11:19 +0200, Peter Zijlstra wrote:
> On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
> > On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> > > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> > > 
> > > > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> > > > and likely take a size XXL latency hit.  Big box running master bled
> > > > profusely under heavy load until I turned TTWU_QUEUE off.
> > 
> > May as well make it official and against master.today.  Fly or die
> > little patchlet.
> > 
> > sched/fair: Move se->vruntime normalization state into struct sched_entity
> 
> Does this work?

Yup, bugs--.  Kinda funny, I considered ~this way first, but thought
you'd not that approach.. dang, got it back-assward ;-)

> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 18 +++++++++++-------
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1b43b45a22b9..a2001e01b3df 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1534,6 +1534,7 @@ struct task_struct {
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
>  	unsigned sched_migrated:1;
> +	unsigned sched_remote_wakeup:1;
>  	unsigned :0; /* force alignment to the next boundary */
>  
>  	/* unserialized, strictly 'current' */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 404c0784b1fc..7f2cae4620c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1768,13 +1768,15 @@ void sched_ttwu_pending(void)
>  	cookie = lockdep_pin_lock(&rq->lock);
>  
>  	while (llist) {
> +		int wake_flags = 0;
> +
>  		p = llist_entry(llist, struct task_struct,
> wake_entry);
>  		llist = llist_next(llist);
> -		/*
> -		 * See ttwu_queue(); we only call
> ttwu_queue_remote() when
> -		 * its a x-cpu wakeup.
> -		 */
> -		ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
> +
> +		if (p->sched_remote_wakeup)
> +			wake_flags = WF_MIGRATED;
> +
> +		ttwu_do_activate(rq, p, wake_flags, cookie);
>  	}
>  
>  	lockdep_unpin_lock(&rq->lock, cookie);
> @@ -1819,10 +1821,12 @@ void scheduler_ipi(void)
>  	irq_exit();
>  }
>  
> -static void ttwu_queue_remote(struct task_struct *p, int cpu)
> +static void ttwu_queue_remote(struct task_struct *p, int cpu, int
> wake_flags)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  
> +	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> +
>  	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>  		if (!set_nr_if_polling(rq->idle))
>  			smp_send_reschedule(cpu);
> @@ -1869,7 +1873,7 @@ static void ttwu_queue(struct task_struct *p,
> int cpu, int wake_flags)
>  #if defined(CONFIG_SMP)
>  	if (sched_feat(TTWU_QUEUE) &&
> !cpus_share_cache(smp_processor_id(), cpu)) {
>  		sched_clock_cpu(cpu); /* sync clocks x-cpu */
> -		ttwu_queue_remote(p, cpu);
> +		ttwu_queue_remote(p, cpu, wake_flags);
>  		return;
>  	}
>  #endif

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-23  9:40           ` Mike Galbraith
@ 2016-05-23 10:13             ` Wanpeng Li
  2016-05-23 10:26               ` Mike Galbraith
  2016-05-23 12:28             ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Wanpeng Li @ 2016-05-23 10:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

2016-05-23 17:40 GMT+08:00 Mike Galbraith <umgwanakikbuti@gmail.com>:
> On Mon, 2016-05-23 at 11:19 +0200, Peter Zijlstra wrote:
>> On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
>> > On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
>> > > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
>> > >
>> > > > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
>> > > > and likely take a size XXL latency hit.  Big box running master bled
>> > > > profusely under heavy load until I turned TTWU_QUEUE off.
>> >
>> > May as well make it official and against master.today.  Fly or die
>> > little patchlet.
>> >
>> > sched/fair: Move se->vruntime normalization state into struct sched_entity
>>
>> Does this work?
>
> Yup, bugs--.  Kinda funny, I considered ~this way first, but thought
> you'd not that approach.. dang, got it back-assward ;-)
>

Nicer this one.

Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>

>> ---
>>  include/linux/sched.h |  1 +
>>  kernel/sched/core.c   | 18 +++++++++++-------
>>  2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 1b43b45a22b9..a2001e01b3df 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1534,6 +1534,7 @@ struct task_struct {
>>       unsigned sched_reset_on_fork:1;
>>       unsigned sched_contributes_to_load:1;
>>       unsigned sched_migrated:1;
>> +     unsigned sched_remote_wakeup:1;
>>       unsigned :0; /* force alignment to the next boundary */
>>
>>       /* unserialized, strictly 'current' */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 404c0784b1fc..7f2cae4620c7 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1768,13 +1768,15 @@ void sched_ttwu_pending(void)
>>       cookie = lockdep_pin_lock(&rq->lock);
>>
>>       while (llist) {
>> +             int wake_flags = 0;
>> +
>>               p = llist_entry(llist, struct task_struct,
>> wake_entry);
>>               llist = llist_next(llist);
>> -             /*
>> -              * See ttwu_queue(); we only call
>> ttwu_queue_remote() when
>> -              * its a x-cpu wakeup.
>> -              */
>> -             ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
>> +
>> +             if (p->sched_remote_wakeup)
>> +                     wake_flags = WF_MIGRATED;
>> +
>> +             ttwu_do_activate(rq, p, wake_flags, cookie);
>>       }
>>
>>       lockdep_unpin_lock(&rq->lock, cookie);
>> @@ -1819,10 +1821,12 @@ void scheduler_ipi(void)
>>       irq_exit();
>>  }
>>
>> -static void ttwu_queue_remote(struct task_struct *p, int cpu)
>> +static void ttwu_queue_remote(struct task_struct *p, int cpu, int
>> wake_flags)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>>
>> +     p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
>> +
>>       if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>>               if (!set_nr_if_polling(rq->idle))
>>                       smp_send_reschedule(cpu);
>> @@ -1869,7 +1873,7 @@ static void ttwu_queue(struct task_struct *p,
>> int cpu, int wake_flags)
>>  #if defined(CONFIG_SMP)
>>       if (sched_feat(TTWU_QUEUE) &&
>> !cpus_share_cache(smp_processor_id(), cpu)) {
>>               sched_clock_cpu(cpu); /* sync clocks x-cpu */
>> -             ttwu_queue_remote(p, cpu);
>> +             ttwu_queue_remote(p, cpu, wake_flags);
>>               return;
>>       }
>>  #endif



-- 
Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-23 10:13             ` Wanpeng Li
@ 2016-05-23 10:26               ` Mike Galbraith
  0 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2016-05-23 10:26 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pavan Kondeti,
	Ben Segall, Matt Fleming, Morten Rasmussen, Paul Turner,
	Thomas Gleixner, Byungchul Park, Andrew Hunter

On Mon, 2016-05-23 at 18:13 +0800, Wanpeng Li wrote:
> 2016-05-23 17:40 GMT+08:00 Mike Galbraith <umgwanakikbuti@gmail.com>:
> > On Mon, 2016-05-23 at 11:19 +0200, Peter Zijlstra wrote:
> > > On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
> > > > On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> > > > > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> > > > > 
> > > > > > Wakees that were not migrated/normalized eat an unwanted
> > > > > > min_vruntime,
> > > > > > and likely take a size XXL latency hit.  Big box running
> > > > > > master bled
> > > > > > profusely under heavy load until I turned TTWU_QUEUE off.
> > > > 
> > > > May as well make it official and against master.today.  Fly or
> > > > die
> > > > little patchlet.
> > > > 
> > > > sched/fair: Move se->vruntime normalization state into struct
> > > > sched_entity
> > > 
> > > Does this work?
> > 
> > Yup, bugs--.  Kinda funny, I considered ~this way first, but
> > thought
> > you'd not that approach.. dang, got it back-assward ;-)
> > 
> 
> Nicer this one.

Well, the wakeup is a remote wakeup whether the flag is set or not, so
in that regard it's not _squeaky_ clean, but is dinky, works (yay)..
and doesn't invent helpers with funny names vs what they actually do.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-23  9:40           ` Mike Galbraith
  2016-05-23 10:13             ` Wanpeng Li
@ 2016-05-23 12:28             ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-05-23 12:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, linux-kernel, Pavan Kondeti, Ben Segall, Matt Fleming,
	Morten Rasmussen, Paul Turner, Thomas Gleixner, byungchul.park,
	Andrew Hunter

On Mon, May 23, 2016 at 11:40:35AM +0200, Mike Galbraith wrote:

> Yup, bugs--.  Kinda funny, I considered ~this way first, but thought
> you'd not that approach.. dang, got it back-assward ;-)

Hehe, so the new flag is in a word we've already written to on this path
and it avoids growing the structure (by an undefined number of bytes,
see gripes about sizeof(bool)).

I am thinking of doing the thing you did for a debug aid -- then again;
if the Google guys would finally get the global vruntime patches sorted,
we can go and kill all this code.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip:sched/urgent] sched/core: Fix remote wakeups
  2016-05-23  9:19         ` Peter Zijlstra
  2016-05-23  9:40           ` Mike Galbraith
@ 2016-05-25  7:12           ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-05-25  7:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave.hansen, dvlasenk, ahh, peterz, umgwanakikbuti, matt, oleg,
	brgerst, pjt, bsegall, pkondeti, linux-kernel, paulmck, mingo,
	bp, hpa, morten.rasmussen, fenghua.yu, torvalds, tglx,
	quentin.casasnovas, luto

Commit-ID:  b7e7ade34e6188bee2e3b0d42b51d25137d9e2a5
Gitweb:     http://git.kernel.org/tip/b7e7ade34e6188bee2e3b0d42b51d25137d9e2a5
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 23 May 2016 11:19:07 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 May 2016 08:35:18 +0200

sched/core: Fix remote wakeups

Commit:

  b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")

... introduced a bug: Mike Galbraith found that it introduced a
performance regression, while Paul E. McKenney reported lost
wakeups and bisected it to this commit.

The reason is that I mis-read ttwu_queue() such that I assumed any
wakeup that got a remote queue must have had the task migrated.

Since this is not so; we need to transfer this information between
queueing the wakeup and actually doing the wakeup. Use a new
task_struct::sched_flag for this, we already write to
sched_contributes_to_load in the wakeup path so this is a hot and
modified cacheline.

Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ben Segall <bsegall@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 18 +++++++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6cc0df9..e053517 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1533,6 +1533,7 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
+	unsigned sched_remote_wakeup:1;
 	unsigned :0; /* force alignment to the next boundary */
 
 	/* unserialized, strictly 'current' */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 404c078..7f2cae4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1768,13 +1768,15 @@ void sched_ttwu_pending(void)
 	cookie = lockdep_pin_lock(&rq->lock);
 
 	while (llist) {
+		int wake_flags = 0;
+
 		p = llist_entry(llist, struct task_struct, wake_entry);
 		llist = llist_next(llist);
-		/*
-		 * See ttwu_queue(); we only call ttwu_queue_remote() when
-		 * its a x-cpu wakeup.
-		 */
-		ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
+
+		if (p->sched_remote_wakeup)
+			wake_flags = WF_MIGRATED;
+
+		ttwu_do_activate(rq, p, wake_flags, cookie);
 	}
 
 	lockdep_unpin_lock(&rq->lock, cookie);
@@ -1819,10 +1821,12 @@ void scheduler_ipi(void)
 	irq_exit();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu)
+static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
+	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+
 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
 			smp_send_reschedule(cpu);
@@ -1869,7 +1873,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 #if defined(CONFIG_SMP)
 	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
-		ttwu_queue_remote(p, cpu);
+		ttwu_queue_remote(p, cpu, wake_flags);
 		return;
 	}
 #endif

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
  2016-05-24 17:04 [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity Paul E. McKenney
@ 2016-05-25 17:49 ` Paul E. McKenney
  0 siblings, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2016-05-25 17:49 UTC (permalink / raw)
  To: peterz
  Cc: umgwanakikbuti, mingo, linux-kernel, bsegall, matt,
	morten.rasmussen, pjt, tglx, byungchul.park, ahh

On Tue, May 24, 2016 at 10:04:17AM -0700, Paul E. McKenney wrote:
> On Mon, May 23, 2016 at 2:19 AM +0200, Peter Zijlstra wrote:
> > On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
> > > On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> > > > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> > > >
> > > > > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> > > > > and likely take a size XXL latency hit.  Big box running master bled
> > > > > profusely under heavy load until I turned TTWU_QUEUE off.
> > >
> > > May as well make it official and against master.today.  Fly or die
> > > little patchlet.
> > >
> > > sched/fair: Move se->vruntime normalization state into struct sched_entity
> > 
> > Does this work?
> 
> This gets rid of the additional lost wakeups introduced during the
> merge window, thank you!
> 
> The pre-existing low-probability lost wakeups still persist, sad to say
> Can't have everything, I guess.

However, their behavior has changed.  Previously, the TREE03 rcutorture
scenario had the most stalls.  Now TREE07 appears to be more likely.
The corresponding Kconfig fragments are shown below, in case it gives
someone ideas about where the problem might be.

My current guess, based on insufficient data, is that TREE07 has about
an 80-90% chance of seeing a lost wakeup during a two-hour run.

							Thanx, Paul

------------------------------------------------------------------------
TREE03
------------------------------------------------------------------------
CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
#CHECK#CONFIG_PREEMPT_RCU=y
CONFIG_HZ_PERIODIC=y
CONFIG_NO_HZ_IDLE=n
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_RCU_BOOST=y
CONFIG_RCU_KTHREAD_PRIO=2
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y

------------------------------------------------------------------------
TREE07
------------------------------------------------------------------------
CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_CPUMASK_OFFSTACK=y
CONFIG_PREEMPT_NONE=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
#CHECK#CONFIG_TREE_RCU=y
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=n
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=n
CONFIG_NO_HZ_FULL_SYSIDLE=y
CONFIG_RCU_FAST_NO_HZ=n
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity
@ 2016-05-24 17:04 Paul E. McKenney
  2016-05-25 17:49 ` Paul E. McKenney
  0 siblings, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2016-05-24 17:04 UTC (permalink / raw)
  To: peterz
  Cc: umgwanakikbuti, mingo, linux-kernel, bsegall, matt,
	morten.rasmussen, pjt, tglx, byungchul.park, ahh

On Mon, May 23, 2016 at 2:19 AM +0200, Peter Zijlstra wrote:
> On Sun, May 22, 2016 at 09:00:01AM +0200, Mike Galbraith wrote:
> > On Sat, 2016-05-21 at 21:00 +0200, Mike Galbraith wrote:
> > > On Sat, 2016-05-21 at 16:04 +0200, Mike Galbraith wrote:
> > >
> > > > Wakees that were not migrated/normalized eat an unwanted min_vruntime,
> > > > and likely take a size XXL latency hit.  Big box running master bled
> > > > profusely under heavy load until I turned TTWU_QUEUE off.
> >
> > May as well make it official and against master.today.  Fly or die
> > little patchlet.
> >
> > sched/fair: Move se->vruntime normalization state into struct sched_entity
> 
> Does this work?

This gets rid of the additional lost wakeups introduced during the
merge window, thank you!

The pre-existing low-probability lost wakeups still persist, sad to say
Can't have everything, I guess.

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 18 +++++++++++-------
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1b43b45a22b9..a2001e01b3df 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1534,6 +1534,7 @@ struct task_struct {
>         unsigned sched_reset_on_fork:1;
>         unsigned sched_contributes_to_load:1;
>         unsigned sched_migrated:1;
> +       unsigned sched_remote_wakeup:1;
>         unsigned :0; /* force alignment to the next boundary */
> 
>         /* unserialized, strictly 'current' */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 404c0784b1fc..7f2cae4620c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1768,13 +1768,15 @@ void sched_ttwu_pending(void)
>         cookie = lockdep_pin_lock(&rq->lock);
> 
>         while (llist) {
> +               int wake_flags = 0;
> +
>                 p = llist_entry(llist, struct task_struct, wake_entry);
>                 llist = llist_next(llist);
> -               /*
> -                * See ttwu_queue(); we only call ttwu_queue_remote() when
> -                * its a x-cpu wakeup.
> -                */
> -               ttwu_do_activate(rq, p, WF_MIGRATED, cookie);
> +
> +               if (p->sched_remote_wakeup)
> +                       wake_flags = WF_MIGRATED;
> +
> +               ttwu_do_activate(rq, p, wake_flags, cookie);
>         }
> 
>         lockdep_unpin_lock(&rq->lock, cookie);
> @@ -1819,10 +1821,12 @@ void scheduler_ipi(void)
>         irq_exit();
>  }
> 
> -static void ttwu_queue_remote(struct task_struct *p, int cpu)
> +static void ttwu_queue_remote(struct task_struct *p, int cpu, int
> wake_flags)
>  {
>         struct rq *rq = cpu_rq(cpu);
> 
> +       p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> +
>         if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>                 if (!set_nr_if_polling(rq->idle))
>                         smp_send_reschedule(cpu);
> @@ -1869,7 +1873,7 @@ static void ttwu_queue(struct task_struct *p, int
> cpu, int wake_flags)
>  #if defined(CONFIG_SMP)
>         if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(),
> cpu)) {
>                 sched_clock_cpu(cpu); /* sync clocks x-cpu */
> -               ttwu_queue_remote(p, cpu);
> +               ttwu_queue_remote(p, cpu, wake_flags);
>                 return;
>         }
>  #endif

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-05-25 17:49 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-10 17:43 [PATCH 0/3] sched: Fix wakeup preemption regression Peter Zijlstra
2016-05-10 17:43 ` [PATCH 1/3] sched,fair: Move record_wakee() Peter Zijlstra
2016-05-12 10:27   ` Matt Fleming
2016-05-12 10:31     ` Peter Zijlstra
2016-05-10 17:43 ` [PATCH 2/3] sched,fair: Fix local starvation Peter Zijlstra
2016-05-10 20:21   ` Ingo Molnar
2016-05-10 22:23     ` Peter Zijlstra
2016-05-20 21:24   ` Matt Fleming
2016-05-21 14:04   ` Mike Galbraith
2016-05-21 19:00     ` Mike Galbraith
2016-05-22  7:00       ` [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity Mike Galbraith
2016-05-22  9:36         ` Peter Zijlstra
2016-05-22  9:52           ` Mike Galbraith
2016-05-22 10:33           ` Peter Zijlstra
2016-05-23  9:19         ` Peter Zijlstra
2016-05-23  9:40           ` Mike Galbraith
2016-05-23 10:13             ` Wanpeng Li
2016-05-23 10:26               ` Mike Galbraith
2016-05-23 12:28             ` Peter Zijlstra
2016-05-25  7:12           ` [tip:sched/urgent] sched/core: Fix remote wakeups tip-bot for Peter Zijlstra
2016-05-22  6:50     ` [PATCH 2/3] sched,fair: Fix local starvation Wanpeng Li
2016-05-22  7:15       ` Mike Galbraith
2016-05-22  7:27         ` Wanpeng Li
2016-05-22  7:32           ` Mike Galbraith
2016-05-22  7:42             ` Wanpeng Li
2016-05-22  8:04               ` Mike Galbraith
2016-05-22  8:24                 ` Wanpeng Li
2016-05-22  8:39                   ` Mike Galbraith
2016-05-22  8:50                     ` Wanpeng Li
2016-05-10 17:43 ` [PATCH 3/3] sched: Kill sched_class::task_waking Peter Zijlstra
2016-05-11  5:55 ` [PATCH 0/3] sched: Fix wakeup preemption regression Mike Galbraith
2016-05-12  9:56 ` Pavan Kondeti
2016-05-12 10:52 ` Matt Fleming
2016-05-24 17:04 [patch] sched/fair: Move se->vruntime normalization state into struct sched_entity Paul E. McKenney
2016-05-25 17:49 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).