[PATCH 0/2] Optimise try_to_wake_up() when wakee is descheduling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Optimise try_to_wake_up() when wakee is descheduling
@ 2020-05-24 20:29 Mel Gorman
  2020-05-24 20:29 ` [PATCH 1/2] sched: Optimize ttwu() spinning on p->on_cpu Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mel Gorman @ 2020-05-24 20:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Ingo Molnar, Vincent Guittot, valentin.schneider,
	Hillf Danton, Rik van Riel, LKML, Mel Gorman

The following two patches optimise try_to_wake_up() when the wakee is
descheduling. In a vanilla kernel, there can be excessive time spent
spinning on p->on_rq. This is fine if it's a strictly synchronous wakeup
and the waker is going to sleep but in other cases, the waker spins until
it can do work that can be deferred to the wakee.

The first patch frontloads work that can be done before p->on_rq is
checked.  If it's a wakeup on a CPU that does not share cache then the
wakelist is used instead of spinning. The second patch goes a little
further and uses the wakelist if the wakee is descheduling and is the
only task running on the target CPU.

The performance impact is documented in the changelog of the second patch.

 kernel/sched/core.c  | 81 ++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h |  3 +-
 2 files changed, 61 insertions(+), 23 deletions(-)

-- 
2.26.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] sched: Optimize ttwu() spinning on p->on_cpu
  2020-05-24 20:29 [PATCH 0/2] Optimise try_to_wake_up() when wakee is descheduling Mel Gorman
@ 2020-05-24 20:29 ` Mel Gorman
  2020-05-25  5:26   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
  2020-05-24 20:29 ` [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling Mel Gorman
  2020-05-25  5:26 ` [PATCH 0/2] Optimise try_to_wake_up() when " Ingo Molnar
  2 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2020-05-24 20:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Ingo Molnar, Vincent Guittot, valentin.schneider,
	Hillf Danton, Rik van Riel, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that owns the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c | 52 +++++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..987b8ecf2ee9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,7 +2330,7 @@ void scheduler_ipi(void)
 	irq_exit();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -2372,6 +2372,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+		__ttwu_queue_remote(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2380,11 +2391,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		ttwu_queue_remote(p, cpu, wake_flags);
+	if (ttwu_queue_remote(p, cpu, wake_flags))
 		return;
-	}
 #endif
 
 	rq_lock(rq, &rf);
@@ -2566,7 +2574,15 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto unlock;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #ifdef CONFIG_SMP
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2588,6 +2604,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_rmb();
 
+	/*
+	 * If the owning (remote) CPU is still in the middle of schedule() with
+	 * this task as prev, considering queueing p on the remote CPUs wake_list
+	 * which potentially sends an IPI instead of spinning on p->on_cpu to
+	 * let the waker make forward progress. This is safe because IRQs are
+	 * disabled and the IPI will deliver after on_cpu is cleared.
+	 */
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+		goto unlock;
+
 	/*
 	 * If the owning (remote) CPU is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
@@ -2599,28 +2625,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		psi_ttwu_dequeue(p);
 		set_task_cpu(p, cpu);
 	}
-
-#else /* CONFIG_SMP */
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling
  2020-05-24 20:29 [PATCH 0/2] Optimise try_to_wake_up() when wakee is descheduling Mel Gorman
  2020-05-24 20:29 ` [PATCH 1/2] sched: Optimize ttwu() spinning on p->on_cpu Mel Gorman
@ 2020-05-24 20:29 ` Mel Gorman
  2020-05-25  5:26   ` [tip: sched/core] sched/core: " tip-bot2 for Mel Gorman
  2020-05-27 10:10   ` [PATCH 2/2] sched: " Pavan Kondeti
  2020-05-25  5:26 ` [PATCH 0/2] Optimise try_to_wake_up() when " Ingo Molnar
  2 siblings, 2 replies; 7+ messages in thread
From: Mel Gorman @ 2020-05-24 20:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Ingo Molnar, Vincent Guittot, valentin.schneider,
	Hillf Danton, Rik van Riel, LKML, Mel Gorman

The patch "sched: Optimize ttwu() spinning on p->on_cpu" avoids spinning
on p->on_rq when the task is descheduling but only if the wakee is on
a CPU that does not share cache with the waker. This patch offloads the
activation of the wakee to the CPU that is about to go idle if the task
is the only one on the runqueue. This potentially allows the waker task
to continue making progress when the wakeup is not strictly synchronous.

This is very obvious with netperf UDP_STREAM running on localhost. The
waker is sending packets as quickly as possible without waiting for any
reply. It frequently wakes the server for the processing of packets and
when netserver is using local memory, it quickly completes the processing
and goes back to idle. The waker often observes that netserver is on_rq
and spins excessively leading to a drop in throughput.

This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
localwakelist-v1r2 respectively.

                                  5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                    vanilla           optttwu-v1r1     localwakelist-v1r2
Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*

Note that this benefit is not universal to all wakeups, it only applies
to the case where the waker often spins on p->on_rq.

The impact can be seen from a "perf sched latency" report generated from
a single iteration of one packet size.

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------

vanilla
  netperf:4337          |  21709.193 ms |     2932 | avg:    0.002 ms | max:    0.041 ms | max at:    112.154512 s
  netserver:4338        |  14629.459 ms |  5146990 | avg:    0.001 ms | max: 1615.864 ms | max at:    140.134496 s

localwakelist-v1r2
  netperf:4339          |  29789.717 ms |     2460 | avg:    0.002 ms | max:    0.059 ms | max at:    138.205389 s
  netserver:4340        |  18858.767 ms |  7279005 | avg:    0.001 ms | max:    0.362 ms | max at:    135.709683 s
 -----------------------------------------------------------------------------------------------------------------

Note that the average wakeup delay is quite small on both the vanilla
kernel and with the two patches applied. However, there are significant
outliers with the vanilla kernel with the maximum one measured as 1615
milliseconds with a vanilla kernel but never worse than 0.362 ms with
both patches applied and a much higher rate of context switching.

Similarly a separate profile of cycles showed that 2.83% of all cycles
were spent in try_to_wake_up() with almost half of the cycles spent
on spinning on p->on_rq. With the two patches, the percentage of cycles
spent in try_to_wake_up() drops to 1.13%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c  | 41 ++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  3 ++-
 2 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 987b8ecf2ee9..cdf0da1dcc05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,13 +2330,19 @@ void scheduler_ipi(void)
 	irq_exit();
 }
 
-static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+/*
+ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
+ * necessary. The wakee CPU on receipt of the IPI will queue the task
+ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
+ * of the wakeup instead of the waker.
+ */
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+	if (llist_add(&p->wake_entry, &rq->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
 			smp_send_reschedule(cpu);
 		else
@@ -2373,11 +2379,32 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
-static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static inline bool ttwu_queue_cond(int cpu, int wake_flags)
+{
+	/*
+	 * If the CPU does not share cache, then queue the task on the
+	 * remote rqs wakelist to avoid accessing remote data.
+	 */
+	if (!cpus_share_cache(smp_processor_id(), cpu))
+		return true;
+
+	/*
+	 * If the task is descheduling and the only running task on the
+	 * CPU then use the wakelist to offload the task activation to
+	 * the soon-to-be-idle CPU as the current CPU is likely busy.
+	 * nr_running is checked to avoid unnecessary task stacking.
+	 */
+	if ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)
+		return true;
+
+	return false;
+}
+
+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
 		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		__ttwu_queue_remote(p, cpu, wake_flags);
+		__ttwu_queue_wakelist(p, cpu, wake_flags);
 		return true;
 	}
 
@@ -2391,7 +2418,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (ttwu_queue_remote(p, cpu, wake_flags))
+	if (ttwu_queue_wakelist(p, cpu, wake_flags))
 		return;
 #endif
 
@@ -2611,7 +2638,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * let the waker make forward progress. This is safe because IRQs are
 	 * disabled and the IPI will deliver after on_cpu is cleared.
 	 */
-	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
 		goto unlock;
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..06297d1142a0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1688,7 +1688,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
  */
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
 #define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
+#define WF_ON_RQ		0x08		/* Wakee is on_rq */
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip: sched/core] sched/core: Offload wakee task activation if it the wakee is descheduling
  2020-05-24 20:29 ` [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling Mel Gorman
@ 2020-05-25  5:26   ` tip-bot2 for Mel Gorman
  2020-05-27 10:10   ` [PATCH 2/2] sched: " Pavan Kondeti
  1 sibling, 0 replies; 7+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-05-25  5:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Jirka Hladky, Vincent Guittot, valentin.schneider, Hillf Danton,
	Rik van Riel, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2ebb17717550607bcd85fb8cf7d24ac870e9d762
Gitweb:        https://git.kernel.org/tip/2ebb17717550607bcd85fb8cf7d24ac870e9d762
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Sun, 24 May 2020 21:29:56 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 25 May 2020 07:04:10 +02:00

sched/core: Offload wakee task activation if it the wakee is descheduling

The previous commit:

  c6e7bd7afaeb: ("sched/core: Optimize ttwu() spinning on p->on_cpu")

avoids spinning on p->on_rq when the task is descheduling, but only if the
wakee is on a CPU that does not share cache with the waker.

This patch offloads the activation of the wakee to the CPU that is about to
go idle if the task is the only one on the runqueue. This potentially allows
the waker task to continue making progress when the wakeup is not strictly
synchronous.

This is very obvious with netperf UDP_STREAM running on localhost. The
waker is sending packets as quickly as possible without waiting for any
reply. It frequently wakes the server for the processing of packets and
when netserver is using local memory, it quickly completes the processing
and goes back to idle. The waker often observes that netserver is on_rq
and spins excessively leading to a drop in throughput.

This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
localwakelist-v1r2 respectively.

                                  5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                    vanilla           optttwu-v1r1     localwakelist-v1r2
Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*

Note that this benefit is not universal to all wakeups, it only applies
to the case where the waker often spins on p->on_rq.

The impact can be seen from a "perf sched latency" report generated from
a single iteration of one packet size:

   -----------------------------------------------------------------------------------------------------------------
    Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
   -----------------------------------------------------------------------------------------------------------------

  vanilla
    netperf:4337          |  21709.193 ms |     2932 | avg:    0.002 ms | max:    0.041 ms | max at:    112.154512 s
    netserver:4338        |  14629.459 ms |  5146990 | avg:    0.001 ms | max: 1615.864 ms | max at:    140.134496 s

  localwakelist-v1r2
    netperf:4339          |  29789.717 ms |     2460 | avg:    0.002 ms | max:    0.059 ms | max at:    138.205389 s
    netserver:4340        |  18858.767 ms |  7279005 | avg:    0.001 ms | max:    0.362 ms | max at:    135.709683 s
   -----------------------------------------------------------------------------------------------------------------

Note that the average wakeup delay is quite small on both the vanilla
kernel and with the two patches applied. However, there are significant
outliers with the vanilla kernel with the maximum one measured as 1615
milliseconds with a vanilla kernel but never worse than 0.362 ms with
both patches applied and a much higher rate of context switching.

Similarly a separate profile of cycles showed that 2.83% of all cycles
were spent in try_to_wake_up() with almost half of the cycles spent
on spinning on p->on_rq. With the two patches, the percentage of cycles
spent in try_to_wake_up() drops to 1.13%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-3-mgorman@techsingularity.net
---
 kernel/sched/core.c  | 39 +++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  3 ++-
 2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 903c9ee..6130ab1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,7 +2312,13 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
-static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+/*
+ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
+ * necessary. The wakee CPU on receipt of the IPI will queue the task
+ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
+ * of the wakeup instead of the waker.
+ */
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -2355,11 +2361,32 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
-static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static inline bool ttwu_queue_cond(int cpu, int wake_flags)
+{
+	/*
+	 * If the CPU does not share cache, then queue the task on the
+	 * remote rqs wakelist to avoid accessing remote data.
+	 */
+	if (!cpus_share_cache(smp_processor_id(), cpu))
+		return true;
+
+	/*
+	 * If the task is descheduling and the only running task on the
+	 * CPU then use the wakelist to offload the task activation to
+	 * the soon-to-be-idle CPU as the current CPU is likely busy.
+	 * nr_running is checked to avoid unnecessary task stacking.
+	 */
+	if ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)
+		return true;
+
+	return false;
+}
+
+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
 		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		__ttwu_queue_remote(p, cpu, wake_flags);
+		__ttwu_queue_wakelist(p, cpu, wake_flags);
 		return true;
 	}
 
@@ -2373,7 +2400,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (ttwu_queue_remote(p, cpu, wake_flags))
+	if (ttwu_queue_wakelist(p, cpu, wake_flags))
 		return;
 #endif
 
@@ -2593,7 +2620,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * let the waker make forward progress. This is safe because IRQs are
 	 * disabled and the IPI will deliver after on_cpu is cleared.
 	 */
-	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
 		goto unlock;
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7ab633..4b32cff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1685,7 +1685,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
  */
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
 #define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
+#define WF_ON_RQ		0x08		/* Wakee is on_rq */
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip: sched/core] sched/core: Optimize ttwu() spinning on p->on_cpu
  2020-05-24 20:29 ` [PATCH 1/2] sched: Optimize ttwu() spinning on p->on_cpu Mel Gorman
@ 2020-05-25  5:26   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-25  5:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Mel Gorman, Ingo Molnar, Jirka Hladky, Vincent Guittot,
	valentin.schneider, Hillf Danton, Rik van Riel, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c6e7bd7afaeb3af55ffac122828035f1c01d1d7b
Gitweb:        https://git.kernel.org/tip/c6e7bd7afaeb3af55ffac122828035f1c01d1d7b
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sun, 24 May 2020 21:29:55 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 25 May 2020 07:01:44 +02:00

sched/core: Optimize ttwu() spinning on p->on_cpu

Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that owns the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net
---
 kernel/sched/core.c | 52 ++++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa905b6..903c9ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+		__ttwu_queue_remote(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		ttwu_queue_remote(p, cpu, wake_flags);
+	if (ttwu_queue_remote(p, cpu, wake_flags))
 		return;
-	}
 #endif
 
 	rq_lock(rq, &rf);
@@ -2548,7 +2556,15 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto unlock;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #ifdef CONFIG_SMP
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2572,6 +2588,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 	/*
 	 * If the owning (remote) CPU is still in the middle of schedule() with
+	 * this task as prev, considering queueing p on the remote CPUs wake_list
+	 * which potentially sends an IPI instead of spinning on p->on_cpu to
+	 * let the waker make forward progress. This is safe because IRQs are
+	 * disabled and the IPI will deliver after on_cpu is cleared.
+	 */
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+		goto unlock;
+
+	/*
+	 * If the owning (remote) CPU is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
 	 *
 	 * Pairs with the smp_store_release() in finish_task().
@@ -2581,28 +2607,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		psi_ttwu_dequeue(p);
 		set_task_cpu(p, cpu);
 	}
-
-#else /* CONFIG_SMP */
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] Optimise try_to_wake_up() when wakee is descheduling
  2020-05-24 20:29 [PATCH 0/2] Optimise try_to_wake_up() when wakee is descheduling Mel Gorman
  2020-05-24 20:29 ` [PATCH 1/2] sched: Optimize ttwu() spinning on p->on_cpu Mel Gorman
  2020-05-24 20:29 ` [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling Mel Gorman
@ 2020-05-25  5:26 ` Ingo Molnar
  2 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2020-05-25  5:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Jirka Hladky, Vincent Guittot,
	valentin.schneider, Hillf Danton, Rik van Riel, LKML


* Mel Gorman <mgorman@techsingularity.net> wrote:

> The following two patches optimise try_to_wake_up() when the wakee is
> descheduling. In a vanilla kernel, there can be excessive time spent
> spinning on p->on_rq. This is fine if it's a strictly synchronous wakeup
> and the waker is going to sleep but in other cases, the waker spins until
> it can do work that can be deferred to the wakee.
> 
> The first patch frontloads work that can be done before p->on_rq is
> checked.  If it's a wakeup on a CPU that does not share cache then the
> wakelist is used instead of spinning. The second patch goes a little
> further and uses the wakelist if the wakee is descheduling and is the
> only task running on the target CPU.
> 
> The performance impact is documented in the changelog of the second patch.
> 
>  kernel/sched/core.c  | 81 ++++++++++++++++++++++++++++++++------------
>  kernel/sched/sched.h |  3 +-
>  2 files changed, 61 insertions(+), 23 deletions(-)

Thanks, these are really nice improvements, those latency spikes in netperf
runs were always annoying, and they've been there forever I think.

I've applied these two patches to tip:sched/core:

  c6e7bd7afaeb: ("sched/core: Optimize ttwu() spinning on p->on_cpu")
  2ebb17717550: ("sched/core: Offload wakee task activation if it the wakee is descheduling")

In 2ebb17717550 I resolved the conflict with a recent cleanup by PeterZ:

  90b5363acd47: ("sched: Clean up scheduler_ipi()")

... which had a drive-by cleanup of ttwu_queue_remote() as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling
  2020-05-24 20:29 ` [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling Mel Gorman
  2020-05-25  5:26   ` [tip: sched/core] sched/core: " tip-bot2 for Mel Gorman
@ 2020-05-27 10:10   ` Pavan Kondeti
  1 sibling, 0 replies; 7+ messages in thread
From: Pavan Kondeti @ 2020-05-27 10:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Jirka Hladky, Ingo Molnar, Vincent Guittot,
	valentin.schneider, Hillf Danton, Rik van Riel, LKML

On Sun, May 24, 2020 at 09:29:56PM +0100, Mel Gorman wrote:
> The patch "sched: Optimize ttwu() spinning on p->on_cpu" avoids spinning
> on p->on_rq when the task is descheduling but only if the wakee is on
> a CPU that does not share cache with the waker. This patch offloads the
> activation of the wakee to the CPU that is about to go idle if the task
> is the only one on the runqueue. This potentially allows the waker task
> to continue making progress when the wakeup is not strictly synchronous.
> 
> This is very obvious with netperf UDP_STREAM running on localhost. The
> waker is sending packets as quickly as possible without waiting for any
> reply. It frequently wakes the server for the processing of packets and
> when netserver is using local memory, it quickly completes the processing
> and goes back to idle. The waker often observes that netserver is on_rq
> and spins excessively leading to a drop in throughput.
> 
> This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
> on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
> localwakelist-v1r2 respectively.
> 
>                                   5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
>                                     vanilla           optttwu-v1r1     localwakelist-v1r2
> Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
> Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
> Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
> Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
> Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
> Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
> Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
> Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
> Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*
> 
> Note that this benefit is not universal to all wakeups, it only applies
> to the case where the waker often spins on p->on_rq.
> 
Thanks for the detailed description. I think you meant p->on_cpu here not
p->on_rq. I know this patch is included, if possible, please make this edit
here and below.

>  	/*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index db3a57675ccf..06297d1142a0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1688,7 +1688,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
>   */
>  #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
>  #define WF_FORK			0x02		/* Child wakeup after fork */
> -#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
> +#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
> +#define WF_ON_RQ		0x08		/* Wakee is on_rq */
>  
should this be WF_ON_CPU? There is a different check for p->on_rq in
try_to_wake_up.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-27 10:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-24 20:29 [PATCH 0/2] Optimise try_to_wake_up() when wakee is descheduling Mel Gorman
2020-05-24 20:29 ` [PATCH 1/2] sched: Optimize ttwu() spinning on p->on_cpu Mel Gorman
2020-05-25  5:26   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
2020-05-24 20:29 ` [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling Mel Gorman
2020-05-25  5:26   ` [tip: sched/core] sched/core: " tip-bot2 for Mel Gorman
2020-05-27 10:10   ` [PATCH 2/2] sched: " Pavan Kondeti
2020-05-25  5:26 ` [PATCH 0/2] Optimise try_to_wake_up() when " Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).