From: Mel Gorman <mgorman@techsingularity.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>, Phil Auld <pauld@redhat.com>,
Ingo Molnar <mingo@kernel.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>,
Valentin Schneider <valentin.schneider@arm.com>,
Hillf Danton <hdanton@sina.com>,
LKML <linux-kernel@vger.kernel.org>,
Douglas Shakshober <dshaks@redhat.com>,
Waiman Long <longman@redhat.com>, Joe Mario <jmario@redhat.com>,
Bill Gray <bgray@redhat.com>,
riel@surriel.com
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
Date: Fri, 22 May 2020 14:28:54 +0100 [thread overview]
Message-ID: <20200522132854.GF7167@techsingularity.net> (raw)
In-Reply-To: <20200521114132.GI325280@hirez.programming.kicks-ass.net>
On Thu, May 21, 2020 at 01:41:32PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 11:38:16AM +0100, Mel Gorman wrote:
> > IIUC, this patch front-loads as much work as possible before checking if
> > the task is on_rq and then the waker/wakee shares a cache, queue task on
> > the wake_list and otherwise do a direct wakeup.
> >
> > The advantage is that spinning is avoided on p->on_rq when p does not
> > share a cache. The disadvantage is that it may result in tasks being
> > stacked but this should only happen when the domain is overloaded and
> > select_task_eq() is unlikely to find an idle CPU. The load balancer would
> > soon correct the situation anyway.
> >
> > In terms of netperf for my testing, the benefit is marginal because the
> > wakeups are primarily between tasks that share cache. It does trigger as
> > perf indicates that some time is spent in ttwu_queue_remote with this
> > patch, it's just that the overall time spent spinning on p->on_rq is
> > very similar. I'm still waiting on other workloads to complete to see
> > what the impact is.
>
> So it might make sense to play with the exact conditions under which
> we'll attempt this remote queue, if we see a large 'local' p->on_cpu
> spin time, it might make sense to attempt the queue even in this case.
>
> We could for example change it to:
>
> if (REAC_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags | WF_ON_CPU))
> goto unlock;
>
> and then use that in ttwu_queue_remote() to differentiate between these
> two cases.
>
> #endif /* CONFIG_SMP */
>
> ttwu_queue(p, cpu, wake_flags);
Is something like this on top of your patch what you had in mind?
---8<---
---
kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
kernel/sched/sched.h | 3 ++-
2 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 987b8ecf2ee9..435ecf5820ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,13 +2330,19 @@ void scheduler_ipi(void)
irq_exit();
}
-static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+/*
+ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
+ * necessary. The wakee CPU on receipt of the IPI will queue the task
+ * via sched_ttwu_wakeup() for activation instead of the waking task
+ * activating and queueing the wakee.
+ */
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
- if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+ if (llist_add(&p->wake_entry, &rq->wake_list)) {
if (!set_nr_if_polling(rq->idle))
smp_send_reschedule(cpu);
else
@@ -2373,12 +2379,23 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
}
-static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
- if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
- sched_clock_cpu(cpu); /* Sync clocks across CPUs */
- __ttwu_queue_remote(p, cpu, wake_flags);
- return true;
+ if (sched_feat(TTWU_QUEUE)) {
+ /*
+ * If CPU does not share cache then queue the task on the remote
+ * rqs wakelist to avoid accessing remote data. Alternatively,
+ * if the task is descheduling and the only running task on the
+ * CPU then use the wakelist to offload the task activation to
+ * the CPU that will soon be idle so the waker can continue.
+ * nr_running is checked to avoid unnecessary task stacking.
+ */
+ if (!cpus_share_cache(smp_processor_id(), cpu) ||
+ ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)) {
+ sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+ __ttwu_queue_wakelist(p, cpu, wake_flags);
+ return true;
+ }
}
return false;
@@ -2391,7 +2408,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
struct rq_flags rf;
#if defined(CONFIG_SMP)
- if (ttwu_queue_remote(p, cpu, wake_flags))
+ if (ttwu_queue_wakelist(p, cpu, wake_flags))
return;
#endif
@@ -2611,7 +2628,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
*/
- if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+ if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
goto unlock;
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..06297d1142a0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1688,7 +1688,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
*/
#define WF_SYNC 0x01 /* Waker goes to sleep after wakeup */
#define WF_FORK 0x02 /* Child wakeup after fork */
-#define WF_MIGRATED 0x4 /* Internal use, task got migrated */
+#define WF_MIGRATED 0x04 /* Internal use, task got migrated */
+#define WF_ON_RQ 0x08 /* Wakee is on_rq */
/*
* To aid in avoiding the subversion of "niceness" due to uneven distribution
next prev parent reply other threads:[~2020-05-22 13:29 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-24 9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
2020-02-24 9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-24 9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() " tip-bot2 for Mel Gorman
2020-02-24 9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 16:01 ` Valentin Schneider
2020-02-24 16:34 ` Mel Gorman
2020-02-25 8:23 ` Vincent Guittot
2020-02-24 9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] sched/numa: Prefer using an idle CPU " tip-bot2 for Mel Gorman
2020-02-24 9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-24 15:20 ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
2020-02-25 11:59 ` Mel Gorman
2020-02-25 13:28 ` Vincent Guittot
2020-02-25 14:24 ` Mel Gorman
2020-02-25 14:53 ` Vincent Guittot
2020-02-27 9:09 ` Ingo Molnar
2020-03-09 19:12 ` Phil Auld
2020-03-09 20:36 ` Mel Gorman
2020-03-12 9:54 ` Mel Gorman
2020-03-12 12:17 ` Jirka Hladky
[not found] ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
2020-03-12 15:56 ` Mel Gorman
2020-03-12 17:06 ` Jirka Hladky
[not found] ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
2020-03-12 21:47 ` Mel Gorman
2020-03-12 22:24 ` Jirka Hladky
2020-03-20 15:08 ` Jirka Hladky
[not found] ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
2020-03-20 15:22 ` Mel Gorman
2020-03-20 15:33 ` Jirka Hladky
[not found] ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
2020-03-20 16:38 ` Mel Gorman
2020-03-20 17:21 ` Jirka Hladky
2020-05-07 15:24 ` Jirka Hladky
2020-05-07 15:54 ` Mel Gorman
2020-05-07 16:29 ` Jirka Hladky
2020-05-07 17:49 ` Phil Auld
[not found] ` <20200508034741.13036-1-hdanton@sina.com>
2020-05-18 14:52 ` Jirka Hladky
[not found] ` <20200519043154.10876-1-hdanton@sina.com>
2020-05-20 13:58 ` Jirka Hladky
2020-05-20 16:01 ` Jirka Hladky
2020-05-21 11:06 ` Mel Gorman
[not found] ` <20200521140931.15232-1-hdanton@sina.com>
2020-05-21 16:04 ` Mel Gorman
[not found] ` <20200522010950.3336-1-hdanton@sina.com>
2020-05-22 11:05 ` Mel Gorman
2020-05-08 9:22 ` Mel Gorman
2020-05-08 11:05 ` Jirka Hladky
[not found] ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
2020-05-13 14:57 ` Jirka Hladky
2020-05-13 15:30 ` Mel Gorman
2020-05-13 16:20 ` Jirka Hladky
2020-05-14 9:50 ` Mel Gorman
[not found] ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
2020-05-14 10:08 ` Mel Gorman
2020-05-14 10:22 ` Jirka Hladky
2020-05-14 11:50 ` Mel Gorman
2020-05-14 13:34 ` Jirka Hladky
2020-05-14 15:31 ` Peter Zijlstra
2020-05-15 8:47 ` Mel Gorman
2020-05-15 11:17 ` Peter Zijlstra
2020-05-15 13:03 ` Mel Gorman
2020-05-15 13:12 ` Peter Zijlstra
2020-05-15 13:28 ` Peter Zijlstra
2020-05-15 14:24 ` Peter Zijlstra
2020-05-21 10:38 ` Mel Gorman
2020-05-21 11:41 ` Peter Zijlstra
2020-05-22 13:28 ` Mel Gorman [this message]
2020-05-22 14:38 ` Peter Zijlstra
2020-05-15 11:28 ` Peter Zijlstra
2020-05-15 12:22 ` Mel Gorman
2020-05-15 12:51 ` Peter Zijlstra
2020-05-15 14:43 ` Jirka Hladky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200522132854.GF7167@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=bgray@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=dshaks@redhat.com \
--cc=hdanton@sina.com \
--cc=jhladky@redhat.com \
--cc=jmario@redhat.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mingo@kernel.org \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=valentin.schneider@arm.com \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).