From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Phil Auld <pauld@redhat.com>,
Valentin Schneider <valentin.schneider@arm.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
Quentin Perret <quentin.perret@arm.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Morten Rasmussen <Morten.Rasmussen@arm.com>,
Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v4 04/11] sched/fair: rework load_balance
Date: Fri, 8 Nov 2019 17:35:01 +0100 [thread overview]
Message-ID: <20191108163501.GA26528@linaro.org> (raw)
In-Reply-To: <20191031114020.GQ3016@techsingularity.net>
Le Thursday 31 Oct 2019 à 11:40:20 (+0000), Mel Gorman a écrit :
> On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote:
> > > > > On the last one, spreading tasks evenly across NUMA domains is not
> > > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine
> > > > > with 24 logical CPUs per socket, it should not automatically mean that
> > > > > one task should move cross-node and I have definitely observed this
> > > > > happening. It's probably bad in terms of locality no matter what but it's
> > > > > especially bad if the 2 tasks happened to be communicating because then
> > > > > load balancing will pull apart the tasks while wake_affine will push
> > > > > them together (and potentially NUMA balancing as well). Note that this
> > > > > also applies for some IO workloads because, depending on the filesystem,
> > > > > the task may be communicating with workqueues (XFS) or a kernel thread
> > > > > (ext4 with jbd2).
> > > >
> > > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing
> > > > still gives guidances with fbq_classify_group/queue.
> > >
> > > I know the NUMA_BALANCING part is not touched, I'm talking about load
> > > balancing across SD_NUMA domains which happens independently of
> > > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to
> > > override the load balancer when it moves tasks away from the preferred
> > > node.
> >
> > Yes. this patchset relies on this override for now to prevent moving task away.
>
> Fair enough, netperf hits the corner case where it does not work but
> that is also true without your series.
I run mmtest/netperf test on my setup. It's a mix of small positive or
negative differences (see below)
netperf-udp
5.3-rc2 5.3-rc2
tip +rwk+fix
Hmean send-64 95.06 ( 0.00%) 94.12 * -0.99%*
Hmean send-128 191.71 ( 0.00%) 189.94 * -0.93%*
Hmean send-256 379.05 ( 0.00%) 370.96 * -2.14%*
Hmean send-1024 1485.24 ( 0.00%) 1476.64 * -0.58%*
Hmean send-2048 2894.80 ( 0.00%) 2887.00 * -0.27%*
Hmean send-3312 4580.27 ( 0.00%) 4555.91 * -0.53%*
Hmean send-4096 5592.99 ( 0.00%) 5517.31 * -1.35%*
Hmean send-8192 9117.00 ( 0.00%) 9497.06 * 4.17%*
Hmean send-16384 15824.59 ( 0.00%) 15824.30 * -0.00%*
Hmean recv-64 95.06 ( 0.00%) 94.08 * -1.04%*
Hmean recv-128 191.68 ( 0.00%) 189.89 * -0.93%*
Hmean recv-256 378.94 ( 0.00%) 370.87 * -2.13%*
Hmean recv-1024 1485.24 ( 0.00%) 1476.20 * -0.61%*
Hmean recv-2048 2893.52 ( 0.00%) 2885.25 * -0.29%*
Hmean recv-3312 4580.27 ( 0.00%) 4553.48 * -0.58%*
Hmean recv-4096 5592.99 ( 0.00%) 5517.27 * -1.35%*
Hmean recv-8192 9115.69 ( 0.00%) 9495.69 * 4.17%*
Hmean recv-16384 15824.36 ( 0.00%) 15818.36 * -0.04%*
Stddev send-64 0.15 ( 0.00%) 1.17 (-688.29%)
Stddev send-128 1.56 ( 0.00%) 1.15 ( 25.96%)
Stddev send-256 4.20 ( 0.00%) 5.27 ( -25.63%)
Stddev send-1024 20.11 ( 0.00%) 5.68 ( 71.74%)
Stddev send-2048 11.06 ( 0.00%) 21.74 ( -96.50%)
Stddev send-3312 61.10 ( 0.00%) 48.03 ( 21.39%)
Stddev send-4096 71.84 ( 0.00%) 31.99 ( 55.46%)
Stddev send-8192 165.14 ( 0.00%) 159.99 ( 3.12%)
Stddev send-16384 81.30 ( 0.00%) 188.65 (-132.05%)
Stddev recv-64 0.15 ( 0.00%) 1.15 (-673.42%)
Stddev recv-128 1.58 ( 0.00%) 1.14 ( 28.27%)
Stddev recv-256 4.29 ( 0.00%) 5.19 ( -21.05%)
Stddev recv-1024 20.11 ( 0.00%) 5.70 ( 71.67%)
Stddev recv-2048 10.43 ( 0.00%) 21.41 (-105.22%)
Stddev recv-3312 61.10 ( 0.00%) 46.92 ( 23.20%)
Stddev recv-4096 71.84 ( 0.00%) 31.97 ( 55.50%)
Stddev recv-8192 163.90 ( 0.00%) 160.88 ( 1.84%)
Stddev recv-16384 81.41 ( 0.00%) 187.01 (-129.71%)
5.3-rc2 5.3-rc2
tip +rwk+fix
Duration User 38.90 39.13
Duration System 1311.29 1311.10
Duration Elapsed 1892.82 1892.86
netperf-tcp
5.3-rc2 5.3-rc2
tip +rwk+fix
Hmean 64 871.30 ( 0.00%) 860.90 * -1.19%*
Hmean 128 1689.39 ( 0.00%) 1679.31 * -0.60%*
Hmean 256 3199.59 ( 0.00%) 3241.98 * 1.32%*
Hmean 1024 9390.47 ( 0.00%) 9268.47 * -1.30%*
Hmean 2048 13373.95 ( 0.00%) 13395.61 * 0.16%*
Hmean 3312 16701.30 ( 0.00%) 17165.96 * 2.78%*
Hmean 4096 15831.03 ( 0.00%) 15544.66 * -1.81%*
Hmean 8192 19720.01 ( 0.00%) 20188.60 * 2.38%*
Hmean 16384 23925.90 ( 0.00%) 23914.50 * -0.05%*
Stddev 64 7.38 ( 0.00%) 4.23 ( 42.67%)
Stddev 128 11.62 ( 0.00%) 10.13 ( 12.85%)
Stddev 256 34.33 ( 0.00%) 7.94 ( 76.88%)
Stddev 1024 35.61 ( 0.00%) 116.34 (-226.66%)
Stddev 2048 285.30 ( 0.00%) 80.50 ( 71.78%)
Stddev 3312 304.74 ( 0.00%) 449.08 ( -47.36%)
Stddev 4096 668.11 ( 0.00%) 569.30 ( 14.79%)
Stddev 8192 733.23 ( 0.00%) 944.38 ( -28.80%)
Stddev 16384 553.03 ( 0.00%) 299.44 ( 45.86%)
5.3-rc2 5.3-rc2
tip +rwk+fix
Duration User 138.05 140.95
Duration System 1210.60 1208.45
Duration Elapsed 1352.86 1352.90
>
> > I agree that additional patches are probably needed to improve load
> > balance at NUMA level and I expect that this rework will make it
> > simpler to add.
> > I just wanted to get the output of some real use cases before defining
> > more numa level specific conditions. Some want to spread on there numa
> > nodes but other want to keep everything together. The preferred node
> > and fbq_classify_group was the only sensible metrics to me when he
> > wrote this patchset but changes can be added if they make sense.
> >
>
> That's fair. While it was possible to address the case before your
> series, it was a hatchet job. If the changelog simply notes that some
> special casing may still be required for SD_NUMA but it's outside the
> scope of the series, then I'd be happy. At least there is a good chance
> then if there is follow-up work that it won't be interpreted as an
> attempt to reintroduce hacky heuristics.
>
Would the additional comment make sense for you about work to be done
for SD_NUMA ?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ad4b21..7e4cb65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all };
* group. see update_sd_pick_busiest().
*/
enum group_type {
+ /*
+ * The group has spare capacity that can be used to process more work.
+ */
group_has_spare = 0,
+ /*
+ * The group is fully used and the tasks don't compete for more CPU
+ * cycles. Nevetheless, some tasks might wait before running.
+ */
group_fully_busy,
+ /*
+ * One task doesn't fit with CPU's capacity and must be migrated on a
+ * more powerful CPU.
+ */
group_misfit_task,
+ /*
+ * One local CPU with higher capacity is available and task should be
+ * migrated on it instead on current CPU.
+ */
group_asym_packing,
+ /*
+ * The tasks affinity prevents the scheduler to balance the load across
+ * the system.
+ */
group_imbalanced,
+ /*
+ * The CPU is overloaded and can't provide expected CPU cycles to all
+ * tasks.
+ */
group_overloaded
};
@@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/*
* Try to use spare capacity of local group without overloading it or
- * emptying busiest
+ * emptying busiest.
+ * XXX Spreading tasks across numa nodes is not always the best policy
+ * and special cares should be taken for SD_NUMA domain level before
+ * spreading the tasks. For now, load_balance() fully relies on
+ * NUMA_BALANCING and fbq_classify_group/rq to overide the decision.
*/
if (local->group_type == group_has_spare) {
if (busiest->group_type > group_fully_busy) {
--
2.7.4
>
> > >
> > > > But the latter could also take advantage of the new type of group. For
> > > > example, what I did in the fix for find_idlest_group : checking
> > > > numa_preferred_nid when the group has capacity and keep the task on
> > > > preferred node if possible. Similar behavior could also be beneficial
> > > > in periodic load_balance case.
> > > >
> > >
> > > And this is the catch -- numa_preferred_nid is not guaranteed to be set at
> > > all. NUMA balancing might be disabled, the task may not have been running
> > > long enough to pick a preferred NID or NUMA balancing might be unable to
> > > pick a preferred NID. The decision to avoid unnecessary migrations across
> > > NUMA domains should be made independently of NUMA balancing. The netperf
> > > configuration from mmtests is great at illustrating the point because it'll
> > > also say what the average local/remote access ratio is. 2 communicating
> > > tasks running on an otherwise idle NUMA machine should not have the load
> > > balancer move the server to one node and the client to another.
> >
> > I'm going to make it a try on my setup to see the results
> >
>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs
next prev parent reply other threads:[~2019-11-08 16:35 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-18 13:26 [PATCH v4 00/10] sched/fair: rework the CFS load balance Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 01/11] sched/fair: clean up asym packing Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Clean " tip-bot2 for Vincent Guittot
2019-10-30 14:51 ` [PATCH v4 01/11] sched/fair: clean " Mel Gorman
2019-10-30 16:03 ` Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 02/11] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rename sg_lb_stats::sum_nr_running " tip-bot2 for Vincent Guittot
2019-10-30 14:53 ` [PATCH v4 02/11] sched/fair: rename sum_nr_running " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 03/11] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Remove " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 04/11] sched/fair: rework load_balance Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rework load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:45 ` [PATCH v4 04/11] sched/fair: rework load_balance Mel Gorman
2019-10-30 16:16 ` Valentin Schneider
2019-10-31 9:09 ` Vincent Guittot
2019-10-31 10:15 ` Mel Gorman
2019-10-31 11:13 ` Vincent Guittot
2019-10-31 11:40 ` Mel Gorman
2019-11-08 16:35 ` Vincent Guittot [this message]
2019-11-08 18:37 ` Mel Gorman
2019-11-12 10:58 ` Vincent Guittot
2019-11-12 15:06 ` Mel Gorman
2019-11-12 15:40 ` Vincent Guittot
2019-11-12 17:45 ` Mel Gorman
2019-11-18 13:50 ` Ingo Molnar
2019-11-18 13:57 ` Vincent Guittot
2019-11-18 14:51 ` Mel Gorman
2019-10-18 13:26 ` [PATCH v4 05/11] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-30 15:54 ` [PATCH v4 05/11] sched/fair: use " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use load instead of runnable load in load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:58 ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Mel Gorman
2019-10-18 13:26 ` [PATCH v4 07/11] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Spread out tasks evenly " tip-bot2 for Vincent Guittot
2019-10-30 16:03 ` [PATCH v4 07/11] sched/fair: evenly spread tasks " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 08/11] sched/fair: use utilization to select misfit task Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 09/11] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 10/11] sched/fair: optimize find_idlest_group Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Optimize find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rework find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-22 16:46 ` [PATCH] sched/fair: fix rework of find_idlest_group() Vincent Guittot
2019-10-23 7:50 ` Chen, Rong A
2019-10-30 16:07 ` Mel Gorman
2019-11-18 17:42 ` [tip: sched/core] sched/fair: Fix " tip-bot2 for Vincent Guittot
2019-11-22 14:37 ` [PATCH] sched/fair: fix " Valentin Schneider
2019-11-25 9:16 ` Vincent Guittot
2019-11-25 11:03 ` Valentin Schneider
2019-11-20 11:58 ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Qais Yousef
2019-11-20 13:21 ` Vincent Guittot
2019-11-20 16:53 ` Vincent Guittot
2019-11-20 17:34 ` Qais Yousef
2019-11-20 17:43 ` Vincent Guittot
2019-11-20 18:10 ` Qais Yousef
2019-11-20 18:20 ` Vincent Guittot
2019-11-20 18:27 ` Qais Yousef
2019-11-20 19:28 ` Vincent Guittot
2019-11-20 19:55 ` Qais Yousef
2019-11-21 14:58 ` Qais Yousef
2019-11-22 14:34 ` Valentin Schneider
2019-11-25 9:59 ` Vincent Guittot
2019-11-25 11:13 ` Valentin Schneider
2019-10-21 7:50 ` [PATCH v4 00/10] sched/fair: rework the CFS load balance Ingo Molnar
2019-10-21 8:44 ` Vincent Guittot
2019-10-21 12:56 ` Phil Auld
2019-10-24 12:38 ` Phil Auld
2019-10-24 13:46 ` Phil Auld
2019-10-24 14:59 ` Vincent Guittot
2019-10-25 13:33 ` Phil Auld
2019-10-28 13:03 ` Vincent Guittot
2019-10-30 14:39 ` Phil Auld
2019-10-30 16:24 ` Dietmar Eggemann
2019-10-30 16:35 ` Valentin Schneider
2019-10-30 17:19 ` Phil Auld
2019-10-30 17:25 ` Valentin Schneider
2019-10-30 17:29 ` Phil Auld
2019-10-30 17:28 ` Vincent Guittot
2019-10-30 17:44 ` Phil Auld
2019-10-30 17:25 ` Vincent Guittot
2019-10-31 13:57 ` Phil Auld
2019-10-31 16:41 ` Vincent Guittot
2019-10-30 16:24 ` Mel Gorman
2019-10-30 16:35 ` Vincent Guittot
2019-11-18 13:15 ` Ingo Molnar
2019-11-25 12:48 ` Valentin Schneider
2020-01-03 16:39 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191108163501.GA26528@linaro.org \
--to=vincent.guittot@linaro.org \
--cc=Morten.Rasmussen@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=parth@linux.ibm.com \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=quentin.perret@arm.com \
--cc=riel@surriel.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=valentin.schneider@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).