From: Phil Auld <pauld@redhat.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>,
Mel Gorman <mgorman@techsingularity.net>,
linux-kernel <linux-kernel@vger.kernel.org>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Valentin Schneider <valentin.schneider@arm.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
Quentin Perret <quentin.perret@arm.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Morten Rasmussen <Morten.Rasmussen@arm.com>,
Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v4 00/10] sched/fair: rework the CFS load balance
Date: Thu, 31 Oct 2019 09:57:15 -0400 [thread overview]
Message-ID: <20191031135715.GA5738@pauld.bos.csb> (raw)
In-Reply-To: <CAKfTPtCR93MBPjKhMSyMZJTqVS7YWBPCnk3DmSEq2Q0MVxm1ug@mail.gmail.com>
Hi Vincent,
On Wed, Oct 30, 2019 at 06:25:49PM +0100 Vincent Guittot wrote:
> On Wed, 30 Oct 2019 at 15:39, Phil Auld <pauld@redhat.com> wrote:
> > > That fact that the 4 nodes works well but not the 8 nodes is a bit
> > > surprising except if this means more NUMA level in the sched_domain
> > > topology
> > > Could you give us more details about the sched domain topology ?
> > >
> >
> > The 8-node system has 5 sched domain levels. The 4-node system only
> > has 3.
>
> That's an interesting difference. and your additional tests on a 8
> nodes with 3 level tends to confirm that the number of level make a
> difference
> I need to study a bit more how this can impact the spread of tasks
So I think I understand what my numbers have been showing.
I believe the numa balancing is causing problems.
Here's numbers from the test on 5.4-rc3+ without your series:
echo 1 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 10.87 0.00 0.00 11.49 36.69 34.26 30.59 32.10
lu.C.x_156_GROUP_2 Average 20.15 16.32 9.49 24.91 21.07 20.93 21.63 21.50
lu.C.x_156_GROUP_3 Average 21.27 17.23 11.84 21.80 20.91 20.68 21.11 21.16
lu.C.x_156_GROUP_4 Average 19.44 6.53 8.71 19.72 22.95 23.16 28.85 26.64
lu.C.x_156_GROUP_5 Average 20.59 6.20 11.32 14.63 28.73 30.36 22.20 21.98
lu.C.x_156_NORMAL_1 Average 20.50 19.95 20.40 20.45 18.75 19.35 18.25 18.35
lu.C.x_156_NORMAL_2 Average 17.15 19.04 18.42 18.69 21.35 21.42 20.00 19.92
lu.C.x_156_NORMAL_3 Average 18.00 18.15 17.55 17.60 18.90 18.40 19.90 19.75
lu.C.x_156_NORMAL_4 Average 20.53 20.05 20.21 19.11 19.00 19.47 19.37 18.26
lu.C.x_156_NORMAL_5 Average 18.72 18.78 19.72 18.50 19.67 19.72 21.11 19.78
============156_GROUP========Mop/s===================================
min q1 median q3 max
1564.63 3003.87 3928.23 5411.13 8386.66
============156_GROUP========time====================================
min q1 median q3 max
243.12 376.82 519.06 678.79 1303.18
============156_NORMAL========Mop/s===================================
min q1 median q3 max
13845.6 18013.8 18545.5 19359.9 19647.4
============156_NORMAL========time====================================
min q1 median q3 max
103.78 105.32 109.95 113.19 147.27
(This one above is especially bad... we don't usually see 0.00s, but overall it's
basically on par. It's reflected in the spread of the results).
echo 0 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 17.75 19.30 21.20 21.20 20.20 20.80 18.90 16.65
lu.C.x_156_GROUP_2 Average 18.38 19.25 21.00 20.06 20.19 20.31 19.56 17.25
lu.C.x_156_GROUP_3 Average 21.81 21.00 18.38 16.86 20.81 21.48 18.24 17.43
lu.C.x_156_GROUP_4 Average 20.48 20.96 19.61 17.61 17.57 19.74 18.48 21.57
lu.C.x_156_GROUP_5 Average 23.32 21.96 19.16 14.28 21.44 22.56 17.00 16.28
lu.C.x_156_NORMAL_1 Average 19.50 19.83 19.58 19.25 19.58 19.42 19.42 19.42
lu.C.x_156_NORMAL_2 Average 18.90 18.40 20.00 19.80 19.70 19.30 19.80 20.10
lu.C.x_156_NORMAL_3 Average 19.45 19.09 19.91 20.09 19.45 18.73 19.45 19.82
lu.C.x_156_NORMAL_4 Average 19.64 19.27 19.64 19.00 19.82 19.55 19.73 19.36
lu.C.x_156_NORMAL_5 Average 18.75 19.42 20.08 19.67 18.75 19.50 19.92 19.92
============156_GROUP========Mop/s===================================
min q1 median q3 max
14956.3 16346.5 17505.7 18440.6 22492.7
============156_GROUP========time====================================
min q1 median q3 max
90.65 110.57 116.48 124.74 136.33
============156_NORMAL========Mop/s===================================
min q1 median q3 max
29801.3 30739.2 31967.5 32151.3 34036
============156_NORMAL========time====================================
min q1 median q3 max
59.91 63.42 63.78 66.33 68.42
Note there is a significant improvement already. But we are seeing imbalance due to
using weighted load and averages. In this case it's only 55% slowdown rather than
the 5x. But the overall performance if the benchmark is also much better in both cases.
Here's the same test, same system with the full series (lb_v4a as I've been calling it):
echo 1 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 18.59 19.36 19.50 18.86 20.41 20.59 18.27 20.41
lu.C.x_156_GROUP_2 Average 19.52 20.52 20.48 21.17 19.52 19.09 17.70 18.00
lu.C.x_156_GROUP_3 Average 20.58 20.71 20.17 20.50 18.46 19.50 18.58 17.50
lu.C.x_156_GROUP_4 Average 18.95 19.63 19.47 19.84 18.79 19.84 20.84 18.63
lu.C.x_156_GROUP_5 Average 16.85 17.96 19.89 19.15 19.26 20.48 21.70 20.70
lu.C.x_156_NORMAL_1 Average 18.04 18.48 20.00 19.72 20.72 20.48 18.48 20.08
lu.C.x_156_NORMAL_2 Average 18.22 20.56 19.50 19.39 20.67 19.83 18.44 19.39
lu.C.x_156_NORMAL_3 Average 17.72 19.61 19.56 19.17 20.17 19.89 20.78 19.11
lu.C.x_156_NORMAL_4 Average 18.05 19.74 20.21 19.89 20.32 20.26 19.16 18.37
lu.C.x_156_NORMAL_5 Average 18.89 19.95 20.21 20.63 19.84 19.26 19.26 17.95
============156_GROUP========Mop/s===================================
min q1 median q3 max
13460.1 14949 15851.7 16391.4 18993
============156_GROUP========time====================================
min q1 median q3 max
107.35 124.39 128.63 136.4 151.48
============156_NORMAL========Mop/s===================================
min q1 median q3 max
14418.5 18512.4 19049.5 19682 19808.8
============156_NORMAL========time====================================
min q1 median q3 max
102.93 103.6 107.04 110.14 141.42
echo 0 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 19.00 19.33 19.33 19.58 20.08 19.67 19.83 19.17
lu.C.x_156_GROUP_2 Average 18.55 19.91 20.09 19.27 18.82 19.27 19.91 20.18
lu.C.x_156_GROUP_3 Average 18.42 19.08 19.75 19.00 19.50 20.08 20.25 19.92
lu.C.x_156_GROUP_4 Average 18.42 19.83 19.17 19.50 19.58 19.83 19.83 19.83
lu.C.x_156_GROUP_5 Average 19.17 19.42 20.17 19.92 19.25 18.58 19.92 19.58
lu.C.x_156_NORMAL_1 Average 19.25 19.50 19.92 18.92 19.33 19.75 19.58 19.75
lu.C.x_156_NORMAL_2 Average 19.42 19.25 17.83 18.17 19.83 20.50 20.42 20.58
lu.C.x_156_NORMAL_3 Average 18.58 19.33 19.75 18.25 19.42 20.25 20.08 20.33
lu.C.x_156_NORMAL_4 Average 19.00 19.55 19.73 18.73 19.55 20.00 19.64 19.82
lu.C.x_156_NORMAL_5 Average 19.25 19.25 19.50 18.75 19.92 19.58 19.92 19.83
============156_GROUP========Mop/s===================================
min q1 median q3 max
28520.1 29024.2 29042.1 29367.4 31235.2
============156_GROUP========time====================================
min q1 median q3 max
65.28 69.43 70.21 70.25 71.49
============156_NORMAL========Mop/s===================================
min q1 median q3 max
28974.5 29806.5 30237.1 30907.4 31830.1
============156_NORMAL========time====================================
min q1 median q3 max
64.06 65.97 67.43 68.41 70.37
This all now makes sense. Looking at the numa balancing code a bit you can see
that it still uses load so it will still be subject to making bogus decisions
based on the weighted load. In this case it's been actively working against the
load balancer because of that.
I think with the three numa levels on this system the numa balancing was able to
win more often. We don't see the same level of this result on systems with only
one SD_NUMA level.
Following the other part of this thread, I have to add that I'm of the opinion
that the weighted load (which is all we have now I believe) really should be used
only in extreme cases of overload to deal with fairness. And even then maybe not.
As far as I can see, once the fair group scheduling is involved, that load is
basically a random number between 1 and 1024. It really has no bearing on how
much "load" a task will put on a cpu. Any comparison of that to cpu capacity
is pretty meaningless.
I'm sure there are workloads for which the numa balancing is more important. But
even then I suspect it is making the wrong decisions more often than not. I think
a similar rework may be needed :)
I've asked our perf team to try the full battery of tests with numa balancing
disabled to see what it shows across the board.
Good job on this and thanks for the time looking at my specific issues.
As far as this series is concerned, and as far as it matters:
Acked-by: Phil Auld <pauld@redhat.com>
Cheers,
Phil
--
next prev parent reply other threads:[~2019-10-31 13:57 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-18 13:26 [PATCH v4 00/10] sched/fair: rework the CFS load balance Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 01/11] sched/fair: clean up asym packing Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Clean " tip-bot2 for Vincent Guittot
2019-10-30 14:51 ` [PATCH v4 01/11] sched/fair: clean " Mel Gorman
2019-10-30 16:03 ` Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 02/11] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rename sg_lb_stats::sum_nr_running " tip-bot2 for Vincent Guittot
2019-10-30 14:53 ` [PATCH v4 02/11] sched/fair: rename sum_nr_running " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 03/11] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Remove " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 04/11] sched/fair: rework load_balance Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rework load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:45 ` [PATCH v4 04/11] sched/fair: rework load_balance Mel Gorman
2019-10-30 16:16 ` Valentin Schneider
2019-10-31 9:09 ` Vincent Guittot
2019-10-31 10:15 ` Mel Gorman
2019-10-31 11:13 ` Vincent Guittot
2019-10-31 11:40 ` Mel Gorman
2019-11-08 16:35 ` Vincent Guittot
2019-11-08 18:37 ` Mel Gorman
2019-11-12 10:58 ` Vincent Guittot
2019-11-12 15:06 ` Mel Gorman
2019-11-12 15:40 ` Vincent Guittot
2019-11-12 17:45 ` Mel Gorman
2019-11-18 13:50 ` Ingo Molnar
2019-11-18 13:57 ` Vincent Guittot
2019-11-18 14:51 ` Mel Gorman
2019-10-18 13:26 ` [PATCH v4 05/11] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-30 15:54 ` [PATCH v4 05/11] sched/fair: use " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use load instead of runnable load in load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:58 ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Mel Gorman
2019-10-18 13:26 ` [PATCH v4 07/11] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Spread out tasks evenly " tip-bot2 for Vincent Guittot
2019-10-30 16:03 ` [PATCH v4 07/11] sched/fair: evenly spread tasks " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 08/11] sched/fair: use utilization to select misfit task Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 09/11] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 10/11] sched/fair: optimize find_idlest_group Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Optimize find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rework find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-22 16:46 ` [PATCH] sched/fair: fix rework of find_idlest_group() Vincent Guittot
2019-10-23 7:50 ` Chen, Rong A
2019-10-30 16:07 ` Mel Gorman
2019-11-18 17:42 ` [tip: sched/core] sched/fair: Fix " tip-bot2 for Vincent Guittot
2019-11-22 14:37 ` [PATCH] sched/fair: fix " Valentin Schneider
2019-11-25 9:16 ` Vincent Guittot
2019-11-25 11:03 ` Valentin Schneider
2019-11-20 11:58 ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Qais Yousef
2019-11-20 13:21 ` Vincent Guittot
2019-11-20 16:53 ` Vincent Guittot
2019-11-20 17:34 ` Qais Yousef
2019-11-20 17:43 ` Vincent Guittot
2019-11-20 18:10 ` Qais Yousef
2019-11-20 18:20 ` Vincent Guittot
2019-11-20 18:27 ` Qais Yousef
2019-11-20 19:28 ` Vincent Guittot
2019-11-20 19:55 ` Qais Yousef
2019-11-21 14:58 ` Qais Yousef
2019-11-22 14:34 ` Valentin Schneider
2019-11-25 9:59 ` Vincent Guittot
2019-11-25 11:13 ` Valentin Schneider
2019-10-21 7:50 ` [PATCH v4 00/10] sched/fair: rework the CFS load balance Ingo Molnar
2019-10-21 8:44 ` Vincent Guittot
2019-10-21 12:56 ` Phil Auld
2019-10-24 12:38 ` Phil Auld
2019-10-24 13:46 ` Phil Auld
2019-10-24 14:59 ` Vincent Guittot
2019-10-25 13:33 ` Phil Auld
2019-10-28 13:03 ` Vincent Guittot
2019-10-30 14:39 ` Phil Auld
2019-10-30 16:24 ` Dietmar Eggemann
2019-10-30 16:35 ` Valentin Schneider
2019-10-30 17:19 ` Phil Auld
2019-10-30 17:25 ` Valentin Schneider
2019-10-30 17:29 ` Phil Auld
2019-10-30 17:28 ` Vincent Guittot
2019-10-30 17:44 ` Phil Auld
2019-10-30 17:25 ` Vincent Guittot
2019-10-31 13:57 ` Phil Auld [this message]
2019-10-31 16:41 ` Vincent Guittot
2019-10-30 16:24 ` Mel Gorman
2019-10-30 16:35 ` Vincent Guittot
2019-11-18 13:15 ` Ingo Molnar
2019-11-25 12:48 ` Valentin Schneider
2020-01-03 16:39 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191031135715.GA5738@pauld.bos.csb \
--to=pauld@redhat.com \
--cc=Morten.Rasmussen@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=parth@linux.ibm.com \
--cc=peterz@infradead.org \
--cc=quentin.perret@arm.com \
--cc=riel@surriel.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=valentin.schneider@arm.com \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.