Re: [PATCH v4 00/10] sched/fair: rework the CFS load balance

From: Phil Auld <pauld@redhat.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
	Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v4 00/10] sched/fair: rework the CFS load balance
Date: Thu, 31 Oct 2019 09:57:15 -0400	[thread overview]
Message-ID: <20191031135715.GA5738@pauld.bos.csb> (raw)
In-Reply-To: <CAKfTPtCR93MBPjKhMSyMZJTqVS7YWBPCnk3DmSEq2Q0MVxm1ug@mail.gmail.com>

Hi Vincent,

On Wed, Oct 30, 2019 at 06:25:49PM +0100 Vincent Guittot wrote:
> On Wed, 30 Oct 2019 at 15:39, Phil Auld <pauld@redhat.com> wrote:
> > > That fact that the 4 nodes works well but not the 8 nodes is a bit
> > > surprising except if this means more NUMA level in the sched_domain
> > > topology
> > > Could you give us more details about the sched domain topology ?
> > >
> >
> > The 8-node system has 5 sched domain levels.  The 4-node system only
> > has 3.
> 
> That's an interesting difference. and your additional tests on a 8
> nodes with 3 level tends to confirm that the number of level make a
> difference
> I need to study a bit more how this can impact the spread of tasks

So I think I understand what my numbers have been showing. 

I believe the numa balancing is causing problems.

Here's numbers from the test on 5.4-rc3+ without your series: 

echo 1 >  /proc/sys/kernel/numa_balancing 
lu.C.x_156_GROUP_1  Average    10.87  0.00   0.00   11.49  36.69  34.26  30.59  32.10
lu.C.x_156_GROUP_2  Average    20.15  16.32  9.49   24.91  21.07  20.93  21.63  21.50
lu.C.x_156_GROUP_3  Average    21.27  17.23  11.84  21.80  20.91  20.68  21.11  21.16
lu.C.x_156_GROUP_4  Average    19.44  6.53   8.71   19.72  22.95  23.16  28.85  26.64
lu.C.x_156_GROUP_5  Average    20.59  6.20   11.32  14.63  28.73  30.36  22.20  21.98
lu.C.x_156_NORMAL_1 Average    20.50  19.95  20.40  20.45  18.75  19.35  18.25  18.35
lu.C.x_156_NORMAL_2 Average    17.15  19.04  18.42  18.69  21.35  21.42  20.00  19.92
lu.C.x_156_NORMAL_3 Average    18.00  18.15  17.55  17.60  18.90  18.40  19.90  19.75
lu.C.x_156_NORMAL_4 Average    20.53  20.05  20.21  19.11  19.00  19.47  19.37  18.26
lu.C.x_156_NORMAL_5 Average    18.72  18.78  19.72  18.50  19.67  19.72  21.11  19.78

============156_GROUP========Mop/s===================================
min	q1	median	q3	max
1564.63	3003.87	3928.23	5411.13	8386.66
============156_GROUP========time====================================
min	q1	median	q3	max
243.12	376.82	519.06	678.79	1303.18
============156_NORMAL========Mop/s===================================
min	q1	median	q3	max
13845.6	18013.8	18545.5	19359.9	19647.4
============156_NORMAL========time====================================
min	q1	median	q3	max
103.78	105.32	109.95	113.19	147.27

(This one above is especially bad... we don't usually see 0.00s, but overall it's 
basically on par. It's reflected in the spread of the results).

echo 0 >  /proc/sys/kernel/numa_balancing 
lu.C.x_156_GROUP_1  Average    17.75  19.30  21.20  21.20  20.20  20.80  18.90  16.65
lu.C.x_156_GROUP_2  Average    18.38  19.25  21.00  20.06  20.19  20.31  19.56  17.25
lu.C.x_156_GROUP_3  Average    21.81  21.00  18.38  16.86  20.81  21.48  18.24  17.43
lu.C.x_156_GROUP_4  Average    20.48  20.96  19.61  17.61  17.57  19.74  18.48  21.57
lu.C.x_156_GROUP_5  Average    23.32  21.96  19.16  14.28  21.44  22.56  17.00  16.28
lu.C.x_156_NORMAL_1 Average    19.50  19.83  19.58  19.25  19.58  19.42  19.42  19.42
lu.C.x_156_NORMAL_2 Average    18.90  18.40  20.00  19.80  19.70  19.30  19.80  20.10
lu.C.x_156_NORMAL_3 Average    19.45  19.09  19.91  20.09  19.45  18.73  19.45  19.82
lu.C.x_156_NORMAL_4 Average    19.64  19.27  19.64  19.00  19.82  19.55  19.73  19.36
lu.C.x_156_NORMAL_5 Average    18.75  19.42  20.08  19.67  18.75  19.50  19.92  19.92

============156_GROUP========Mop/s===================================
min	q1	median	q3	max
14956.3	16346.5	17505.7	18440.6	22492.7
============156_GROUP========time====================================
min	q1	median	q3	max
90.65	110.57	116.48	124.74	136.33
============156_NORMAL========Mop/s===================================
min	q1	median	q3	max
29801.3	30739.2	31967.5	32151.3	34036
============156_NORMAL========time====================================
min	q1	median	q3	max
59.91	63.42	63.78	66.33	68.42

Note there is a significant improvement already. But we are seeing imbalance due to
using weighted load and averages. In this case it's only 55% slowdown rather than
the 5x. But the overall performance if the benchmark is also much better in both cases.

Here's the same test, same system with the full series (lb_v4a as I've been calling it):

echo 1 >  /proc/sys/kernel/numa_balancing 
lu.C.x_156_GROUP_1  Average    18.59  19.36  19.50  18.86  20.41  20.59  18.27  20.41
lu.C.x_156_GROUP_2  Average    19.52  20.52  20.48  21.17  19.52  19.09  17.70  18.00
lu.C.x_156_GROUP_3  Average    20.58  20.71  20.17  20.50  18.46  19.50  18.58  17.50
lu.C.x_156_GROUP_4  Average    18.95  19.63  19.47  19.84  18.79  19.84  20.84  18.63
lu.C.x_156_GROUP_5  Average    16.85  17.96  19.89  19.15  19.26  20.48  21.70  20.70
lu.C.x_156_NORMAL_1 Average    18.04  18.48  20.00  19.72  20.72  20.48  18.48  20.08
lu.C.x_156_NORMAL_2 Average    18.22  20.56  19.50  19.39  20.67  19.83  18.44  19.39
lu.C.x_156_NORMAL_3 Average    17.72  19.61  19.56  19.17  20.17  19.89  20.78  19.11
lu.C.x_156_NORMAL_4 Average    18.05  19.74  20.21  19.89  20.32  20.26  19.16  18.37
lu.C.x_156_NORMAL_5 Average    18.89  19.95  20.21  20.63  19.84  19.26  19.26  17.95

============156_GROUP========Mop/s===================================
min	q1	median	q3	max
13460.1	14949	15851.7	16391.4	18993
============156_GROUP========time====================================
min	q1	median	q3	max
107.35	124.39	128.63	136.4	151.48
============156_NORMAL========Mop/s===================================
min	q1	median	q3	max
14418.5	18512.4	19049.5	19682	19808.8
============156_NORMAL========time====================================
min	q1	median	q3	max
102.93	103.6	107.04	110.14	141.42

echo 0 >  /proc/sys/kernel/numa_balancing 
lu.C.x_156_GROUP_1  Average    19.00  19.33  19.33  19.58  20.08  19.67  19.83  19.17
lu.C.x_156_GROUP_2  Average    18.55  19.91  20.09  19.27  18.82  19.27  19.91  20.18
lu.C.x_156_GROUP_3  Average    18.42  19.08  19.75  19.00  19.50  20.08  20.25  19.92
lu.C.x_156_GROUP_4  Average    18.42  19.83  19.17  19.50  19.58  19.83  19.83  19.83
lu.C.x_156_GROUP_5  Average    19.17  19.42  20.17  19.92  19.25  18.58  19.92  19.58
lu.C.x_156_NORMAL_1 Average    19.25  19.50  19.92  18.92  19.33  19.75  19.58  19.75
lu.C.x_156_NORMAL_2 Average    19.42  19.25  17.83  18.17  19.83  20.50  20.42  20.58
lu.C.x_156_NORMAL_3 Average    18.58  19.33  19.75  18.25  19.42  20.25  20.08  20.33
lu.C.x_156_NORMAL_4 Average    19.00  19.55  19.73  18.73  19.55  20.00  19.64  19.82
lu.C.x_156_NORMAL_5 Average    19.25  19.25  19.50  18.75  19.92  19.58  19.92  19.83

============156_GROUP========Mop/s===================================
min	q1	median	q3	max
28520.1	29024.2	29042.1	29367.4	31235.2
============156_GROUP========time====================================
min	q1	median	q3	max
65.28	69.43	70.21	70.25	71.49
============156_NORMAL========Mop/s===================================
min	q1	median	q3	max
28974.5	29806.5	30237.1	30907.4	31830.1
============156_NORMAL========time====================================
min	q1	median	q3	max
64.06	65.97	67.43	68.41	70.37

This all now makes sense. Looking at the numa balancing code a bit you can see 
that it still uses load so it will still be subject to making bogus decisions 
based on the weighted load. In this case it's been actively working against the 
load balancer because of that.

I think with the three numa levels on this system the numa balancing was able to 
win more often.  We don't see the same level of this result on systems with only 
one SD_NUMA level.

Following the other part of this thread, I have to add that I'm of the opinion 
that the weighted load (which is all we have now I believe) really should be used 
only in extreme cases of overload to deal with fairness. And even then maybe not.  
As far as I can see, once the fair group scheduling is involved, that load is 
basically a random number between 1 and 1024.  It really has no bearing on how 
much  "load" a task will put on a cpu.   Any comparison of that to cpu capacity 
is pretty meaningless. 

I'm sure there are workloads for which the numa balancing is more important. But 
even then I suspect it is making the wrong decisions more often than not. I think 
a similar rework may be needed :)

I've asked our perf team to try the full battery of tests with numa balancing 
disabled  to see what it shows across the board.

Good job on this and thanks for the time looking at my specific issues. 

As far as this series is concerned, and as far as it matters: 

Acked-by: Phil Auld <pauld@redhat.com>

Cheers,
Phil

--