Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4

From: Phil Auld <pauld@redhat.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
	Rik van Riel <riel@surriel.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
Date: Fri, 17 Jan 2020 09:23:59 -0500	[thread overview]
Message-ID: <20200117142358.GA6339@pauld.bos.csb> (raw)
In-Reply-To: <CAKfTPtBznUt20QFzeQBPELcmN6+F=BOx09oSqVMzSGvXF5ByHg@mail.gmail.com>

Hi,

On Fri, Jan 17, 2020 at 02:08:13PM +0100 Vincent Guittot wrote:
> Hi Mel,
> 
> 
> On Thu, 16 Jan 2020 at 17:35, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Jan 14, 2020 at 10:13:20AM +0000, Mel Gorman wrote:
> > > Changelog since V3
> > > o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> > >   be as good or better than allowing an imbalance based on the group weight
> > >   without worrying about potential spillover of the lower scheduler domains.
> > >
> > > Changelog since V2
> > > o Only allow a small imbalance when utilisation is low to address reports that
> > >   higher utilisation workloads were hitting corner cases.
> > >
> > > Changelog since V1
> > > o Alter code flow                                             vincent.guittot
> > > o Use idle CPUs for comparison instead of sum_nr_running      vincent.guittot
> > > o Note that the division is still in place. Without it and taking
> > >   imbalance_adj into account before the cutoff, two NUMA domains
> > >   do not converage as being equally balanced when the number of
> > >   busy tasks equals the size of one domain (50% of the sum).
> > >
> > > The CPU load balancer balances between different domains to spread load
> > > and strives to have equal balance everywhere. Communicating tasks can
> > > migrate so they are topologically close to each other but these decisions
> > > are independent. On a lightly loaded NUMA machine, two communicating tasks
> > > pulled together at wakeup time can be pushed apart by the load balancer.
> > > In isolation, the load balancer decision is fine but it ignores the tasks
> > > data locality and the wakeup/LB paths continually conflict. NUMA balancing
> > > is also a factor but it also simply conflicts with the load balancer.
> > >
> > > This patch allows a fixed degree of imbalance of two tasks to exist
> > > between NUMA domains regardless of utilisation levels. In many cases,
> > > this prevents communicating tasks being pulled apart. It was evaluated
> > > whether the imbalance should be scaled to the domain size. However, no
> > > additional benefit was measured across a range of workloads and machines
> > > and scaling adds the risk that lower domains have to be rebalanced. While
> > > this could change again in the future, such a change should specify the
> > > use case and benefit.
> > >
> >
> > Any thoughts on whether this is ok for tip or are there suggestions on
> > an alternative approach?
> 
> I have just finished to run some tests on my system with your patch
> and I haven't seen any noticeable any changes so far which was a bit
> expected. The tests that I usually run, use more than 4 tasks on my 2
> nodes system; the only exception is perf sched  pipe and the results
> for this test stays the same with and without your patch. I'm curious
> if this impacts Phil's tests which run LU.c benchmark with some
> burning cpu tasks
>

I'm not seeing much meaningful difference with this real v4 versus what I 
posted earlier. It seems to have tightened up the range in some cases, but 
otherwise it's hard to see much difference, which is a good thing. I'll 
put the eye chart below for the curious. 

I'll see if the perf team has time to run a full suite on it. But for this
case it looks fine.

Cheers,
Phil

The lbv4* one is the non-official v4 from the email thread. The other v4
is the real posted one. Otherwise the rest of this is the same as I posted
the other day, which described the test. https://lkml.org/lkml/2020/1/7/840

GROUP - LU.c and cpu hogs in separate cgroups
Mop/s - Higher is better
============76_GROUP========Mop/s===================================
	min	q1	median	q3	max
5.4.0	 1671.8	 4211.2	 6103.0	 6934.1	 7865.4
5.4.0	 1777.1	 3719.9	 4861.8	 5822.5	13479.6
5.4.0	 2015.3	 2716.2	 5007.1	 6214.5	 9491.7
5.5-rc2	27641.0	30684.7	32091.8	33417.3	38118.1
5.5-rc2	27386.0	29795.2	32484.1	36004.0	37704.3
5.5-rc2	26649.6	29485.0	30379.7	33116.0	36832.8
lbv3	28496.3	29716.0	30634.8	32998.4	40945.2
lbv3	27294.7	29336.4	30186.0	31888.3	35839.1
lbv3	27099.3	29325.3	31680.1	35973.5	39000.0
lbv4*	27936.4	30109.0	31724.8	33150.7	35905.1
lbv4*	26431.0	29355.6	29850.1	32704.4	36060.3
lbv4*	27436.6	29945.9	31076.9	32207.8	35401.5
lbv4	28006.3	29861.1	31993.1	33469.3	34060.7
lbv4	28468.2	30057.7	31606.3	31963.5	35348.5
lbv4	25016.3	28897.5	29274.4	30229.2	36862.7

Runtime - Lower is better
============76_GROUP========time====================================
	min	q1	median	q3	max
5.4.0	259.2	294.92	335.39	484.33	1219.61
5.4.0	151.3	351.1	419.4	551.99	1147.3
5.4.0	214.8	328.16	407.27	751.03	1011.77
5.5-rc2	 53.49	 61.03	 63.56	 66.46	  73.77
5.5-rc2  54.08	 56.67	 62.78	 68.44	  74.45
5.5-rc2	 55.36	 61.61	 67.14	 69.16	  76.51
lbv3	 49.8	 61.8	 66.59	 68.62	  71.55
lbv3	 56.89	 63.95	 67.55	 69.51	  74.7
lbv3	 52.28	 56.68	 64.38	 69.54	  75.24
lbv4*	 56.79	 61.52	 64.3	 67.73	  72.99
lbv4*	 56.54	 62.36	 68.31	 69.47	  77.14
lbv4*	 57.6	 63.33	 65.64	 68.11	  74.32
lbv4	 59.86	 60.93	 63.74	 68.28	  72.81
lbv4	 57.68	 63.79	 64.52	 67.84	  71.62
lbv4	 55.31	 67.46	 69.66	 70.56	  81.51

NORMAL - LU.c and cpu hogs all in one cgroup
Mop/s - Higher is better
============76_NORMAL========Mop/s===================================
	min	q1	median	q3	max
5.4.0	32912.6	34047.5	36739.4	39124.1	41592.5
5.4.0	29937.7	33060.6	34860.8	39528.8	43328.1
5.4.0	31851.2	34281.1	35284.4	36016.8	38847.4
5.5-rc2	30475.6	32505.1	33977.3	34876	36233.8
5.5-rc2	30657.7	31301.1	32059.4	34396.7	38661.8
5.5-rc2	31022	32247.6	32628.9	33245	38572.3
lbv3	30606.4	32794.4	34258.6	35699	38669.2
lbv3	29722.7	30558.9	32731.2	36412	40752.3
lbv3	30297.7	32568.3	36654.6	38066.2	38988.3
lbv4*	30084.9	31227.5	32312.8	33222.8	36039.7
lbv4*	29875.9	32903.6	33803.1	34519.3	38663.5
lbv4*	27923.3	30631.1	32666.9	33516.7	36663.4
lbv4	30401.4	32559.5	33268.3	35012.9	35953.9
lbv4	29372.5	30677	32081.7	33734.2	39326.8
lbv4	29583.7	30432.5	32542.9	33170.5	34123.1

Runtime - Lower is better
============76_NORMAL========time====================================
	min	q1	median	q3	max
5.4.0	49.02	52.115	55.58	59.89	61.95
5.4.0	47.06	51.615	58.57	61.68	68.11
5.4.0	52.49	56.615	57.795	59.48	64.02
5.5-rc2	56.27	58.47	60.02	62.735	66.91
5.5-rc2	52.74	59.295	63.605	65.145	66.51
5.5-rc2	52.86	61.335	62.495	63.23	65.73
lbv3	52.73	57.12	59.52	62.19	66.62
lbv3	50.03	56.02	62.39	66.725	68.6
lbv3	52.3	53.565	55.65	62.645	67.3
lbv4*	56.58	61.375	63.135	65.3	67.77
lbv4*	52.74	59.07	60.335	61.97	68.25
lbv4*	55.61	60.835	62.42	66.635	73.02
lbv4	56.71	58.235	61.295	62.63	67.07
lbv4	51.85	60.535	63.56	66.54	69.42
lbv4	59.75	61.48	62.655	67	68.92

> >
> > --
> > Mel Gorman
> > SUSE Labs
> 

--