Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Mel Gorman <mgorman@techsingularity.net>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	pauld@redhat.com, srikar@linux.vnet.ibm.com,
	quentin.perret@arm.com, dietmar.eggemann@arm.com,
	Morten.Rasmussen@arm.com, hdanton@sina.com, parth@linux.ibm.com,
	riel@surriel.com, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
Date: Sat, 21 Dec 2019 11:25:03 +0000	[thread overview]
Message-ID: <20191221112503.GN3178@techsingularity.net> (raw)
In-Reply-To: <d44ae0ff-3bd7-fab1-66d0-71769c078918@arm.com>

On Fri, Dec 20, 2019 at 12:40:02PM +0000, Valentin Schneider wrote:
> > +			 * it is not exactly half as imbalance_adj must be
> > +			 * accounted for or the two domains do not converge
> > +			 * as equally balanced if the number of busy tasks is
> > +			 * roughly the size of one NUMA domain.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> 
> I'm confused now, have we set env->imbalance to anything at this point? AIUI
> Vincent's suggestion was to hinge this purely on the weight vs idle_cpus /
> nr_running, IOW not use imbalance:
> 
> if (sd->flags & SD_NUMA) {                                                                         
> 	imbalance_adj = max(2U, (busiest->group_weight *                                           
> 				 (env->sd->imbalance_pct - 100) / 100) >> 1);                      
> 	imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;                              
>                                                                                                      
> 	if (busiest->idle_cpus >= imbalance_max) {                                                 
> 		env->imbalance = 0;                                                                
> 		return;                                                                            
> 	}                                                                                          
> }                                                                                                  
>        

Ok, I tried this just in case and it does not work out when the number
of CPUs per NUMA node gets very large. It allows very large imbalances
between the number of CPUs calculated based on imbalance_pct and based on
group_weight >> 1 and caused some very large regressions on a modern AMD
machine. The version I had that calculated an imbalance first and decided
whether to ignore it still has been consistently the best approach across
multiple machines and workloads.

-- 
Mel Gorman
SUSE Labs