Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

From: Matt Fleming <matt@codeblueprint.co.uk>
To: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	"Lendacky, Thomas" <Thomas.Lendacky@amd.com>,
	Borislav Petkov <bp@alien8.de>
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
Date: Mon, 29 Jul 2019 13:16:20 +0100	[thread overview]
Message-ID: <20190729121620.GD6909@codeblueprint.co.uk> (raw)
In-Reply-To: <a8241850-7111-2d93-2330-d28b00797e56@amd.com>

On Thu, 25 Jul, at 04:37:06PM, Suthikulpanit, Suravee wrote:
> 
> I am testing this patch on the Linux-5.2, and I actually do not
> notice difference pre vs post patch.
> 
> Besides the case above, I have also run an experiment with
> a different number of threads across two sockets:
> 
> (Note: I only focus on thread0 of each core.)
> 
> sXnY = Socket X Node Y
> 
>      * s0n0 + s0n1 + s1n0 + s1n1
>      numactl -C 0-15,32-47 ./spinner 32
> 
>      * s0n2 + s0n3 + s1n2 + s1n3
>      numactl -C 16-31,48-63 ./spinner 32
> 
>      * s0 + s1
>      numactl -C 0-63 ./spinner 64
> 
> My observations are:
> 
>      * I still notice improper load-balance on one of the task initially
>        for a few seconds before they are load-balanced correctly.
> 
>      * It is taking longer to load balance w/ more number of tasks.
> 
> I wonder if you have tried with a different kernel base?

It was tested with one of the 5.2 -rc kernels.

I'll take another look at this behaviour, but for the benefit of LKML
readers, here's the summary I gave before. It's specific to using
cgroups to partitions tasks:

    It turns out there's a secondary issue to do with how run queue load
    averages are compared between sched groups.

    Load averages for a sched_group (a group within a domain) are
    effectively "scaled" by the number of CPUs in that group. This has a
    direct influence on how quickly load ramps up in a group.

    What's happening on my system when running with $(numactl -C
    0-7,32-39) is that the load for the top NUMA sched_domain (domain4) is
    scaling the load by 64 CPUs -- even though the workload can't use all
    64 due to scheduler affinity.

    So because the load balancer thinks there's plenty of room left to run
    tasks, it doesn't balance very well across sockets even with the
    SD_BALANCE_FORK flag.

    This super quick and ugly patch, which caps the number of CPUs at 8, gets both
    sockets used by fork() on my system.

    ---->8----

    diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
    index 40bd1e27b1b7..9444c34d038c 100644
    --- a/kernel/sched/fair.c
    +++ b/kernel/sched/fair.c
    @@ -5791,6 +5791,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
     	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
     	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
     				(sd->imbalance_pct-100) / 100;
    +	unsigned long capacity;

     	if (sd_flag & SD_BALANCE_WAKE)
     		load_idx = sd->wake_idx;
    @@ -5835,10 +5836,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
     		}

     		/* Adjust by relative CPU capacity of the group */
    +		capacity = group->sgc->capacity;
    +
    +		if (capacity > (SCHED_CAPACITY_SCALE * 8))
    +			capacity = SCHED_CAPACITY_SCALE * 8;
    +
     		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
    -					group->sgc->capacity;
    +					capacity;
     		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
    -					group->sgc->capacity;
    +					capacity;

     		if (local_group) {
     			this_runnable_load = runnable_load;

    ----8<----

    There's still an issue with the active load balancer kicking in after a few
    seconds, but I suspect that is related to the use of group capacity elsewhere
    in the load balancer code (like update_sg_lb_stats()).

-- 
Matt Fleming
SUSE Performance Team