linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matt Fleming <matt@codeblueprint.co.uk>
To: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	"Lendacky, Thomas" <Thomas.Lendacky@amd.com>,
	Borislav Petkov <bp@alien8.de>
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
Date: Mon, 29 Jul 2019 13:16:20 +0100	[thread overview]
Message-ID: <20190729121620.GD6909@codeblueprint.co.uk> (raw)
In-Reply-To: <a8241850-7111-2d93-2330-d28b00797e56@amd.com>

On Thu, 25 Jul, at 04:37:06PM, Suthikulpanit, Suravee wrote:
> 
> I am testing this patch on the Linux-5.2, and I actually do not
> notice difference pre vs post patch.
> 
> Besides the case above, I have also run an experiment with
> a different number of threads across two sockets:
> 
> (Note: I only focus on thread0 of each core.)
> 
> sXnY = Socket X Node Y
> 
>      * s0n0 + s0n1 + s1n0 + s1n1
>      numactl -C 0-15,32-47 ./spinner 32
> 
>      * s0n2 + s0n3 + s1n2 + s1n3
>      numactl -C 16-31,48-63 ./spinner 32
> 
>      * s0 + s1
>      numactl -C 0-63 ./spinner 64
> 
> My observations are:
> 
>      * I still notice improper load-balance on one of the task initially
>        for a few seconds before they are load-balanced correctly.
> 
>      * It is taking longer to load balance w/ more number of tasks.
> 
> I wonder if you have tried with a different kernel base?

It was tested with one of the 5.2 -rc kernels.

I'll take another look at this behaviour, but for the benefit of LKML
readers, here's the summary I gave before. It's specific to using
cgroups to partitions tasks:

    It turns out there's a secondary issue to do with how run queue load
    averages are compared between sched groups.
    
    Load averages for a sched_group (a group within a domain) are
    effectively "scaled" by the number of CPUs in that group. This has a
    direct influence on how quickly load ramps up in a group.
    
    What's happening on my system when running with $(numactl -C
    0-7,32-39) is that the load for the top NUMA sched_domain (domain4) is
    scaling the load by 64 CPUs -- even though the workload can't use all
    64 due to scheduler affinity.
    
    So because the load balancer thinks there's plenty of room left to run
    tasks, it doesn't balance very well across sockets even with the
    SD_BALANCE_FORK flag.
    
    This super quick and ugly patch, which caps the number of CPUs at 8, gets both
    sockets used by fork() on my system.
    
    ---->8----
    
    diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
    index 40bd1e27b1b7..9444c34d038c 100644
    --- a/kernel/sched/fair.c
    +++ b/kernel/sched/fair.c
    @@ -5791,6 +5791,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
     	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
     	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
     				(sd->imbalance_pct-100) / 100;
    +	unsigned long capacity;
     
     	if (sd_flag & SD_BALANCE_WAKE)
     		load_idx = sd->wake_idx;
    @@ -5835,10 +5836,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
     		}
     
     		/* Adjust by relative CPU capacity of the group */
    +		capacity = group->sgc->capacity;
    +
    +		if (capacity > (SCHED_CAPACITY_SCALE * 8))
    +			capacity = SCHED_CAPACITY_SCALE * 8;
    +
     		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
    -					group->sgc->capacity;
    +					capacity;
     		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
    -					group->sgc->capacity;
    +					capacity;
     
     		if (local_group) {
     			this_runnable_load = runnable_load;
    
    ----8<----
    
    There's still an issue with the active load balancer kicking in after a few
    seconds, but I suspect that is related to the use of group capacity elsewhere
    in the load balancer code (like update_sg_lb_stats()).


-- 
Matt Fleming
SUSE Performance Team

      reply	other threads:[~2019-07-29 12:16 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-23 10:48 [PATCH v3] sched/topology: Improve load balancing on AMD EPYC Matt Fleming
2019-07-23 11:42 ` Mel Gorman
2019-07-23 12:00   ` Peter Zijlstra
2019-07-23 13:03     ` Mel Gorman
2019-07-23 14:09       ` Peter Zijlstra
2019-07-25 16:37 ` Suthikulpanit, Suravee
2019-07-29 12:16   ` Matt Fleming [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190729121620.GD6909@codeblueprint.co.uk \
    --to=matt@codeblueprint.co.uk \
    --cc=Suravee.Suthikulpanit@amd.com \
    --cc=Thomas.Lendacky@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).