linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
@ 2019-07-23 10:48 Matt Fleming
  2019-07-23 11:42 ` Mel Gorman
  2019-07-25 16:37 ` Suthikulpanit, Suravee
  0 siblings, 2 replies; 7+ messages in thread
From: Matt Fleming @ 2019-07-23 10:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Matt Fleming, Suthikulpanit, Suravee, Mel Gorman,
	Lendacky, Thomas, Borislav Petkov

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Lendacky, Thomas" <Thomas.Lendacky@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
---
 arch/x86/kernel/cpu/amd.c |  3 +++
 include/linux/topology.h  | 14 ++++++++++++++
 kernel/sched/topology.c   |  3 ++-
 mm/khugepaged.c           |  2 +-
 mm/page_alloc.c           |  2 +-
 5 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e50428b68..d94bf83d5ee6 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
 #include <linux/random.h>
+#include <linux/topology.h>
 #include <asm/processor.h>
 #include <asm/apic.h>
 #include <asm/cacheinfo.h>
@@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
 	set_cpu_cap(c, X86_FEATURE_ZEN);
 
+	node_reclaim_distance = 32;
+
 	/*
 	 * Fix erratum 1076: CPB feature bit not being set in CPUID.
 	 * Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c08036..579522ec446c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f751ce0b783e..f684fde00536 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int			sched_domains_curr_level;
 int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
 static struct cpumask		***sched_domains_numa_masks;
+int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
 		sd->flags &= ~SD_PREFER_SIBLING;
 		sd->flags |= SD_SERIALIZE;
-		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+		if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
 				       SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..ccede2425c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		if (!khugepaged_node_load[i])
 			continue;
-		if (node_distance(nid, i) > RECLAIM_DISTANCE)
+		if (node_distance(nid, i) > node_reclaim_distance)
 			return true;
 	}
 	return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..0d54cd2c43a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
-				RECLAIM_DISTANCE;
+				node_reclaim_distance;
 }
 #else	/* CONFIG_NUMA */
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
  2019-07-23 10:48 [PATCH v3] sched/topology: Improve load balancing on AMD EPYC Matt Fleming
@ 2019-07-23 11:42 ` Mel Gorman
  2019-07-23 12:00   ` Peter Zijlstra
  2019-07-25 16:37 ` Suthikulpanit, Suravee
  1 sibling, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2019-07-23 11:42 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, linux-kernel, Suthikulpanit, Suravee, Lendacky,
	Thomas, Borislav Petkov

On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
> 
> However, as is rather unfortunately explained in
> 
>   commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
> 
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
> 
> Current AMD EPYC machines have the following NUMA node distances:
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  16  16  16  32  32  32  32
>   1:  16  10  16  16  32  32  32  32
>   2:  16  16  10  16  32  32  32  32
>   3:  16  16  16  10  32  32  32  32
>   4:  32  32  32  32  10  16  16  16
>   5:  32  32  32  32  16  10  16  16
>   6:  32  32  32  32  16  16  10  16
>   7:  32  32  32  32  16  16  16  10
> 
> where 2 hops is 32.
> 
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
> 
> For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> (CPUs 32-39) like so,
> 
>   $ numactl -C 0-7,32-39 ./spinner 16
> 
> causes all threads to fork and remain on node 0 until the active
> balancer kicks in after a few seconds and forcibly moves some threads
> to node 4.
> 
> Override node_reclaim_distance for AMD Zen.
> 
> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
> Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: "Lendacky, Thomas" <Thomas.Lendacky@amd.com>
> Cc: Borislav Petkov <bp@alien8.de>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

The only caveat I can think of is that a future generation of Zen might
take a different magic number than 32 as their remote distance. If or
when this happens, it'll need additional smarts but lacking a crystal
ball, we can cross that bridge when we come to it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
  2019-07-23 11:42 ` Mel Gorman
@ 2019-07-23 12:00   ` Peter Zijlstra
  2019-07-23 13:03     ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2019-07-23 12:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matt Fleming, linux-kernel, Suthikulpanit, Suravee, Lendacky,
	Thomas, Borislav Petkov

On Tue, Jul 23, 2019 at 12:42:48PM +0100, Mel Gorman wrote:
> On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> > SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> > for any sched domains with a NUMA distance greater than 2 hops
> > (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> > across domains that far apart.
> > 
> > However, as is rather unfortunately explained in
> > 
> >   commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
> > 
> > the value for RECLAIM_DISTANCE is based on node distance tables from
> > 2011-era hardware.
> > 
> > Current AMD EPYC machines have the following NUMA node distances:
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  16  16  16  32  32  32  32
> >   1:  16  10  16  16  32  32  32  32
> >   2:  16  16  10  16  32  32  32  32
> >   3:  16  16  16  10  32  32  32  32
> >   4:  32  32  32  32  10  16  16  16
> >   5:  32  32  32  32  16  10  16  16
> >   6:  32  32  32  32  16  16  10  16
> >   7:  32  32  32  32  16  16  16  10
> > 
> > where 2 hops is 32.
> > 
> > The result is that the scheduler fails to load balance properly across
> > NUMA nodes on different sockets -- 2 hops apart.
> > 
> > For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> > (CPUs 32-39) like so,
> > 
> >   $ numactl -C 0-7,32-39 ./spinner 16
> > 
> > causes all threads to fork and remain on node 0 until the active
> > balancer kicks in after a few seconds and forcibly moves some threads
> > to node 4.
> > 
> > Override node_reclaim_distance for AMD Zen.
> > 
> > Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
> > Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: "Lendacky, Thomas" <Thomas.Lendacky@amd.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> The only caveat I can think of is that a future generation of Zen might
> take a different magic number than 32 as their remote distance. If or
> when this happens, it'll need additional smarts but lacking a crystal
> ball, we can cross that bridge when we come to it.

I just suggested to Matt on IRC we could do something along these lines,
but we can do that later.

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1611,6 +1611,12 @@ void sched_init_numa(void)
 	}
 
 	/*
+	 * Set the reclaim distance at 2 hops instead of at a fixed distance value.
+	 */
+	if (level >= 2)
+		node_reclaim_distance = sched_domains_numa_distance[2];
+
+	/*
 	 * 'level' contains the number of unique distances
 	 *
 	 * The sched_domains_numa_distance[] array includes the actual distance

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
  2019-07-23 12:00   ` Peter Zijlstra
@ 2019-07-23 13:03     ` Mel Gorman
  2019-07-23 14:09       ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2019-07-23 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matt Fleming, linux-kernel, Suthikulpanit, Suravee, Lendacky,
	Thomas, Borislav Petkov

On Tue, Jul 23, 2019 at 02:00:30PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 23, 2019 at 12:42:48PM +0100, Mel Gorman wrote:
> > On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> > > Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
> > > Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
> > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > Cc: "Lendacky, Thomas" <Thomas.Lendacky@amd.com>
> > > Cc: Borislav Petkov <bp@alien8.de>
> > 
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> > 
> > The only caveat I can think of is that a future generation of Zen might
> > take a different magic number than 32 as their remote distance. If or
> > when this happens, it'll need additional smarts but lacking a crystal
> > ball, we can cross that bridge when we come to it.
> 
> I just suggested to Matt on IRC we could do something along these lines,
> but we can do that later.
> 

That would seem fair but I do think it's something that could be done
later (maybe 1 release away?) to avoid a false bisection to this patch by
accident. I don't *think* there are any machines out there with a 1-hop
distance of 14 but if there is, your patch would make a difference to
MM behaviour.  In the same context, it might make sense to rename the
value to somewhat reflective of the fact that "reclaim distance" affects
scheduler placement. No good name springs to mind at the moment.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
  2019-07-23 13:03     ` Mel Gorman
@ 2019-07-23 14:09       ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2019-07-23 14:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matt Fleming, linux-kernel, Suthikulpanit, Suravee, Lendacky,
	Thomas, Borislav Petkov

On Tue, Jul 23, 2019 at 02:03:21PM +0100, Mel Gorman wrote:
> On Tue, Jul 23, 2019 at 02:00:30PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 23, 2019 at 12:42:48PM +0100, Mel Gorman wrote:
> > > On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> > > > Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
> > > > Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
> > > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > > Cc: "Lendacky, Thomas" <Thomas.Lendacky@amd.com>
> > > > Cc: Borislav Petkov <bp@alien8.de>
> > > 
> > > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> > > 
> > > The only caveat I can think of is that a future generation of Zen might
> > > take a different magic number than 32 as their remote distance. If or
> > > when this happens, it'll need additional smarts but lacking a crystal
> > > ball, we can cross that bridge when we come to it.
> > 
> > I just suggested to Matt on IRC we could do something along these lines,
> > but we can do that later.
> > 
> 
> That would seem fair but I do think it's something that could be done
> later (maybe 1 release away?) to avoid a false bisection to this patch by
> accident.

Quite agreed; changing reclaim_distance like that will affect a lot of
hardware, while the current patch limits the impact to just AMD-Zen
based bits.

> I don't *think* there are any machines out there with a 1-hop
> distance of 14 but if there is, your patch would make a difference to
> MM behaviour.  In the same context, it might make sense to rename the
> value to somewhat reflective of the fact that "reclaim distance" affects
> scheduler placement. No good name springs to mind at the moment.

Yeah, naming sucks. Let us pain this bicycle shed blue!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
  2019-07-23 10:48 [PATCH v3] sched/topology: Improve load balancing on AMD EPYC Matt Fleming
  2019-07-23 11:42 ` Mel Gorman
@ 2019-07-25 16:37 ` Suthikulpanit, Suravee
  2019-07-29 12:16   ` Matt Fleming
  1 sibling, 1 reply; 7+ messages in thread
From: Suthikulpanit, Suravee @ 2019-07-25 16:37 UTC (permalink / raw)
  To: Matt Fleming, Peter Zijlstra
  Cc: linux-kernel, Mel Gorman, Lendacky, Thomas, Borislav Petkov

Matt,

On 7/23/2019 5:48 AM, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
> 
> However, as is rather unfortunately explained in
> 
>    commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
> 
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
> 
> Current AMD EPYC machines have the following NUMA node distances:
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>    0:  10  16  16  16  32  32  32  32
>    1:  16  10  16  16  32  32  32  32
>    2:  16  16  10  16  32  32  32  32
>    3:  16  16  16  10  32  32  32  32
>    4:  32  32  32  32  10  16  16  16
>    5:  32  32  32  32  16  10  16  16
>    6:  32  32  32  32  16  16  10  16
>    7:  32  32  32  32  16  16  16  10
> 
> where 2 hops is 32.
> 
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
> 
> For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> (CPUs 32-39) like so,
> 
>    $ numactl -C 0-7,32-39 ./spinner 16
> 
> causes all threads to fork and remain on node 0 until the active
> balancer kicks in after a few seconds and forcibly moves some threads
> to node 4.

I am testing this patch on the Linux-5.2, and I actually do not
notice difference pre vs post patch.

Besides the case above, I have also run an experiment with
a different number of threads across two sockets:

(Note: I only focus on thread0 of each core.)

sXnY = Socket X Node Y

     * s0n0 + s0n1 + s1n0 + s1n1
     numactl -C 0-15,32-47 ./spinner 32

     * s0n2 + s0n3 + s1n2 + s1n3
     numactl -C 16-31,48-63 ./spinner 32

     * s0 + s1
     numactl -C 0-63 ./spinner 64

My observations are:

     * I still notice improper load-balance on one of the task initially
       for a few seconds before they are load-balanced correctly.

     * It is taking longer to load balance w/ more number of tasks.

I wonder if you have tried with a different kernel base?

Regards,
Suravee

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC
  2019-07-25 16:37 ` Suthikulpanit, Suravee
@ 2019-07-29 12:16   ` Matt Fleming
  0 siblings, 0 replies; 7+ messages in thread
From: Matt Fleming @ 2019-07-29 12:16 UTC (permalink / raw)
  To: Suthikulpanit, Suravee
  Cc: Peter Zijlstra, linux-kernel, Mel Gorman, Lendacky, Thomas,
	Borislav Petkov

On Thu, 25 Jul, at 04:37:06PM, Suthikulpanit, Suravee wrote:
> 
> I am testing this patch on the Linux-5.2, and I actually do not
> notice difference pre vs post patch.
> 
> Besides the case above, I have also run an experiment with
> a different number of threads across two sockets:
> 
> (Note: I only focus on thread0 of each core.)
> 
> sXnY = Socket X Node Y
> 
>      * s0n0 + s0n1 + s1n0 + s1n1
>      numactl -C 0-15,32-47 ./spinner 32
> 
>      * s0n2 + s0n3 + s1n2 + s1n3
>      numactl -C 16-31,48-63 ./spinner 32
> 
>      * s0 + s1
>      numactl -C 0-63 ./spinner 64
> 
> My observations are:
> 
>      * I still notice improper load-balance on one of the task initially
>        for a few seconds before they are load-balanced correctly.
> 
>      * It is taking longer to load balance w/ more number of tasks.
> 
> I wonder if you have tried with a different kernel base?

It was tested with one of the 5.2 -rc kernels.

I'll take another look at this behaviour, but for the benefit of LKML
readers, here's the summary I gave before. It's specific to using
cgroups to partitions tasks:

    It turns out there's a secondary issue to do with how run queue load
    averages are compared between sched groups.
    
    Load averages for a sched_group (a group within a domain) are
    effectively "scaled" by the number of CPUs in that group. This has a
    direct influence on how quickly load ramps up in a group.
    
    What's happening on my system when running with $(numactl -C
    0-7,32-39) is that the load for the top NUMA sched_domain (domain4) is
    scaling the load by 64 CPUs -- even though the workload can't use all
    64 due to scheduler affinity.
    
    So because the load balancer thinks there's plenty of room left to run
    tasks, it doesn't balance very well across sockets even with the
    SD_BALANCE_FORK flag.
    
    This super quick and ugly patch, which caps the number of CPUs at 8, gets both
    sockets used by fork() on my system.
    
    ---->8----
    
    diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
    index 40bd1e27b1b7..9444c34d038c 100644
    --- a/kernel/sched/fair.c
    +++ b/kernel/sched/fair.c
    @@ -5791,6 +5791,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
     	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
     	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
     				(sd->imbalance_pct-100) / 100;
    +	unsigned long capacity;
     
     	if (sd_flag & SD_BALANCE_WAKE)
     		load_idx = sd->wake_idx;
    @@ -5835,10 +5836,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
     		}
     
     		/* Adjust by relative CPU capacity of the group */
    +		capacity = group->sgc->capacity;
    +
    +		if (capacity > (SCHED_CAPACITY_SCALE * 8))
    +			capacity = SCHED_CAPACITY_SCALE * 8;
    +
     		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
    -					group->sgc->capacity;
    +					capacity;
     		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
    -					group->sgc->capacity;
    +					capacity;
     
     		if (local_group) {
     			this_runnable_load = runnable_load;
    
    ----8<----
    
    There's still an issue with the active load balancer kicking in after a few
    seconds, but I suspect that is related to the use of group capacity elsewhere
    in the load balancer code (like update_sg_lb_stats()).


-- 
Matt Fleming
SUSE Performance Team

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-07-29 12:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-23 10:48 [PATCH v3] sched/topology: Improve load balancing on AMD EPYC Matt Fleming
2019-07-23 11:42 ` Mel Gorman
2019-07-23 12:00   ` Peter Zijlstra
2019-07-23 13:03     ` Mel Gorman
2019-07-23 14:09       ` Peter Zijlstra
2019-07-25 16:37 ` Suthikulpanit, Suravee
2019-07-29 12:16   ` Matt Fleming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).