linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/topology: Improve load balancing on AMD EPYC
@ 2019-06-05 15:59 Matt Fleming
  2019-06-05 18:00 ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Fleming @ 2019-06-05 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Matt Fleming, Suthikulpanit, Suravee, Mel Gorman,
	Lendacky, Thomas, Borislav Petkov

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Update the code in sd_init() to account for modern node distances, and
maintaining backward-compatible behaviour by respecting
RECLAIM_DISTANCE for distances more than 2 hops.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Lendacky, Thomas" <Thomas.Lendacky@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
---
 kernel/sched/topology.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..0eea395f7c6b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1410,7 +1410,18 @@ sd_init(struct sched_domain_topology_level *tl,
 
 		sd->flags &= ~SD_PREFER_SIBLING;
 		sd->flags |= SD_SERIALIZE;
-		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+
+		/*
+		 * Strip the following flags for sched domains with a NUMA
+		 * distance greater than the historical 2-hops value
+		 * (RECLAIM_DISTANCE) and where tl->numa_level confirms it
+		 * really is more than 2 hops.
+		 *
+		 * Respecting RECLAIM_DISTANCE means we maintain
+		 * backwards-compatible behaviour.
+		 */
+		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE &&
+		    tl->numa_level > 3) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
 				       SD_WAKE_AFFINE);
-- 
2.13.7


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-05 15:59 [PATCH] sched/topology: Improve load balancing on AMD EPYC Matt Fleming
@ 2019-06-05 18:00 ` Peter Zijlstra
  2019-06-10 21:26   ` Matt Fleming
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2019-06-05 18:00 UTC (permalink / raw)
  To: Matt Fleming
  Cc: linux-kernel, Suthikulpanit, Suravee, Mel Gorman, Lendacky,
	Thomas, Borislav Petkov

On Wed, Jun 05, 2019 at 04:59:22PM +0100, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
> 
> However, as is rather unfortunately explained in
> 
>   commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
> 
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
> 
> Current AMD EPYC machines have the following NUMA node distances:
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  16  16  16  32  32  32  32
>   1:  16  10  16  16  32  32  32  32
>   2:  16  16  10  16  32  32  32  32
>   3:  16  16  16  10  32  32  32  32
>   4:  32  32  32  32  10  16  16  16
>   5:  32  32  32  32  16  10  16  16
>   6:  32  32  32  32  16  16  10  16
>   7:  32  32  32  32  16  16  16  10
> 
> where 2 hops is 32.
> 
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
> 

> Update the code in sd_init() to account for modern node distances, and
> maintaining backward-compatible behaviour by respecting
> RECLAIM_DISTANCE for distances more than 2 hops.

And then we had two magic values :/

Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
surely, if we want to load-balance agressively over 30, then so too
should we do node_reclaim() I'm thikning.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-05 18:00 ` Peter Zijlstra
@ 2019-06-10 21:26   ` Matt Fleming
  2019-06-11 17:22     ` Lendacky, Thomas
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Fleming @ 2019-06-10 21:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Suthikulpanit, Suravee, Mel Gorman, Lendacky,
	Thomas, Borislav Petkov

On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
> 
> And then we had two magic values :/
> 
> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
> surely, if we want to load-balance agressively over 30, then so too
> should we do node_reclaim() I'm thikning.

Yeah we can fix it just for EPYC, Mel suggested that approach originally.

Suravee, Tom, what's the best way to detect these EPYC machines that need to
override RECLAIM_DISTANCE?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-10 21:26   ` Matt Fleming
@ 2019-06-11 17:22     ` Lendacky, Thomas
  2019-06-18 10:43       ` Matt Fleming
  0 siblings, 1 reply; 11+ messages in thread
From: Lendacky, Thomas @ 2019-06-11 17:22 UTC (permalink / raw)
  To: Matt Fleming, Peter Zijlstra
  Cc: linux-kernel, Suthikulpanit, Suravee, Mel Gorman, Borislav Petkov

On 6/10/19 4:26 PM, Matt Fleming wrote:
> On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
>>
>> And then we had two magic values :/
>>
>> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
>> surely, if we want to load-balance agressively over 30, then so too
>> should we do node_reclaim() I'm thikning.
> 
> Yeah we can fix it just for EPYC, Mel suggested that approach originally.
> 
> Suravee, Tom, what's the best way to detect these EPYC machines that need to
> override RECLAIM_DISTANCE?

You should be able to do it based on the family. There's an init_amd_zn()
function in arch/x86/kernel/cpu/amd.c.  You can add something there or,
since init_amd_zn() sets X86_FEATURE_ZEN, you could check for that if you
prefer putting it in a different location.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-11 17:22     ` Lendacky, Thomas
@ 2019-06-18 10:43       ` Matt Fleming
  2019-06-18 12:33         ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Fleming @ 2019-06-18 10:43 UTC (permalink / raw)
  To: Lendacky, Thomas
  Cc: Peter Zijlstra, linux-kernel, Suthikulpanit, Suravee, Mel Gorman,
	Borislav Petkov

On Tue, 11 Jun, at 05:22:21PM, Lendacky, Thomas wrote:
> On 6/10/19 4:26 PM, Matt Fleming wrote:
> > On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
> >>
> >> And then we had two magic values :/
> >>
> >> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
> >> surely, if we want to load-balance agressively over 30, then so too
> >> should we do node_reclaim() I'm thikning.
> > 
> > Yeah we can fix it just for EPYC, Mel suggested that approach originally.
> > 
> > Suravee, Tom, what's the best way to detect these EPYC machines that need to
> > override RECLAIM_DISTANCE?
> 
> You should be able to do it based on the family. There's an init_amd_zn()
> function in arch/x86/kernel/cpu/amd.c.  You can add something there or,
> since init_amd_zn() sets X86_FEATURE_ZEN, you could check for that if you
> prefer putting it in a different location.

This works for me under all my tests. Thoughts?

--->8---

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 80a405c2048a..4db4e9e7654b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
 #include <linux/random.h>
+#include <linux/topology.h>
 #include <asm/processor.h>
 #include <asm/apic.h>
 #include <asm/cacheinfo.h>
@@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
 	set_cpu_cap(c, X86_FEATURE_ZEN);
 
+	node_reclaim_distance = 32;
+
 	/* Fix erratum 1076: CPB feature bit not being set in CPUID. */
 	if (!cpu_has(c, X86_FEATURE_CPB))
 		set_cpu_cap(c, X86_FEATURE_CPB);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e1ee4b..74b484354ac9 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,9 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..20f0f8f6792b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int			sched_domains_curr_level;
 int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
 static struct cpumask		***sched_domains_numa_masks;
+int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1410,7 +1411,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
 		sd->flags &= ~SD_PREFER_SIBLING;
 		sd->flags |= SD_SERIALIZE;
-		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+		if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
 				       SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..67f5f09d70ed 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		if (!khugepaged_node_load[i])
 			continue;
-		if (node_distance(nid, i) > RECLAIM_DISTANCE)
+		if (node_distance(nid, i) > node_reclaim_distance)
 			return true;
 	}
 	return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8abe0af..8ccaaf3a47f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3450,7 +3450,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
-				RECLAIM_DISTANCE;
+				node_reclaim_distance;
 }
 #else	/* CONFIG_NUMA */
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-18 10:43       ` Matt Fleming
@ 2019-06-18 12:33         ` Peter Zijlstra
  2019-06-19 21:34           ` Matt Fleming
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2019-06-18 12:33 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Lendacky, Thomas, linux-kernel, Suthikulpanit, Suravee,
	Mel Gorman, Borislav Petkov

On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> This works for me under all my tests. Thoughts?
> 
> --->8---
> 
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 80a405c2048a..4db4e9e7654b 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -8,6 +8,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/clock.h>
>  #include <linux/random.h>
> +#include <linux/topology.h>
>  #include <asm/processor.h>
>  #include <asm/apic.h>
>  #include <asm/cacheinfo.h>
> @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
>  {
>  	set_cpu_cap(c, X86_FEATURE_ZEN);
>  

I'm thinking this deserves a comment. Traditionally the SLIT table held
relative memory latency. So where the identity is 10, 16 would indicate
1.6 times local latency and 32 would be 3.2 times local.

Now, even very early on BIOS monkeys went about their business and put
in random values in an attempt to 'tune' the system based on how
$random-os behaved, which is all sorts of fu^Wwrong.

Now, I suppose my question is; is that 32 Zen puts in an actual relative
memory latency metric, or a random value we somehow have to deal with.
And can we pretty please describe the whole sordid story behind this
'tunable' somewhere?

> +	node_reclaim_distance = 32;
> +
>  	/* Fix erratum 1076: CPB feature bit not being set in CPUID. */
>  	if (!cpu_has(c, X86_FEATURE_CPB))
>  		set_cpu_cap(c, X86_FEATURE_CPB);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-18 12:33         ` Peter Zijlstra
@ 2019-06-19 21:34           ` Matt Fleming
  2019-06-24 14:24             ` Mel Gorman
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Fleming @ 2019-06-19 21:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lendacky, Thomas, linux-kernel, Suthikulpanit, Suravee,
	Mel Gorman, Borislav Petkov

On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
> On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> > This works for me under all my tests. Thoughts?
> > 
> > --->8---
> > 
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index 80a405c2048a..4db4e9e7654b 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/sched/clock.h>
> >  #include <linux/random.h>
> > +#include <linux/topology.h>
> >  #include <asm/processor.h>
> >  #include <asm/apic.h>
> >  #include <asm/cacheinfo.h>
> > @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
> >  {
> >  	set_cpu_cap(c, X86_FEATURE_ZEN);
> >  
> 
> I'm thinking this deserves a comment. Traditionally the SLIT table held
> relative memory latency. So where the identity is 10, 16 would indicate
> 1.6 times local latency and 32 would be 3.2 times local.
> 
> Now, even very early on BIOS monkeys went about their business and put
> in random values in an attempt to 'tune' the system based on how
> $random-os behaved, which is all sorts of fu^Wwrong.
> 
> Now, I suppose my question is; is that 32 Zen puts in an actual relative
> memory latency metric, or a random value we somehow have to deal with.
> And can we pretty please describe the whole sordid story behind this
> 'tunable' somewhere?

This is one for the AMD folks. I don't know if the memory latency
really is 3.2 times or not, only that that's the value in all the Zen
machines I have access to. Even this 2-socket one:

node distances:
node   0   1 
  0:  10  32 
  1:  32  10 

Tom, Suravee?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-19 21:34           ` Matt Fleming
@ 2019-06-24 14:24             ` Mel Gorman
  2019-06-26 21:18               ` Suthikulpanit, Suravee
  0 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2019-06-24 14:24 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, Lendacky, Thomas, linux-kernel, Suthikulpanit,
	Suravee, Borislav Petkov

On Wed, Jun 19, 2019 at 10:34:37PM +0100, Matt Fleming wrote:
> On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
> > On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> > > This works for me under all my tests. Thoughts?
> > > 
> > > --->8---
> > > 
> > > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > > index 80a405c2048a..4db4e9e7654b 100644
> > > --- a/arch/x86/kernel/cpu/amd.c
> > > +++ b/arch/x86/kernel/cpu/amd.c
> > > @@ -8,6 +8,7 @@
> > >  #include <linux/sched.h>
> > >  #include <linux/sched/clock.h>
> > >  #include <linux/random.h>
> > > +#include <linux/topology.h>
> > >  #include <asm/processor.h>
> > >  #include <asm/apic.h>
> > >  #include <asm/cacheinfo.h>
> > > @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
> > >  {
> > >  	set_cpu_cap(c, X86_FEATURE_ZEN);
> > >  
> > 
> > I'm thinking this deserves a comment. Traditionally the SLIT table held
> > relative memory latency. So where the identity is 10, 16 would indicate
> > 1.6 times local latency and 32 would be 3.2 times local.
> > 
> > Now, even very early on BIOS monkeys went about their business and put
> > in random values in an attempt to 'tune' the system based on how
> > $random-os behaved, which is all sorts of fu^Wwrong.
> > 
> > Now, I suppose my question is; is that 32 Zen puts in an actual relative
> > memory latency metric, or a random value we somehow have to deal with.
> > And can we pretty please describe the whole sordid story behind this
> > 'tunable' somewhere?
> 
> This is one for the AMD folks. I don't know if the memory latency
> really is 3.2 times or not, only that that's the value in all the Zen
> machines I have access to. Even this 2-socket one:
> 
> node distances:
> node   0   1 
>   0:  10  32 
>   1:  32  10 
> 
> Tom, Suravee?

Do not consider this an authorative response but based on what I know
of the physical topology, it is not unreasonable to use 32 in the SLIT
table. There is a small latency when accessing another die on the same
socket (details are generation specific). It's not quite a local access
but it's not as much as a traditional remote access either (hence 16 being
the base unit for another die to hint that it's not quite local but not
quite remote either). 32 is based on accessing a die on a remote socket
based on the expected performance and latency of the interconnect.

To the best of my knowledge, the magic numbers are reflective of the real
topology and not just a gamification of the numbers for a random OS. If
anything, the fact that there is a load balancing issue on Linux would
indicate that they were not picking random numbers for Linux at least :P

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-24 14:24             ` Mel Gorman
@ 2019-06-26 21:18               ` Suthikulpanit, Suravee
  2019-06-28 15:15                 ` Matt Fleming
  0 siblings, 1 reply; 11+ messages in thread
From: Suthikulpanit, Suravee @ 2019-06-26 21:18 UTC (permalink / raw)
  To: Mel Gorman, Matt Fleming
  Cc: Peter Zijlstra, Lendacky, Thomas, linux-kernel, Borislav Petkov

On 6/24/19 9:24 AM, Mel Gorman wrote:
> On Wed, Jun 19, 2019 at 10:34:37PM +0100, Matt Fleming wrote:
>> On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
>>> On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
>>>> This works for me under all my tests. Thoughts?
>>>>
>>>> --->8---
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
>>>> index 80a405c2048a..4db4e9e7654b 100644
>>>> --- a/arch/x86/kernel/cpu/amd.c
>>>> +++ b/arch/x86/kernel/cpu/amd.c
>>>> @@ -8,6 +8,7 @@
>>>>   #include <linux/sched.h>
>>>>   #include <linux/sched/clock.h>
>>>>   #include <linux/random.h>
>>>> +#include <linux/topology.h>
>>>>   #include <asm/processor.h>
>>>>   #include <asm/apic.h>
>>>>   #include <asm/cacheinfo.h>
>>>> @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
>>>>   {
>>>>   	set_cpu_cap(c, X86_FEATURE_ZEN);
>>>>   
>>>
>>> I'm thinking this deserves a comment. Traditionally the SLIT table held
>>> relative memory latency. So where the identity is 10, 16 would indicate
>>> 1.6 times local latency and 32 would be 3.2 times local.
>>>
>>> Now, even very early on BIOS monkeys went about their business and put
>>> in random values in an attempt to 'tune' the system based on how
>>> $random-os behaved, which is all sorts of fu^Wwrong.
>>>
>>> Now, I suppose my question is; is that 32 Zen puts in an actual relative
>>> memory latency metric, or a random value we somehow have to deal with.
>>> And can we pretty please describe the whole sordid story behind this
>>> 'tunable' somewhere?
>>
>> This is one for the AMD folks. I don't know if the memory latency
>> really is 3.2 times or not, only that that's the value in all the Zen
>> machines I have access to. Even this 2-socket one:
>>
>> node distances:
>> node   0   1
>>    0:  10  32
>>    1:  32  10
>>
>> Tom, Suravee?
> 
> Do not consider this an authorative response but based on what I know
> of the physical topology, it is not unreasonable to use 32 in the SLIT
> table. There is a small latency when accessing another die on the same
> socket (details are generation specific). It's not quite a local access
> but it's not as much as a traditional remote access either (hence 16 being
> the base unit for another die to hint that it's not quite local but not
> quite remote either). 32 is based on accessing a die on a remote socket
> based on the expected performance and latency of the interconnect.
> 
> To the best of my knowledge, the magic numbers are reflective of the real
> topology and not just a gamification of the numbers for a random OS. If
> anything, the fact that there is a load balancing issue on Linux would
> indicate that they were not picking random numbers for Linux at least :P
> 

We use 16 to designate 1-hop latency (for different node within the same socket).
For across-socket access, since the latency is greater, we set the latency to 32
(twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.

At this point, it might not be possible to change the SLIT values on
existing platforms out in the field. So, introducing the AMD family17h
quirk as Matt suggested would be a more feasible approach.

Going forward, we will make sure that this would not exceed the standard
RECLAIM_DISTANCE (30).

Thanks,
Suravee

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-26 21:18               ` Suthikulpanit, Suravee
@ 2019-06-28 15:15                 ` Matt Fleming
  2019-07-22 14:11                   ` Suthikulpanit, Suravee
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Fleming @ 2019-06-28 15:15 UTC (permalink / raw)
  To: Suthikulpanit, Suravee
  Cc: Mel Gorman, Peter Zijlstra, Lendacky, Thomas, linux-kernel,
	Borislav Petkov

On Wed, 26 Jun, at 09:18:01PM, Suthikulpanit, Suravee wrote:
> 
> We use 16 to designate 1-hop latency (for different node within the same socket).
> For across-socket access, since the latency is greater, we set the latency to 32
> (twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.
 
I guess the question is: Is the memory latency of a remote node 1 hop
away 1.6x the local node latency?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC
  2019-06-28 15:15                 ` Matt Fleming
@ 2019-07-22 14:11                   ` Suthikulpanit, Suravee
  0 siblings, 0 replies; 11+ messages in thread
From: Suthikulpanit, Suravee @ 2019-07-22 14:11 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Mel Gorman, Peter Zijlstra, Lendacky, Thomas, linux-kernel,
	Borislav Petkov

Matt

On 6/28/2019 10:15 AM, Matt Fleming wrote:
> On Wed, 26 Jun, at 09:18:01PM, Suthikulpanit, Suravee wrote:
>>
>> We use 16 to designate 1-hop latency (for different node within the same socket).
>> For across-socket access, since the latency is greater, we set the latency to 32
>> (twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.
>   
> I guess the question is: Is the memory latency of a remote node 1 hop
> away 1.6x the local node latency?
> 

Yes, remote node 1 hop is 1.6x the local node latency for AMD EPYC.

Suravee

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-07-22 14:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05 15:59 [PATCH] sched/topology: Improve load balancing on AMD EPYC Matt Fleming
2019-06-05 18:00 ` Peter Zijlstra
2019-06-10 21:26   ` Matt Fleming
2019-06-11 17:22     ` Lendacky, Thomas
2019-06-18 10:43       ` Matt Fleming
2019-06-18 12:33         ` Peter Zijlstra
2019-06-19 21:34           ` Matt Fleming
2019-06-24 14:24             ` Mel Gorman
2019-06-26 21:18               ` Suthikulpanit, Suravee
2019-06-28 15:15                 ` Matt Fleming
2019-07-22 14:11                   ` Suthikulpanit, Suravee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).