All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] sched/topology: Introduce NUMA identity node sched domain
@ 2017-08-11  7:41 Suravee Suthikulpanit
  2017-08-14 17:43 ` Borislav Petkov
  0 siblings, 1 reply; 4+ messages in thread
From: Suravee Suthikulpanit @ 2017-08-11  7:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, bp, Suravee Suthikulpanit, stable

On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

             ----------------------------
         C0  | T0 T1 |    ||    | T0 T1 | C4
             --------|    ||    |--------
         C1  | T0 T1 | L3 || L3 | T0 T1 | C5
             --------|    ||    |--------
         C2  | T0 T1 | #0 || #1 | T0 T1 | C6
             --------|    ||    |--------
         C3  | T0 T1 |    ||    | T0 T1 | C7
             ----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA      (span a socket: 4 nodes)
  domain3 NUMA      (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierarchy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
Sometimes, threads are scheduled on the same cpu, while other
cpus are idle. This results in run-to-run inconsistency. For example:

  taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
                          --cpu-max-prime=100000 run

Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
across 8 cpus.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a NUMA node. This results in the following
hierarchy of sched domains on the same system described above.

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NODE      (span a NUMA node)
  domain3 NUMA      (span a socket: 4 nodes)
  domain4 NUMA      (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Cc: stable@vger.kernel.org
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
Changes from V1 (https://lkml.org/lkml/2017/8/10/540)
  * Update commit message to include performance number.
  * Change from NUMA_IDEN to NODE.
  * Fix code styling and update comments.

 kernel/sched/topology.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895ae..2dd5b11 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1335,6 +1335,10 @@ void sched_init_numa(void)
 	if (!sched_domains_numa_distance)
 		return;
 
+	/* Includes NUMA identity node at level 0. */
+	sched_domains_numa_distance[level++] = curr_distance;
+	sched_domains_numa_levels = level;
+
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
 	 * unique distances in the node_distance() table.
@@ -1382,8 +1386,7 @@ void sched_init_numa(void)
 		return;
 
 	/*
-	 * 'level' contains the number of unique distances, excluding the
-	 * identity distance node_distance(i,i).
+	 * 'level' contains the number of unique distances
 	 *
 	 * The sched_domains_numa_distance[] array includes the actual distance
 	 * numbers.
@@ -1445,9 +1448,26 @@ void sched_init_numa(void)
 		tl[i] = sched_domain_topology[i];
 
 	/*
+	 * Do not setup NUMA node level if it has the same cpumask
+	 * as sched domain at previous level:
+	 * This is the case for system with:
+	 *  - LLC == NODE : LLC (MC) sched domain span a NUMA node.
+	 *  - DIE == NODE : DIE sched domain span a NUMA node.
+	 *
+	 * Assume all NUMA nodes are identical, so only check node 0.
+	 */
+	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
+		tl[i++] = (struct sched_domain_topology_level){
+			.mask = sd_numa_mask,
+			.numa_level = 0,
+			SD_INIT_NAME(NODE)
+		};
+	}
+
+	/*
 	 * .. and append 'j' levels of NUMA goodness.
 	 */
-	for (j = 0; j < level; i++, j++) {
+	for (j = 1; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
 			.mask = sd_numa_mask,
 			.sd_flags = cpu_numa_flags,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/topology: Introduce NUMA identity node sched domain
  2017-08-11  7:41 [PATCH v2] sched/topology: Introduce NUMA identity node sched domain Suravee Suthikulpanit
@ 2017-08-14 17:43 ` Borislav Petkov
  2017-08-24  1:13   ` Suravee Suthikulpanit
  0 siblings, 1 reply; 4+ messages in thread
From: Borislav Petkov @ 2017-08-14 17:43 UTC (permalink / raw)
  To: Suravee Suthikulpanit; +Cc: linux-kernel, mingo, peterz

Dropping stable@ from CC.

When using git send-email, make sure to exclude stable@ from the list of
recipients as this is not how we send a patch to stable.

On Fri, Aug 11, 2017 at 02:41:15AM -0500, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a NUMA node can contain
> upto 8 cores (16 threads) with the following topology.
> 
>              ----------------------------
>          C0  | T0 T1 |    ||    | T0 T1 | C4
>              --------|    ||    |--------
>          C1  | T0 T1 | L3 || L3 | T0 T1 | C5
>              --------|    ||    |--------
>          C2  | T0 T1 | #0 || #1 | T0 T1 | C6
>              --------|    ||    |--------
>          C3  | T0 T1 |    ||    | T0 T1 | C7
>              ----------------------------
> 
> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
> With full system configuration, current scheduler creates 4 sched
> domains:
> 
>   domain0 SMT       (span a core)
>   domain1 MC        (span a last-level-cache)
>   domain2 NUMA      (span a socket: 4 nodes)
>   domain3 NUMA      (span a system: 8 nodes)
> 
> Note that there is no domain to represent cpus spaning a NUMA node.

...

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79895ae..2dd5b11 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
>  	if (!sched_domains_numa_distance)
>  		return;
>  
> +	/* Includes NUMA identity node at level 0. */
> +	sched_domains_numa_distance[level++] = curr_distance;
> +	sched_domains_numa_levels = level;
> +
>  	/*
>  	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>  	 * unique distances in the node_distance() table.
> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
>  		return;
>  
>  	/*
> -	 * 'level' contains the number of unique distances, excluding the
> -	 * identity distance node_distance(i,i).

I'm still unclear as to why were we excluding this identity distance
until now and how would that change affect existing systems.

Also, you do use the term "NUMA" pretty loosely in the text - please
take care to explain precisely what kind of node you mean: physical,
logical, ... Don't be afraid to be too verbose.

Thanks.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/topology: Introduce NUMA identity node sched domain
  2017-08-14 17:43 ` Borislav Petkov
@ 2017-08-24  1:13   ` Suravee Suthikulpanit
  2017-08-24  8:37     ` Borislav Petkov
  0 siblings, 1 reply; 4+ messages in thread
From: Suravee Suthikulpanit @ 2017-08-24  1:13 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, mingo, peterz

Boris,

Sorry for late reply. I missed this email earlier.

On 8/15/17 00:43, Borislav Petkov wrote:
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 79895ae..2dd5b11 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
>>  	if (!sched_domains_numa_distance)
>>  		return;
>>
>> +	/* Includes NUMA identity node at level 0. */
>> +	sched_domains_numa_distance[level++] = curr_distance;
>> +	sched_domains_numa_levels = level;
>> +
>>  	/*
>>  	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>>  	 * unique distances in the node_distance() table.
>> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
>>  		return;
>>
>>  	/*
>> -	 * 'level' contains the number of unique distances, excluding the
>> -	 * identity distance node_distance(i,i).
> I'm still unclear as to why were we excluding this identity distance
> until now and how would that change affect existing systems.

One reason that I can think of is that, on older systems (at least for AMD), L3 
sched domain level has the same cpumask as the node level. This is not the case 
for Zen.

> Also, you do use the term "NUMA" pretty loosely in the text - please
> take care to explain precisely what kind of node you mean: physical,
> logical, ... Don't be afraid to be too verbose.
>
> Thanks.

Mostly, I am referring to the logical NUMA (as described by the SRAT/SLIT), 
which can change based on how BIOS configures the system. I will update this in V3.

Thanks,
Suravee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/topology: Introduce NUMA identity node sched domain
  2017-08-24  1:13   ` Suravee Suthikulpanit
@ 2017-08-24  8:37     ` Borislav Petkov
  0 siblings, 0 replies; 4+ messages in thread
From: Borislav Petkov @ 2017-08-24  8:37 UTC (permalink / raw)
  To: Suravee Suthikulpanit; +Cc: linux-kernel, mingo, peterz

On Thu, Aug 24, 2017 at 08:13:23AM +0700, Suravee Suthikulpanit wrote:
> One reason that I can think of is that, on older systems (at least for AMD),
> L3 sched domain level has the same cpumask as the node level. This is not
> the case for Zen.

So please write that in the commit message. You're changing code which
runs on older systems so the impact (or the absence of it) has to be
clear.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-08-24  8:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-11  7:41 [PATCH v2] sched/topology: Introduce NUMA identity node sched domain Suravee Suthikulpanit
2017-08-14 17:43 ` Borislav Petkov
2017-08-24  1:13   ` Suravee Suthikulpanit
2017-08-24  8:37     ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.