All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
@ 2017-08-10 15:20 Suravee Suthikulpanit
  2017-08-10 16:41 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Suravee Suthikulpanit @ 2017-08-10 15:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, bp, Suravee Suthikulpanit

On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

             ----------------------------
         C0  | T0 T1 |    ||    | T0 T1 | C4
             --------|    ||    |--------
         C1  | T0 T1 | L3 || L3 | T0 T1 | C5
             --------|    ||    |--------
         C2  | T0 T1 | #0 || #1 | T0 T1 | C6
             --------|    ||    |--------
         C3  | T0 T1 |    ||    | T0 T1 | C7
             ----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA      (span a socket: 4 nodes)
  domain3 NUMA      (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierachy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
a properly balanced system should schedule 8 threads on 8 cpus
(e.g. T0 of C0-C7).  However, current scheduler would schedule
some threads on the same cpu, while others are idle.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a NUMA node. This results in the following
hierachy of sched domains on the same system described above.

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA_IDEN (span a NUMA node)
  domain3 NUMA      (span a socket: 4 nodes)
  domain4 NUMA      (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
 kernel/sched/topology.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895ae..c57df98 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1335,6 +1335,10 @@ void sched_init_numa(void)
 	if (!sched_domains_numa_distance)
 		return;
 
+	/* Includes NUMA identity node at level 0. */
+	sched_domains_numa_distance[level++] = curr_distance;
+	sched_domains_numa_levels = level;
+
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
 	 * unique distances in the node_distance() table.
@@ -1382,8 +1386,7 @@ void sched_init_numa(void)
 		return;
 
 	/*
-	 * 'level' contains the number of unique distances, excluding the
-	 * identity distance node_distance(i,i).
+	 * 'level' contains the number of unique distances
 	 *
 	 * The sched_domains_numa_distance[] array includes the actual distance
 	 * numbers.
@@ -1445,9 +1448,24 @@ void sched_init_numa(void)
 		tl[i] = sched_domain_topology[i];
 
 	/*
+	 * Ignore the NUMA identity level if it has the same cpumask
+	 * as previous level. This is the case for:
+	 *   - System with last-level-cache (MC) sched domain span a NUMA node.
+	 *   - System with DIE sched domain span a NUMA node.
+	 *
+	 * Assume all NUMA nodes are identical, so only check node 0.
+	 */
+	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
+		tl[i++] = (struct sched_domain_topology_level){
+			.mask = sd_numa_mask,
+			.numa_level = 0,
+			SD_INIT_NAME(NUMA_IDEN)
+		};
+
+	/*
 	 * .. and append 'j' levels of NUMA goodness.
 	 */
-	for (j = 0; j < level; i++, j++) {
+	for (j = 1; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
 			.mask = sd_numa_mask,
 			.sd_flags = cpu_numa_flags,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
  2017-08-10 15:20 [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain Suravee Suthikulpanit
@ 2017-08-10 16:41 ` Peter Zijlstra
  2017-08-11  4:57   ` Suravee Suthikulpanit
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-08-10 16:41 UTC (permalink / raw)
  To: Suravee Suthikulpanit; +Cc: linux-kernel, mingo, bp

On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a NUMA node can contain
> upto 8 cores (16 threads) with the following topology.
> 
>              ----------------------------
>          C0  | T0 T1 |    ||    | T0 T1 | C4
>              --------|    ||    |--------
>          C1  | T0 T1 | L3 || L3 | T0 T1 | C5
>              --------|    ||    |--------
>          C2  | T0 T1 | #0 || #1 | T0 T1 | C6
>              --------|    ||    |--------
>          C3  | T0 T1 |    ||    | T0 T1 | C7
>              ----------------------------
> 
> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
> With full system configuration, current scheduler creates 4 sched
> domains:
> 
>   domain0 SMT       (span a core)
>   domain1 MC        (span a last-level-cache)

Right, so traditionally we'd have the DIE level do that, but because
x86_has_numa_in_package we don't generate that, right?

>   domain2 NUMA      (span a socket: 4 nodes)
>   domain3 NUMA      (span a system: 8 nodes)
> 
> Note that there is no domain to represent cpus spaning a NUMA node.
> With this hierachy of sched domains, the scheduler does not balance
> properly in the following cases:
> 
> Case1:
> When running 8 tasks, a properly balanced system should
> schedule a task per NUMA node. This is not the case for
> the current scheduler.
> 
> Case2:
> When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
> a properly balanced system should schedule 8 threads on 8 cpus
> (e.g. T0 of C0-C7).  However, current scheduler would schedule
> some threads on the same cpu, while others are idle.

Sure.. could you amend with a few actual performance numbers?

> Introducing NUMA identity node sched domain, which is based on how
> SRAT/SLIT table define a NUMA node. This results in the following
> hierachy of sched domains on the same system described above.
> 
>   domain0 SMT       (span a core)
>   domain1 MC        (span a last-level-cache)
>   domain2 NUMA_IDEN (span a NUMA node)

Hate that name though..

>   domain3 NUMA      (span a socket: 4 nodes)
>   domain4 NUMA      (span a system: 8 nodes)
> 
> This fixes the improper load balancing cases mentioned above.
> 
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> ---
>  kernel/sched/topology.c | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79895ae..c57df98 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
>  	if (!sched_domains_numa_distance)
>  		return;
>  
> +	/* Includes NUMA identity node at level 0. */
> +	sched_domains_numa_distance[level++] = curr_distance;
> +	sched_domains_numa_levels = level;
> +
>  	/*
>  	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>  	 * unique distances in the node_distance() table.
> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
>  		return;
>  
>  	/*
> -	 * 'level' contains the number of unique distances, excluding the
> -	 * identity distance node_distance(i,i).
> +	 * 'level' contains the number of unique distances
>  	 *
>  	 * The sched_domains_numa_distance[] array includes the actual distance
>  	 * numbers.
> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>  		tl[i] = sched_domain_topology[i];
>  
>  	/*
> +	 * Ignore the NUMA identity level if it has the same cpumask
> +	 * as previous level. This is the case for:
> +	 *   - System with last-level-cache (MC) sched domain span a NUMA node.
> +	 *   - System with DIE sched domain span a NUMA node.
> +	 *
> +	 * Assume all NUMA nodes are identical, so only check node 0.
> +	 */
> +	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
> +		tl[i++] = (struct sched_domain_topology_level){
> +			.mask = sd_numa_mask,
> +			.numa_level = 0,
> +			SD_INIT_NAME(NUMA_IDEN)

Shall we make that:

			SD_INIT_NAME(NODE),

instead?

> +		};

This misses a set of '{}'. While C doesn't require it, out coding style
warrants blocks around any multi-line statement.

So what you've forgotten to mention is that for those systems where the
LLC == NODE this now superfluous level gets removed by the degenerate
code. Have you verified that does the right thing?

> +
> +	/*
>  	 * .. and append 'j' levels of NUMA goodness.
>  	 */
> -	for (j = 0; j < level; i++, j++) {
> +	for (j = 1; j < level; i++, j++) {
>  		tl[i] = (struct sched_domain_topology_level){
>  			.mask = sd_numa_mask,
>  			.sd_flags = cpu_numa_flags,
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
  2017-08-10 16:41 ` Peter Zijlstra
@ 2017-08-11  4:57   ` Suravee Suthikulpanit
  2017-08-11  5:58     ` Suravee Suthikulpanit
  0 siblings, 1 reply; 8+ messages in thread
From: Suravee Suthikulpanit @ 2017-08-11  4:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, bp

On 8/10/17 23:41, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
>> On AMD Family17h-based (EPYC) system, a NUMA node can contain
>> upto 8 cores (16 threads) with the following topology.
>>
>>              ----------------------------
>>          C0  | T0 T1 |    ||    | T0 T1 | C4
>>              --------|    ||    |--------
>>          C1  | T0 T1 | L3 || L3 | T0 T1 | C5
>>              --------|    ||    |--------
>>          C2  | T0 T1 | #0 || #1 | T0 T1 | C6
>>              --------|    ||    |--------
>>          C3  | T0 T1 |    ||    | T0 T1 | C7
>>              ----------------------------
>>
>> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
>> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
>> With full system configuration, current scheduler creates 4 sched
>> domains:
>>
>>   domain0 SMT       (span a core)
>>   domain1 MC        (span a last-level-cache)
>
> Right, so traditionally we'd have the DIE level do that, but because
> x86_has_numa_in_package we don't generate that, right?

That's correct.

>
>>   domain2 NUMA      (span a socket: 4 nodes)
>>   domain3 NUMA      (span a system: 8 nodes)
>>
>> Note that there is no domain to represent cpus spaning a NUMA node.
>> With this hierachy of sched domains, the scheduler does not balance
>> properly in the following cases:
>>
>> Case1:
>> When running 8 tasks, a properly balanced system should
>> schedule a task per NUMA node. This is not the case for
>> the current scheduler.
>>
>> Case2:
>> When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
>> a properly balanced system should schedule 8 threads on 8 cpus
>> (e.g. T0 of C0-C7).  However, current scheduler would schedule
>> some threads on the same cpu, while others are idle.
>
> Sure.. could you amend with a few actual performance numbers?

Sure.

>> [...]
>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>  		tl[i] = sched_domain_topology[i];
>>
>>  	/*
>> +	 * Ignore the NUMA identity level if it has the same cpumask
>> +	 * as previous level. This is the case for:
>> +	 *   - System with last-level-cache (MC) sched domain span a NUMA node.
>> +	 *   - System with DIE sched domain span a NUMA node.
>> +	 *
>> +	 * Assume all NUMA nodes are identical, so only check node 0.
>> +	 */
>> +	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>> +		tl[i++] = (struct sched_domain_topology_level){
>> +			.mask = sd_numa_mask,
>> +			.numa_level = 0,
>> +			SD_INIT_NAME(NUMA_IDEN)
>
> Shall we make that:
>
> 			SD_INIT_NAME(NODE),
>
> instead?

Sounds good.

>> +		};
>
> This misses a set of '{}'. While C doesn't require it, out coding style
> warrants blocks around any multi-line statement.
>
> So what you've forgotten to mention is that for those systems where the
> LLC == NODE this now superfluous level gets removed by the degenerate
> code. Have you verified that does the right thing?

Let me check with that one and get back.

Thanks,
Suravee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
  2017-08-11  4:57   ` Suravee Suthikulpanit
@ 2017-08-11  5:58     ` Suravee Suthikulpanit
  2017-08-11  9:15       ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Suravee Suthikulpanit @ 2017-08-11  5:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, bp



On 8/11/17 11:57, Suravee Suthikulpanit wrote:
>
>>> [...]
>>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>>          tl[i] = sched_domain_topology[i];
>>>
>>>      /*
>>> +     * Ignore the NUMA identity level if it has the same cpumask
>>> +     * as previous level. This is the case for:
>>> +     *   - System with last-level-cache (MC) sched domain span a NUMA node.
>>> +     *   - System with DIE sched domain span a NUMA node.
>>> +     *
>>> +     * Assume all NUMA nodes are identical, so only check node 0.
>>> +     */
>>> +    if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>>> +        tl[i++] = (struct sched_domain_topology_level){
>>> +            .mask = sd_numa_mask,
>>> +            .numa_level = 0,
>>> +            SD_INIT_NAME(NODE)
>>> +        };
>>
>> So what you've forgotten to mention is that for those systems where the
>> LLC == NODE this now superfluous level gets removed by the degenerate
>> code. Have you verified that does the right thing?
>
> Let me check with that one and get back.

Actually, it is not removed by the degenerate code. That is what this logic is 
for. It checks for LCC == NODE or DIE == NODE before setting up the NODE sched 
level. I can update the comment. This has also been tested on system w/ LLC == NODE.

Thanks,
Suravee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
  2017-08-11  5:58     ` Suravee Suthikulpanit
@ 2017-08-11  9:15       ` Peter Zijlstra
  2017-08-14  7:44         ` Suravee Suthikulpanit
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-08-11  9:15 UTC (permalink / raw)
  To: Suravee Suthikulpanit; +Cc: linux-kernel, mingo, bp

On Fri, Aug 11, 2017 at 12:58:22PM +0700, Suravee Suthikulpanit wrote:
> 
> 
> On 8/11/17 11:57, Suravee Suthikulpanit wrote:
> > 
> > > > [...]
> > > > @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
> > > >          tl[i] = sched_domain_topology[i];
> > > > 
> > > >      /*
> > > > +     * Ignore the NUMA identity level if it has the same cpumask
> > > > +     * as previous level. This is the case for:
> > > > +     *   - System with last-level-cache (MC) sched domain span a NUMA node.
> > > > +     *   - System with DIE sched domain span a NUMA node.
> > > > +     *
> > > > +     * Assume all NUMA nodes are identical, so only check node 0.
> > > > +     */
> > > > +    if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
> > > > +        tl[i++] = (struct sched_domain_topology_level){
> > > > +            .mask = sd_numa_mask,
> > > > +            .numa_level = 0,
> > > > +            SD_INIT_NAME(NODE)
> > > > +        };
> > > 
> > > So what you've forgotten to mention is that for those systems where the
> > > LLC == NODE this now superfluous level gets removed by the degenerate
> > > code. Have you verified that does the right thing?
> > 
> > Let me check with that one and get back.
> 
> Actually, it is not removed by the degenerate code. That is what this logic
> is for. It checks for LCC == NODE or DIE == NODE before setting up the NODE
> sched level. I can update the comment. This has also been tested on system
> w/ LLC == NODE.

Why does the degenerate code fail to remove things?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
  2017-08-11  9:15       ` Peter Zijlstra
@ 2017-08-14  7:44         ` Suravee Suthikulpanit
  2017-08-24  1:14           ` Suravee Suthikulpanit
  2017-10-02 12:43           ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Suravee Suthikulpanit @ 2017-08-14  7:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, bp



On 8/11/17 16:15, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 12:58:22PM +0700, Suravee Suthikulpanit wrote:
>>
>>
>> On 8/11/17 11:57, Suravee Suthikulpanit wrote:
>>>
>>>>> [...]
>>>>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>>>>          tl[i] = sched_domain_topology[i];
>>>>>
>>>>>      /*
>>>>> +     * Ignore the NUMA identity level if it has the same cpumask
>>>>> +     * as previous level. This is the case for:
>>>>> +     *   - System with last-level-cache (MC) sched domain span a NUMA node.
>>>>> +     *   - System with DIE sched domain span a NUMA node.
>>>>> +     *
>>>>> +     * Assume all NUMA nodes are identical, so only check node 0.
>>>>> +     */
>>>>> +    if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>>>>> +        tl[i++] = (struct sched_domain_topology_level){
>>>>> +            .mask = sd_numa_mask,
>>>>> +            .numa_level = 0,
>>>>> +            SD_INIT_NAME(NODE)
>>>>> +        };
>>>>
>>>> So what you've forgotten to mention is that for those systems where the
>>>> LLC == NODE this now superfluous level gets removed by the degenerate
>>>> code. Have you verified that does the right thing?
>>>
>>> Let me check with that one and get back.
>>
>> Actually, it is not removed by the degenerate code. That is what this logic
>> is for. It checks for LCC == NODE or DIE == NODE before setting up the NODE
>> sched level. I can update the comment. This has also been tested on system
>> w/ LLC == NODE.
>
> Why does the degenerate code fail to remove things?
>

Sorry for confusion. Actually, the degenerate code does remove the duplicate 
NODE sched-domain.

The logic above is taking a different approach. Instead of depending on the 
degenerate code during cpu_attach_domain() at a later time, it would exclude the 
NODE sched-domain during sched_init_numa().  The difference is, without 
!cpumask_equal(), now the MC sched-domain would have the SD_PREFER_SIBLING flag 
set by the degenerate code since the flag got transferred down from the NODE to 
MC sched-domain. Would this be the preferred behavior for MC sched-domain?

Regards,
Suravee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
  2017-08-14  7:44         ` Suravee Suthikulpanit
@ 2017-08-24  1:14           ` Suravee Suthikulpanit
  2017-10-02 12:43           ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Suravee Suthikulpanit @ 2017-08-24  1:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, bp

Hi Peter,

On 8/14/17 14:44, Suravee Suthikulpanit wrote:
>
>
> On 8/11/17 16:15, Peter Zijlstra wrote:
>> On Fri, Aug 11, 2017 at 12:58:22PM +0700, Suravee Suthikulpanit wrote:
>>>
>>>
>>> On 8/11/17 11:57, Suravee Suthikulpanit wrote:
>>>>
>>>>>> [...]
>>>>>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>>>>>          tl[i] = sched_domain_topology[i];
>>>>>>
>>>>>>      /*
>>>>>> +     * Ignore the NUMA identity level if it has the same cpumask
>>>>>> +     * as previous level. This is the case for:
>>>>>> +     *   - System with last-level-cache (MC) sched domain span a NUMA node.
>>>>>> +     *   - System with DIE sched domain span a NUMA node.
>>>>>> +     *
>>>>>> +     * Assume all NUMA nodes are identical, so only check node 0.
>>>>>> +     */
>>>>>> +    if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>>>>>> +        tl[i++] = (struct sched_domain_topology_level){
>>>>>> +            .mask = sd_numa_mask,
>>>>>> +            .numa_level = 0,
>>>>>> +            SD_INIT_NAME(NODE)
>>>>>> +        };
>>>>>
>>>>> So what you've forgotten to mention is that for those systems where the
>>>>> LLC == NODE this now superfluous level gets removed by the degenerate
>>>>> code. Have you verified that does the right thing?
>>>>
>>>> Let me check with that one and get back.
>>>
>>> Actually, it is not removed by the degenerate code. That is what this logic
>>> is for. It checks for LCC == NODE or DIE == NODE before setting up the NODE
>>> sched level. I can update the comment. This has also been tested on system
>>> w/ LLC == NODE.
>>
>> Why does the degenerate code fail to remove things?
>>
>
> Sorry for confusion. Actually, the degenerate code does remove the duplicate
> NODE sched-domain.
>
> The logic above is taking a different approach. Instead of depending on the
> degenerate code during cpu_attach_domain() at a later time, it would exclude the
> NODE sched-domain during sched_init_numa().  The difference is, without
> !cpumask_equal(), now the MC sched-domain would have the SD_PREFER_SIBLING flag
> set by the degenerate code since the flag got transferred down from the NODE to
> MC sched-domain. Would this be the preferred behavior for MC sched-domain?
>
> Regards,
> Suravee

Any feedback on this part?

Thanks,
Suravee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
  2017-08-14  7:44         ` Suravee Suthikulpanit
  2017-08-24  1:14           ` Suravee Suthikulpanit
@ 2017-10-02 12:43           ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2017-10-02 12:43 UTC (permalink / raw)
  To: Suravee Suthikulpanit; +Cc: linux-kernel, mingo, bp


Sorry, lost track of this thing..

On Mon, Aug 14, 2017 at 02:44:59PM +0700, Suravee Suthikulpanit wrote:

> > Why does the degenerate code fail to remove things?
> > 
> 
> Sorry for confusion. Actually, the degenerate code does remove the duplicate
> NODE sched-domain.
> 
> The difference is, without !cpumask_equal(), now the MC sched-domain
> would have the SD_PREFER_SIBLING flag set by the degenerate code since
> the flag got transferred down from the NODE to MC sched-domain. Would
> this be the preferred behavior for MC sched-domain?

Right, but the 'normal' x86_topology setup will have SD_PREFER_SIBLING
set on the MC domain by virtue of merging the MC and CPU domains. So the
first layer of NUMA spreads.

Now, that appears not be the case for x86_numa_in_package_topology,
since that doesn't do the CPU domain, but your patch would effectively
'fix' that if it were to not have that conditional. But it would keep
the PREFER_SIBLING at the NODE level, MC would not get it.

But I'm wondering if we're not doing this all a bit backwards. Having
SD_PREFER_SIBLING on the L3 makes sense in that it will spread workload
across the machine and maximize cache utilization. But we don't set it
directly, instead we inherit it through merging.

So commit 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling
remnants and dysfunctional knobs") made a fail by not preserving the
SD_PREFER_SIBLING for the !power_saving case on both CPU and MC.

Then commit 6956dc568f34 ("sched/numa: Add SD_PERFER_SIBLING to CPU
domain") adds it back to the CPU but not MC.

So I'm thinking it makes sense restore the explicit MC one and make your
NODE domain unconditional.

That would make behaviour consistent between the various modes;
something like the below. I'll go try and update your Changelog so we
can get this merged.

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1157,6 +1157,7 @@ sd_init(struct sched_domain_topology_lev
 		sd->smt_gain = 1178; /* ~15% */
 
 	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 117;
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
@@ -1331,6 +1332,10 @@ void sched_init_numa(void)
 	if (!sched_domains_numa_distance)
 		return;
 
+	/* Includes NUMA identity node at level 0. */
+	sched_domains_numa_distance[level++] = curr_distance;
+	sched_domains_numa_levels = level;
+
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
 	 * unique distances in the node_distance() table.
@@ -1378,8 +1383,7 @@ void sched_init_numa(void)
 		return;
 
 	/*
-	 * 'level' contains the number of unique distances, excluding the
-	 * identity distance node_distance(i,i).
+	 * 'level' contains the number of unique distances
 	 *
 	 * The sched_domains_numa_distance[] array includes the actual distance
 	 * numbers.
@@ -1441,9 +1445,18 @@ void sched_init_numa(void)
 		tl[i] = sched_domain_topology[i];
 
 	/*
+	 * Add the NUMA identity distance, aka single NODE.
+	 */
+	tl[i++] = (struct sched_domain_topology_level){
+		.mask = sd_numa_mask,
+		.numa_level = 0,
+		SD_INIT_NAME(NODE)
+	};
+
+	/*
 	 * .. and append 'j' levels of NUMA goodness.
 	 */
-	for (j = 0; j < level; i++, j++) {
+	for (j = 1; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
 			.mask = sd_numa_mask,
 			.sd_flags = cpu_numa_flags,

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-02 12:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-10 15:20 [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain Suravee Suthikulpanit
2017-08-10 16:41 ` Peter Zijlstra
2017-08-11  4:57   ` Suravee Suthikulpanit
2017-08-11  5:58     ` Suravee Suthikulpanit
2017-08-11  9:15       ` Peter Zijlstra
2017-08-14  7:44         ` Suravee Suthikulpanit
2017-08-24  1:14           ` Suravee Suthikulpanit
2017-10-02 12:43           ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.