All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap
@ 2020-01-15 16:09 Valentin Schneider
  2020-01-16 10:44 ` Peter Zijlstra
  2020-01-17 10:08 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
  0 siblings, 2 replies; 6+ messages in thread
From: Valentin Schneider @ 2020-01-15 16:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: sudeep.holla, prime.zeng, dietmar.eggemann, morten.rasmussen,
	peterz, mingo

topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.

In his example (8 CPUs, 2-node system), we end up with:
  MC span for CPU3 == 3-7
  MC span for CPU4 == 4-7

The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:

  3 -> 4 -> 5 -> 6 -> 7
  ^                  /
   `----------------'

And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

  3 -> 4 -> 5 -> 6 -> 7
       ^             /
	`-----------'

which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.


There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.

Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.

Testing
=======

Dietmar managed to reproduce this using the following qemu incantation:

  $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
  -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
  cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
  node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):

8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;

+		if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+			continue;
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

8<---

[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com

Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
A "less intrusive" alternative is to assert the sd->groups list doesn't get
re-written, which is a symptom of such bogus topologies. I've briefly
tested this, you can have a look at it here:

  http://www.linux-arm.org/git?p=linux-vs.git;a=commit;h=e0ead72137332cbd3d69c9055ab29e6ffae5b37b

fetchable via:
  git fetch git://linux-arm.org/linux-vs mainline/topology/non_numa_overlap_sg
---
 kernel/sched/topology.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6ec1e595b1d4..dfb64c08a407 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1879,6 +1879,42 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 	return sd;
 }
 
+/*
+ * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
+ * any two given CPUs at this (non-NUMA) topology level.
+ */
+static bool topology_span_sane(struct sched_domain_topology_level *tl,
+			      const struct cpumask *cpu_map, int cpu)
+{
+	int i;
+
+	/* NUMA levels are allowed to overlap */
+	if (tl->flags & SDTL_OVERLAP)
+		return true;
+
+	/*
+	 * Non-NUMA levels cannot partially overlap - they must be either
+	 * completely equal or completely disjoint. Otherwise we can end up
+	 * breaking the sched_group lists - i.e. a later get_group() pass
+	 * breaks the linking done for an earlier span.
+	 */
+	for_each_cpu(i, cpu_map) {
+		if (i == cpu)
+			continue;
+		/*
+		 * We should 'and' all those masks with 'cpu_map' to exactly
+		 * match the topology we're about to build, but that can only
+		 * remove CPUs, which only lessens our ability to detect
+		 * overlaps
+		 */
+		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
+		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * Find the sched_domain_topology_level where all CPU capacities are visible
  * for all CPUs.
@@ -1975,6 +2011,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 				has_asym = true;
 			}
 
+			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
+				goto error;
+
 			sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
 
 			if (tl == sched_domain_topology)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  2020-01-15 16:09 [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap Valentin Schneider
@ 2020-01-16 10:44 ` Peter Zijlstra
  2020-01-16 10:48   ` Valentin Schneider
  2020-01-16 15:19   ` Peter Zijlstra
  2020-01-17 10:08 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
  1 sibling, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2020-01-16 10:44 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, sudeep.holla, prime.zeng, dietmar.eggemann,
	morten.rasmussen, mingo

On Wed, Jan 15, 2020 at 04:09:15PM +0000, Valentin Schneider wrote:
> A "less intrusive" alternative is to assert the sd->groups list doesn't get
> re-written, which is a symptom of such bogus topologies. I've briefly
> tested this, you can have a look at it here:
> 
>   http://www.linux-arm.org/git?p=linux-vs.git;a=commit;h=e0ead72137332cbd3d69c9055ab29e6ffae5b37b

Something like that might still make sense. Can't never be too careful,
right ;-)

>  kernel/sched/topology.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6ec1e595b1d4..dfb64c08a407 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1879,6 +1879,42 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>  	return sd;
>  }
>  
> +/*
> + * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> + * any two given CPUs at this (non-NUMA) topology level.
> + */
> +static bool topology_span_sane(struct sched_domain_topology_level *tl,
> +			      const struct cpumask *cpu_map, int cpu)
> +{
> +	int i;
> +
> +	/* NUMA levels are allowed to overlap */
> +	if (tl->flags & SDTL_OVERLAP)
> +		return true;
> +
> +	/*
> +	 * Non-NUMA levels cannot partially overlap - they must be either
> +	 * completely equal or completely disjoint. Otherwise we can end up
> +	 * breaking the sched_group lists - i.e. a later get_group() pass
> +	 * breaks the linking done for an earlier span.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		if (i == cpu)
> +			continue;
> +		/*
> +		 * We should 'and' all those masks with 'cpu_map' to exactly
> +		 * match the topology we're about to build, but that can only
> +		 * remove CPUs, which only lessens our ability to detect
> +		 * overlaps
> +		 */
> +		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> +		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>  /*
>   * Find the sched_domain_topology_level where all CPU capacities are visible
>   * for all CPUs.
> @@ -1975,6 +2011,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  				has_asym = true;
>  			}
>  
> +			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> +				goto error;
> +
>  			sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
>  
>  			if (tl == sched_domain_topology)

This is O(nr_cpus), but then, that function already is, so I don't see a
problem with this.

I'll take it, thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  2020-01-16 10:44 ` Peter Zijlstra
@ 2020-01-16 10:48   ` Valentin Schneider
  2020-01-16 15:19   ` Peter Zijlstra
  1 sibling, 0 replies; 6+ messages in thread
From: Valentin Schneider @ 2020-01-16 10:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, sudeep.holla, prime.zeng, dietmar.eggemann,
	morten.rasmussen, mingo



On 16/01/2020 10:44, Peter Zijlstra wrote:
> On Wed, Jan 15, 2020 at 04:09:15PM +0000, Valentin Schneider wrote:
>> A "less intrusive" alternative is to assert the sd->groups list doesn't get
>> re-written, which is a symptom of such bogus topologies. I've briefly
>> tested this, you can have a look at it here:
>>
>>   http://www.linux-arm.org/git?p=linux-vs.git;a=commit;h=e0ead72137332cbd3d69c9055ab29e6ffae5b37b
> 
> Something like that might still make sense. Can't never be too careful,
> right ;-)
> 

True. I'll ponder about it and see if I can't make it a bit neater


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  2020-01-16 10:44 ` Peter Zijlstra
  2020-01-16 10:48   ` Valentin Schneider
@ 2020-01-16 15:19   ` Peter Zijlstra
  2020-01-16 15:30     ` Valentin Schneider
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2020-01-16 15:19 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, sudeep.holla, prime.zeng, dietmar.eggemann,
	morten.rasmussen, mingo

On Thu, Jan 16, 2020 at 11:44:28AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2020 at 04:09:15PM +0000, Valentin Schneider wrote:
> > @@ -1975,6 +2011,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  				has_asym = true;
> >  			}
> >  
> > +			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> > +				goto error;
> > +
> >  			sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
> >  
> >  			if (tl == sched_domain_topology)
> 
> This is O(nr_cpus), but then, that function already is, so I don't see a
> problem with this.

Clearly I meant to write O(nr_cpus^2), there's a bunch of nested
for_each_cpu() in there.

> I'll take it, thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  2020-01-16 15:19   ` Peter Zijlstra
@ 2020-01-16 15:30     ` Valentin Schneider
  0 siblings, 0 replies; 6+ messages in thread
From: Valentin Schneider @ 2020-01-16 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, sudeep.holla, prime.zeng, dietmar.eggemann,
	morten.rasmussen, mingo



On 16/01/2020 15:19, Peter Zijlstra wrote:
> On Thu, Jan 16, 2020 at 11:44:28AM +0100, Peter Zijlstra wrote:
>> On Wed, Jan 15, 2020 at 04:09:15PM +0000, Valentin Schneider wrote:
>>> @@ -1975,6 +2011,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>  				has_asym = true;
>>>  			}
>>>  
>>> +			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
>>> +				goto error;
>>> +
>>>  			sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
>>>  
>>>  			if (tl == sched_domain_topology)
>>
>> This is O(nr_cpus), but then, that function already is, so I don't see a
>> problem with this.
> 
> Clearly I meant to write O(nr_cpus^2), there's a bunch of nested
> for_each_cpu() in there.
> 

Hm so the sanity check is O(topo_levels * nr_cpus²), and I think that
roughly matches the complexity of building the sched_groups (I say
roughly because you go over less CPUs as you go up the topo levels).

As you said the important bit is that we don't make build_sched_domains()
noticeably worse, which I think is the case...

>> I'll take it, thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip: sched/core] sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  2020-01-15 16:09 [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap Valentin Schneider
  2020-01-16 10:44 ` Peter Zijlstra
@ 2020-01-17 10:08 ` tip-bot2 for Valentin Schneider
  1 sibling, 0 replies; 6+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2020-01-17 10:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Zeng Tao, Valentin Schneider, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     ccf74128d66ce937876184ad55db2e0276af08d3
Gitweb:        https://git.kernel.org/tip/ccf74128d66ce937876184ad55db2e0276af08d3
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Wed, 15 Jan 2020 16:09:15 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 17 Jan 2020 10:19:23 +01:00

sched/topology: Assert non-NUMA topology masks don't (partially) overlap

topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.

In his example (8 CPUs, 2-node system), we end up with:
  MC span for CPU3 == 3-7
  MC span for CPU4 == 4-7

The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:

  3 -> 4 -> 5 -> 6 -> 7
  ^                  /
   `----------------'

And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

  3 -> 4 -> 5 -> 6 -> 7
       ^             /
	`-----------'

which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.

There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.

Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.

Testing
=======

Dietmar managed to reproduce this using the following qemu incantation:

  $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
  -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
  cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
  node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):

8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;

+		if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+			continue;
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

8<---

[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com

Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
---
 kernel/sched/topology.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6ec1e59..dfb64c0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1880,6 +1880,42 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 }
 
 /*
+ * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
+ * any two given CPUs at this (non-NUMA) topology level.
+ */
+static bool topology_span_sane(struct sched_domain_topology_level *tl,
+			      const struct cpumask *cpu_map, int cpu)
+{
+	int i;
+
+	/* NUMA levels are allowed to overlap */
+	if (tl->flags & SDTL_OVERLAP)
+		return true;
+
+	/*
+	 * Non-NUMA levels cannot partially overlap - they must be either
+	 * completely equal or completely disjoint. Otherwise we can end up
+	 * breaking the sched_group lists - i.e. a later get_group() pass
+	 * breaks the linking done for an earlier span.
+	 */
+	for_each_cpu(i, cpu_map) {
+		if (i == cpu)
+			continue;
+		/*
+		 * We should 'and' all those masks with 'cpu_map' to exactly
+		 * match the topology we're about to build, but that can only
+		 * remove CPUs, which only lessens our ability to detect
+		 * overlaps
+		 */
+		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
+		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
+			return false;
+	}
+
+	return true;
+}
+
+/*
  * Find the sched_domain_topology_level where all CPU capacities are visible
  * for all CPUs.
  */
@@ -1975,6 +2011,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 				has_asym = true;
 			}
 
+			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
+				goto error;
+
 			sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
 
 			if (tl == sched_domain_topology)

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-01-17 10:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-15 16:09 [PATCH] sched/topology: Assert non-NUMA topology masks don't (partially) overlap Valentin Schneider
2020-01-16 10:44 ` Peter Zijlstra
2020-01-16 10:48   ` Valentin Schneider
2020-01-16 15:19   ` Peter Zijlstra
2020-01-16 15:30     ` Valentin Schneider
2020-01-17 10:08 ` [tip: sched/core] " tip-bot2 for Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.