linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
@ 2018-10-24  3:02 Srikar Dronamraju
  2018-10-24  8:56 ` Mel Gorman
  2018-10-24 10:03 ` Peter Zijlstra
  0 siblings, 2 replies; 15+ messages in thread
From: Srikar Dronamraju @ 2018-10-24  3:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Yi Wang, zhong.weidong, Yi Liu,
	Srikar Dronamraju, Frederic Weisbecker, Thomas Gleixner

Load balancer and NUMA balancer are not suppose to work on isolcpus.

Currently when setting sched affinity, there are no checks to see if the
requested cpumask has CPUs from both isolcpus and housekeeping CPUs.

If user passes a mix of isolcpus and housekeeping CPUs, then
NUMA balancer can pick a isolcpu to schedule.
With this change, if a combination of isolcpus and housekeeping CPUs are
provided, then we restrict ourselves to housekeeping CPUs.

For example: System with 32 CPUs
$ grep -o "isolcpus=[,,1-9]*" /proc/cmdline
isolcpus=1,5,9,13
$ grep -i cpus_allowed /proc/$$/status
Cpus_allowed:   ffffdddd
Cpus_allowed_list:      0,2-4,6-8,10-12,14-31

Running "perf bench numa mem --no-data_rand_walk -p 4 -t 8 -G 0 -P 3072
-T 0 -l 50 -c -s 1000" which  calls sched_setaffinity to all CPUs in
system.

Without patch
------------
$ for i in $(pgrep -f perf); do  grep -i cpus_allowed_list  /proc/$i/task/*/status ; done | head -n 10
Cpus_allowed_list:      0,2-4,6-8,10-12,14-31
/proc/2107/task/2107/status:Cpus_allowed_list:  0-31
/proc/2107/task/2196/status:Cpus_allowed_list:  0-31
/proc/2107/task/2197/status:Cpus_allowed_list:  0-31
/proc/2107/task/2198/status:Cpus_allowed_list:  0-31
/proc/2107/task/2199/status:Cpus_allowed_list:  0-31
/proc/2107/task/2200/status:Cpus_allowed_list:  0-31
/proc/2107/task/2201/status:Cpus_allowed_list:  0-31
/proc/2107/task/2202/status:Cpus_allowed_list:  0-31
/proc/2107/task/2203/status:Cpus_allowed_list:  0-31

With patch
----------
$ for i in $(pgrep -f perf); do  grep -i cpus_allowed_list  /proc/$i/task/*/status ; done | head -n 10
Cpus_allowed_list:      0,2-4,6-8,10-12,14-31
/proc/18591/task/18591/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18603/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18604/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18605/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18606/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18607/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18608/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18609/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31
/proc/18591/task/18610/status:Cpus_allowed_list:        0,2-4,6-8,10-12,14-31

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
 constification of hk_mask (reported by kbuild test robot)

 kernel/sched/core.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad97f3b..54e7207 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4734,6 +4734,7 @@ static int sched_read_attr(struct sched_attr __user *uattr,
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
 	cpumask_var_t cpus_allowed, new_mask;
+	const struct cpumask *hk_mask;
 	struct task_struct *p;
 	int retval;
 
@@ -4778,6 +4779,19 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 
 	cpuset_cpus_allowed(p, cpus_allowed);
 	cpumask_and(new_mask, in_mask, cpus_allowed);
+	hk_mask = housekeeping_cpumask(HK_FLAG_DOMAIN);
+
+	/*
+	 * If the cpumask provided has CPUs that are part of isolated and
+	 * housekeeping_cpumask, then restrict it to just the CPUs that
+	 * are part of the housekeeping_cpumask.
+	 */
+	if (!cpumask_subset(new_mask, hk_mask) &&
+			cpumask_intersects(new_mask, hk_mask)) {
+		pr_info("pid %d: Mix of isolcpus and non-isolcpus provided\n",
+			       p->pid);
+		cpumask_and(new_mask, new_mask, hk_mask);
+	}
 
 	/*
 	 * Since bandwidth control happens on root_domain basis,
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24  3:02 [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs Srikar Dronamraju
@ 2018-10-24  8:56 ` Mel Gorman
  2018-10-24  9:46   ` Srikar Dronamraju
  2018-10-24 10:03 ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-10-24  8:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> Load balancer and NUMA balancer are not suppose to work on isolcpus.
> 
> Currently when setting sched affinity, there are no checks to see if the
> requested cpumask has CPUs from both isolcpus and housekeeping CPUs.
> 
> If user passes a mix of isolcpus and housekeeping CPUs, then
> NUMA balancer can pick a isolcpu to schedule.
> With this change, if a combination of isolcpus and housekeeping CPUs are
> provided, then we restrict ourselves to housekeeping CPUs.
> 
> For example: System with 32 CPUs
> $ grep -o "isolcpus=[,,1-9]*" /proc/cmdline
> isolcpus=1,5,9,13
> $ grep -i cpus_allowed /proc/$$/status
> Cpus_allowed:   ffffdddd
> Cpus_allowed_list:      0,2-4,6-8,10-12,14-31
> 
> Running "perf bench numa mem --no-data_rand_walk -p 4 -t 8 -G 0 -P 3072
> -T 0 -l 50 -c -s 1000" which  calls sched_setaffinity to all CPUs in
> system.
> 

Forgive my naivety, but is it wrong for a process to bind to both isolated
CPUs and housekeeping CPUs? It would certainly be a bit odd because the
application is asking for some protection but no guarantees are given
and the application is not made aware via an error code that there is a
problem. Asking the application to parse dmesg hoping to find the right
error message is going to be fragile.

Would it be more appropriate to fail sched_setaffinity when there is a
mix of isolated and housekeeping CPUs? In that case, an info message in
dmesg may be appropriate as it'll likely be a once-off configuration
error that's obvious due to an application failure. Alternatively,
should NUMA balancing ignore isolated CPUs? The latter seems unusual as
the application has specified a mask that allows those CPUs and it's not
clear why NUMA balancing should ignore them. If anything, an application
that wants to avoid all interference should also be using memory policies
to bind to nodes so it behaves predictably with respect to access latencies
(presumably if an application cannot tolerate kernel threads interfering
then it also cannot tolerate remote access latencies) or disabling NUMA
balancing entirely to avoid incurring minor faults.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24  8:56 ` Mel Gorman
@ 2018-10-24  9:46   ` Srikar Dronamraju
  2018-10-24 10:15     ` Peter Zijlstra
  2018-10-24 10:31     ` Mel Gorman
  0 siblings, 2 replies; 15+ messages in thread
From: Srikar Dronamraju @ 2018-10-24  9:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

* Mel Gorman <mgorman@techsingularity.net> [2018-10-24 09:56:36]:

> On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> It would certainly be a bit odd because the
> application is asking for some protection but no guarantees are given
> and the application is not made aware via an error code that there is a
> problem. Asking the application to parse dmesg hoping to find the right
> error message is going to be fragile.

Its a actually a good question.
What should we be doing if a mix of isolcpus and housekeeping (aka
non-isolcpus) is given in the mask.

Right now as you pointed, there is no easy way for the application to know
which are the non-isolcpus to set its affinity. cpusets effective_cpus and
cpus_allowed both will contain isolcpus too.

> Would it be more appropriate to fail sched_setaffinity when there is a
> mix of isolated and housekeeping CPUs? In that case, an info message in
> dmesg may be appropriate as it'll likely be a once-off configuration
> error that's obvious due to an application failure.

I was looking at option of returning an error in that case. However our
own perf bench numa cant handle it. We can be sure that there will other
tools who could have coded that way and may start failing. So I am not sure
if thats a good option. Previously they would *seem* to be working.

Thats why I have taken an option of correcting within the kernel.

> Alternatively,
> should NUMA balancing ignore isolated CPUs? The latter seems unusual as
> the application has specified a mask that allows those CPUs and it's not
> clear why NUMA balancing should ignore them. If anything, an application
> that wants to avoid all interference should also be using memory policies
> to bind to nodes so it behaves predictably with respect to access latencies
> (presumably if an application cannot tolerate kernel threads interfering
> then it also cannot tolerate remote access latencies) or disabling NUMA
> balancing entirely to avoid incurring minor faults.
> 

The numa balancing is doing the right thing because if the application has
the mask specified, then we should allow the application to run.

The problem happens when the unbound application starts to run on the isolcpu,
regular load balancing will no more work. If another task is bound on the
isolcpu, and other cpus in the node are free, this task will still contend
with task bound to the isolcpus. (Unless numa affinity of the unbound thread
changes).

Also isocpus are suppose to run only those tasks that are bound to it.
So we are kind of breaking that rule.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24  3:02 [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs Srikar Dronamraju
  2018-10-24  8:56 ` Mel Gorman
@ 2018-10-24 10:03 ` Peter Zijlstra
  2018-10-24 10:30   ` Srikar Dronamraju
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2018-10-24 10:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> Load balancer and NUMA balancer are not suppose to work on isolcpus.
> 
> Currently when setting sched affinity, there are no checks to see if the
> requested cpumask has CPUs from both isolcpus and housekeeping CPUs.
> 
> If user passes a mix of isolcpus and housekeeping CPUs, then
> NUMA balancer can pick a isolcpu to schedule.
> With this change, if a combination of isolcpus and housekeeping CPUs are
> provided, then we restrict ourselves to housekeeping CPUs.

I'm still not liking this much. This adds more special cases for
isolcpus. Also, I don't believe in correcting silly users; give 'em rope
and show them how to tie the knot.

Where does the numa balancer pick the 'wrong' CPU?

task_numa_migrate() checks to see if the task is currently part of a
SD_NUMA domain, otherwise it doesn't do anything. This means your
housekeeping mask spans multiple nodes to begin with, right?

But after that we seem to ignore the sched domains entirely;
task_numa_find_cpu() only tests cpus_allowed.

It appears to me the for_each_online_node() iteration in
task_numa_migrate() needs an addition test to see if the selected node
has any CPUs in the relevant sched_domain _at_all_.

A little something like the below -- except we also need to do something
about cpus_active_mask. Not been near a compiler.

---
 kernel/sched/fair.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee271bb661cc..287ef7f0203b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1497,6 +1497,8 @@ struct task_numa_env {
 	struct task_struct *best_task;
 	long best_imp;
 	int best_cpu;
+
+	cpumask_var_t cpus;
 };
 
 static void task_numa_assign(struct task_numa_env *env,
@@ -1704,7 +1706,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	 */
 	maymove = !load_too_imbalanced(src_load, dst_load, env);
 
-	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+	for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), env->cpus) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
 			continue;
@@ -1734,6 +1736,9 @@ static int task_numa_migrate(struct task_struct *p)
 	int nid, ret, dist;
 	long taskimp, groupimp;
 
+	if (!alloc_cpumask_var(&env.cpus, GFP_KERNEL))
+		return -ENOMEM;
+
 	/*
 	 * Pick the lowest SD_NUMA domain, as that would have the smallest
 	 * imbalance and would be the first to start moving tasks about.
@@ -1744,20 +1749,23 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
-		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
-	rcu_read_unlock();
-
 	/*
 	 * Cpusets can break the scheduler domain tree into smaller
 	 * balance domains, some of which do not cross NUMA boundaries.
 	 * Tasks that are "trapped" in such domains cannot be migrated
 	 * elsewhere, so there is no point in (re)trying.
 	 */
-	if (unlikely(!sd)) {
+	if (!sd) {
 		sched_setnuma(p, task_node(p));
-		return -EINVAL;
+		rcu_read_unlock();
+		ret = -EINVAL;
+		goto out;
 	}
+	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+	while (sd->parent)
+		sd = sd->parent;
+	cpumask_copy(env.cpus, sched_domain_span(sd));
+	rcu_read_unlock();
 
 	env.dst_nid = p->numa_preferred_nid;
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
@@ -1783,6 +1791,9 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
+			if (!cpumask_intersects(cpumask_of_node(nid), env.cpus))
+				continue;
+
 			dist = node_distance(env.src_nid, env.dst_nid);
 			if (sched_numa_topology_type == NUMA_BACKPLANE &&
 						dist != env.dist) {
@@ -1822,8 +1833,10 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 
 	/* No better CPU than the current one was found. */
-	if (env.best_cpu == -1)
-		return -EAGAIN;
+	if (env.best_cpu == -1) {
+		ret = -EAGAIN;
+		goto out;
+	}
 
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
@@ -1831,7 +1844,7 @@ static int task_numa_migrate(struct task_struct *p)
 		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
-		return ret;
+		goto out;
 	}
 
 	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
@@ -1840,6 +1853,9 @@ static int task_numa_migrate(struct task_struct *p)
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
 	put_task_struct(env.best_task);
+
+out:
+	free_cpumask_var(&env.cpus);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24  9:46   ` Srikar Dronamraju
@ 2018-10-24 10:15     ` Peter Zijlstra
  2018-10-24 10:41       ` Srikar Dronamraju
  2018-10-24 10:31     ` Mel Gorman
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2018-10-24 10:15 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, LKML, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Wed, Oct 24, 2018 at 03:16:46PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2018-10-24 09:56:36]:
> 
> > On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> > It would certainly be a bit odd because the
> > application is asking for some protection but no guarantees are given
> > and the application is not made aware via an error code that there is a
> > problem. Asking the application to parse dmesg hoping to find the right
> > error message is going to be fragile.
> 
> Its a actually a good question.
> What should we be doing if a mix of isolcpus and housekeeping (aka
> non-isolcpus) is given in the mask.
> 
> Right now as you pointed, there is no easy way for the application to know
> which are the non-isolcpus to set its affinity. cpusets effective_cpus and
> cpus_allowed both will contain isolcpus too.

The easy option is to not use isolcpus :-) It is a horrifically bad
interface.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24 10:03 ` Peter Zijlstra
@ 2018-10-24 10:30   ` Srikar Dronamraju
  2018-10-25  0:07     ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Srikar Dronamraju @ 2018-10-24 10:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-10-24 12:03:23]:

> It appears to me the for_each_online_node() iteration in
> task_numa_migrate() needs an addition test to see if the selected node
> has any CPUs in the relevant sched_domain _at_all_.
> 

Yes, this should work.
Yi Wang does this extra check a little differently.

http://lkml.kernel.org/r/1540177516-38613-1-git-send-email-wang.yi59@zte.com.cn

However the last time I had posted you didn't like that approach.
http://lkml.kernel.org/r/20170406073659.y6ubqriyshax4v4m@hirez.programming.kicks-ass.net

Further, I would think the number of times, we would be calling
sched_setaffinity would be far less than task_numa_migrate().
In the regular case, where we never have isolcpus, we add this extra check.

Also what does it mean for a task to have its cpu affinity set to a mix of
isolcpus and non isolcpus.

Also should we be updating update_numa_stats accordingly? While the problem
may not be apparent there but we are counting the load and compute capacity
of isolcpus.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24  9:46   ` Srikar Dronamraju
  2018-10-24 10:15     ` Peter Zijlstra
@ 2018-10-24 10:31     ` Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-10-24 10:31 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Wed, Oct 24, 2018 at 03:16:46PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2018-10-24 09:56:36]:
> 
> > On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> > It would certainly be a bit odd because the
> > application is asking for some protection but no guarantees are given
> > and the application is not made aware via an error code that there is a
> > problem. Asking the application to parse dmesg hoping to find the right
> > error message is going to be fragile.
> 
> Its a actually a good question.
> What should we be doing if a mix of isolcpus and housekeeping (aka
> non-isolcpus) is given in the mask.
> 
> Right now as you pointed, there is no easy way for the application to know
> which are the non-isolcpus to set its affinity. cpusets effective_cpus and
> cpus_allowed both will contain isolcpus too.
> 

While it's tangential to your patch, that sounds like a fairly big hole in
the API. Now, sure enough, it just puts the burden on the administrator
to figure it out but the silent "appears to work" is unfortunate. It's a
question on whether application owners or sysadmins are displined enough
to ensure that the kernel command line and scheduler affinities always
match properly.

> > Would it be more appropriate to fail sched_setaffinity when there is a
> > mix of isolated and housekeeping CPUs? In that case, an info message in
> > dmesg may be appropriate as it'll likely be a once-off configuration
> > error that's obvious due to an application failure.
> 
> I was looking at option of returning an error in that case. However our
> own perf bench numa cant handle it. We can be sure that there will other
> tools who could have coded that way and may start failing. So I am not sure
> if thats a good option. Previously they would *seem* to be working.
> 

I'm going to throw it out there -- should we do it anyway as it's a
sensible configuration and see what breaks? At the end of the day, it
would be a revert if we get caught by the "you never break userspace"
rule and get overruled on the defense "what the application asks for is
only being partially granted and it is vunerable to interference when it
thinks the kernel has promised isolation". It could even be a patch on top
of the one you have so that even if it does get reverted, there still is
a warning left over. Of course, it would instantly break the perf bench
numa case and open the question on whether that should be fixed too.

> Thats why I have taken an option of correcting within the kernel.
> 

FWIW, I'm not going to nak the patch on that basis as the warning is
beneficial in itself and an improvement of the current state. There are
others on the cc that have a greater awareness of how much interference
an isolated process is willing to tolerate and what sort of promises
we've given them in the past.

> > Alternatively,
> > should NUMA balancing ignore isolated CPUs? The latter seems unusual as
> > the application has specified a mask that allows those CPUs and it's not
> > clear why NUMA balancing should ignore them. If anything, an application
> > that wants to avoid all interference should also be using memory policies
> > to bind to nodes so it behaves predictably with respect to access latencies
> > (presumably if an application cannot tolerate kernel threads interfering
> > then it also cannot tolerate remote access latencies) or disabling NUMA
> > balancing entirely to avoid incurring minor faults.
> > 
> 
> The numa balancing is doing the right thing because if the application has
> the mask specified, then we should allow the application to run.
> 

This is what I thought.

> The problem happens when the unbound application starts to run on the isolcpu,
> regular load balancing will no more work. If another task is bound on the
> isolcpu, and other cpus in the node are free, this task will still contend
> with task bound to the isolcpus. (Unless numa affinity of the unbound thread
> changes).
> 

Which is bad.

> Also isocpus are suppose to run only those tasks that are bound to it.
> So we are kind of breaking that rule.
> 

Indeed so the warning has use in itself and it's a question on whether
others think we should be more strict about misleading (or broken if it's
critical enough) configurations.

As the patch at least highlights a problem and my suggestions are additional
work that may or may not be useful;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

But further opinions on whether there should be a patch on top that are
more strict would be welcome.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24 10:15     ` Peter Zijlstra
@ 2018-10-24 10:41       ` Srikar Dronamraju
  2018-10-24 11:21         ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Srikar Dronamraju @ 2018-10-24 10:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, LKML, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-10-24 12:15:08]:

> On Wed, Oct 24, 2018 at 03:16:46PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2018-10-24 09:56:36]:
> > 
> > > On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> > > It would certainly be a bit odd because the
> > > application is asking for some protection but no guarantees are given
> > > and the application is not made aware via an error code that there is a
> > > problem. Asking the application to parse dmesg hoping to find the right
> > > error message is going to be fragile.
> > 
> > Its a actually a good question.
> > What should we be doing if a mix of isolcpus and housekeeping (aka
> > non-isolcpus) is given in the mask.
> > 
> > Right now as you pointed, there is no easy way for the application to know
> > which are the non-isolcpus to set its affinity. cpusets effective_cpus and
> > cpus_allowed both will contain isolcpus too.
> 
> The easy option is to not use isolcpus :-) It is a horrifically bad
> interface.

Agree, but thats something thats been exposed long time back.
Do we have an option to remove that?  Hopefully nobody is using it.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24 10:41       ` Srikar Dronamraju
@ 2018-10-24 11:21         ` Mel Gorman
  0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-10-24 11:21 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Wed, Oct 24, 2018 at 04:11:24PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-10-24 12:15:08]:
> 
> > On Wed, Oct 24, 2018 at 03:16:46PM +0530, Srikar Dronamraju wrote:
> > > * Mel Gorman <mgorman@techsingularity.net> [2018-10-24 09:56:36]:
> > > 
> > > > On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> > > > It would certainly be a bit odd because the
> > > > application is asking for some protection but no guarantees are given
> > > > and the application is not made aware via an error code that there is a
> > > > problem. Asking the application to parse dmesg hoping to find the right
> > > > error message is going to be fragile.
> > > 
> > > Its a actually a good question.
> > > What should we be doing if a mix of isolcpus and housekeeping (aka
> > > non-isolcpus) is given in the mask.
> > > 
> > > Right now as you pointed, there is no easy way for the application to know
> > > which are the non-isolcpus to set its affinity. cpusets effective_cpus and
> > > cpus_allowed both will contain isolcpus too.
> > 
> > The easy option is to not use isolcpus :-) It is a horrifically bad
> > interface.
> 
> Agree, but thats something thats been exposed long time back.
> Do we have an option to remove that?  Hopefully nobody is using it.
> 

I occasionally see bugs asking questions about interference from the kernel
when isolcpus are in use. The last one was related to a timer interrupt
every HZ (not a mainline kernel) but still, it's some evidence that it
has users :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-24 10:30   ` Srikar Dronamraju
@ 2018-10-25  0:07     ` Peter Zijlstra
  2018-10-25 17:30       ` Srikar Dronamraju
  2018-10-25 18:23       ` Srikar Dronamraju
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2018-10-25  0:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Wed, Oct 24, 2018 at 04:00:02PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-10-24 12:03:23]:
> 
> > It appears to me the for_each_online_node() iteration in
> > task_numa_migrate() needs an addition test to see if the selected node
> > has any CPUs in the relevant sched_domain _at_all_.
> > 
> 
> Yes, this should work.
> Yi Wang does this extra check a little differently.
> 
> http://lkml.kernel.org/r/1540177516-38613-1-git-send-email-wang.yi59@zte.com.cn

That's completely broken. Nothing in the numa balancing path uses that
variable and afaict preemption is actually enabled where that's used, so
using that per-cpu variable at all is broken.

> However the last time I had posted you didn't like that approach.
> http://lkml.kernel.org/r/20170406073659.y6ubqriyshax4v4m@hirez.programming.kicks-ass.net

You again checked against isolated_map there, which is not the immediate
problem.

Both of you are fixing symptoms, not the cause.

> Further, I would think the number of times, we would be calling
> sched_setaffinity would be far less than task_numa_migrate().
> In the regular case, where we never have isolcpus, we add this extra check.

But it doesn't solve the problem.

You can create multiple partitions with cpusets but still have an
unbound task in the root cgroup. That would suffer the exact same
problems.

Thing is, load-balancing, of any kind, should respect sched_domains, and
currently numa balancing barely looks at it.

The proposed patch puts the minimal constraints on the numa balancer to
respect sched_domains; but doesn't yet correctly deal with hotplug.

isolcpus is just one case that goes wrong.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-25  0:07     ` Peter Zijlstra
@ 2018-10-25 17:30       ` Srikar Dronamraju
  2018-10-26  8:41         ` Peter Zijlstra
  2018-10-25 18:23       ` Srikar Dronamraju
  1 sibling, 1 reply; 15+ messages in thread
From: Srikar Dronamraju @ 2018-10-25 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

> 
> That's completely broken. Nothing in the numa balancing path uses that
> variable and afaict preemption is actually enabled where that's used, so
> using that per-cpu variable at all is broken.
> 

I can demonstrate that even without numa balancing, there are
inconsistent behaviour with isolcpus on.

> 
> Both of you are fixing symptoms, not the cause.
> 

Okay.

> But it doesn't solve the problem.
> 
> You can create multiple partitions with cpusets but still have an
> unbound task in the root cgroup. That would suffer the exact same
> problems.
> 
> Thing is, load-balancing, of any kind, should respect sched_domains, and
> currently numa balancing barely looks at it.

Agreed that we should have looked at sched_domains. However I still believe
we can't have task->cpus_allowed with a mix of isolcpus and non-isolcpus.
won't it lead to inconsistent behaviour?

> 
> The proposed patch puts the minimal constraints on the numa balancer to
> respect sched_domains; but doesn't yet correctly deal with hotplug.

I was also thinking about hotplug. Also your proposed patch and even my
proposed patch don't seem to work well with the below scenario.

# cat /sys/devices/system/cpu/possible
0-31
# cat /sys/devices/system/cpu/isolated
1,5,9,13
# cat hist.sh
echo 0 > /proc/sys/kernel/numa_balancing
cd /sys/fs/cgroup/cpuset
mkdir -p student
cp cpuset.mems student/
cd student
echo "0-31" > cpuset.cpus
echo $$ > cgroup.procs 
echo "1-8" > cpuset.cpus
/home/srikar/work/ebizzy-0.3/ebizzy -S 1000 &
PID=$!
sleep 10
pidstat -p $! -t |tail -n +3 |head -n 10
pidstat -p $$ -t |tail -n +3
pkill ebizzy
#
# ./hist.sh
10:35:21  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
10:35:21  IST     0      2645         -    8.70    0.01    0.00    8.71     1  ebizzy
10:35:21  IST     0         -      2645    0.01    0.00    0.00    0.01     1  |__ebizzy
10:35:21  IST     0         -      2647    0.14    0.00    0.00    0.14     1  |__ebizzy
10:35:21  IST     0         -      2648    0.13    0.00    0.00    0.13     1  |__ebizzy
10:35:21  IST     0         -      2649    0.13    0.00    0.00    0.13     1  |__ebizzy
10:35:21  IST     0         -      2650    0.13    0.00    0.00    0.13     1  |__ebizzy
10:35:21  IST     0         -      2651    0.13    0.00    0.00    0.13     1  |__ebizzy
10:35:21  IST     0         -      2652    0.13    0.00    0.00    0.13     1  |__ebizzy
10:35:21  IST     0         -      2653    0.13    0.00    0.00    0.13     1  |__ebizzy
10:35:23  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
10:35:23  IST     0      2642         -    0.00    0.00    0.00    0.00     1  hist.sh
10:35:23  IST     0         -      2642    0.00    0.00    0.00    0.00     1  |__hist.sh
#

Note all the ebizzy and bash task that started it are on cpu 1. This happens
if the cpuset starts with an isolcpu, then all tasks in that cpuset might
only run in that cpu.  With a smaller cpuset, ebizzy always runs on cpu 1.
However, if I increase the cpuset, the chances of ebizzy spreading increases
but not always.

I only tried this on a powerpc kvm guest. I dont think there is anything to do
with arch/guest/host

I have something that seems to help out. Will post soon.

> isolcpus is just one case that goes wrong.
Similar to isolcpus, are there other cases that we need to worry about?

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-25  0:07     ` Peter Zijlstra
  2018-10-25 17:30       ` Srikar Dronamraju
@ 2018-10-25 18:23       ` Srikar Dronamraju
  2018-10-26  8:42         ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Srikar Dronamraju @ 2018-10-25 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

> 
> You can create multiple partitions with cpusets but still have an
> unbound task in the root cgroup. That would suffer the exact same
> problems.
> 

I probably don't understand this. Even if the child cgroups has
cpu_exclusive or sched_load_balance reset, the tasks in root cgroup has
access to all cpus in system. Right?

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-25 17:30       ` Srikar Dronamraju
@ 2018-10-26  8:41         ` Peter Zijlstra
  2018-10-26  9:30           ` Srikar Dronamraju
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2018-10-26  8:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Thu, Oct 25, 2018 at 11:00:58PM +0530, Srikar Dronamraju wrote:
> > But it doesn't solve the problem.
> > 
> > You can create multiple partitions with cpusets but still have an
> > unbound task in the root cgroup. That would suffer the exact same
> > problems.
> > 
> > Thing is, load-balancing, of any kind, should respect sched_domains, and
> > currently numa balancing barely looks at it.
> 
> Agreed that we should have looked at sched_domains. However I still believe
> we can't have task->cpus_allowed with a mix of isolcpus and non-isolcpus.
> won't it lead to inconsistent behaviour?

We currently can; and it depends on what you call inconsistent.

> > The proposed patch puts the minimal constraints on the numa balancer to
> > respect sched_domains; but doesn't yet correctly deal with hotplug.
> 
> I was also thinking about hotplug. Also your proposed patch and even my
> proposed patch don't seem to work well with the below scenario.
> 
> # cat /sys/devices/system/cpu/possible
> 0-31
> # cat /sys/devices/system/cpu/isolated
> 1,5,9,13
> # cat hist.sh
> echo 0 > /proc/sys/kernel/numa_balancing
> cd /sys/fs/cgroup/cpuset
> mkdir -p student
> cp cpuset.mems student/
> cd student
> echo "0-31" > cpuset.cpus
> echo $$ > cgroup.procs 
> echo "1-8" > cpuset.cpus
> /home/srikar/work/ebizzy-0.3/ebizzy -S 1000 &
> PID=$!
> sleep 10
> pidstat -p $! -t |tail -n +3 |head -n 10
> pidstat -p $$ -t |tail -n +3
> pkill ebizzy
> #
> # ./hist.sh
> 10:35:21  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
> 10:35:21  IST     0      2645         -    8.70    0.01    0.00    8.71     1  ebizzy
> 10:35:21  IST     0         -      2645    0.01    0.00    0.00    0.01     1  |__ebizzy
> 10:35:21  IST     0         -      2647    0.14    0.00    0.00    0.14     1  |__ebizzy
> 10:35:21  IST     0         -      2648    0.13    0.00    0.00    0.13     1  |__ebizzy
> 10:35:21  IST     0         -      2649    0.13    0.00    0.00    0.13     1  |__ebizzy
> 10:35:21  IST     0         -      2650    0.13    0.00    0.00    0.13     1  |__ebizzy
> 10:35:21  IST     0         -      2651    0.13    0.00    0.00    0.13     1  |__ebizzy
> 10:35:21  IST     0         -      2652    0.13    0.00    0.00    0.13     1  |__ebizzy
> 10:35:21  IST     0         -      2653    0.13    0.00    0.00    0.13     1  |__ebizzy
> 10:35:23  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
> 10:35:23  IST     0      2642         -    0.00    0.00    0.00    0.00     1  hist.sh
> 10:35:23  IST     0         -      2642    0.00    0.00    0.00    0.00     1  |__hist.sh
> #

That's correct and specified behaviour.

cpu1 has no sched domain (or the singleton domaon, which is the same)
and thus its tasks will not be migrated.

If there is a problem, it is with cpuset and its interaction with
isolcpus. I still have to look at the last version of cpuset-v2, but
iirc that was better in this regard.

I'm stil saying we should kill isolcpus entirely, its rubbish, has
always been.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-25 18:23       ` Srikar Dronamraju
@ 2018-10-26  8:42         ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2018-10-26  8:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

On Thu, Oct 25, 2018 at 11:53:17PM +0530, Srikar Dronamraju wrote:
> > 
> > You can create multiple partitions with cpusets but still have an
> > unbound task in the root cgroup. That would suffer the exact same
> > problems.
> > 
> 
> I probably don't understand this. Even if the child cgroups has
> cpu_exclusive or sched_load_balance reset, the tasks in root cgroup has
> access to all cpus in system. Right?

The crucial word being 'access'. The root group will not impose a limit
on the affinity of tasks. So you can indeed have affinities that span
load balancing domains.

And yes, that leads to 'interesting' but specified behaviour.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
  2018-10-26  8:41         ` Peter Zijlstra
@ 2018-10-26  9:30           ` Srikar Dronamraju
  0 siblings, 0 replies; 15+ messages in thread
From: Srikar Dronamraju @ 2018-10-26  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Yi Wang,
	zhong.weidong, Yi Liu, Frederic Weisbecker, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-10-26 10:41:05]:

> > #
> > # ./hist.sh
> > 10:35:21  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
> > 10:35:21  IST     0      2645         -    8.70    0.01    0.00    8.71     1  ebizzy
> > 10:35:21  IST     0         -      2645    0.01    0.00    0.00    0.01     1  |__ebizzy
> > 10:35:21  IST     0         -      2647    0.14    0.00    0.00    0.14     1  |__ebizzy
> > 10:35:21  IST     0         -      2648    0.13    0.00    0.00    0.13     1  |__ebizzy
> > 10:35:21  IST     0         -      2649    0.13    0.00    0.00    0.13     1  |__ebizzy
> > 10:35:21  IST     0         -      2650    0.13    0.00    0.00    0.13     1  |__ebizzy
> > 10:35:21  IST     0         -      2651    0.13    0.00    0.00    0.13     1  |__ebizzy
> > 10:35:21  IST     0         -      2652    0.13    0.00    0.00    0.13     1  |__ebizzy
> > 10:35:21  IST     0         -      2653    0.13    0.00    0.00    0.13     1  |__ebizzy
> > 10:35:23  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
> > 10:35:23  IST     0      2642         -    0.00    0.00    0.00    0.00     1  hist.sh
> > 10:35:23  IST     0         -      2642    0.00    0.00    0.00    0.00     1  |__hist.sh
> > #
> 
> That's correct and specified behaviour.
> 
> cpu1 has no sched domain (or the singleton domaon, which is the same)
> and thus its tasks will not be migrated.
> 

As I mentioned, If every run was the same, then it would have been good.
However if we increase the cpuset from 1-8 to 1-31 then we seen spreading
more often but not always.

./hist.sh
02:59:10  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
02:59:10  IST     0      2244         -  100.00    0.13    0.00  100.00    30  ebizzy
02:59:10  IST     0         -      2244    0.00    0.02    0.00    0.02    30  |__ebizzy
02:59:10  IST     0         -      2246    4.37    0.00    0.00    4.37    21  |__ebizzy
02:59:10  IST     0         -      2247    2.14    0.00    0.00    2.14     8  |__ebizzy
02:59:10  IST     0         -      2248    3.13    0.00    0.00    3.13    28  |__ebizzy
02:59:10  IST     0         -      2249    5.57    0.00    0.00    5.57    18  |__ebizzy
02:59:10  IST     0         -      2250    4.29    0.00    0.00    4.29    11  |__ebizzy
02:59:10  IST     0         -      2251    5.32    0.00    0.00    5.32    19  |__ebizzy
02:59:10  IST     0         -      2252    5.62    0.00    0.00    5.62    17  |__ebizzy
02:59:11  IST   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
02:59:11  IST     0      2241         -    0.00    0.00    0.00    0.00    29  hist.sh
02:59:11  IST     0         -      2241    0.00    0.00    0.00    0.00    29  |__hist.sh


> If there is a problem, it is with cpuset and its interaction with
> isolcpus. I still have to look at the last version of cpuset-v2, but
> iirc that was better in this regard.

Okay.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-10-26  9:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-24  3:02 [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs Srikar Dronamraju
2018-10-24  8:56 ` Mel Gorman
2018-10-24  9:46   ` Srikar Dronamraju
2018-10-24 10:15     ` Peter Zijlstra
2018-10-24 10:41       ` Srikar Dronamraju
2018-10-24 11:21         ` Mel Gorman
2018-10-24 10:31     ` Mel Gorman
2018-10-24 10:03 ` Peter Zijlstra
2018-10-24 10:30   ` Srikar Dronamraju
2018-10-25  0:07     ` Peter Zijlstra
2018-10-25 17:30       ` Srikar Dronamraju
2018-10-26  8:41         ` Peter Zijlstra
2018-10-26  9:30           ` Srikar Dronamraju
2018-10-25 18:23       ` Srikar Dronamraju
2018-10-26  8:42         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).