All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Fix numabalancing to work with isolated cpus
@ 2017-04-04 17:27 Srikar Dronamraju
  2017-04-04 18:56 ` Rik van Riel
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2017-04-04 17:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju

When performing load balancing, numabalancing only looks at
task->cpus_allowed to see if the task can run on the target cpu. If
isolcpus kernel parameter is set, then isolated cpus will not be part of
mask task->cpus_allowed.

For example: (On a Power 8 box running in smt 1 mode)

isolcpus=56,64,72,80,88

Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
/proc/20996/task/20996/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
/proc/20996/task/20997/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
/proc/20996/task/20998/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175

Note: offline cpus are excluded in cpus_allowed_list.

However a task might call sched_setaffinity() that includes all possible
cpus in the system including the isolated cpus.

For example:
perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
would call sched_setaffinity that resets the cpus_allowed mask.

Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168

The isolated cpus are part of the cpus allowed list. In the above case,
numabalancing ends up scheduling some of these tasks on isolated cpus.

To avoid this, please check for isolated cpus before choosing a target
cpu.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f045a35..f853dc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1666,6 +1666,10 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
 			continue;
 
+		/* Skip isolated cpus */
+		if (cpumask_test_cpu(cpu, cpu_isolated_map))
+			continue;
+
 		env->dst_cpu = cpu;
 		task_numa_compare(env, taskimp, groupimp);
 	}
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-04 17:27 [PATCH] sched: Fix numabalancing to work with isolated cpus Srikar Dronamraju
@ 2017-04-04 18:56 ` Rik van Riel
  2017-04-04 20:37 ` Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2017-04-04 18:56 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra; +Cc: LKML, Mel Gorman

On Tue, 2017-04-04 at 22:57 +0530, Srikar Dronamraju wrote:
> 
> The isolated cpus are part of the cpus allowed list. In the above
> case,
> numabalancing ends up scheduling some of these tasks on isolated
> cpus.
> 
> To avoid this, please check for isolated cpus before choosing a
> target
> cpu.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-04 17:27 [PATCH] sched: Fix numabalancing to work with isolated cpus Srikar Dronamraju
  2017-04-04 18:56 ` Rik van Riel
@ 2017-04-04 20:37 ` Mel Gorman
  2017-04-05  1:50   ` Srikar Dronamraju
  2017-04-05 12:57 ` Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2017-04-04 20:37 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel

On Tue, Apr 04, 2017 at 10:57:28PM +0530, Srikar Dronamraju wrote:
> When performing load balancing, numabalancing only looks at
> task->cpus_allowed to see if the task can run on the target cpu. If
> isolcpus kernel parameter is set, then isolated cpus will not be part of
> mask task->cpus_allowed.
> 
> For example: (On a Power 8 box running in smt 1 mode)
> 
> isolcpus=56,64,72,80,88
> 
> Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20996/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20997/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20998/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> 
> Note: offline cpus are excluded in cpus_allowed_list.
> 
> However a task might call sched_setaffinity() that includes all possible
> cpus in the system including the isolated cpus.
> 
> For example:
> perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> would call sched_setaffinity that resets the cpus_allowed mask.
> 
> Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> 
> The isolated cpus are part of the cpus allowed list. In the above case,
> numabalancing ends up scheduling some of these tasks on isolated cpus.
> 
> To avoid this, please check for isolated cpus before choosing a target
> cpu.
> 

Hmm, would this also prevent a task running inside a cgroup that is
allowed accessed to isolated CPUs from balancing? I severely doubt it
matters because if a process is isolated from interference then it
follows that automatic NUMA balancing should not be involved. If
anything the protection should be absolute but either way;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-04 20:37 ` Mel Gorman
@ 2017-04-05  1:50   ` Srikar Dronamraju
  2017-04-05  8:09     ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Srikar Dronamraju @ 2017-04-05  1:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel

> > 
> > To avoid this, please check for isolated cpus before choosing a target
> > cpu.
> > 
> 
> Hmm, would this also prevent a task running inside a cgroup that is
> allowed accessed to isolated CPUs from balancing? I severely doubt it

Scheduler doesn't do any kind of load balancing for isolated cpus.

# grep -o "isolcpus=.*" /proc/cmdline 
isolcpus=56,64,72,80,88
# taskset -c "56,64,72,80" ebizzy -S 100 -t 44

(on another terminal)
# pgrep ebizzy
10437
# cat /proc/10437/status |grep Cpus_allowed_list
Cpus_allowed_list:      56,64,72,80
#

But the all the tasks would only run on cpu 56, even if its fully
overloaded.  Cpus 64,72,80 would be completely idle.  So on isolated
cpus, its the user who is in full control of scheduling the tasks on the
cpu.

> matters because if a process is isolated from interference then it
> follows that automatic NUMA balancing should not be involved. If

Yes, as an extension of the above, numa balancing should not be
involved.

> anything the protection should be absolute but either way;
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-05  1:50   ` Srikar Dronamraju
@ 2017-04-05  8:09     ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2017-04-05  8:09 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel

On Wed, Apr 05, 2017 at 07:20:06AM +0530, Srikar Dronamraju wrote:
> > > 
> > > To avoid this, please check for isolated cpus before choosing a target
> > > cpu.
> > > 
> > 
> > Hmm, would this also prevent a task running inside a cgroup that is
> > allowed accessed to isolated CPUs from balancing? I severely doubt it
> 
> Scheduler doesn't do any kind of load balancing for isolated cpus.
> 

I was referring specifically to numa balancing, not load balancing but I
should have been clearer.

> > matters because if a process is isolated from interference then it
> > follows that automatic NUMA balancing should not be involved. If
> 
> Yes, as an extension of the above, numa balancing should not be
> involved.
> 

If anything, it arguably is a more sensible fix. If tasks on isolated CPUs
should not be interfered with then that should include the NUMA scanner
running in task context. IIf the PTEs are not updated then the faults are
not incurred which would be a much larger saving in overhead overall. There
would be a potential corner case where two threads in the same address
space run in separate cpusets but it would be somewhat of an odd cornercase.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-04 17:27 [PATCH] sched: Fix numabalancing to work with isolated cpus Srikar Dronamraju
  2017-04-04 18:56 ` Rik van Riel
  2017-04-04 20:37 ` Mel Gorman
@ 2017-04-05 12:57 ` Michal Hocko
  2017-04-05 15:22   ` Srikar Dronamraju
  2017-04-06  7:36 ` Mike Galbraith
  2017-04-06  7:36 ` Peter Zijlstra
  4 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-04-05 12:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel

On Tue 04-04-17 22:57:28, Srikar Dronamraju wrote:
[...]
> For example:
> perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> would call sched_setaffinity that resets the cpus_allowed mask.
> 
> Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> 
> The isolated cpus are part of the cpus allowed list. In the above case,
> numabalancing ends up scheduling some of these tasks on isolated cpus.

Why is this bad? If the task is allowed to run on isolated CPUs then why
shouldn't its numa balancing be allowed the same? The changelog
describes what but doesn't explain _why_ this change is needed/useful.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-05 12:57 ` Michal Hocko
@ 2017-04-05 15:22   ` Srikar Dronamraju
  2017-04-05 16:44     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Srikar Dronamraju @ 2017-04-05 15:22 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel

* Michal Hocko <mhocko@kernel.org> [2017-04-05 14:57:43]:

> On Tue 04-04-17 22:57:28, Srikar Dronamraju wrote:
> [...]
> > For example:
> > perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> > would call sched_setaffinity that resets the cpus_allowed mask.
> > 
> > Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > 
> > The isolated cpus are part of the cpus allowed list. In the above case,
> > numabalancing ends up scheduling some of these tasks on isolated cpus.
> 
> Why is this bad? If the task is allowed to run on isolated CPUs then why

1. kernel-parameters.txt states: isolcpus as "Isolate CPUs from the
general scheduler." So the expectation that numabalancing can schedule
tasks on it is wrong.

2. If numabalancing was disabled, the task would never run on the
isolated CPUs.

3. With the faulty behaviour, it was observed that tasks scheduled on
the isolated cpus might end up taking more time, because they never get
a chance to move back to a node which has local memory.

4. The isolated cpus may be idle at that point, but actual work may be
scheduled on isolcpus later (when numabalancing had already scheduled
work on to it.) Since scheduler doesnt do any balancing on isolcpus even
if they are overloaded and the system is completely free, the isolcpus
stay overloaded.

> shouldn't its numa balancing be allowed the same? The changelog
> describes what but doesn't explain _why_ this change is needed/useful.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-05 15:22   ` Srikar Dronamraju
@ 2017-04-05 16:44     ` Michal Hocko
  2017-04-06  7:19       ` Srikar Dronamraju
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-04-05 16:44 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel

On Wed 05-04-17 20:52:15, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@kernel.org> [2017-04-05 14:57:43]:
> 
> > On Tue 04-04-17 22:57:28, Srikar Dronamraju wrote:
> > [...]
> > > For example:
> > > perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> > > would call sched_setaffinity that resets the cpus_allowed mask.
> > > 
> > > Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> > > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > > Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > > 
> > > The isolated cpus are part of the cpus allowed list. In the above case,
> > > numabalancing ends up scheduling some of these tasks on isolated cpus.
> > 
> > Why is this bad? If the task is allowed to run on isolated CPUs then why
> 
> 1. kernel-parameters.txt states: isolcpus as "Isolate CPUs from the
> general scheduler." So the expectation that numabalancing can schedule
> tasks on it is wrong.

Right but if the task is allowed to run on isolated cpus then the numa
balancing for this taks should be allowed to run on those cpus, no?
Say your application would be bound _only_ to isolated cpus. Should that
imply no numa balancing at all?

> 2. If numabalancing was disabled, the task would never run on the
> isolated CPUs.

I am confused. I thought you said "However a task might call
sched_setaffinity() that includes all possible cpus in the system
including the isolated cpus." So the task is allowed to run there.
Or am I missing something?

> 3. With the faulty behaviour, it was observed that tasks scheduled on
> the isolated cpus might end up taking more time, because they never get
> a chance to move back to a node which has local memory.

I am not sure I understand.

> 4. The isolated cpus may be idle at that point, but actual work may be
> scheduled on isolcpus later (when numabalancing had already scheduled
> work on to it.) Since scheduler doesnt do any balancing on isolcpus even
> if they are overloaded and the system is completely free, the isolcpus
> stay overloaded.

Please note that I do not claim the patch is wrong. I am still not sure
myself but the chagelog is missing the most important information "why
the change is the right thing".
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-05 16:44     ` Michal Hocko
@ 2017-04-06  7:19       ` Srikar Dronamraju
  2017-04-06  7:34         ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Srikar Dronamraju @ 2017-04-06  7:19 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel

> > > > The isolated cpus are part of the cpus allowed list. In the above case,
> > > > numabalancing ends up scheduling some of these tasks on isolated cpus.
> > > 
> > > Why is this bad? If the task is allowed to run on isolated CPUs then why
> > 
> > 1. kernel-parameters.txt states: isolcpus as "Isolate CPUs from the
> > general scheduler." So the expectation that numabalancing can schedule
> > tasks on it is wrong.
> 
> Right but if the task is allowed to run on isolated cpus then the numa
> balancing for this taks should be allowed to run on those cpus, no?

No numabalancing or any other scheduler balancing should be looking at
tasks that are bound to isolated cpus.

Similar example that I gave in my reply to Mel.

Lets consider 2 node, 24 core with 12 cores in each node.
Cores 0-11 in Node 1 and cores 12-23 in the other node.
Lets also disable smt/hyperthreading, enable isolcpus from core
6-11,12-17.  Lets run 48 thread ebizzy workload and give it a cpu list
of say 11,12-17 using taskset.

Now all the 48 ebizzy threads will only run on core 11. It will never
spread to other cores even in the same node(or in the same node/but
isolated cpus) or to the different nodes. i.e even if numabalancing is
running or not, even if my fix is around or not, all threads will be
confined to core 11, even though the cpus_allowed is 11,12-17.

> Say your application would be bound _only_ to isolated cpus. Should that
> imply no numa balancing at all?

Yes, it implies no numa balancing.

> 
> > 2. If numabalancing was disabled, the task would never run on the
> > isolated CPUs.
> 
> I am confused. I thought you said "However a task might call
> sched_setaffinity() that includes all possible cpus in the system
> including the isolated cpus." So the task is allowed to run there.
> Or am I missing something?
> 

Peter, Rik, Ingo can correct me here.

I feel most programs that call sched_setaffinity including perf bench
are written with an assumption that they are never run with isolcpus.
If they were written with isolcpus in mind, they would have either
looked at sched_getaffinity and modified the mask or they would have
explicitly looked for isolated maps and derived the mask. Just because
the program calls sched_setaffinity with all cpus includine the
isolcpus, we cant assume that the user wanted us to do scheduling on top
of isolated CPUS. That would be break isolcpus.

> > 3. With the faulty behaviour, it was observed that tasks scheduled on
> > the isolated cpus might end up taking more time, because they never get
> > a chance to move back to a node which has local memory.
> 
> I am not sure I understand.

Lets say we ran 48 thread perf bench (2 process/12 threads each) on a 24
CPU/2node/12 cores per node system with 4 isolated cpus with all
isolated CPUs in one node. The assumption here is each process would
eventually end up consolidating in one node. However lets say one of the
perf thread was moved to isolated thread on the wrong node, i.e it
ends up accessing remote memory. (Why does it end up on the wrong node?
Because when we compare numa faults, this thread fault may be low at
present and it should have moved back to the right node eventually.)
However once it gets scheduled on isolated cpus, it wont be picked for
moving out. So it ends up taking far more time to finish. I have
observed this practically.

> 
> > 4. The isolated cpus may be idle at that point, but actual work may be
> > scheduled on isolcpus later (when numabalancing had already scheduled
> > work on to it.) Since scheduler doesnt do any balancing on isolcpus even
> > if they are overloaded and the system is completely free, the isolcpus
> > stay overloaded.
> 
> Please note that I do not claim the patch is wrong. I am still not sure
> myself but the chagelog is missing the most important information "why
> the change is the right thing".

 I am open to editing the changelog, I assumed that isolcpus kernel
 parameter was clear that no scheduling algorithms can interfere with
 isolcpus. Would stating this in the changelog clarify to you that this
 change is right?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-06  7:19       ` Srikar Dronamraju
@ 2017-04-06  7:34         ` Michal Hocko
  2017-04-06  9:23           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-04-06  7:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel

On Thu 06-04-17 12:49:50, Srikar Dronamraju wrote:
> > > > > The isolated cpus are part of the cpus allowed list. In the above case,
> > > > > numabalancing ends up scheduling some of these tasks on isolated cpus.
> > > > 
> > > > Why is this bad? If the task is allowed to run on isolated CPUs then why
> > > 
> > > 1. kernel-parameters.txt states: isolcpus as "Isolate CPUs from the
> > > general scheduler." So the expectation that numabalancing can schedule
> > > tasks on it is wrong.
> > 
> > Right but if the task is allowed to run on isolated cpus then the numa
> > balancing for this taks should be allowed to run on those cpus, no?
> 
> No numabalancing or any other scheduler balancing should be looking at
> tasks that are bound to isolated cpus.

Is this documented anywhere? My understanding of isolcpus is to make
sure that nothing _outside_ of the dedicated workload interfers. But why
shouldn't the dedicated workload be numa balanced is not clear to me at
all.

> Similar example that I gave in my reply to Mel.
> 
> Lets consider 2 node, 24 core with 12 cores in each node.
> Cores 0-11 in Node 1 and cores 12-23 in the other node.
> Lets also disable smt/hyperthreading, enable isolcpus from core
> 6-11,12-17.  Lets run 48 thread ebizzy workload and give it a cpu list
> of say 11,12-17 using taskset.
> 
> Now all the 48 ebizzy threads will only run on core 11. It will never
> spread to other cores even in the same node(or in the same node/but
> isolated cpus) or to the different nodes. i.e even if numabalancing is
> running or not, even if my fix is around or not, all threads will be
> confined to core 11, even though the cpus_allowed is 11,12-17.

Isn't that a bug in isolcpus implementation? It is certainly an
unexpected behavior I would say. Is this documented anywhere?

> > Say your application would be bound _only_ to isolated cpus. Should that
> > imply no numa balancing at all?
> 
> Yes, it implies no numa balancing.
> 
> > 
> > > 2. If numabalancing was disabled, the task would never run on the
> > > isolated CPUs.
> > 
> > I am confused. I thought you said "However a task might call
> > sched_setaffinity() that includes all possible cpus in the system
> > including the isolated cpus." So the task is allowed to run there.
> > Or am I missing something?
> > 
> 
> Peter, Rik, Ingo can correct me here.
> 
> I feel most programs that call sched_setaffinity including perf bench
> are written with an assumption that they are never run with isolcpus.

Isn't sched_setaffinity the only way how to actually make it possible to
run on isolcpus?

> > Please note that I do not claim the patch is wrong. I am still not sure
> > myself but the chagelog is missing the most important information "why
> > the change is the right thing".
> 
>  I am open to editing the changelog, I assumed that isolcpus kernel
>  parameter was clear that no scheduling algorithms can interfere with
>  isolcpus. Would stating this in the changelog clarify to you that this
>  change is right?

I would really like to see it confirmed by the scheduler maintainers and
documented properly as well. What you are claiming here is rather
surprising to my understanding of what isolcpus acutally is.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-04 17:27 [PATCH] sched: Fix numabalancing to work with isolated cpus Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2017-04-05 12:57 ` Michal Hocko
@ 2017-04-06  7:36 ` Mike Galbraith
  2017-04-06  7:36 ` Peter Zijlstra
  4 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2017-04-06  7:36 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel

On Tue, 2017-04-04 at 22:57 +0530, Srikar Dronamraju wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f045a35..f853dc0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1666,6 +1666,10 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  > 	> 	> if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
>  > 	> 	> 	> continue;
>  
> +> 	> 	> /* Skip isolated cpus */
> +> 	> 	> if (cpumask_test_cpu(cpu, cpu_isolated_map))
> +> 	> 	> 	> continue;

Methinks that should check for !cpu_rq(cpu)->sd, so the thing doesn't
screw up cpuset isolation either.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-04 17:27 [PATCH] sched: Fix numabalancing to work with isolated cpus Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2017-04-06  7:36 ` Mike Galbraith
@ 2017-04-06  7:36 ` Peter Zijlstra
  4 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2017-04-06  7:36 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel

On Tue, Apr 04, 2017 at 10:57:28PM +0530, Srikar Dronamraju wrote:
> When performing load balancing, numabalancing only looks at
> task->cpus_allowed to see if the task can run on the target cpu. If
> isolcpus kernel parameter is set, then isolated cpus will not be part of
> mask task->cpus_allowed.
> 
> For example: (On a Power 8 box running in smt 1 mode)
> 
> isolcpus=56,64,72,80,88
> 
> Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20996/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20997/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20998/status:Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> 
> Note: offline cpus are excluded in cpus_allowed_list.
> 
> However a task might call sched_setaffinity() that includes all possible
> cpus in the system including the isolated cpus.
> 
> For example:
> perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> would call sched_setaffinity that resets the cpus_allowed mask.
> 
> Cpus_allowed_list:	0-55,57-63,65-71,73-79,81-87,89-175
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list:	0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> 
> The isolated cpus are part of the cpus allowed list. In the above case,
> numabalancing ends up scheduling some of these tasks on isolated cpus.
> 
> To avoid this, please check for isolated cpus before choosing a target
> cpu.

Is there anything stopping the numa balancer taking tasks off an
isolated CPU?

Its been too long since I've looked at the NUMA bits; but from a quick
reading we mostly completely ignore the sched domain stuff.

That means there's likely to be more holes here, and just plugging them
as we find them doesn't appear to be the best approach.

For example, if set use cpusets to partition the scheduler, but somehow
leave a task in the root group, numa balancer looks like it will happily
migrate tasks between the partitions.

So please try and fix the bigger problem, then I think this one will go
away as well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-06  7:34         ` Michal Hocko
@ 2017-04-06  9:23           ` Peter Zijlstra
  2017-04-06 10:13             ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-04-06  9:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Srikar Dronamraju, Ingo Molnar, LKML, Mel Gorman, Rik van Riel

On Thu, Apr 06, 2017 at 09:34:36AM +0200, Michal Hocko wrote:
> On Thu 06-04-17 12:49:50, Srikar Dronamraju wrote:

> > Similar example that I gave in my reply to Mel.
> > 
> > Lets consider 2 node, 24 core with 12 cores in each node.
> > Cores 0-11 in Node 1 and cores 12-23 in the other node.
> > Lets also disable smt/hyperthreading, enable isolcpus from core
> > 6-11,12-17.  Lets run 48 thread ebizzy workload and give it a cpu list
> > of say 11,12-17 using taskset.
> > 
> > Now all the 48 ebizzy threads will only run on core 11. It will never
> > spread to other cores even in the same node(or in the same node/but
> > isolated cpus) or to the different nodes. i.e even if numabalancing is
> > running or not, even if my fix is around or not, all threads will be
> > confined to core 11, even though the cpus_allowed is 11,12-17.

Argh, why such a convoluted example :-(

> Isn't that a bug in isolcpus implementation? It is certainly an
> unexpected behavior I would say. Is this documented anywhere?

Without engaging the brain too much to decipher the example; it does
look right. isolcpus will have no balancing.

> Isn't sched_setaffinity the only way how to actually make it possible to
> run on isolcpus?

Think so.. Personally I hate isolcpus and never use it.

> I would really like to see it confirmed by the scheduler maintainers and
> documented properly as well. What you are claiming here is rather
> surprising to my understanding of what isolcpus acutally is.

isolcpus gets you a set of fully partitioned CPUs. What's surprising
about that?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-06  9:23           ` Peter Zijlstra
@ 2017-04-06 10:13             ` Michal Hocko
  2017-04-06 10:29               ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-04-06 10:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, LKML, Mel Gorman, Rik van Riel

On Thu 06-04-17 11:23:29, Peter Zijlstra wrote:
> On Thu, Apr 06, 2017 at 09:34:36AM +0200, Michal Hocko wrote:
[...]
> > I would really like to see it confirmed by the scheduler maintainers and
> > documented properly as well. What you are claiming here is rather
> > surprising to my understanding of what isolcpus acutally is.
> 
> isolcpus gets you a set of fully partitioned CPUs. What's surprising
> about that?

Well, I thought that all isolated cpus simply form their own scheduling
domain which is isolated from the general workload on the system
(kthreads, softirqs etc...).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-06 10:13             ` Michal Hocko
@ 2017-04-06 10:29               ` Peter Zijlstra
  2017-04-06 10:42                 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-04-06 10:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Srikar Dronamraju, Ingo Molnar, LKML, Mel Gorman, Rik van Riel

On Thu, Apr 06, 2017 at 12:13:49PM +0200, Michal Hocko wrote:
> On Thu 06-04-17 11:23:29, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2017 at 09:34:36AM +0200, Michal Hocko wrote:
> [...]
> > > I would really like to see it confirmed by the scheduler maintainers and
> > > documented properly as well. What you are claiming here is rather
> > > surprising to my understanding of what isolcpus acutally is.
> > 
> > isolcpus gets you a set of fully partitioned CPUs. What's surprising
> > about that?
> 
> Well, I thought that all isolated cpus simply form their own scheduling
> domain which is isolated from the general workload on the system
> (kthreads, softirqs etc...).

No, they all form their own 1 cpu partition.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-06 10:29               ` Peter Zijlstra
@ 2017-04-06 10:42                 ` Michal Hocko
  2017-04-06 10:47                   ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-04-06 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, LKML, Mel Gorman, Rik van Riel

On Thu 06-04-17 12:29:57, Peter Zijlstra wrote:
> On Thu, Apr 06, 2017 at 12:13:49PM +0200, Michal Hocko wrote:
> > On Thu 06-04-17 11:23:29, Peter Zijlstra wrote:
> > > On Thu, Apr 06, 2017 at 09:34:36AM +0200, Michal Hocko wrote:
> > [...]
> > > > I would really like to see it confirmed by the scheduler maintainers and
> > > > documented properly as well. What you are claiming here is rather
> > > > surprising to my understanding of what isolcpus acutally is.
> > > 
> > > isolcpus gets you a set of fully partitioned CPUs. What's surprising
> > > about that?
> > 
> > Well, I thought that all isolated cpus simply form their own scheduling
> > domain which is isolated from the general workload on the system
> > (kthreads, softirqs etc...).
> 
> No, they all form their own 1 cpu partition.

Is this something dictated by usecases which rely on isolcpus or rather
nobody bothered to implement one scheduling domain?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-06 10:42                 ` Michal Hocko
@ 2017-04-06 10:47                   ` Peter Zijlstra
  2017-04-06 13:44                     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-04-06 10:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Srikar Dronamraju, Ingo Molnar, LKML, Mel Gorman, Rik van Riel

On Thu, Apr 06, 2017 at 12:42:04PM +0200, Michal Hocko wrote:

> Is this something dictated by usecases which rely on isolcpus or rather
> nobody bothered to implement one scheduling domain?

Its from the original use-case I suspect. It was done very much on
purpose.

If you want bigger partitions use cpusets. Or rather, like I've been
saying for many years, use cpusets for everything and kill isolcpus.

Problem with that last part is cgroups.. you then need to make a !root
cgroup the 'default' cgroup in order to represent that.

I really should've done this before all that cgroup crap came along, too
late now :/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched: Fix numabalancing to work with isolated cpus
  2017-04-06 10:47                   ` Peter Zijlstra
@ 2017-04-06 13:44                     ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-04-06 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, LKML, Mel Gorman, Rik van Riel

On Thu 06-04-17 12:47:57, Peter Zijlstra wrote:
> On Thu, Apr 06, 2017 at 12:42:04PM +0200, Michal Hocko wrote:
> 
> > Is this something dictated by usecases which rely on isolcpus or rather
> > nobody bothered to implement one scheduling domain?
> 
> Its from the original use-case I suspect. It was done very much on
> purpose.

OK, fair enough.
 
> If you want bigger partitions use cpusets. Or rather, like I've been
> saying for many years, use cpusets for everything and kill isolcpus.

I thought that the only point of isolcpus is to provide an exclusive set
of cpus that only selected userspace will ever run. Mostly to schield it
of from the random system activity. But maybe this is just my
misinterpretation of what this feature offers.

Anway I am shifting this off-topic so I will not interfere more. Thanks
for the clarification.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-04-06 13:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-04 17:27 [PATCH] sched: Fix numabalancing to work with isolated cpus Srikar Dronamraju
2017-04-04 18:56 ` Rik van Riel
2017-04-04 20:37 ` Mel Gorman
2017-04-05  1:50   ` Srikar Dronamraju
2017-04-05  8:09     ` Mel Gorman
2017-04-05 12:57 ` Michal Hocko
2017-04-05 15:22   ` Srikar Dronamraju
2017-04-05 16:44     ` Michal Hocko
2017-04-06  7:19       ` Srikar Dronamraju
2017-04-06  7:34         ` Michal Hocko
2017-04-06  9:23           ` Peter Zijlstra
2017-04-06 10:13             ` Michal Hocko
2017-04-06 10:29               ` Peter Zijlstra
2017-04-06 10:42                 ` Michal Hocko
2017-04-06 10:47                   ` Peter Zijlstra
2017-04-06 13:44                     ` Michal Hocko
2017-04-06  7:36 ` Mike Galbraith
2017-04-06  7:36 ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.