[PATCH v6 0/2] cpuset: Enable cpuset controller in default hierarchy

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/2] cpuset: Enable cpuset controller in default hierarchy
@ 2018-03-21 16:21 Waiman Long
  2018-03-21 16:21 ` [PATCH v6 1/2] " Waiman Long
  2018-03-21 16:21 ` [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2 Waiman Long
  0 siblings, 2 replies; 14+ messages in thread
From: Waiman Long @ 2018-03-21 16:21 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto, efault,
	torvalds, Roman Gushchin, Waiman Long

v6:
 - Hide cpuset control knobs in root cgroup.
 - Rename effective_cpus and effective_mems to cpus.effective and
   mems.effective respectively.
 - Remove cpuset.flags and add cpuset.sched_load_balance instead
   as the behavior of sched_load_balance has changed and so is
   not a simple flag.
 - Update cgroup-v2.txt accordingly.

v5:
 - Add patch 2 to provide the cpuset.flags control knob for the
   sched_load_balance flag which should be the only feature that is
   essential as a replacement of the "isolcpus" kernel boot parameter.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

The purpose of this patchset is to provide a minimal set of cpuset
features for cgroup v2. That minimal set includes the cpus, mems,
cpus.effective and mems.effective and sched_load_balance. The last one is
needed to support use cases similar to the "isolcpus" kernel parameter.

This patchset does not exclude the possibility of adding more flags
and features in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective counterparts.

Patch 2 adds sched_load_balance whose behavior changes in v2 to become
hierarchical and includes an implicit !cpu_exclusive.

Waiman Long (2):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add cpuset.sched_load_balance to v2

 Documentation/cgroup-v2.txt | 112 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/cgroup/cpuset.c      | 104 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 201 insertions(+), 15 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 1/2] cpuset: Enable cpuset controller in default hierarchy
  2018-03-21 16:21 [PATCH v6 0/2] cpuset: Enable cpuset controller in default hierarchy Waiman Long
@ 2018-03-21 16:21 ` Waiman Long
  2018-03-21 16:21 ` [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2 Waiman Long
  1 sibling, 0 replies; 14+ messages in thread
From: Waiman Long @ 2018-03-21 16:21 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto, efault,
	torvalds, Roman Gushchin, Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++++++++++++++++++++---
 kernel/cgroup/cpuset.c      | 48 ++++++++++++++++++++++--
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..ed8ec66 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
        5-3-2. Writeback
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cgroups.
+
+	It lists the CPUs allowed to be used by tasks within this
+	cgroup.  The CPU numbers are comma-separated numbers or
+	ranges.  For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cgroups.
+
+	It lists the onlined CPUs that are actually allowed to be
+	used by tasks within the current cgroup.  If "cpuset.cpus"
+	is empty, it shows all the CPUs from the parent cgroup that
+	will be available to be used by this cgroup.  Otherwise, it is
+	a subset of "cpuset.cpus".  Its value will be affected by CPU
+	hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cgroups.
+
+	It lists the memory nodes allowed to be used by tasks within
+	this cgroup.  The memory node numbers are comma-separated
+	numbers or ranges.  For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cgroups.
+
+	It lists the onlined memory nodes that are actually allowed to
+	be used by tasks within the current cgroup.  If "cpuset.mems"
+	is empty, it shows all the memory nodes from the parent cgroup
+	that will be available to be used by this cgroup.  Otherwise,
+	it is a subset of "cpuset.mems".  Its value will be affected
+	by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e..419b758 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1931,6 +1930,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 /*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
+/*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
  */
@@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-21 16:21 [PATCH v6 0/2] cpuset: Enable cpuset controller in default hierarchy Waiman Long
  2018-03-21 16:21 ` [PATCH v6 1/2] " Waiman Long
@ 2018-03-21 16:21 ` Waiman Long
  2018-03-22  8:41   ` Juri Lelli
  1 sibling, 1 reply; 14+ messages in thread
From: Waiman Long @ 2018-03-21 16:21 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto, efault,
	torvalds, Roman Gushchin, Waiman Long

The sched_load_balance flag is needed to enable CPU isolation similar
to what can be done with the "isolcpus" kernel boot parameter.

The sched_load_balance flag implies an implicit !cpu_exclusive as
it doesn't make sense to have an isolated CPU being load-balanced in
another cpuset.

For v2, this flag is hierarchical and is inherited by child cpusets. It
is not allowed to have this flag turn off in a parent cpuset, but on
in a child cpuset.

This flag is set by the parent and is not delegatable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 22 ++++++++++++++++++
 kernel/cgroup/cpuset.c      | 56 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index ed8ec66..c970bd7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1514,6 +1514,28 @@ Cpuset Interface Files
 	it is a subset of "cpuset.mems".  Its value will be affected
 	by memory nodes hotplug events.
 
+  cpuset.sched_load_balance
+	A read-write single value file which exists on non-root cgroups.
+	The default is "1" (on), and the other possible value is "0"
+	(off).
+
+	When it is on, tasks within this cpuset will be load-balanced
+	by the kernel scheduler.  Tasks will be moved from CPUs with
+	high load to other CPUs within the same cpuset with less load
+	periodically.
+
+	When it is off, there will be no load balancing among CPUs on
+	this cgroup.  Tasks will stay in the CPUs they are running on
+	and will not be moved to other CPUs.
+
+	This flag is hierarchical and is inherited by child cpusets. It
+	can be turned off only when the CPUs in this cpuset aren't
+	listed in the cpuset.cpus of other sibling cgroups, and all
+	the child cpusets, if present, have this flag turned off.
+
+	Once it is off, it cannot be turned back on as long as the
+	parent cgroup still has this flag in the off state.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 419b758..d675c4f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -407,15 +407,22 @@ static void cpuset_update_task_spread_flag(struct cpuset *cs,
  *
  * One cpuset is a subset of another if all its allowed CPUs and
  * Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set.  Call holding cpuset_mutex.
+ * are only set if the other's are set (on legacy hierarchy) or
+ * its sched_load_balance flag is only set if the other is set
+ * (on default hierarchy).  Caller holding cpuset_mutex.
  */
 
 static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
 {
-	return	cpumask_subset(p->cpus_allowed, q->cpus_allowed) &&
-		nodes_subset(p->mems_allowed, q->mems_allowed) &&
-		is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
-		is_mem_exclusive(p) <= is_mem_exclusive(q);
+	if (!cpumask_subset(p->cpus_allowed, q->cpus_allowed) ||
+	    !nodes_subset(p->mems_allowed, q->mems_allowed))
+		return false;
+
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys))
+		return is_sched_load_balance(p) <= is_sched_load_balance(q);
+	else
+		return is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
+		       is_mem_exclusive(p) <= is_mem_exclusive(q);
 }
 
 /**
@@ -498,7 +505,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 
 	par = parent_cs(cur);
 
-	/* On legacy hiearchy, we must be a subset of our parent cpuset. */
+	/* On legacy hierarchy, we must be a subset of our parent cpuset. */
 	ret = -EACCES;
 	if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
 		goto out;
@@ -1327,6 +1334,19 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	else
 		clear_bit(bit, &trialcs->flags);
 
+	/*
+	 * On default hierarchy, turning off sched_load_balance flag implies
+	 * an implicit cpu_exclusive. Turning on sched_load_balance will
+	 * clear the cpu_exclusive flag.
+	 */
+	if ((bit == CS_SCHED_LOAD_BALANCE) &&
+	    cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+		if (turning_on)
+			clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+		else
+			set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+	}
+
 	err = validate_change(cs, trialcs);
 	if (err < 0)
 		goto out;
@@ -1966,6 +1986,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched_load_balance",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_SCHED_LOAD_BALANCE,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -1991,7 +2019,21 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
 		goto free_cpus;
 
-	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	/*
+	 * On default hierarchy, inherit parent's CS_SCHED_LOAD_BALANCE and
+	 * CS_CPU_EXCLUSIVE flag.
+	 */
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+		struct cpuset *parent = css_cs(parent_css);
+
+		if (test_bit(CS_SCHED_LOAD_BALANCE, &parent->flags))
+			set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+		else
+			set_bit(CS_CPU_EXCLUSIVE, &cs->flags);
+	} else {
+		set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	}
+
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
 	cpumask_clear(cs->effective_cpus);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-21 16:21 ` [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2 Waiman Long
@ 2018-03-22  8:41   ` Juri Lelli
  2018-03-22 21:50     ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Juri Lelli @ 2018-03-22  8:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

Hi Waiman,

On 21/03/18 12:21, Waiman Long wrote:
> The sched_load_balance flag is needed to enable CPU isolation similar
> to what can be done with the "isolcpus" kernel boot parameter.
> 
> The sched_load_balance flag implies an implicit !cpu_exclusive as
> it doesn't make sense to have an isolated CPU being load-balanced in
> another cpuset.
> 
> For v2, this flag is hierarchical and is inherited by child cpusets. It
> is not allowed to have this flag turn off in a parent cpuset, but on
> in a child cpuset.
> 
> This flag is set by the parent and is not delegatable.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  Documentation/cgroup-v2.txt | 22 ++++++++++++++++++
>  kernel/cgroup/cpuset.c      | 56 +++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 71 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index ed8ec66..c970bd7 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>  	it is a subset of "cpuset.mems".  Its value will be affected
>  	by memory nodes hotplug events.
>  
> +  cpuset.sched_load_balance
> +	A read-write single value file which exists on non-root cgroups.
> +	The default is "1" (on), and the other possible value is "0"
> +	(off).
> +
> +	When it is on, tasks within this cpuset will be load-balanced
> +	by the kernel scheduler.  Tasks will be moved from CPUs with
> +	high load to other CPUs within the same cpuset with less load
> +	periodically.
> +
> +	When it is off, there will be no load balancing among CPUs on
> +	this cgroup.  Tasks will stay in the CPUs they are running on
> +	and will not be moved to other CPUs.
> +
> +	This flag is hierarchical and is inherited by child cpusets. It
> +	can be turned off only when the CPUs in this cpuset aren't
> +	listed in the cpuset.cpus of other sibling cgroups, and all
> +	the child cpusets, if present, have this flag turned off.
> +
> +	Once it is off, it cannot be turned back on as long as the
> +	parent cgroup still has this flag in the off state.
> +

I'm afraid that this will not work for SCHED_DEADLINE (at least for how
it is implemented today). As you can see in Documentation [1] the only
way a user has to perform partitioned/clustered scheduling is to create
subset of exclusive cpusets and then assign deadline tasks to them. The
other thing to take into account here is that a root_domain is created
for each exclusive set and we use such root_domain to keep information
about admitted bandwidth and speed up load balancing decisions (there is
a max heap tracking deadlines of active tasks on each root_domain).

Now, AFAIR distinct root_domain(s) are created when parent group has
sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
is). So, what we normally do is create, say, cpus_exclusive groups for
the different clusters and then disable sched_load_balance at root level
(so that each cluster gets its own root_domain). Also,
sched_load_balance is enabled in children groups (as load balancing
inside clusters is what we actually needed :).

IIUC your proposal this will not be permitted with cgroup v2 because
sched_load_balance won't be present at root level and children groups
won't be able to set sched_load_balance back to 1 if that was set to 0
in some parent. Is that true?

Look, the way things work today is most probably not perfect (just to
say one thing, we need to disable load balancing for all classes at root
level just because DEADLINE wants to set restricted affinities to his
tasks :/) and we could probably think on how to change how this all
work. So, let's first see if IIUC what you are proposing (and its
implications). :)

Best,

- Juri

[1] https://elixir.bootlin.com/linux/v4.16-rc6/source/Documentation/scheduler/sched-deadline.txt#L640

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-22  8:41   ` Juri Lelli
@ 2018-03-22 21:50     ` Waiman Long
  2018-03-23  7:59       ` Juri Lelli
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2018-03-22 21:50 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

On 03/22/2018 04:41 AM, Juri Lelli wrote:
> Hi Waiman,
>
> On 21/03/18 12:21, Waiman Long wrote:
>> The sched_load_balance flag is needed to enable CPU isolation similar
>> to what can be done with the "isolcpus" kernel boot parameter.
>>
>> The sched_load_balance flag implies an implicit !cpu_exclusive as
>> it doesn't make sense to have an isolated CPU being load-balanced in
>> another cpuset.
>>
>> For v2, this flag is hierarchical and is inherited by child cpusets. It
>> is not allowed to have this flag turn off in a parent cpuset, but on
>> in a child cpuset.
>>
>> This flag is set by the parent and is not delegatable.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt | 22 ++++++++++++++++++
>>  kernel/cgroup/cpuset.c      | 56 +++++++++++++++++++++++++++++++++++++++------
>>  2 files changed, 71 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index ed8ec66..c970bd7 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>>  	it is a subset of "cpuset.mems".  Its value will be affected
>>  	by memory nodes hotplug events.
>>  
>> +  cpuset.sched_load_balance
>> +	A read-write single value file which exists on non-root cgroups.
>> +	The default is "1" (on), and the other possible value is "0"
>> +	(off).
>> +
>> +	When it is on, tasks within this cpuset will be load-balanced
>> +	by the kernel scheduler.  Tasks will be moved from CPUs with
>> +	high load to other CPUs within the same cpuset with less load
>> +	periodically.
>> +
>> +	When it is off, there will be no load balancing among CPUs on
>> +	this cgroup.  Tasks will stay in the CPUs they are running on
>> +	and will not be moved to other CPUs.
>> +
>> +	This flag is hierarchical and is inherited by child cpusets. It
>> +	can be turned off only when the CPUs in this cpuset aren't
>> +	listed in the cpuset.cpus of other sibling cgroups, and all
>> +	the child cpusets, if present, have this flag turned off.
>> +
>> +	Once it is off, it cannot be turned back on as long as the
>> +	parent cgroup still has this flag in the off state.
>> +
> I'm afraid that this will not work for SCHED_DEADLINE (at least for how
> it is implemented today). As you can see in Documentation [1] the only
> way a user has to perform partitioned/clustered scheduling is to create
> subset of exclusive cpusets and then assign deadline tasks to them. The
> other thing to take into account here is that a root_domain is created
> for each exclusive set and we use such root_domain to keep information
> about admitted bandwidth and speed up load balancing decisions (there is
> a max heap tracking deadlines of active tasks on each root_domain).
> Now, AFAIR distinct root_domain(s) are created when parent group has
> sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
> is). So, what we normally do is create, say, cpus_exclusive groups for
> the different clusters and then disable sched_load_balance at root level
> (so that each cluster gets its own root_domain). Also,
> sched_load_balance is enabled in children groups (as load balancing
> inside clusters is what we actually needed :).

That looks like an undocumented side effect to me. I would rather see an
explicit control file that enable root_domain and break it free from
cpu_exclusive && !sched_load_balance, e.g. sched_root_domain(?).

> IIUC your proposal this will not be permitted with cgroup v2 because
> sched_load_balance won't be present at root level and children groups
> won't be able to set sched_load_balance back to 1 if that was set to 0
> in some parent. Is that true?

Yes, that is the current plan.

> Look, the way things work today is most probably not perfect (just to
> say one thing, we need to disable load balancing for all classes at root
> level just because DEADLINE wants to set restricted affinities to his
> tasks :/) and we could probably think on how to change how this all
> work. So, let's first see if IIUC what you are proposing (and its
> implications). :)
>
Cgroup v2 is supposed to allow us to have a fresh start to rethink what
is a more sane way of partitioning resources without worrying about
backward compatibility. So I think it is time to design a new way for
deadline tasks to work with cpuset v2.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-22 21:50     ` Waiman Long
@ 2018-03-23  7:59       ` Juri Lelli
  2018-03-23 18:44         ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Juri Lelli @ 2018-03-23  7:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

On 22/03/18 17:50, Waiman Long wrote:
> On 03/22/2018 04:41 AM, Juri Lelli wrote:
> > On 21/03/18 12:21, Waiman Long wrote:

[...]

> >> +  cpuset.sched_load_balance
> >> +	A read-write single value file which exists on non-root cgroups.
> >> +	The default is "1" (on), and the other possible value is "0"
> >> +	(off).
> >> +
> >> +	When it is on, tasks within this cpuset will be load-balanced
> >> +	by the kernel scheduler.  Tasks will be moved from CPUs with
> >> +	high load to other CPUs within the same cpuset with less load
> >> +	periodically.
> >> +
> >> +	When it is off, there will be no load balancing among CPUs on
> >> +	this cgroup.  Tasks will stay in the CPUs they are running on
> >> +	and will not be moved to other CPUs.
> >> +
> >> +	This flag is hierarchical and is inherited by child cpusets. It
> >> +	can be turned off only when the CPUs in this cpuset aren't
> >> +	listed in the cpuset.cpus of other sibling cgroups, and all
> >> +	the child cpusets, if present, have this flag turned off.
> >> +
> >> +	Once it is off, it cannot be turned back on as long as the
> >> +	parent cgroup still has this flag in the off state.
> >> +
> > I'm afraid that this will not work for SCHED_DEADLINE (at least for how
> > it is implemented today). As you can see in Documentation [1] the only
> > way a user has to perform partitioned/clustered scheduling is to create
> > subset of exclusive cpusets and then assign deadline tasks to them. The
> > other thing to take into account here is that a root_domain is created
> > for each exclusive set and we use such root_domain to keep information
> > about admitted bandwidth and speed up load balancing decisions (there is
> > a max heap tracking deadlines of active tasks on each root_domain).
> > Now, AFAIR distinct root_domain(s) are created when parent group has
> > sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
> > is). So, what we normally do is create, say, cpus_exclusive groups for
> > the different clusters and then disable sched_load_balance at root level
> > (so that each cluster gets its own root_domain). Also,
> > sched_load_balance is enabled in children groups (as load balancing
> > inside clusters is what we actually needed :).
> 
> That looks like an undocumented side effect to me. I would rather see an
> explicit control file that enable root_domain and break it free from
> cpu_exclusive && !sched_load_balance, e.g. sched_root_domain(?).

Mmm, it actually makes some sort of sense to me that as long as parent
groups can't load balance (because !sched_load_balance) and this group
can't have CPUs overlapping with some other group (because
cpu_exclusive) a data structure (root_domain) is created to handle load
balancing for this isolated subsystem. I agree that it should be better
documented, though.

> > IIUC your proposal this will not be permitted with cgroup v2 because
> > sched_load_balance won't be present at root level and children groups
> > won't be able to set sched_load_balance back to 1 if that was set to 0
> > in some parent. Is that true?
> 
> Yes, that is the current plan.

OK, thanks for confirming. Can you tell again however why do you think
we need to remove sched_load_balance from root level? Won't we end up
having tasks put on isolated sets?

Also, I guess children groups with more than one CPU will need to be
able to load balance across their CPUs, no matter what their parent
group does?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-23  7:59       ` Juri Lelli
@ 2018-03-23 18:44         ` Waiman Long
  2018-03-26 12:47           ` Juri Lelli
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2018-03-23 18:44 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

On 03/23/2018 03:59 AM, Juri Lelli wrote:
> On 22/03/18 17:50, Waiman Long wrote:
>> On 03/22/2018 04:41 AM, Juri Lelli wrote:
>>> On 21/03/18 12:21, Waiman Long wrote:
> [...]
>
>>>> +  cpuset.sched_load_balance
>>>> +	A read-write single value file which exists on non-root cgroups.
>>>> +	The default is "1" (on), and the other possible value is "0"
>>>> +	(off).
>>>> +
>>>> +	When it is on, tasks within this cpuset will be load-balanced
>>>> +	by the kernel scheduler.  Tasks will be moved from CPUs with
>>>> +	high load to other CPUs within the same cpuset with less load
>>>> +	periodically.
>>>> +
>>>> +	When it is off, there will be no load balancing among CPUs on
>>>> +	this cgroup.  Tasks will stay in the CPUs they are running on
>>>> +	and will not be moved to other CPUs.
>>>> +
>>>> +	This flag is hierarchical and is inherited by child cpusets. It
>>>> +	can be turned off only when the CPUs in this cpuset aren't
>>>> +	listed in the cpuset.cpus of other sibling cgroups, and all
>>>> +	the child cpusets, if present, have this flag turned off.
>>>> +
>>>> +	Once it is off, it cannot be turned back on as long as the
>>>> +	parent cgroup still has this flag in the off state.
>>>> +
>>> I'm afraid that this will not work for SCHED_DEADLINE (at least for how
>>> it is implemented today). As you can see in Documentation [1] the only
>>> way a user has to perform partitioned/clustered scheduling is to create
>>> subset of exclusive cpusets and then assign deadline tasks to them. The
>>> other thing to take into account here is that a root_domain is created
>>> for each exclusive set and we use such root_domain to keep information
>>> about admitted bandwidth and speed up load balancing decisions (there is
>>> a max heap tracking deadlines of active tasks on each root_domain).
>>> Now, AFAIR distinct root_domain(s) are created when parent group has
>>> sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
>>> is). So, what we normally do is create, say, cpus_exclusive groups for
>>> the different clusters and then disable sched_load_balance at root level
>>> (so that each cluster gets its own root_domain). Also,
>>> sched_load_balance is enabled in children groups (as load balancing
>>> inside clusters is what we actually needed :).
>> That looks like an undocumented side effect to me. I would rather see an
>> explicit control file that enable root_domain and break it free from
>> cpu_exclusive && !sched_load_balance, e.g. sched_root_domain(?).
> Mmm, it actually makes some sort of sense to me that as long as parent
> groups can't load balance (because !sched_load_balance) and this group
> can't have CPUs overlapping with some other group (because
> cpu_exclusive) a data structure (root_domain) is created to handle load
> balancing for this isolated subsystem. I agree that it should be better
> documented, though.

Yes, this need to be documented.

>>> IIUC your proposal this will not be permitted with cgroup v2 because
>>> sched_load_balance won't be present at root level and children groups
>>> won't be able to set sched_load_balance back to 1 if that was set to 0
>>> in some parent. Is that true?
>> Yes, that is the current plan.
> OK, thanks for confirming. Can you tell again however why do you think
> we need to remove sched_load_balance from root level? Won't we end up
> having tasks put on isolated sets?

The root cgroup is special that it owns all the resources in the system.
We generally don't want restriction be put on the root cgroup. A child
cgroup has to be created to have constraints put on it. In fact, most of
the controller files don't show up in the v2 cgroup root at all.

An isolated cgroup has to be put under root, e.g.

      Root
     /    \
isolated  balanced

>
> Also, I guess children groups with more than one CPU will need to be
> able to load balance across their CPUs, no matter what their parent
> group does?

The purpose of an isolated cpuset is to have a dedicated set of CPUs to
be used by a certain application that makes its own scheduling decision
by placing tasks explicitly on specific CPUs. It just doesn't make sense
to have a CPU in an isolated cpuset to participated in load balancing in
another cpuset. If one want load balancing in a child cpuset, the parent
cpuset should have load balancing turned on as well.

As I look into the code, it seems like root domain is probably somewhat
associated with cpu_exclusive only. Whether sched_load_balance is set
doesn't really matter.  I will need to look further on the conditions
where a new root domain is created.

BTW, there is always a default root domain for the root. So you don't
really need to do anything special for the root cpuset to make one.

Cheers,
Longman

It just doesn't make sense to have a CPU in both an isolated CPU set

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-23 18:44         ` Waiman Long
@ 2018-03-26 12:47           ` Juri Lelli
  2018-03-26 20:28             ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Juri Lelli @ 2018-03-26 12:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

On 23/03/18 14:44, Waiman Long wrote:
> On 03/23/2018 03:59 AM, Juri Lelli wrote:

[...]

> > OK, thanks for confirming. Can you tell again however why do you think
> > we need to remove sched_load_balance from root level? Won't we end up
> > having tasks put on isolated sets?
> 
> The root cgroup is special that it owns all the resources in the system.
> We generally don't want restriction be put on the root cgroup. A child
> cgroup has to be created to have constraints put on it. In fact, most of
> the controller files don't show up in the v2 cgroup root at all.
> 
> An isolated cgroup has to be put under root, e.g.
> 
>       Root
>      /    \
> isolated  balanced
> 
> >
> > Also, I guess children groups with more than one CPU will need to be
> > able to load balance across their CPUs, no matter what their parent
> > group does?
> 
> The purpose of an isolated cpuset is to have a dedicated set of CPUs to
> be used by a certain application that makes its own scheduling decision
> by placing tasks explicitly on specific CPUs. It just doesn't make sense
> to have a CPU in an isolated cpuset to participated in load balancing in
> another cpuset. If one want load balancing in a child cpuset, the parent
> cpuset should have load balancing turned on as well.

Isolated with CPUs overlapping some other cpuset makes little sense, I
agree. What I have in mind however is an isolated set of CPUs that don't
overlap with any other cpuset (as your balanced set above). In this case
I think it makes sense to let the sys admin decide if "automatic" load
balancing has to be performed (by the scheduler) or no load balacing at
all has to take place?

Further extending your example:

             Root [0-3]
	     /        \
        group1 [0-1] group2[2-3]

Why should we prevent load balancing to be disabled at root level (so
that for example tasks still residing in root group are not freely
migrated around, potentially disturbing both sub-groups)?

Then one can decide that group1 is a "userspace managed" group (no load
balancing takes place) and group2 is balanced by the scheduler.

And this is not DEADLINE specific, IMHO.

> As I look into the code, it seems like root domain is probably somewhat
> associated with cpu_exclusive only. Whether sched_load_balance is set
> doesn't really matter.  I will need to look further on the conditions
> where a new root domain is created.

I checked again myself (sched domains code is always a maze :) and I
believe that sched_load_balance flag indeed controls domains (sched and
root) creation and configuration . Changing the flag triggers potential
rebuild and separed sched/root domains are generated if subgroups have
non overlapping cpumasks.  cpu_exclusive only enforces this latter
condition.

Best,

- Juri

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-26 12:47           ` Juri Lelli
@ 2018-03-26 20:28             ` Waiman Long
  2018-03-27  6:17               ` Juri Lelli
                                 ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Waiman Long @ 2018-03-26 20:28 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

On 03/26/2018 08:47 AM, Juri Lelli wrote:
> On 23/03/18 14:44, Waiman Long wrote:
>> On 03/23/2018 03:59 AM, Juri Lelli wrote:
> [...]
>
>>> OK, thanks for confirming. Can you tell again however why do you think
>>> we need to remove sched_load_balance from root level? Won't we end up
>>> having tasks put on isolated sets?
>> The root cgroup is special that it owns all the resources in the system.
>> We generally don't want restriction be put on the root cgroup. A child
>> cgroup has to be created to have constraints put on it. In fact, most of
>> the controller files don't show up in the v2 cgroup root at all.
>>
>> An isolated cgroup has to be put under root, e.g.
>>
>>       Root
>>      /    \
>> isolated  balanced
>>
>>> Also, I guess children groups with more than one CPU will need to be
>>> able to load balance across their CPUs, no matter what their parent
>>> group does?
>> The purpose of an isolated cpuset is to have a dedicated set of CPUs to
>> be used by a certain application that makes its own scheduling decision
>> by placing tasks explicitly on specific CPUs. It just doesn't make sense
>> to have a CPU in an isolated cpuset to participated in load balancing in
>> another cpuset. If one want load balancing in a child cpuset, the parent
>> cpuset should have load balancing turned on as well.
> Isolated with CPUs overlapping some other cpuset makes little sense, I
> agree. What I have in mind however is an isolated set of CPUs that don't
> overlap with any other cpuset (as your balanced set above). In this case
> I think it makes sense to let the sys admin decide if "automatic" load
> balancing has to be performed (by the scheduler) or no load balacing at
> all has to take place?
>
> Further extending your example:
>
>              Root [0-3]
> 	     /        \
>         group1 [0-1] group2[2-3]
>
> Why should we prevent load balancing to be disabled at root level (so
> that for example tasks still residing in root group are not freely
> migrated around, potentially disturbing both sub-groups)?
>
> Then one can decide that group1 is a "userspace managed" group (no load
> balancing takes place) and group2 is balanced by the scheduler.
>
> And this is not DEADLINE specific, IMHO.
>
>> As I look into the code, it seems like root domain is probably somewhat
>> associated with cpu_exclusive only. Whether sched_load_balance is set
>> doesn't really matter.  I will need to look further on the conditions
>> where a new root domain is created.
> I checked again myself (sched domains code is always a maze :) and I
> believe that sched_load_balance flag indeed controls domains (sched and
> root) creation and configuration . Changing the flag triggers potential
> rebuild and separed sched/root domains are generated if subgroups have
> non overlapping cpumasks.  cpu_exclusive only enforces this latter
> condition.

Right, I ran some tests and figured out that to have root_domain in the
child cgroup level, we do need to disable load balancing at the root
cgroup level and enabling it in child cgroups that are mutually disjoint
in their cpu lists. The cpu_exclusive flag isn't really needed.

I am not against doing that at the root cgroup, but it is kind of weird
in term of semantics. If we disable load balancing in the root cgroup,
but enabling it at child cgroups, what does that mean to the processes
that are still in the root cgroup?

The sched_load_balance flag isn't something that is passed to the
scheduler. It only only affects the CPU topology of the system. So I
suspect that a process in the root cgroup will be load balanced among
the CPUs in the one of the child cgroups. That doesn't look right unless
we enforce that no process can be in the root cgroup in this case.

Real cpu isolation will then require that we disable load balancing at
root, and enable load balancing in child cgroups that only contain CPUs
outside of the isolated CPU list. Again, it is still possible that some
tasks in the root cgroup, if present, may be using some of the isolated
CPUs.

Maybe we can have a different root level flag, say,
sched_partition_domain that is equivalent to !sched_load_balnace.
However, I am still not sure if we should enforce that no task should be
in the root cgroup when the flag is set.

Tejun and Peter, what are your thoughts on this?

Cheers,
Longman

 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-26 20:28             ` Waiman Long
@ 2018-03-27  6:17               ` Juri Lelli
  2018-03-27  6:54               ` Mike Galbraith
  2018-03-27 14:02               ` Tejun Heo
  2 siblings, 0 replies; 14+ messages in thread
From: Juri Lelli @ 2018-03-27  6:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

On 26/03/18 16:28, Waiman Long wrote:
> On 03/26/2018 08:47 AM, Juri Lelli wrote:
> > On 23/03/18 14:44, Waiman Long wrote:
> >> On 03/23/2018 03:59 AM, Juri Lelli wrote:
> > [...]
> >
> >>> OK, thanks for confirming. Can you tell again however why do you think
> >>> we need to remove sched_load_balance from root level? Won't we end up
> >>> having tasks put on isolated sets?
> >> The root cgroup is special that it owns all the resources in the system.
> >> We generally don't want restriction be put on the root cgroup. A child
> >> cgroup has to be created to have constraints put on it. In fact, most of
> >> the controller files don't show up in the v2 cgroup root at all.
> >>
> >> An isolated cgroup has to be put under root, e.g.
> >>
> >>       Root
> >>      /    \
> >> isolated  balanced
> >>
> >>> Also, I guess children groups with more than one CPU will need to be
> >>> able to load balance across their CPUs, no matter what their parent
> >>> group does?
> >> The purpose of an isolated cpuset is to have a dedicated set of CPUs to
> >> be used by a certain application that makes its own scheduling decision
> >> by placing tasks explicitly on specific CPUs. It just doesn't make sense
> >> to have a CPU in an isolated cpuset to participated in load balancing in
> >> another cpuset. If one want load balancing in a child cpuset, the parent
> >> cpuset should have load balancing turned on as well.
> > Isolated with CPUs overlapping some other cpuset makes little sense, I
> > agree. What I have in mind however is an isolated set of CPUs that don't
> > overlap with any other cpuset (as your balanced set above). In this case
> > I think it makes sense to let the sys admin decide if "automatic" load
> > balancing has to be performed (by the scheduler) or no load balacing at
> > all has to take place?
> >
> > Further extending your example:
> >
> >              Root [0-3]
> > 	     /        \
> >         group1 [0-1] group2[2-3]
> >
> > Why should we prevent load balancing to be disabled at root level (so
> > that for example tasks still residing in root group are not freely
> > migrated around, potentially disturbing both sub-groups)?
> >
> > Then one can decide that group1 is a "userspace managed" group (no load
> > balancing takes place) and group2 is balanced by the scheduler.
> >
> > And this is not DEADLINE specific, IMHO.
> >
> >> As I look into the code, it seems like root domain is probably somewhat
> >> associated with cpu_exclusive only. Whether sched_load_balance is set
> >> doesn't really matter.  I will need to look further on the conditions
> >> where a new root domain is created.
> > I checked again myself (sched domains code is always a maze :) and I
> > believe that sched_load_balance flag indeed controls domains (sched and
> > root) creation and configuration . Changing the flag triggers potential
> > rebuild and separed sched/root domains are generated if subgroups have
> > non overlapping cpumasks.  cpu_exclusive only enforces this latter
> > condition.
> 
> Right, I ran some tests and figured out that to have root_domain in the
> child cgroup level, we do need to disable load balancing at the root
> cgroup level and enabling it in child cgroups that are mutually disjoint
> in their cpu lists. The cpu_exclusive flag isn't really needed.

It seems to make little sense at root level indeed. 

> I am not against doing that at the root cgroup, but it is kind of weird
> in term of semantics. If we disable load balancing in the root cgroup,
> but enabling it at child cgroups, what does that mean to the processes
> that are still in the root cgroup?

It might be up to the different scheduling classes I guess. See more on
this below.

> The sched_load_balance flag isn't something that is passed to the
> scheduler. It only only affects the CPU topology of the system. So I
> suspect that a process in the root cgroup will be load balanced among
> the CPUs in the one of the child cgroups. That doesn't look right unless
> we enforce that no process can be in the root cgroup in this case.
> 
> Real cpu isolation will then require that we disable load balancing at
> root, and enable load balancing in child cgroups that only contain CPUs
> outside of the isolated CPU list. Again, it is still possible that some
> tasks in the root cgroup, if present, may be using some of the isolated
> CPUs.

So, for DEADLINE this is currently a problem. We know that this is
broken (and Mathieu proposed already patches to fix it [1]). What we
want, I think, is to deny setting a task to DEADLINE if its current
affinity could overlap some exclusive set (root domain as per above), as
for example in your case if the task is residing in the root group.
Since DEADLINE bases load balancing on root domains, once those have
been correctly created, tasks shouldn't be able to escape. And if no
task can reside on the root level once sched_load_balance has been
disable, it seems we won't have the problem you fear.

RT looks similar in this sense (load balancing using root domains info),
but no admission control is performed, so I guess we could fall in your
problematic situation.

FAIR uses a mix of sched domains and root domains information to perform
load balancing, so once tasks are divided among configured sets all
should work OK, but again there might be still some tasks left at root
group. :/ I'm not sure what happens to those w.r.t. load balancing.

> Maybe we can have a different root level flag, say,
> sched_partition_domain that is equivalent to !sched_load_balnace.
> However, I am still not sure if we should enforce that no task should be
> in the root cgroup when the flag is set.
> 
> Tejun and Peter, what are your thoughts on this?

Let's see what they think. :)

Thanks for the discussion!

Best,

- Juri

[1] https://marc.info/?l=linux-kernel&m=151855397701977&w=2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-26 20:28             ` Waiman Long
  2018-03-27  6:17               ` Juri Lelli
@ 2018-03-27  6:54               ` Mike Galbraith
  2018-03-27 14:02               ` Tejun Heo
  2 siblings, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2018-03-27  6:54 UTC (permalink / raw)
  To: Waiman Long, Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, torvalds, Roman Gushchin

On Mon, 2018-03-26 at 16:28 -0400, Waiman Long wrote:
> 
> The sched_load_balance flag isn't something that is passed to the
> scheduler. It only only affects the CPU topology of the system. So I
> suspect that a process in the root cgroup will be load balanced among
> the CPUs in the one of the child cgroups. 

Yes, among CPUs that remain part of topology (and intersect affinity).

> That doesn't look right unless
> we enforce that no process can be in the root cgroup in this case.

caveat: quite a few kthreads are nailed to the floor of root.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-26 20:28             ` Waiman Long
  2018-03-27  6:17               ` Juri Lelli
  2018-03-27  6:54               ` Mike Galbraith
@ 2018-03-27 14:02               ` Tejun Heo
  2018-03-27 14:23                 ` Waiman Long
  2 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2018-03-27 14:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Juri Lelli, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

Hello,

On Mon, Mar 26, 2018 at 04:28:49PM -0400, Waiman Long wrote:
> Maybe we can have a different root level flag, say,
> sched_partition_domain that is equivalent to !sched_load_balnace.
> However, I am still not sure if we should enforce that no task should be
> in the root cgroup when the flag is set.
> 
> Tejun and Peter, what are your thoughts on this?

I haven't looked into the other issues too much but we for sure cannot
empty the root cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-27 14:02               ` Tejun Heo
@ 2018-03-27 14:23                 ` Waiman Long
  2018-03-28  6:57                   ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2018-03-27 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Juri Lelli, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, efault, torvalds, Roman Gushchin

On 03/27/2018 10:02 AM, Tejun Heo wrote:
> Hello,
>
> On Mon, Mar 26, 2018 at 04:28:49PM -0400, Waiman Long wrote:
>> Maybe we can have a different root level flag, say,
>> sched_partition_domain that is equivalent to !sched_load_balnace.
>> However, I am still not sure if we should enforce that no task should be
>> in the root cgroup when the flag is set.
>>
>> Tejun and Peter, what are your thoughts on this?
> I haven't looked into the other issues too much but we for sure cannot
> empty the root cgroup.
>
> Thanks.
>
Now, I have a different idea. How about we add a special root-only knob,
say, "cpuset.cpus.isolated" that contains the list of CPUs that are
still owned by root, but not participated in load balancing. All the
tasks in the root are load-balanced among the remaining CPUs.

A child can then be created that hold some or all the CPUs in the
isolated set. It will then have a separate root domain if load balancing
is on, or an isolated cpuset if load balancing is off.

Will that idea work?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
  2018-03-27 14:23                 ` Waiman Long
@ 2018-03-28  6:57                   ` Mike Galbraith
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2018-03-28  6:57 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo
  Cc: Juri Lelli, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, torvalds, Roman Gushchin

On Tue, 2018-03-27 at 10:23 -0400, Waiman Long wrote:
> On 03/27/2018 10:02 AM, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Mar 26, 2018 at 04:28:49PM -0400, Waiman Long wrote:
> >> Maybe we can have a different root level flag, say,
> >> sched_partition_domain that is equivalent to !sched_load_balnace.
> >> However, I am still not sure if we should enforce that no task should be
> >> in the root cgroup when the flag is set.
> >>
> >> Tejun and Peter, what are your thoughts on this?
> > I haven't looked into the other issues too much but we for sure cannot
> > empty the root cgroup.
> >
> > Thanks.
> >
> Now, I have a different idea. How about we add a special root-only knob,
> say, "cpuset.cpus.isolated" that contains the list of CPUs that are
> still owned by root, but not participated in load balancing. All the
> tasks in the root are load-balanced among the remaining CPUs.
> 
> A child can then be created that hold some or all the CPUs in the
> isolated set. It will then have a separate root domain if load balancing
> is on, or an isolated cpuset if load balancing is off.
> 
> Will that idea work?

Hrm.  Sounds very much like the typical v1 setup today..

              root
               /\
          peons  vips

...with v2 root effectively shrinking to become the v1 "peons" set
*rd/sd/sd_llc wise only* when you poke /cpuset.cpus.isolated, but still
actually spanning all CPUs.  True?

If so, a user would also still have to create a real "peons" subset as
in v1 and migrate everything not nailed to the floor into it for
containment, else any task can be placed or place itself anywhere in
the box, or merely wake to find itself sitting on it's previous, but
now vip turf CPU.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-03-28  6:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 16:21 [PATCH v6 0/2] cpuset: Enable cpuset controller in default hierarchy Waiman Long
2018-03-21 16:21 ` [PATCH v6 1/2] " Waiman Long
2018-03-21 16:21 ` [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2 Waiman Long
2018-03-22  8:41   ` Juri Lelli
2018-03-22 21:50     ` Waiman Long
2018-03-23  7:59       ` Juri Lelli
2018-03-23 18:44         ` Waiman Long
2018-03-26 12:47           ` Juri Lelli
2018-03-26 20:28             ` Waiman Long
2018-03-27  6:17               ` Juri Lelli
2018-03-27  6:54               ` Mike Galbraith
2018-03-27 14:02               ` Tejun Heo
2018-03-27 14:23                 ` Waiman Long
2018-03-28  6:57                   ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).