linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy
@ 2018-06-24  7:30 Waiman Long
  2018-06-24  7:30 ` [PATCH v11 1/9] " Waiman Long
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

v11:
 - Change the "domain_root" name to "partition" as suggested by Peter and
   update the documentation and code accordingly.
 - Remove the dying cgroup check in update_reserved_cpus() as the check
   may not be needed after all.
 - Document the effect of losing CPU affinity after offling all the cpus
   in a partition.
 - There is no other major code changes in this version.

v10:
 - Remove the cpuset.sched.load_balance patch for now as it may not
   be that useful.
 - Break the large patch 2 into smaller patches to make them a bit
   easier to review.
 - Test and fix issues related to changing "cpuset.cpus" and cpu
   online/offline in a domain root.
 - Rename isolated_cpus to reserved_cpus as this cpumask holds CPUs
   reserved for child sched domains.
 - Rework the scheduling domain debug printing code in the last patch.
 - Document update to the newly moved
   Documentation/admin-guide/cgroup-v2.rst.

v9:
 - Rename cpuset.sched.domain to cpuset.sched.domain_root to better
   identify its purpose as the root of a new scheduling domain or
   partition.
 - Clarify in the document about the purpose of domain_root and
   load_balance. Using domain_root is th only way to create new
   partition.
 - Fix a lockdep warning in update_isolated_cpumask() function.
 - Add a new patch to eliminate call to generate_sched_domains() for
   v2 when a change in cpu list does not touch a domain_root.

v8:
 - Remove cpuset.cpus.isolated and add a new cpuset.sched.domain flag
   and rework the code accordingly.

v8 patch : https://lkml.org/lkml/2018/5/17/939
v9 patch : https://lkml.org/lkml/2018/5/29/507
v10 patch: https://lkml.org/lkml/2018/6/18/3

The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus",
"mems" and "sched.partition". The "cpus.effective" and "mems.effective"
will appear in all cpuset-enabled cgroups.

The new control file that is unique to v2 is "sched.partition". It
is a boolean flag file that designates if a cgroup is the root of a
new scheduling domain or partition with its own set of unique list of
CPUs from scheduling perspective disjointed from other partitions. The
root cgroup is always a partition root. Multiple levels of partitions
are supported with some limitations. So a container partition root can
behave like a real root.

When a partition root cgroup is removed, its list of exclusive
CPUs will be returned to the parent's cpus.effective automatically.

A container root can be a partition root with sub-partitions
created underneath it. One difference from the real root is that the
"cpuset.sched.partition" flag isn't present in the real root, but is
present in a container root. This is also true for other cpuset control
files as well as those from the other controllers. This is a general
issue that is not going to be addressed here in this patchset.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their effective
counterparts.

Patch 2 adds a new "sched.partition" control file for setting up multiple
scheduling domains or partitions. A partition root implies cpu_exclusive.

Patch 3 handles the proper deletion of a partition root cgroup by turning
off the partition flag automatically before deletion.

Patch 4 allows "cpuset.cpus" of a partition root cgroup to be changed
subject to certain constraints.

Patch 5 makes the hotplug code deal with partition root properly.

Patch 6 updates the scheduling domain genaration code to work with
the new domain root feature.

Patch 7 exposes cpus.effective and mems.effective to the root cgroup as
enabling child scheduling domains will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.

Patch 8 eliminates the need to rebuild sched domains for v2 if cpu list
changes occur to non-domain root cpusets only.

Patch 9 enables the printing the debug information about scheduling
domain generation. This patch is optional and is mainly for testing
purpose only, it may not need to be merged.

Waiman Long (9):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add new v2 cpuset.sched.partition flag
  cpuset: Simulate auto-off of sched.partition at cgroup removal
  cpuset: Allow changes to cpus in a partition root
  cpuset: Make sure that partition flag work properly with CPU hotplug
  cpuset: Make generate_sched_domains() recognize reserved_cpus
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Don't rebuild sched domains if cpu changes in non-partition
    root
  cpuset: Allow reporting of sched domain generation info

 Documentation/admin-guide/cgroup-v2.rst | 161 ++++++++++++-
 kernel/cgroup/cpuset.c                  | 397 ++++++++++++++++++++++++++++++--
 2 files changed, 537 insertions(+), 21 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 1/9] cpuset: Enable cpuset controller in default hierarchy
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-06-24  7:30 ` [PATCH v11 2/9] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 109 ++++++++++++++++++++++++++++++--
 kernel/cgroup/cpuset.c                  |  48 +++++++++++++-
 2 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8a2c52d..fbc30b6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
        5-3-2. Writeback
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1486,6 +1488,103 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested CPUs to be used by tasks within this
+	cgroup.  The actual list of CPUs to be granted, however, is
+	subjected to constraints imposed by its parent and can differ
+	from the requested CPUs.
+
+	The CPU numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined CPUs that are actually granted to this
+	cgroup by its parent.  These CPUs are allowed to be used by
+	tasks within the current cgroup.
+
+	If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+	all the CPUs from the parent cgroup that can be available to
+	be used by this cgroup.  Otherwise, it should be a subset of
+	"cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+	can be granted.  In this case, it will be treated just like an
+	empty "cpuset.cpus".
+
+	Its value will be affected by CPU hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested memory nodes to be used by tasks within
+	this cgroup.  The actual list of memory nodes granted, however,
+	is subjected to constraints imposed by its parent and can differ
+	from the requested memory nodes.
+
+	The memory node numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined memory nodes that are actually granted to
+	this cgroup by its parent. These memory nodes are allowed to
+	be used by tasks within the current cgroup.
+
+	If "cpuset.mems" is empty, it shows all the memory nodes from the
+	parent cgroup that will be available to be used by this cgroup.
+	Otherwise, it should be a subset of "cpuset.mems" unless none of
+	the memory nodes listed in "cpuset.mems" can be granted.  In this
+	case, it will be treated just like an empty "cpuset.mems".
+
+	Its value will be affected by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10c..2b5c447 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1824,12 +1824,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1932,6 +1931,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 /*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
+/*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
  */
@@ -2105,8 +2145,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 2/9] cpuset: Add new v2 cpuset.sched.partition flag
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
  2018-06-24  7:30 ` [PATCH v11 1/9] " Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-06-24  7:30 ` [PATCH v11 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal Waiman Long
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

A new cpuset.sched.partition boolean flag is added to cpuset v2.
This new flag, if set, indicates that the cgroup is the root of a
new scheduling domain or partition that includes itself and all its
descendants except those that are scheduling domain roots themselves
and their descendants.

With this new flag, one can directly create as many partitions as
necessary without ever using the v1 trick of turning off load balancing
in specific cpusets to create partitions as a side effect.

This new flag is owned by the parent and will cause the CPUs in the
cpuset to be removed from the effective CPUs of its parent.

This is implemented internally by adding a new reserved_cpus mask that
holds the CPUs belonging to child scheduling domain cpusets so that:

	reserved_cpus | effective_cpus = cpus_allowed
	reserved_cpus & effective_cpus = 0

This new flag can only be turned on in a cpuset if its parent is a
partition root itself. The state of this flag cannot be changed if the
cpuset has children.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  33 +++++
 kernel/cgroup/cpuset.c                  | 209 +++++++++++++++++++++++++++++++-
 2 files changed, 239 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index fbc30b6..24a7133 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1584,6 +1584,39 @@ Cpuset Interface Files
 
 	Its value will be affected by memory nodes hotplug events.
 
+  cpuset.sched.partition
+	A read-write single value file which exists on non-root
+	cpuset-enabled cgroups.  It is a binary value flag that accepts
+	either "0" (off) or "1" (on).  This flag is set and owned by the
+        parent cgroup.
+
+	If set, it indicates that the current cgroup is the root of a
+	new partition or scheduling domain that comprises itself and
+	all its descendants except those that are separate partition
+        roots themselves and their descendants.  The root cgroup is
+	always a partition root.
+
+	There are constraints on where this flag can be set.  It can
+	only be set in a cgroup if all the following conditions are true.
+
+	1) The "cpuset.cpus" is not empty and the list of CPUs are
+	   exclusive, i.e. they are not shared by any of its siblings.
+	2) The "cpuset.cpus" is also a proper subset of the parent's
+	   "cpuset.cpus.effective".
+	3) The parent cgroup is a partition root.
+	4) There is no child cgroups with cpuset enabled.  This is for
+	   eliminating corner cases that have to be handled if such a
+	   condition is allowed.
+
+	Setting this flag will take the CPUs away from the effective
+	CPUs of the parent cgroup.  That is why this flag has to be set
+        and owned by the parent.  Once it is set, this flag cannot be
+	cleared if there are any child cgroups with cpuset enabled.
+
+	A parent partition root cgroup cannot distribute all its CPUs to
+        its child partition root cgroups.  There must be at least one cpu
+        left in the parent partition root cgroup.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2b5c447..5989bff 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,9 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/* CPUs reserved for child partitions */
+	cpumask_var_t reserved_cpus;
+
 	/*
 	 * This is old Memory Nodes tasks took on.
 	 *
@@ -134,6 +137,9 @@ struct cpuset {
 
 	/* for custom sched domain */
 	int relax_domain_level;
+
+	/* number of CPUs in reserved_cpus */
+	int nr_reserved;
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -175,6 +181,7 @@ static inline bool task_has_mempolicy(struct task_struct *task)
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_PARTITION_ROOT,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -203,6 +210,11 @@ static inline int is_sched_load_balance(const struct cpuset *cs)
 	return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 }
 
+static inline int is_partition_root(const struct cpuset *cs)
+{
+	return test_bit(CS_PARTITION_ROOT, &cs->flags);
+}
+
 static inline int is_memory_migrate(const struct cpuset *cs)
 {
 	return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
@@ -220,7 +232,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
 
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
-		  (1 << CS_MEM_EXCLUSIVE)),
+		  (1 << CS_MEM_EXCLUSIVE) | (1 << CS_PARTITION_ROOT)),
 };
 
 /**
@@ -881,6 +893,27 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	css_task_iter_end(&it);
 }
 
+/**
+ * compute_effective_cpumask - Compute the effective cpumask of the cpuset
+ * @new_cpus: the temp variable for the new effective_cpus mask
+ * @cs: the cpuset the need to recompute the new effective_cpus mask
+ * @parent: the parent cpuset
+ *
+ * If the parent has reserved CPUs, include them in the list of allowable
+ * CPUs in computing the new effective_cpus mask.
+ */
+static void compute_effective_cpumask(struct cpumask *new_cpus,
+				      struct cpuset *cs, struct cpuset *parent)
+{
+	if (parent->nr_reserved) {
+		cpumask_or(new_cpus, parent->effective_cpus,
+			   parent->reserved_cpus);
+		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+	} else {
+		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
+	}
+}
+
 /*
  * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
  * @cs: the cpuset to consider
@@ -903,7 +936,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
 
-		cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+		compute_effective_cpumask(new_cpus, cp, parent);
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
@@ -949,6 +982,130 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 }
 
 /**
+ * update_reserved_cpumask - update the reserved_cpus mask of parent cpuset
+ * @cpuset:  The cpuset that requests CPU reservation
+ * @delmask: The old reserved cpumask to be removed from the parent
+ * @addmask: The new reserved cpumask to be added to the parent
+ * Return: 0 if successful, an error code otherwise
+ *
+ * Changes to the reserved CPUs are not allowed if any of CPUs changing
+ * state are in any of the child cpusets of the parent except the requesting
+ * child.
+ *
+ * If the sched.partition flag changes, either the delmask (0=>1) or the
+ * addmask (1=>0) will be NULL.
+ *
+ * Called with cpuset_mutex held.
+ */
+static int update_reserved_cpumask(struct cpuset *cpuset,
+	struct cpumask *delmask, struct cpumask *addmask)
+{
+	int retval;
+	struct cpuset *parent = parent_cs(cpuset);
+	struct cpuset *sibling;
+	struct cgroup_subsys_state *pos_css;
+	int old_count = parent->nr_reserved;
+
+	/*
+	 * The parent must be a partition root.
+	 * The new cpumask, if present, must not be empty.
+	 */
+	if (!is_partition_root(parent) ||
+	   (addmask && cpumask_empty(addmask)))
+		return -EINVAL;
+
+	/*
+	 * The delmask, if present, must be a subset of parent's reserved
+	 * CPUs.
+	 */
+	if (delmask && !cpumask_empty(delmask) && (!parent->nr_reserved ||
+		       !cpumask_subset(delmask, parent->reserved_cpus))) {
+		WARN_ON_ONCE(1);
+		return -EINVAL;
+	}
+
+	/*
+	 * A sched.partition state change is not allowed if there are
+	 * online children.
+	 */
+	if (css_has_online_children(&cpuset->css))
+		return -EBUSY;
+
+	if (!old_count) {
+		if (!zalloc_cpumask_var(&parent->reserved_cpus, GFP_KERNEL)) {
+			retval = -ENOMEM;
+			goto out;
+		}
+		old_count = 1;
+	}
+
+	retval = -EBUSY;
+
+	/*
+	 * The cpus to be added must be a proper subset of the parent's
+	 * effective_cpus mask but not in the reserved_cpus mask.
+	 */
+	if (addmask) {
+		if (!cpumask_subset(addmask, parent->effective_cpus) ||
+		     cpumask_equal(addmask, parent->effective_cpus))
+			goto out;
+		if (parent->nr_reserved &&
+		    cpumask_intersects(parent->reserved_cpus, addmask))
+			goto out;
+	}
+
+	/*
+	 * Check if any CPUs in addmask or delmask are in the effective_cpus
+	 * of a sibling cpuset. The implied cpu_exclusive of a partition
+	 * root will ensure there are no overlap in cpus_allowed.
+	 */
+	rcu_read_lock();
+	cpuset_for_each_child(sibling, pos_css, parent) {
+		if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
+			continue;
+		if (addmask &&
+		    cpumask_intersects(sibling->effective_cpus, addmask))
+			goto out_unlock;
+		if (delmask &&
+		    cpumask_intersects(sibling->effective_cpus, delmask))
+			goto out_unlock;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * Change the reserved CPU list.
+	 * Newly added reserved CPUs will be removed from effective_cpus
+	 * and newly deleted ones will be added back if they are online.
+	 */
+	spin_lock_irq(&callback_lock);
+	if (addmask) {
+		cpumask_or(parent->reserved_cpus,
+			   parent->reserved_cpus, addmask);
+		cpumask_andnot(parent->effective_cpus,
+			       parent->effective_cpus, addmask);
+	}
+	if (delmask) {
+		cpumask_andnot(parent->reserved_cpus,
+			       parent->reserved_cpus, delmask);
+		cpumask_or(parent->effective_cpus,
+			   parent->effective_cpus, delmask);
+	}
+
+	parent->nr_reserved = cpumask_weight(parent->reserved_cpus);
+	spin_unlock_irq(&callback_lock);
+	retval = 0;
+out:
+	if (old_count && !parent->nr_reserved)
+		free_cpumask_var(parent->reserved_cpus);
+
+	return retval;
+
+out_unlock:
+	rcu_read_unlock();
+	goto out;
+}
+
+/**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
  * @trialcs: trial cpuset
@@ -989,6 +1146,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
+	if (is_partition_root(cs))
+		return -EBUSY;
+
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
@@ -1317,6 +1477,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	struct cpuset *trialcs;
 	int balance_flag_changed;
 	int spread_flag_changed;
+	int domain_flag_changed;
 	int err;
 
 	trialcs = alloc_trial_cpuset(cs);
@@ -1328,6 +1489,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	else
 		clear_bit(bit, &trialcs->flags);
 
+	/*
+	 *  Turning on sched.domain flag (default hierarchy only) implies
+	 *  an implicit cpu_exclusive. Turning off sched.domain will clear
+	 *  the cpu_exclusive flag.
+	 */
+	if (bit == CS_PARTITION_ROOT) {
+		if (turning_on)
+			set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+		else
+			clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+	}
+
 	err = validate_change(cs, trialcs);
 	if (err < 0)
 		goto out;
@@ -1338,11 +1511,27 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
+	domain_flag_changed = (is_partition_root(cs) !=
+			       is_partition_root(trialcs));
+
+	if (domain_flag_changed) {
+		err = turning_on
+		    ? update_reserved_cpumask(cs, NULL, cs->cpus_allowed)
+		    : update_reserved_cpumask(cs, cs->cpus_allowed, NULL);
+		if (err < 0)
+			goto out;
+		/*
+		 * At this point, the state has been changed.
+		 * So we can't back out with error anymore.
+		 */
+	}
+
 	spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
 	spin_unlock_irq(&callback_lock);
 
-	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
+	if (!cpumask_empty(trialcs->cpus_allowed) &&
+	   (balance_flag_changed || domain_flag_changed))
 		rebuild_sched_domains_locked();
 
 	if (spread_flag_changed)
@@ -1597,6 +1786,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
 	FILE_SCHED_LOAD_BALANCE,
+	FILE_PARTITION_ROOT,
 	FILE_SCHED_RELAX_DOMAIN_LEVEL,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
@@ -1630,6 +1820,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SCHED_LOAD_BALANCE:
 		retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val);
 		break;
+	case FILE_PARTITION_ROOT:
+		retval = update_flag(CS_PARTITION_ROOT, cs, val);
+		break;
 	case FILE_MEMORY_MIGRATE:
 		retval = update_flag(CS_MEMORY_MIGRATE, cs, val);
 		break;
@@ -1791,6 +1984,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 		return is_mem_hardwall(cs);
 	case FILE_SCHED_LOAD_BALANCE:
 		return is_sched_load_balance(cs);
+	case FILE_PARTITION_ROOT:
+		return is_partition_root(cs);
 	case FILE_MEMORY_MIGRATE:
 		return is_memory_migrate(cs);
 	case FILE_MEMORY_PRESSURE_ENABLED:
@@ -1967,6 +2162,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched.partition",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_PARTITION_ROOT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
  2018-06-24  7:30 ` [PATCH v11 1/9] " Waiman Long
  2018-06-24  7:30 ` [PATCH v11 2/9] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-06-24  7:30 ` [PATCH v11 4/9] cpuset: Allow changes to cpus in a partition root Waiman Long
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Making a cgroup a partition root will reserve cpu resource at its parent.
So when a partition root cgroup is destroyed, we need to free the
reserved cpus at its parent. This is now done by doing an auto-off of
the sched.partition flag in the offlining phase when a partition root
cgroup is being removed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5989bff..e902f54 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2278,7 +2278,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 /*
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains_locked().
+ * will call rebuild_sched_domains_locked(). That is not needed
+ * in the default hierarchy where only changes in partition
+ * will cause repartitioning.
+ *
+ * If the cpuset has the 'sched.partition' flag enabled, simulate
+ * turning 'sched.partition" off.
  */
 
 static void cpuset_css_offline(struct cgroup_subsys_state *css)
@@ -2287,7 +2292,18 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 
 	mutex_lock(&cpuset_mutex);
 
-	if (is_sched_load_balance(cs))
+	/*
+	 * A WARN_ON_ONCE() check after calling update_flag() to make
+	 * sure that the operation succceeds without failure.
+	 */
+	if (is_partition_root(cs)) {
+		int ret = update_flag(CS_PARTITION_ROOT, cs, 0);
+
+		WARN_ON_ONCE(ret);
+	}
+
+	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+	    is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
 	cpuset_dec();
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 4/9] cpuset: Allow changes to cpus in a partition root
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (2 preceding siblings ...)
  2018-06-24  7:30 ` [PATCH v11 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-06-24  7:30 ` [PATCH v11 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug Waiman Long
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The previous patch introduces a new partition flag, but won't allow
changes made to "cpuset.cpus" once the flag is on. That may be too
restrictive in some use cases. So this restiction is now relaxed to
allow changes made to the "cpuset.cpus" file with some constraints:

 1) The new set of cpus must still be exclusive.
 2) Newly added cpus must be a subset of the parent effective_cpus.
 3) None of the deleted cpus can be one of those allocated to a child
    partition roots, if present.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  9 ++++
 kernel/cgroup/cpuset.c                  | 80 ++++++++++++++++++++++++++-------
 2 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 24a7133..5f3170f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1617,6 +1617,15 @@ Cpuset Interface Files
         its child partition root cgroups.  There must be at least one cpu
         left in the parent partition root cgroup.
 
+	In a partition root, changes to "cpuset.cpus" is allowed as long
+	as the first condition above as well as the following two
+	additional conditions are true.
+
+	1) Any added CPUs must be a proper subset of the parent's
+	   "cpuset.cpus.effective".
+	2) No CPU that has been distributed to child partition roots is
+	   is deleted.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e902f54..3a646a9 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -957,6 +957,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 
 		spin_lock_irq(&callback_lock);
 		cpumask_copy(cp->effective_cpus, new_cpus);
+		if (cp->nr_reserved)
+			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
+				       cp->reserved_cpus);
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -984,23 +987,25 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 /**
  * update_reserved_cpumask - update the reserved_cpus mask of parent cpuset
  * @cpuset:  The cpuset that requests CPU reservation
- * @delmask: The old reserved cpumask to be removed from the parent
- * @addmask: The new reserved cpumask to be added to the parent
+ * @oldmask: The old reserved cpumask to be removed from the parent
+ * @newmask: The new reserved cpumask to be added to the parent
  * Return: 0 if successful, an error code otherwise
  *
  * Changes to the reserved CPUs are not allowed if any of CPUs changing
  * state are in any of the child cpusets of the parent except the requesting
  * child.
  *
- * If the sched.partition flag changes, either the delmask (0=>1) or the
- * addmask (1=>0) will be NULL.
+ * If the sched.partition flag changes, either the oldmask (0=>1) or the
+ * newmask (1=>0) will be NULL.
  *
  * Called with cpuset_mutex held.
  */
 static int update_reserved_cpumask(struct cpuset *cpuset,
-	struct cpumask *delmask, struct cpumask *addmask)
+	struct cpumask *oldmask, struct cpumask *newmask)
 {
 	int retval;
+	int adding, deleting;
+	cpumask_var_t addmask, delmask;
 	struct cpuset *parent = parent_cs(cpuset);
 	struct cpuset *sibling;
 	struct cgroup_subsys_state *pos_css;
@@ -1011,15 +1016,15 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * The new cpumask, if present, must not be empty.
 	 */
 	if (!is_partition_root(parent) ||
-	   (addmask && cpumask_empty(addmask)))
+	   (newmask && cpumask_empty(newmask)))
 		return -EINVAL;
 
 	/*
-	 * The delmask, if present, must be a subset of parent's reserved
+	 * The oldmask, if present, must be a subset of parent's reserved
 	 * CPUs.
 	 */
-	if (delmask && !cpumask_empty(delmask) && (!parent->nr_reserved ||
-		       !cpumask_subset(delmask, parent->reserved_cpus))) {
+	if (oldmask && !cpumask_empty(oldmask) && (!parent->nr_reserved ||
+		       !cpumask_subset(oldmask, parent->reserved_cpus))) {
 		WARN_ON_ONCE(1);
 		return -EINVAL;
 	}
@@ -1028,9 +1033,16 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * A sched.partition state change is not allowed if there are
 	 * online children.
 	 */
-	if (css_has_online_children(&cpuset->css))
+	if ((!oldmask || !newmask) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
+	if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
+		return -ENOMEM;
+	if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
+		free_cpumask_var(addmask);
+		return -ENOMEM;
+	}
+
 	if (!old_count) {
 		if (!zalloc_cpumask_var(&parent->reserved_cpus, GFP_KERNEL)) {
 			retval = -ENOMEM;
@@ -1040,12 +1052,29 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	}
 
 	retval = -EBUSY;
+	adding = deleting = false;
+	/*
+	 * addmask = newmask & ~oldmask
+	 * delmask = oldmask & ~newmask
+	 */
+	if (oldmask && newmask) {
+		adding   = cpumask_andnot(addmask, newmask, oldmask);
+		deleting = cpumask_andnot(delmask, oldmask, newmask);
+		if (!adding && !deleting)
+			goto out_ok;
+	} else if (newmask) {
+		adding = true;
+		cpumask_copy(addmask, newmask);
+	} else if (oldmask) {
+		deleting = true;
+		cpumask_copy(delmask, oldmask);
+	}
 
 	/*
 	 * The cpus to be added must be a proper subset of the parent's
 	 * effective_cpus mask but not in the reserved_cpus mask.
 	 */
-	if (addmask) {
+	if (adding) {
 		if (!cpumask_subset(addmask, parent->effective_cpus) ||
 		     cpumask_equal(addmask, parent->effective_cpus))
 			goto out;
@@ -1055,6 +1084,15 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	}
 
 	/*
+	 * For cpu changes in a partition root, cpu deletion isn't allowed
+	 * if any of the deleted CPUs is in reserved_cpus (distributed
+	 * to child partition roots).
+	 */
+	if (oldmask && newmask && cpuset->nr_reserved && deleting &&
+	    cpumask_intersects(delmask, cpuset->reserved_cpus))
+		goto out;
+
+	/*
 	 * Check if any CPUs in addmask or delmask are in the effective_cpus
 	 * of a sibling cpuset. The implied cpu_exclusive of a partition
 	 * root will ensure there are no overlap in cpus_allowed.
@@ -1063,10 +1101,10 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	cpuset_for_each_child(sibling, pos_css, parent) {
 		if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
 			continue;
-		if (addmask &&
+		if (adding &&
 		    cpumask_intersects(sibling->effective_cpus, addmask))
 			goto out_unlock;
-		if (delmask &&
+		if (deleting &&
 		    cpumask_intersects(sibling->effective_cpus, delmask))
 			goto out_unlock;
 	}
@@ -1078,13 +1116,13 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * and newly deleted ones will be added back if they are online.
 	 */
 	spin_lock_irq(&callback_lock);
-	if (addmask) {
+	if (adding) {
 		cpumask_or(parent->reserved_cpus,
 			   parent->reserved_cpus, addmask);
 		cpumask_andnot(parent->effective_cpus,
 			       parent->effective_cpus, addmask);
 	}
-	if (delmask) {
+	if (deleting) {
 		cpumask_andnot(parent->reserved_cpus,
 			       parent->reserved_cpus, delmask);
 		cpumask_or(parent->effective_cpus,
@@ -1093,8 +1131,12 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 
 	parent->nr_reserved = cpumask_weight(parent->reserved_cpus);
 	spin_unlock_irq(&callback_lock);
+
+out_ok:
 	retval = 0;
 out:
+	free_cpumask_var(addmask);
+	free_cpumask_var(delmask);
 	if (old_count && !parent->nr_reserved)
 		free_cpumask_var(parent->reserved_cpus);
 
@@ -1146,8 +1188,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	if (is_partition_root(cs))
-		return -EBUSY;
+	if (is_partition_root(cs)) {
+		retval = update_reserved_cpumask(cs, cs->cpus_allowed,
+						 trialcs->cpus_allowed);
+		if (retval < 0)
+			return retval;
+	}
 
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (3 preceding siblings ...)
  2018-06-24  7:30 ` [PATCH v11 4/9] cpuset: Allow changes to cpus in a partition root Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-06-24  7:30 ` [PATCH v11 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus Waiman Long
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

When there is a cpu hotplug event (CPU online or offline), the partitions
may need to be reconfigured and regenerated. So code is added to
the hotplug functions to make them work with new reserved_cpus mask to
compute the right effective_cpus for each of the affected cpusets.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
 kernel/cgroup/cpuset.c                  | 26 ++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5f3170f..f4efa40 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1626,6 +1626,16 @@ Cpuset Interface Files
 	2) No CPU that has been distributed to child partition roots is
 	   is deleted.
 
+	When all the CPUs allocated to a partition are offlined, the
+	partition will be temporaily gone and all the tasks in it will
+	be migrated to another one that belongs to the parent of the
+	partition root.  This is a destructive operation and all the
+	existing CPU affinity that is narrower than the cpuset itself
+	will be lost.
+
+	When any of those offlined CPUs is onlined again, a new partition
+	will be re-created and the tasks will be migrated back.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3a646a9..e118dd8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -900,7 +900,8 @@ static void update_tasks_cpumask(struct cpuset *cs)
  * @parent: the parent cpuset
  *
  * If the parent has reserved CPUs, include them in the list of allowable
- * CPUs in computing the new effective_cpus mask.
+ * CPUs in computing the new effective_cpus mask. The cpu_active_mask is
+ * used to mask off cpus that are to be offlined.
  */
 static void compute_effective_cpumask(struct cpumask *new_cpus,
 				      struct cpuset *cs, struct cpuset *parent)
@@ -909,6 +910,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus,
 		cpumask_or(new_cpus, parent->effective_cpus,
 			   parent->reserved_cpus);
 		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+		cpumask_and(new_cpus, new_cpus, cpu_active_mask);
 	} else {
 		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
 	}
@@ -2562,9 +2564,17 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 		goto retry;
 	}
 
-	cpumask_and(&new_cpus, cs->cpus_allowed, parent_cs(cs)->effective_cpus);
+	compute_effective_cpumask(&new_cpus, cs, parent_cs(cs));
 	nodes_and(new_mems, cs->mems_allowed, parent_cs(cs)->effective_mems);
 
+	if (cs->nr_reserved) {
+		/*
+		 * Some of the CPUs may have been distributed to child
+		 * partition roots. So we need skip those when computing the
+		 * real effective cpus.
+		 */
+		cpumask_andnot(&new_cpus, &new_cpus, cs->reserved_cpus);
+	}
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
@@ -2614,6 +2624,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpumask_copy(&new_cpus, cpu_active_mask);
 	new_mems = node_states[N_MEMORY];
 
+	/*
+	 * If reserved_cpus is populated, it is likely that the check below
+	 * will produce a false positive on cpus_updated when the cpu list
+	 * isn't changed. It is extra work, but it is better to be safe.
+	 */
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2622,6 +2637,13 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+		/*
+		 * Make sure that the reserved cpus aren't in the
+		 * effective cpus.
+		 */
+		if (top_cpuset.nr_reserved)
+			cpumask_andnot(&new_cpus, &new_cpus,
+					top_cpuset.reserved_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (4 preceding siblings ...)
  2018-06-24  7:30 ` [PATCH v11 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-06-24  7:30 ` [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The generate_sched_domains() function is modified to make it work
correctly with the newly introduced reserved_cpus mask for schedule
domains generation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e118dd8..e545848 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	int ndoms = 0;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] struct cpumask slot */
 	struct cgroup_subsys_state *pos_css;
+	bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
 	doms = NULL;
 	dattr = NULL;
 	csa = NULL;
 
 	/* Special case for the 99% of systems with one, full, sched domain */
-	if (is_sched_load_balance(&top_cpuset)) {
+	if (root_load_balance && !top_cpuset.nr_reserved) {
 		ndoms = 1;
 		doms = alloc_sched_domains(ndoms);
 		if (!doms)
@@ -701,6 +702,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	csn = 0;
 
 	rcu_read_lock();
+	if (root_load_balance)
+		csa[csn++] = &top_cpuset;
 	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
 		if (cp == &top_cpuset)
 			continue;
@@ -711,6 +714,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 * parent's cpus, so just skip them, and then we call
 		 * update_domain_attr_tree() to calc relax_domain_level of
 		 * the corresponding sched domain.
+		 *
+		 * If root is load-balancing, we can skip @cp if it
+		 * is a subset of the root's effective_cpus.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
@@ -718,11 +724,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
 			continue;
 
+		if (root_load_balance &&
+		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+			continue;
+
 		if (is_sched_load_balance(cp))
 			csa[csn++] = cp;
 
-		/* skip @cp's subtree */
-		pos_css = css_rightmost_descendant(pos_css);
+		/* skip @cp's subtree if not a partition root */
+		if (!is_partition_root(cp))
+			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
 
@@ -850,7 +861,12 @@ static void rebuild_sched_domains_locked(void)
 	 * passing doms with offlined cpu to partition_sched_domains().
 	 * Anyways, hotplug work item will rebuild sched domains.
 	 */
-	if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+	if (!top_cpuset.nr_reserved &&
+	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+		goto out;
+
+	if (top_cpuset.nr_reserved &&
+	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
 		goto out;
 
 	/* Generate domain masks and attrs */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (5 preceding siblings ...)
  2018-06-24  7:30 ` [PATCH v11 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-07-02 16:53   ` Tejun Heo
  2018-06-24  7:30 ` [PATCH v11 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root Waiman Long
  2018-06-24  7:30 ` [PATCH v11 9/9] cpuset: Allow reporting of sched domain generation info Waiman Long
  8 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Because of the fact that setting the "cpuset.sched.partition" in
a direct child of root can remove CPUs from the root's effective CPU
list, it makes sense to know what CPUs are left in the root cgroup for
scheduling purpose. So the "cpuset.cpus.effective" control file is now
exposed in the v2 cgroup root.

For consistency, the "cpuset.mems.effective" control file is exposed
as well.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 4 ++--
 kernel/cgroup/cpuset.c                  | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f4efa40..f7cde15 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1529,7 +1529,7 @@ Cpuset Interface Files
 	and won't be affected by any CPU hotplug events.
 
   cpuset.cpus.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined CPUs that are actually granted to this
@@ -1569,7 +1569,7 @@ Cpuset Interface Files
 	and won't be affected by any memory nodes hotplug events.
 
   cpuset.mems.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined memory nodes that are actually granted to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e545848..9097e9d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2216,14 +2216,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.name = "cpus.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_CPULIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
 		.name = "mems.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_MEMLIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (6 preceding siblings ...)
  2018-06-24  7:30 ` [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-06-24  7:30 ` [PATCH v11 9/9] cpuset: Allow reporting of sched domain generation info Waiman Long
  8 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

With the cpuset v1, any changes to the list of allowable CPUs in a cpuset
may cause changes in the sched domain configuration depending on the
load balancing states and the cpu lists of its parent and its children.

With cpuset v2 (on default hierarchy), there are more restrictions
on how the load balancing state of a cpuset can change. As a result,
only changes made in a partition root will cause possible changes to
the corresponding sched domain. CPU changes to any of the non-partition
root cpusets will not cause changes in the sched domain configuration.
As a result, we don't need to call rebuild_sched_domains_locked()
for changes in a non-partition root cpuset saving precious cpu cycles.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9097e9d..07a5575 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -986,11 +986,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 		update_tasks_cpumask(cp);
 
 		/*
-		 * If the effective cpumask of any non-empty cpuset is changed,
-		 * we need to rebuild sched domains.
+		 * On legacy hierarchy, if the effective cpumask of any non-
+		 * empty cpuset is changed, we need to rebuild sched domains.
+		 * On default hierarchy, the cpuset needs to be a partition
+		 * root as well.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
-		    is_sched_load_balance(cp))
+		    is_sched_load_balance(cp) &&
+		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
+		    is_partition_root(cp)))
 			need_rebuild_sched_domains = true;
 
 		rcu_read_lock();
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v11 9/9] cpuset: Allow reporting of sched domain generation info
  2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (7 preceding siblings ...)
  2018-06-24  7:30 ` [PATCH v11 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root Waiman Long
@ 2018-06-24  7:30 ` Waiman Long
  2018-07-19 13:54   ` Peter Zijlstra
  8 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-06-24  7:30 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

This patch enables us to report sched domain generation information.

If DYNAMIC_DEBUG is enabled, issuing the following command

  echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control

and setting loglevel to 8 will allow the kernel to show what scheduling
domain changes are being made.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 07a5575..62b7e61 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -836,6 +836,23 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+#ifdef CONFIG_DEBUG_KERNEL
+static inline void debug_print_domains(cpumask_var_t *doms, int ndoms)
+{
+	int i;
+	char buf[200];
+	char *ptr, *end = buf + sizeof(buf) - 1;
+
+	for (i = 0, ptr = buf, *end = '\0'; i < ndoms; i++)
+		ptr += snprintf(ptr, end - ptr, "dom%d=%*pbl ", i,
+				cpumask_pr_args(doms[i]));
+
+	pr_debug("Generated %d domains: %s\n", ndoms, buf);
+}
+#else
+static inline void debug_print_domains(cpumask_var_t *doms, int ndoms) { }
+#endif
+
 /*
  * Rebuild scheduler domains.
  *
@@ -871,6 +888,7 @@ static void rebuild_sched_domains_locked(void)
 
 	/* Generate domain masks and attrs */
 	ndoms = generate_sched_domains(&doms, &attr);
+	debug_print_domains(doms, ndoms);
 
 	/* Have scheduler rebuild the domains */
 	partition_sched_domains(ndoms, doms, attr);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-06-24  7:30 ` [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
@ 2018-07-02 16:53   ` Tejun Heo
  2018-07-03  0:41     ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2018-07-02 16:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Waiman.

On Sun, Jun 24, 2018 at 03:30:38PM +0800, Waiman Long wrote:
> Because of the fact that setting the "cpuset.sched.partition" in
> a direct child of root can remove CPUs from the root's effective CPU
> list, it makes sense to know what CPUs are left in the root cgroup for
> scheduling purpose. So the "cpuset.cpus.effective" control file is now
> exposed in the v2 cgroup root.

So, effective changing when enabling partition on a child feels wrong
to me.  It's supposed to contain what's actually allowed to the cgroup
from its parent and that shouldn't change regardless of how those
resources are used.  It's still given to the cgroup from its parent.

It's a bit different because the way partition behaves is different
from other resource konbs in that it locks away those cpus so that
they can't be taken back.

What do people think about restricting partition to the first level
children for now at least?  That way we aren't locked into the special
semantics and we can figure out how to this down the hierarchy later.
Given that we ignore the regular cpuset settings when the set goes
empty (which also is a special condition which only exists for cpuset)
and inherits the parent's, I think the consistent thing to do is doing
the same for partition - if it can't be satisfied, ignore it, but
maybe there is a better way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-02 16:53   ` Tejun Heo
@ 2018-07-03  0:41     ` Waiman Long
  2018-07-03 15:58       ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-07-03  0:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/03/2018 12:53 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Sun, Jun 24, 2018 at 03:30:38PM +0800, Waiman Long wrote:
>> Because of the fact that setting the "cpuset.sched.partition" in
>> a direct child of root can remove CPUs from the root's effective CPU
>> list, it makes sense to know what CPUs are left in the root cgroup for
>> scheduling purpose. So the "cpuset.cpus.effective" control file is now
>> exposed in the v2 cgroup root.
> So, effective changing when enabling partition on a child feels wrong
> to me.  It's supposed to contain what's actually allowed to the cgroup
> from its parent and that shouldn't change regardless of how those
> resources are used.  It's still given to the cgroup from its parent.

Another way to work around this issue is to expose the reserved_cpus in
the parent for holding CPUs that can taken by a chid partition. That
will require adding one more cpuset file for those cgroups that are
partition roots.

> It's a bit different because the way partition behaves is different
> from other resource konbs in that it locks away those cpus so that
> they can't be taken back.
>
> What do people think about restricting partition to the first level
> children for now at least?  That way we aren't locked into the special
> semantics and we can figure out how to this down the hierarchy later.
> Given that we ignore the regular cpuset settings when the set goes
> empty (which also is a special condition which only exists for cpuset)
> and inherits the parent's, I think the consistent thing to do is doing
> the same for partition - if it can't be satisfied, ignore it, but
> maybe there is a better way.

I don't mind restricting that to the first level children for now. That
does restrict where we can put the container root if we want a separate
partition for a container. Let's hear if others have any objection about
that.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-03  0:41     ` Waiman Long
@ 2018-07-03 15:58       ` Tejun Heo
  2018-07-06 20:32         ` Waiman Long
  2018-07-19 13:52         ` Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2018-07-03 15:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Waiman.

On Tue, Jul 03, 2018 at 08:41:31AM +0800, Waiman Long wrote:
> > So, effective changing when enabling partition on a child feels wrong
> > to me.  It's supposed to contain what's actually allowed to the cgroup
> > from its parent and that shouldn't change regardless of how those
> > resources are used.  It's still given to the cgroup from its parent.
> 
> Another way to work around this issue is to expose the reserved_cpus in
> the parent for holding CPUs that can taken by a chid partition. That
> will require adding one more cpuset file for those cgroups that are
> partition roots.

Yeah, that should work.

> I don't mind restricting that to the first level children for now. That
> does restrict where we can put the container root if we want a separate
> partition for a container. Let's hear if others have any objection about
> that.

As currently implemented, partioning locks away the cpus which should
be a system level decision, not container level, so it makes sense to
me that it is only available to system root.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-03 15:58       ` Tejun Heo
@ 2018-07-06 20:32         ` Waiman Long
  2018-07-10 15:23           ` Waiman Long
  2018-07-19 13:52         ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-07-06 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/03/2018 11:58 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, Jul 03, 2018 at 08:41:31AM +0800, Waiman Long wrote:
>>> So, effective changing when enabling partition on a child feels wrong
>>> to me.  It's supposed to contain what's actually allowed to the cgroup
>>> from its parent and that shouldn't change regardless of how those
>>> resources are used.  It's still given to the cgroup from its parent.
>> Another way to work around this issue is to expose the reserved_cpus in
>> the parent for holding CPUs that can taken by a chid partition. That
>> will require adding one more cpuset file for those cgroups that are
>> partition roots.
> Yeah, that should work.
>

Thinking about it a bit more, that approach will make creating a
partition a multi-step process:

1) Reserve the CPUs in reserved_cpus.
2) enable sched.partition
3) Write the CPUs list into cpus.

There are also more exception cases that need to be handled. The current
approach, on the other hands, is much simpler and easier to understand
and use.

>> I don't mind restricting that to the first level children for now. That
>> does restrict where we can put the container root if we want a separate
>> partition for a container. Let's hear if others have any objection about
>> that.
> As currently implemented, partioning locks away the cpus which should
> be a system level decision, not container level, so it makes sense to
> me that it is only available to system root.

So my preference is to allow partition only on the first level children
of the root for the time being. I think it should cover most of the use
cases. I will update the patchset to reflect that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-06 20:32         ` Waiman Long
@ 2018-07-10 15:23           ` Waiman Long
  2018-07-18 15:21             ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-07-10 15:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/06/2018 04:32 PM, Waiman Long wrote:
> On 07/03/2018 11:58 AM, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Tue, Jul 03, 2018 at 08:41:31AM +0800, Waiman Long wrote:
>>>> So, effective changing when enabling partition on a child feels wrong
>>>> to me.  It's supposed to contain what's actually allowed to the cgroup
>>>> from its parent and that shouldn't change regardless of how those
>>>> resources are used.  It's still given to the cgroup from its parent.
>>> Another way to work around this issue is to expose the reserved_cpus in
>>> the parent for holding CPUs that can taken by a chid partition. That
>>> will require adding one more cpuset file for those cgroups that are
>>> partition roots.
>> Yeah, that should work.
>>
> Thinking about it a bit more, that approach will make creating a
> partition a multi-step process:
>
> 1) Reserve the CPUs in reserved_cpus.
> 2) enable sched.partition
> 3) Write the CPUs list into cpus.
>
> There are also more exception cases that need to be handled. The current
> approach, on the other hands, is much simpler and easier to understand
> and use.
>
>>> I don't mind restricting that to the first level children for now. That
>>> does restrict where we can put the container root if we want a separate
>>> partition for a container. Let's hear if others have any objection about
>>> that.
>> As currently implemented, partioning locks away the cpus which should
>> be a system level decision, not container level, so it makes sense to
>> me that it is only available to system root.
> So my preference is to allow partition only on the first level children
> of the root for the time being. I think it should cover most of the use
> cases. I will update the patchset to reflect that.
>
> Cheers,
> Longman
>
Below is the incremental patch that allow partitioning only on the first
level children of the root. Please let me know your thourght on that.

Thanks,
Longman

-------------[ Cut here ]-------------------------

From 5a41209da94385efff87d79f6523265c710cbea5 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 10 Jul 2018 10:23:16 -0400
Subject: [PATCH v11 10/10] cpuset: Restrict sched.partition to first level
 children of root only

Enabling partition on a v2 cpuset has the side effect of affecting the
effective CPUs of its parent which is currently unique to partitioning.
As we are not sure about the repercussion of enabling that globally,
we are now restricting the enabling of sched.partition to the first
level children of the root cgroup in the default hierarchy.

This is done by removing the "cpuset.sched.partition" control file
on cgroups that are not the first level children of the root. A new
show_cfile function pointer is added to the cftype structure. If it
is defined, it will be called to return a boolean value to determine
if the corresponding control file should show up in the cgroup. It
provides a more flexible mechanism to determine the visibility of the
control file than a simple CFTYPE_* flag can do.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 10 +++++-----
 include/linux/cgroup-defs.h             |  9 +++++++++
 kernel/cgroup/cgroup.c                  |  4 ++++
 kernel/cgroup/cpuset.c                  | 10 ++++++++++
 4 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst
b/Documentation/admin-guide/cgroup-v2.rst
index f7cde15..cf7cd88 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1585,10 +1585,10 @@ Cpuset Interface Files
     Its value will be affected by memory nodes hotplug events.
 
   cpuset.sched.partition
-    A read-write single value file which exists on non-root
-    cpuset-enabled cgroups.  It is a binary value flag that accepts
-    either "0" (off) or "1" (on).  This flag is set and owned by the
-        parent cgroup.
+    A read-write single value file which exists on the first level
+    children of the root cgroup.  It is a binary value flag that
+    accepts either "0" (off) or "1" (on).  This flag is set and
+    owned by the parent cgroup.
 
     If set, it indicates that the current cgroup is the root of a
     new partition or scheduling domain that comprises itself and
@@ -1603,7 +1603,7 @@ Cpuset Interface Files
        exclusive, i.e. they are not shared by any of its siblings.
     2) The "cpuset.cpus" is also a proper subset of the parent's
        "cpuset.cpus.effective".
-    3) The parent cgroup is a partition root.
+    3) The parent cgroup is the root cgroup.
     4) There is no child cgroups with cpuset enabled.  This is for
        eliminating corner cases that have to be handled if such a
        condition is allowed.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index c0e68f9..be79487 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -565,6 +565,15 @@ struct cftype {
     ssize_t (*write)(struct kernfs_open_file *of,
              char *buf, size_t nbytes, loff_t off);
 
+    /*
+     * show_cfile(), if defined, will return a boolean value to
+     * determine if the control file should show up in the cgroup.
+     * It provides more flexibility in deciding where the control
+     * file should appear than simple criteria like on-root or
+     * not-on-root.
+     */
+    bool (*show_cfile)(struct cgroup_subsys_state *css);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
     struct lock_class_key    lockdep_key;
 #endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 077370b..0afdab8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3612,6 +3612,10 @@ static int cgroup_addrm_files(struct
cgroup_subsys_state *css,
         if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
             continue;
 
+        /* Should the control file show up in the cgroup */
+        if (cft->show_cfile && !cft->show_cfile(css))
+            continue;
+
         if (is_add) {
             ret = cgroup_add_file(css, cgrp, cft);
             if (ret) {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 62b7e61..2f85c1e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2106,6 +2106,15 @@ static s64 cpuset_read_s64(struct
cgroup_subsys_state *css, struct cftype *cft)
 }
 
 /*
+ * The sched.partition control file should only show up in the first
+ * level children of the root cgroup.
+ */
+static bool cpuset_show_partition(struct cgroup_subsys_state *css)
+{
+    return parent_cs(css_cs(css)) == &top_cpuset;
+}
+
+/*
  * for the common functions, 'private' gives the type of file
  */
 
@@ -2250,6 +2259,7 @@ static s64 cpuset_read_s64(struct
cgroup_subsys_state *css, struct cftype *cft)
         .name = "sched.partition",
         .read_u64 = cpuset_read_u64,
         .write_u64 = cpuset_write_u64,
+        .show_cfile = cpuset_show_partition,
         .private = FILE_PARTITION_ROOT,
         .flags = CFTYPE_NOT_ON_ROOT,
     },
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-10 15:23           ` Waiman Long
@ 2018-07-18 15:21             ` Waiman Long
  2018-07-18 15:31               ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-07-18 15:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/10/2018 11:23 AM, Waiman Long wrote:
> On 07/06/2018 04:32 PM, Waiman Long wrote:
>> On 07/03/2018 11:58 AM, Tejun Heo wrote:
>>> Hello, Waiman.
>>>
>>> On Tue, Jul 03, 2018 at 08:41:31AM +0800, Waiman Long wrote:
>>>>> So, effective changing when enabling partition on a child feels wrong
>>>>> to me.  It's supposed to contain what's actually allowed to the cgroup
>>>>> from its parent and that shouldn't change regardless of how those
>>>>> resources are used.  It's still given to the cgroup from its parent.
>>>> Another way to work around this issue is to expose the reserved_cpus in
>>>> the parent for holding CPUs that can taken by a chid partition. That
>>>> will require adding one more cpuset file for those cgroups that are
>>>> partition roots.
>>> Yeah, that should work.
>>>
>> Thinking about it a bit more, that approach will make creating a
>> partition a multi-step process:
>>
>> 1) Reserve the CPUs in reserved_cpus.
>> 2) enable sched.partition
>> 3) Write the CPUs list into cpus.
>>
>> There are also more exception cases that need to be handled. The current
>> approach, on the other hands, is much simpler and easier to understand
>> and use.
>>
>>>> I don't mind restricting that to the first level children for now. That
>>>> does restrict where we can put the container root if we want a separate
>>>> partition for a container. Let's hear if others have any objection about
>>>> that.
>>> As currently implemented, partioning locks away the cpus which should
>>> be a system level decision, not container level, so it makes sense to
>>> me that it is only available to system root.
>> So my preference is to allow partition only on the first level children
>> of the root for the time being. I think it should cover most of the use
>> cases. I will update the patchset to reflect that.
>>
>> Cheers,
>> Longman
>>
> Below is the incremental patch that allow partitioning only on the first
> level children of the root. Please let me know your thourght on that.
>
> Thanks,
> Longman
>
> -------------[ Cut here ]-------------------------
>
> From 5a41209da94385efff87d79f6523265c710cbea5 Mon Sep 17 00:00:00 2001
> From: Waiman Long <longman@redhat.com>
> Date: Tue, 10 Jul 2018 10:23:16 -0400
> Subject: [PATCH v11 10/10] cpuset: Restrict sched.partition to first level
>  children of root only
>
> Enabling partition on a v2 cpuset has the side effect of affecting the
> effective CPUs of its parent which is currently unique to partitioning.
> As we are not sure about the repercussion of enabling that globally,
> we are now restricting the enabling of sched.partition to the first
> level children of the root cgroup in the default hierarchy.
>
> This is done by removing the "cpuset.sched.partition" control file
> on cgroups that are not the first level children of the root. A new
> show_cfile function pointer is added to the cftype structure. If it
> is defined, it will be called to return a boolean value to determine
> if the corresponding control file should show up in the cgroup. It
> provides a more flexible mechanism to determine the visibility of the
> control file than a simple CFTYPE_* flag can do.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 10 +++++-----
>  include/linux/cgroup-defs.h             |  9 +++++++++
>  kernel/cgroup/cgroup.c                  |  4 ++++
>  kernel/cgroup/cpuset.c                  | 10 ++++++++++
>  4 files changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst
> b/Documentation/admin-guide/cgroup-v2.rst
> index f7cde15..cf7cd88 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1585,10 +1585,10 @@ Cpuset Interface Files
>      Its value will be affected by memory nodes hotplug events.
>  
>    cpuset.sched.partition
> -    A read-write single value file which exists on non-root
> -    cpuset-enabled cgroups.  It is a binary value flag that accepts
> -    either "0" (off) or "1" (on).  This flag is set and owned by the
> -        parent cgroup.
> +    A read-write single value file which exists on the first level
> +    children of the root cgroup.  It is a binary value flag that
> +    accepts either "0" (off) or "1" (on).  This flag is set and
> +    owned by the parent cgroup.
>  
>      If set, it indicates that the current cgroup is the root of a
>      new partition or scheduling domain that comprises itself and
> @@ -1603,7 +1603,7 @@ Cpuset Interface Files
>         exclusive, i.e. they are not shared by any of its siblings.
>      2) The "cpuset.cpus" is also a proper subset of the parent's
>         "cpuset.cpus.effective".
> -    3) The parent cgroup is a partition root.
> +    3) The parent cgroup is the root cgroup.
>      4) There is no child cgroups with cpuset enabled.  This is for
>         eliminating corner cases that have to be handled if such a
>         condition is allowed.
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index c0e68f9..be79487 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -565,6 +565,15 @@ struct cftype {
>      ssize_t (*write)(struct kernfs_open_file *of,
>               char *buf, size_t nbytes, loff_t off);
>  
> +    /*
> +     * show_cfile(), if defined, will return a boolean value to
> +     * determine if the control file should show up in the cgroup.
> +     * It provides more flexibility in deciding where the control
> +     * file should appear than simple criteria like on-root or
> +     * not-on-root.
> +     */
> +    bool (*show_cfile)(struct cgroup_subsys_state *css);
> +
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>      struct lock_class_key    lockdep_key;
>  #endif
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 077370b..0afdab8 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3612,6 +3612,10 @@ static int cgroup_addrm_files(struct
> cgroup_subsys_state *css,
>          if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
>              continue;
>  
> +        /* Should the control file show up in the cgroup */
> +        if (cft->show_cfile && !cft->show_cfile(css))
> +            continue;
> +
>          if (is_add) {
>              ret = cgroup_add_file(css, cgrp, cft);
>              if (ret) {
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 62b7e61..2f85c1e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2106,6 +2106,15 @@ static s64 cpuset_read_s64(struct
> cgroup_subsys_state *css, struct cftype *cft)
>  }
>  
>  /*
> + * The sched.partition control file should only show up in the first
> + * level children of the root cgroup.
> + */
> +static bool cpuset_show_partition(struct cgroup_subsys_state *css)
> +{
> +    return parent_cs(css_cs(css)) == &top_cpuset;
> +}
> +
> +/*
>   * for the common functions, 'private' gives the type of file
>   */
>  
> @@ -2250,6 +2259,7 @@ static s64 cpuset_read_s64(struct
> cgroup_subsys_state *css, struct cftype *cft)
>          .name = "sched.partition",
>          .read_u64 = cpuset_read_u64,
>          .write_u64 = cpuset_write_u64,
> +        .show_cfile = cpuset_show_partition,
>          .private = FILE_PARTITION_ROOT,
>          .flags = CFTYPE_NOT_ON_ROOT,
>      },

Tejun,

What do you think about this patch?

Are there any other issues you have with this patchset?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-18 15:21             ` Waiman Long
@ 2018-07-18 15:31               ` Tejun Heo
  2018-07-18 21:12                 ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2018-07-18 15:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Wed, Jul 18, 2018 at 11:21:18AM -0400, Waiman Long wrote:
> What do you think about this patch?
> 
> Are there any other issues you have with this patchset?

Looks good to me.  If others are okay with it, I think we're all good.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-18 15:31               ` Tejun Heo
@ 2018-07-18 21:12                 ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-07-18 21:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/18/2018 11:31 AM, Tejun Heo wrote:
> On Wed, Jul 18, 2018 at 11:21:18AM -0400, Waiman Long wrote:
>> What do you think about this patch?
>>
>> Are there any other issues you have with this patchset?
> Looks good to me.  If others are okay with it, I think we're all good.
>
> Thanks.
>
Peter,

Do you have any major issues with this cpuset patchset as it stands today?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-03 15:58       ` Tejun Heo
  2018-07-06 20:32         ` Waiman Long
@ 2018-07-19 13:52         ` Peter Zijlstra
  2018-07-19 14:04           ` Waiman Long
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-19 13:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Tue, Jul 03, 2018 at 08:58:23AM -0700, Tejun Heo wrote:
> Hello, Waiman.
> 
> On Tue, Jul 03, 2018 at 08:41:31AM +0800, Waiman Long wrote:
> > > So, effective changing when enabling partition on a child feels wrong
> > > to me.  It's supposed to contain what's actually allowed to the cgroup
> > > from its parent and that shouldn't change regardless of how those
> > > resources are used.  It's still given to the cgroup from its parent.
> > 
> > Another way to work around this issue is to expose the reserved_cpus in
> > the parent for holding CPUs that can taken by a chid partition. That
> > will require adding one more cpuset file for those cgroups that are
> > partition roots.
> 
> Yeah, that should work.
> 
> > I don't mind restricting that to the first level children for now. That
> > does restrict where we can put the container root if we want a separate
> > partition for a container. Let's hear if others have any objection about
> > that.
> 
> As currently implemented, partioning locks away the cpus which should
> be a system level decision, not container level, so it makes sense to
> me that it is only available to system root.

I'm terribly confused, what?!

Why would a container not be allowed to create partitions for its
various RT workloads?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 9/9] cpuset: Allow reporting of sched domain generation info
  2018-06-24  7:30 ` [PATCH v11 9/9] cpuset: Allow reporting of sched domain generation info Waiman Long
@ 2018-07-19 13:54   ` Peter Zijlstra
  2018-07-19 13:56     ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-19 13:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Sun, Jun 24, 2018 at 03:30:40PM +0800, Waiman Long wrote:
> This patch enables us to report sched domain generation information.
> 
> If DYNAMIC_DEBUG is enabled, issuing the following command
> 
>   echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control
> 
> and setting loglevel to 8 will allow the kernel to show what scheduling
> domain changes are being made.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 07a5575..62b7e61 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -836,6 +836,23 @@ static int generate_sched_domains(cpumask_var_t **domains,
>  	return ndoms;
>  }
>  
> +#ifdef CONFIG_DEBUG_KERNEL
> +static inline void debug_print_domains(cpumask_var_t *doms, int ndoms)
> +{
> +	int i;
> +	char buf[200];
> +	char *ptr, *end = buf + sizeof(buf) - 1;
> +
> +	for (i = 0, ptr = buf, *end = '\0'; i < ndoms; i++)
> +		ptr += snprintf(ptr, end - ptr, "dom%d=%*pbl ", i,
> +				cpumask_pr_args(doms[i]));
> +
> +	pr_debug("Generated %d domains: %s\n", ndoms, buf);
> +}
> +#else
> +static inline void debug_print_domains(cpumask_var_t *doms, int ndoms) { }
> +#endif

I still hate this patch.. it prints information already available in
sched_debug and has that rediculous on-stack buffer.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 9/9] cpuset: Allow reporting of sched domain generation info
  2018-07-19 13:54   ` Peter Zijlstra
@ 2018-07-19 13:56     ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-07-19 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/19/2018 09:54 AM, Peter Zijlstra wrote:
> On Sun, Jun 24, 2018 at 03:30:40PM +0800, Waiman Long wrote:
>> This patch enables us to report sched domain generation information.
>>
>> If DYNAMIC_DEBUG is enabled, issuing the following command
>>
>>   echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control
>>
>> and setting loglevel to 8 will allow the kernel to show what scheduling
>> domain changes are being made.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 07a5575..62b7e61 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -836,6 +836,23 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>  	return ndoms;
>>  }
>>  
>> +#ifdef CONFIG_DEBUG_KERNEL
>> +static inline void debug_print_domains(cpumask_var_t *doms, int ndoms)
>> +{
>> +	int i;
>> +	char buf[200];
>> +	char *ptr, *end = buf + sizeof(buf) - 1;
>> +
>> +	for (i = 0, ptr = buf, *end = '\0'; i < ndoms; i++)
>> +		ptr += snprintf(ptr, end - ptr, "dom%d=%*pbl ", i,
>> +				cpumask_pr_args(doms[i]));
>> +
>> +	pr_debug("Generated %d domains: %s\n", ndoms, buf);
>> +}
>> +#else
>> +static inline void debug_print_domains(cpumask_var_t *doms, int ndoms) { }
>> +#endif
> I still hate this patch.. it prints information already available in
> sched_debug and has that rediculous on-stack buffer.

This patch is just for my own testing purpose. We can certainly drop that.

Tejun, please drop this one.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 13:52         ` Peter Zijlstra
@ 2018-07-19 14:04           ` Waiman Long
  2018-07-19 15:30             ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-07-19 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, kernel-team, pjt, luto, Mike Galbraith, torvalds,
	Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/19/2018 09:52 AM, Peter Zijlstra wrote:
> On Tue, Jul 03, 2018 at 08:58:23AM -0700, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Tue, Jul 03, 2018 at 08:41:31AM +0800, Waiman Long wrote:
>>>> So, effective changing when enabling partition on a child feels wrong
>>>> to me.  It's supposed to contain what's actually allowed to the cgroup
>>>> from its parent and that shouldn't change regardless of how those
>>>> resources are used.  It's still given to the cgroup from its parent.
>>> Another way to work around this issue is to expose the reserved_cpus in
>>> the parent for holding CPUs that can taken by a chid partition. That
>>> will require adding one more cpuset file for those cgroups that are
>>> partition roots.
>> Yeah, that should work.
>>
>>> I don't mind restricting that to the first level children for now. That
>>> does restrict where we can put the container root if we want a separate
>>> partition for a container. Let's hear if others have any objection about
>>> that.
>> As currently implemented, partioning locks away the cpus which should
>> be a system level decision, not container level, so it makes sense to
>> me that it is only available to system root.
> I'm terribly confused, what?!
>
> Why would a container not be allowed to create partitions for its
> various RT workloads?

As far as I understand, Tejun has some concern about the way that
partitioning works is inconsistent with how other resources are being
managed by cgroup v2 controllers. I adds an incremental patch to
temporarily disable the creation of partition below the first level
children to buy us time so that we can reach a compromise later on what
to do. We can always add features, but taking away features after they
are made available will be hard.

I am fine either way. It is up to you and Tejun to figure out what
should be made available to the users.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 14:04           ` Waiman Long
@ 2018-07-19 15:30             ` Tejun Heo
  2018-07-19 15:52               ` Waiman Long
  2018-07-20 11:29               ` Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2018-07-19 15:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello,

On Thu, Jul 19, 2018 at 10:04:54AM -0400, Waiman Long wrote:
> > Why would a container not be allowed to create partitions for its
> > various RT workloads?
> 
> As far as I understand, Tejun has some concern about the way that
> partitioning works is inconsistent with how other resources are being
> managed by cgroup v2 controllers. I adds an incremental patch to
> temporarily disable the creation of partition below the first level
> children to buy us time so that we can reach a compromise later on what
> to do. We can always add features, but taking away features after they
> are made available will be hard.
> 
> I am fine either way. It is up to you and Tejun to figure out what
> should be made available to the users.

So, the main thing is that putting a cpu into a partition locks away
the cpu from its ancestors.  That's a system level operation which
isn't delegatable.  If we want to allow partitioning in subtrees, the
parent still be able to take away partitioned cpus too even if that
means ignoring descendants' configurations, which btw is exactly what
cpuset does for non-partition configs.

I don't think this would be technically too challenging to implement,
but unless there are immediate use cases for it, we can start simpler
& restricted.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 15:30             ` Tejun Heo
@ 2018-07-19 15:52               ` Waiman Long
  2018-07-19 16:52                 ` Tejun Heo
  2018-07-20 11:29               ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-07-19 15:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/19/2018 11:30 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 19, 2018 at 10:04:54AM -0400, Waiman Long wrote:
>>> Why would a container not be allowed to create partitions for its
>>> various RT workloads?
>> As far as I understand, Tejun has some concern about the way that
>> partitioning works is inconsistent with how other resources are being
>> managed by cgroup v2 controllers. I adds an incremental patch to
>> temporarily disable the creation of partition below the first level
>> children to buy us time so that we can reach a compromise later on what
>> to do. We can always add features, but taking away features after they
>> are made available will be hard.
>>
>> I am fine either way. It is up to you and Tejun to figure out what
>> should be made available to the users.
> So, the main thing is that putting a cpu into a partition locks away
> the cpu from its ancestors.  That's a system level operation which
> isn't delegatable.  If we want to allow partitioning in subtrees, the
> parent still be able to take away partitioned cpus too even if that
> means ignoring descendants' configurations, which btw is exactly what
> cpuset does for non-partition configs.
>
> I don't think this would be technically too challenging to implement,
> but unless there are immediate use cases for it, we can start simpler
> & restricted.
>
> Thanks.
>
BTW, the way the partition is currently implemented right now is that a
child cannot be a partition root unless its parent is a partition root
itself. That is to avoid turning on partition to affect ancestors
further up the hierarchy than just the parent. So in the case of a
container, it cannot allocate sub-partitions underneath it unless it is
a partition itself. Will that solve your concern?

Thanks,
Longman



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 15:52               ` Waiman Long
@ 2018-07-19 16:52                 ` Tejun Heo
  2018-07-19 17:22                   ` Waiman Long
  2018-07-20 11:31                   ` Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2018-07-19 16:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello,

On Thu, Jul 19, 2018 at 11:52:46AM -0400, Waiman Long wrote:
> BTW, the way the partition is currently implemented right now is that a
> child cannot be a partition root unless its parent is a partition root
> itself. That is to avoid turning on partition to affect ancestors
> further up the hierarchy than just the parent. So in the case of a
> container, it cannot allocate sub-partitions underneath it unless it is
> a partition itself. Will that solve your concern?

Hmm... so a given ancestor must be able to both

1. control which cpus are moved into a partition in all of its
   subtree.

2. take away any given cpu from ist subtree.

Right now, I don't think it's achieving either, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 16:52                 ` Tejun Heo
@ 2018-07-19 17:22                   ` Waiman Long
  2018-07-19 17:25                     ` Tejun Heo
  2018-07-20 11:32                     ` Peter Zijlstra
  2018-07-20 11:31                   ` Peter Zijlstra
  1 sibling, 2 replies; 45+ messages in thread
From: Waiman Long @ 2018-07-19 17:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/19/2018 12:52 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 19, 2018 at 11:52:46AM -0400, Waiman Long wrote:
>> BTW, the way the partition is currently implemented right now is that a
>> child cannot be a partition root unless its parent is a partition root
>> itself. That is to avoid turning on partition to affect ancestors
>> further up the hierarchy than just the parent. So in the case of a
>> container, it cannot allocate sub-partitions underneath it unless it is
>> a partition itself. Will that solve your concern?
> Hmm... so a given ancestor must be able to both
>
> 1. control which cpus are moved into a partition in all of its
>    subtree.

Not at the moment. I do have another idea of adding a "cpus.subpart"
file, for example, to indicate what CPUs a child sub-partition can use
in its partition. This control file only serves as an intention to
grant, it won't affect anything else. Before the partition flag can be
turned on, the kernel will have to check if the cpu list of the child is
also a subset of its parent's cpus.subpart as a precondition. Does that
serve the purpose of letting a parent has a say of what CPUs can be
included in a child partition?

This change will require adding a new cpumask into the cpuset structure,
the code to handle the mask as well as an additional check in the
partition enable code. That will have minimal impact to the current code.

> 2. take away any given cpu from ist subtree.
>
> Right now, I don't think it's achieving either, right?

The only way to take away the CPUs from a sub-partition is to turn off
the partition flag in the child provided that it has no sub-partitions
underneath it.

Do you want a way at the parent level to take CPUs away from child
partitions? The "cpus.subpart" file can probably be used also for this
purpose, but we have to decide what taking CPUs away from child
partition means. Does that mean automatically turn off the partition
flag in the children if there is no CPU left in the partition? There are
some implementation details that need to be fleshed out. I would prefer
not doing this as this will complicate the code without too much benefit
that I can see.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 17:22                   ` Waiman Long
@ 2018-07-19 17:25                     ` Tejun Heo
  2018-07-19 17:38                       ` Waiman Long
  2018-07-20 11:32                     ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2018-07-19 17:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Waiman.

On Thu, Jul 19, 2018 at 01:22:36PM -0400, Waiman Long wrote:
...
> Do you want a way at the parent level to take CPUs away from child
> partitions? The "cpus.subpart" file can probably be used also for this
> purpose, but we have to decide what taking CPUs away from child
> partition means. Does that mean automatically turn off the partition
> flag in the children if there is no CPU left in the partition? There are

Yeah, I think so.  That's what we do when cpuset.cpus or mems go empty
- ignoring the config.

> some implementation details that need to be fleshed out. I would prefer
> not doing this as this will complicate the code without too much benefit
> that I can see.

So, given how long this has been dragging along and it isn't yet super
clear to me why this needs to be fully hierarchical, I'd actually
prefer just restricting it to the first level children.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 17:25                     ` Tejun Heo
@ 2018-07-19 17:38                       ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-07-19 17:38 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, kernel-team, pjt, luto, Mike Galbraith, torvalds,
	Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/19/2018 01:25 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Jul 19, 2018 at 01:22:36PM -0400, Waiman Long wrote:
> ...
>> Do you want a way at the parent level to take CPUs away from child
>> partitions? The "cpus.subpart" file can probably be used also for this
>> purpose, but we have to decide what taking CPUs away from child
>> partition means. Does that mean automatically turn off the partition
>> flag in the children if there is no CPU left in the partition? There are
> Yeah, I think so.  That's what we do when cpuset.cpus or mems go empty
> - ignoring the config.
>
>> some implementation details that need to be fleshed out. I would prefer
>> not doing this as this will complicate the code without too much benefit
>> that I can see.
> So, given how long this has been dragging along and it isn't yet super
> clear to me why this needs to be fully hierarchical, I'd actually
> prefer just restricting it to the first level children.
>
> Thanks.
>
Yes, I would prefer to check in the v11 cpuset patch as it is today,
including the incremental patch to limit partition to first level child
minus the debug patch. I do need to make cpuset v2 enabled ASAP.

This will give me more times to work on extending partition to be
hierarchical if the need arises as well as other cpuset v2 features. We
usually can't get all the things right in one single step.

Peter, are you OK with that? I can promise you that I will work to make
this happen at a later date if you really want it.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 15:30             ` Tejun Heo
  2018-07-19 15:52               ` Waiman Long
@ 2018-07-20 11:29               ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-20 11:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Thu, Jul 19, 2018 at 08:30:45AM -0700, Tejun Heo wrote:
> On Thu, Jul 19, 2018 at 10:04:54AM -0400, Waiman Long wrote:
> > > Why would a container not be allowed to create partitions for its
> > > various RT workloads?
> > 
> > As far as I understand, Tejun has some concern about the way that
> > partitioning works is inconsistent with how other resources are being
> > managed by cgroup v2 controllers. I adds an incremental patch to
> > temporarily disable the creation of partition below the first level
> > children to buy us time so that we can reach a compromise later on what
> > to do. We can always add features, but taking away features after they
> > are made available will be hard.
> > 
> > I am fine either way. It is up to you and Tejun to figure out what
> > should be made available to the users.
> 
> So, the main thing is that putting a cpu into a partition locks away
> the cpu from its ancestors.  That's a system level operation which
> isn't delegatable. 

If I understood things right, the partition file is actually owned by
the parent. So only the parent can flip that flag. In case of a
container, the filesystem namespace capturing the cgroup would cause
that file to be effectively r/o.

So effectively the partition flag if part of the parent control. The
parent takes the CPUs away to give them to the child cgroup. The child
itself cannot take or give here.

This is perhaps a little unorthodox, but it delegates just fine. Because
if a container finds .partition == 1, it knows it too can create (sub)
partitions.

> If we want to allow partitioning in subtrees, the
> parent still be able to take away partitioned cpus too even if that
> means ignoring descendants' configurations, which btw is exactly what
> cpuset does for non-partition configs.

I don't see why it would not be able to take away CPUs. But in case of
partitions this really is henous behaviour of the parent.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 16:52                 ` Tejun Heo
  2018-07-19 17:22                   ` Waiman Long
@ 2018-07-20 11:31                   ` Peter Zijlstra
  2018-07-20 11:45                     ` Tejun Heo
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-20 11:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Thu, Jul 19, 2018 at 09:52:01AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jul 19, 2018 at 11:52:46AM -0400, Waiman Long wrote:
> > BTW, the way the partition is currently implemented right now is that a
> > child cannot be a partition root unless its parent is a partition root
> > itself. That is to avoid turning on partition to affect ancestors
> > further up the hierarchy than just the parent. So in the case of a
> > container, it cannot allocate sub-partitions underneath it unless it is
> > a partition itself. Will that solve your concern?
> 
> Hmm... so a given ancestor must be able to both
> 
> 1. control which cpus are moved into a partition in all of its
>    subtree.

By virtue of the partition file being owned by the parent, this is
already achived, no?

> 2. take away any given cpu from ist subtree.

I really hate this obsession of yours and doubly so for partitions. But
why would this currently not be allowed?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-19 17:22                   ` Waiman Long
  2018-07-19 17:25                     ` Tejun Heo
@ 2018-07-20 11:32                     ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-20 11:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Thu, Jul 19, 2018 at 01:22:36PM -0400, Waiman Long wrote:
> > 1. control which cpus are moved into a partition in all of its
> >    subtree.
> 
> Not at the moment. I do have another idea of adding a "cpus.subpart"
> file, for example, to indicate what CPUs a child sub-partition can use
> in its partition.

ARGH, why more files; this is turning into a right mess to use.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 11:31                   ` Peter Zijlstra
@ 2018-07-20 11:45                     ` Tejun Heo
  2018-07-20 12:04                       ` Tejun Heo
  2018-07-20 15:44                       ` Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2018-07-20 11:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello,

On Fri, Jul 20, 2018 at 01:31:21PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 19, 2018 at 09:52:01AM -0700, Tejun Heo wrote:
> > On Thu, Jul 19, 2018 at 11:52:46AM -0400, Waiman Long wrote:
> > > BTW, the way the partition is currently implemented right now is that a
> > > child cannot be a partition root unless its parent is a partition root
> > > itself. That is to avoid turning on partition to affect ancestors
> > > further up the hierarchy than just the parent. So in the case of a
> > > container, it cannot allocate sub-partitions underneath it unless it is
> > > a partition itself. Will that solve your concern?
> > 
> > Hmm... so a given ancestor must be able to both
> > 
> > 1. control which cpus are moved into a partition in all of its
> >    subtree.
> 
> By virtue of the partition file being owned by the parent, this is
> already achived, no?

The currently proposed implementation is somewhere in the middle.  It
kinda gets there by restricting a partition to be a child of another
partition, which may be okay but it does make the whole delegation
mechanism less useful.

> > 2. take away any given cpu from ist subtree.
> 
> I really hate this obsession of yours and doubly so for partitions. But
> why would this currently not be allowed?

Well, sorry that you hate it.  It's a fundamental architectural
constraint.  If it can't satisfy that, it should't be in cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 11:45                     ` Tejun Heo
@ 2018-07-20 12:04                       ` Tejun Heo
  2018-07-20 15:44                       ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2018-07-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello,

On Fri, Jul 20, 2018 at 04:45:49AM -0700, Tejun Heo wrote:
> > > 2. take away any given cpu from ist subtree.
> > 
> > I really hate this obsession of yours and doubly so for partitions. But
> > why would this currently not be allowed?
> 
> Well, sorry that you hate it.  It's a fundamental architectural
> constraint.  If it can't satisfy that, it should't be in cgroup.

Just in case it wasn't clear from previous the exchanges, basic
architecture designs like this are what makes things like delegation
actually useful across all controllers.  If we allow a descendant in
the hierarchy to restrict waht its ancestors can do, it becomes really
painful operationally.

For now, I'd suggest just doing it at the first child level and move
on unless there are clear use cases where nested partitions are
useful, which I find unlikely given the nature of the operation -
partitions don't really have anything to do with one another.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 11:45                     ` Tejun Heo
  2018-07-20 12:04                       ` Tejun Heo
@ 2018-07-20 15:44                       ` Peter Zijlstra
  2018-07-20 15:56                         ` Tejun Heo
  2018-07-20 15:57                         ` Waiman Long
  1 sibling, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-20 15:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Fri, Jul 20, 2018 at 04:45:49AM -0700, Tejun Heo wrote:

> > > Hmm... so a given ancestor must be able to both
> > > 
> > > 1. control which cpus are moved into a partition in all of its
> > >    subtree.
> > 
> > By virtue of the partition file being owned by the parent, this is
> > already achived, no?
> 
> The currently proposed implementation is somewhere in the middle.  It
> kinda gets there by restricting a partition to be a child of another
> partition, which may be okay but it does make the whole delegation
> mechanism less useful.

So the implementation does not set ownership of the 'partition' file to
that of the parent directory? Because _that_ is what I understood from
Waiman (many versions ago). And that _does_ allow delegation to work
nicely.

> > > 2. take away any given cpu from ist subtree.
> > 
> > I really hate this obsession of yours and doubly so for partitions. But
> > why would this currently not be allowed?
> 
> Well, sorry that you hate it.  It's a fundamental architectural
> constraint.  If it can't satisfy that, it should't be in cgroup.

So is hierarchical behaviour; but you seem willing to forgo that.

Still, the question was, how is this (dispicable or not) behaviour not
allowed by the current implementation?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 15:44                       ` Peter Zijlstra
@ 2018-07-20 15:56                         ` Tejun Heo
  2018-07-20 16:19                           ` Waiman Long
  2018-07-20 16:25                           ` Peter Zijlstra
  2018-07-20 15:57                         ` Waiman Long
  1 sibling, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2018-07-20 15:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Peter.

On Fri, Jul 20, 2018 at 05:44:54PM +0200, Peter Zijlstra wrote:
> > The currently proposed implementation is somewhere in the middle.  It
> > kinda gets there by restricting a partition to be a child of another
> > partition, which may be okay but it does make the whole delegation
> > mechanism less useful.
> 
> So the implementation does not set ownership of the 'partition' file to
> that of the parent directory? Because _that_ is what I understood from
> Waiman (many versions ago). And that _does_ allow delegation to work
> nicely.

So, that part isn't the problem.  The problem is that if we allow
partitioning nested further away from ancestor, the descendant would
be able to take away reources from the removed ancestor.  Waiman
avoids this problem by only allowing partitions to nest but then it's
kinda weird cuz it's just a separate tree.

> > > > 2. take away any given cpu from ist subtree.
> > > 
> > > I really hate this obsession of yours and doubly so for partitions. But
> > > why would this currently not be allowed?
> > 
> > Well, sorry that you hate it.  It's a fundamental architectural
> > constraint.  If it can't satisfy that, it should't be in cgroup.
> 
> So is hierarchical behaviour; but you seem willing to forgo that.

Well, we do have root or first-level only things.  Things just need to
be properly delegatable when they're hierarchical (and things should
be hierarchical only when that actually means something).  While not
desirable, considering the history, accomodating this specific usage
that way seems acceptable to me.

> Still, the question was, how is this (dispicable or not) behaviour not
> allowed by the current implementation?

I'm not quite sure what you're trying to ask.  Can you elaborate?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 15:44                       ` Peter Zijlstra
  2018-07-20 15:56                         ` Tejun Heo
@ 2018-07-20 15:57                         ` Waiman Long
  1 sibling, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-07-20 15:57 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, kernel-team, pjt, luto, Mike Galbraith, torvalds,
	Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/20/2018 11:44 AM, Peter Zijlstra wrote:
> On Fri, Jul 20, 2018 at 04:45:49AM -0700, Tejun Heo wrote:
>
>>>> Hmm... so a given ancestor must be able to both
>>>>
>>>> 1. control which cpus are moved into a partition in all of its
>>>>    subtree.
>>> By virtue of the partition file being owned by the parent, this is
>>> already achived, no?
>> The currently proposed implementation is somewhere in the middle.  It
>> kinda gets there by restricting a partition to be a child of another
>> partition, which may be okay but it does make the whole delegation
>> mechanism less useful.
> So the implementation does not set ownership of the 'partition' file to
> that of the parent directory? Because _that_ is what I understood from
> Waiman (many versions ago). And that _does_ allow delegation to work
> nicely.
>
>>>> 2. take away any given cpu from ist subtree.
>>> I really hate this obsession of yours and doubly so for partitions. But
>>> why would this currently not be allowed?
>> Well, sorry that you hate it.  It's a fundamental architectural
>> constraint.  If it can't satisfy that, it should't be in cgroup.
> So is hierarchical behaviour; but you seem willing to forgo that.
>
> Still, the question was, how is this (dispicable or not) behaviour not
> allowed by the current implementation?

The taking CPUs away part is not functioning yet in the current
patchset. It is certainly doable. I just need more time to work on that.

The current patchset is fine if partition is restricted to the first
level children as CPU online/offline is properly handled by the
patchset. It is in the non-root level that taking CPUs away from a
partition can be problematic.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 15:56                         ` Tejun Heo
@ 2018-07-20 16:19                           ` Waiman Long
  2018-07-20 16:37                             ` Peter Zijlstra
  2018-07-20 16:25                           ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-07-20 16:19 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, kernel-team, pjt, luto, Mike Galbraith, torvalds,
	Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/20/2018 11:56 AM, Tejun Heo wrote:
> Hello, Peter.
>
> On Fri, Jul 20, 2018 at 05:44:54PM +0200, Peter Zijlstra wrote:
>>> The currently proposed implementation is somewhere in the middle.  It
>>> kinda gets there by restricting a partition to be a child of another
>>> partition, which may be okay but it does make the whole delegation
>>> mechanism less useful.
>> So the implementation does not set ownership of the 'partition' file to
>> that of the parent directory? Because _that_ is what I understood from
>> Waiman (many versions ago). And that _does_ allow delegation to work
>> nicely.
> So, that part isn't the problem.  The problem is that if we allow
> partitioning nested further away from ancestor, the descendant would
> be able to take away reources from the removed ancestor.  Waiman
> avoids this problem by only allowing partitions to nest but then it's
> kinda weird cuz it's just a separate tree.

The rationale behind the current design is that a parent is allowed to
create a child partition if it fully owns all the cpus it has. That is
true only if it is a partition itself. IOW, making a cpuset a partition
won't affect any ancestors other than the parent.


>>>>> 2. take away any given cpu from ist subtree.
>>>> I really hate this obsession of yours and doubly so for partitions. But
>>>> why would this currently not be allowed?
>>> Well, sorry that you hate it.  It's a fundamental architectural
>>> constraint.  If it can't satisfy that, it should't be in cgroup.
>> So is hierarchical behaviour; but you seem willing to forgo that.
> Well, we do have root or first-level only things.  Things just need to
> be properly delegatable when they're hierarchical (and things should
> be hierarchical only when that actually means something).  While not
> desirable, considering the history, accomodating this specific usage
> that way seems acceptable to me.

I am not against the idea of making it hierarchical eventually. I am
just hoping to get thing going by merging the patchset in its current
form and then we can make it hierarchical in a followup patch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 15:56                         ` Tejun Heo
  2018-07-20 16:19                           ` Waiman Long
@ 2018-07-20 16:25                           ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-20 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Fri, Jul 20, 2018 at 08:56:13AM -0700, Tejun Heo wrote:
> Hello, Peter.
> 
> On Fri, Jul 20, 2018 at 05:44:54PM +0200, Peter Zijlstra wrote:
> > > The currently proposed implementation is somewhere in the middle.  It
> > > kinda gets there by restricting a partition to be a child of another
> > > partition, which may be okay but it does make the whole delegation
> > > mechanism less useful.
> > 
> > So the implementation does not set ownership of the 'partition' file to
> > that of the parent directory? Because _that_ is what I understood from
> > Waiman (many versions ago). And that _does_ allow delegation to work
> > nicely.
> 
> So, that part isn't the problem.  The problem is that if we allow
> partitioning nested further away from ancestor, the descendant would
> be able to take away reources from the removed ancestor.

If the partition file is owned by the parent, the descendant cannot take
away anything. It can only read the file, not actually write to it.

To be explicit, if we have:

	/.			(uid = root)
	/A/.			(uid = user1)
	/A/cpuset.partition	(uid = root)

so ./*/cpuset.partition is always owned by .

> Waiman avoids this problem by only allowing partitions to nest but
> then it's kinda weird cuz it's just a separate tree.

The only real constraint is that the parent must have exclusive
'ownership' of the CPUs. Being a partition mandates that to be so. There
is no other way to distinguish exclusive / shared CPUs.

But I don't see how any of this is related to delegation per se. It is a
prerequisite for creating nested partitions, which in turn is a
prerequisite for delegation of partitions.

> > > > > 2. take away any given cpu from ist subtree.

> > Still, the question was, how is this (dispicable or not) behaviour not
> > allowed by the current implementation?
> 
> I'm not quite sure what you're trying to ask.  Can you elaborate?

We can always offline the CPU, so how exactly can the ancestor not take
back it's CPU?

The parent can simply shrink it's cpuset.cpus, no? Then the child gets
to recompute its setup and it might find its sub-partition just became
smaller or empty -- just like hotplug.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 16:19                           ` Waiman Long
@ 2018-07-20 16:37                             ` Peter Zijlstra
  2018-07-20 17:09                               ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-07-20 16:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Fri, Jul 20, 2018 at 12:19:29PM -0400, Waiman Long wrote:
> I am not against the idea of making it hierarchical eventually. I am
> just hoping to get thing going by merging the patchset in its current
> form and then we can make it hierarchical in a followup patch.

Where's the rush? Why can't we do this right in one go?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 16:37                             ` Peter Zijlstra
@ 2018-07-20 17:09                               ` Waiman Long
  2018-07-20 17:41                                 ` Tejun Heo
  2018-07-27 21:21                                 ` Waiman Long
  0 siblings, 2 replies; 45+ messages in thread
From: Waiman Long @ 2018-07-20 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/20/2018 12:37 PM, Peter Zijlstra wrote:
> On Fri, Jul 20, 2018 at 12:19:29PM -0400, Waiman Long wrote:
>> I am not against the idea of making it hierarchical eventually. I am
>> just hoping to get thing going by merging the patchset in its current
>> form and then we can make it hierarchical in a followup patch.
> Where's the rush? Why can't we do this right in one go?

For me, the rush comes from RHEL8 as it is a goal to have a fully
functioning cgroup v2 in that release.

I also believe that most of the use cases of partition can be satisfied
with partitions at the first level children. Getting hierarchical
partition right may drag on for half a year, maybe, giving our history
with cpu v2 controller. No matter what we do to enable hierarchical
partition in the future, the current model of using a partition flag is
intuitive enough that it won't be changed at least for the first level
children.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 17:09                               ` Waiman Long
@ 2018-07-20 17:41                                 ` Tejun Heo
  2018-08-13 17:56                                   ` Waiman Long
  2018-07-27 21:21                                 ` Waiman Long
  1 sibling, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2018-07-20 17:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello,

On Fri, Jul 20, 2018 at 01:09:23PM -0400, Waiman Long wrote:
> On 07/20/2018 12:37 PM, Peter Zijlstra wrote:
> > On Fri, Jul 20, 2018 at 12:19:29PM -0400, Waiman Long wrote:
> >> I am not against the idea of making it hierarchical eventually. I am
> >> just hoping to get thing going by merging the patchset in its current
> >> form and then we can make it hierarchical in a followup patch.
> > Where's the rush? Why can't we do this right in one go?
> 
> For me, the rush comes from RHEL8 as it is a goal to have a fully
> functioning cgroup v2 in that release.
> 
> I also believe that most of the use cases of partition can be satisfied
> with partitions at the first level children. Getting hierarchical
> partition right may drag on for half a year, maybe, giving our history
> with cpu v2 controller. No matter what we do to enable hierarchical
> partition in the future, the current model of using a partition flag is
> intuitive enough that it won't be changed at least for the first level
> children.

I'm fully with Waiman here.  There are people wanting to use it and
the part most people isn't controversial at all.  I don't see what'd
be gained by further delaying the whole thing.  If the first level
partition thing isn't acceptable to everyone, we can even strip down
further.  We can get .cpus and .mems merged first, which is what most
people want anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 17:09                               ` Waiman Long
  2018-07-20 17:41                                 ` Tejun Heo
@ 2018-07-27 21:21                                 ` Waiman Long
  1 sibling, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-07-27 21:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/20/2018 01:09 PM, Waiman Long wrote:
> On 07/20/2018 12:37 PM, Peter Zijlstra wrote:
>> On Fri, Jul 20, 2018 at 12:19:29PM -0400, Waiman Long wrote:
>>> I am not against the idea of making it hierarchical eventually. I am
>>> just hoping to get thing going by merging the patchset in its current
>>> form and then we can make it hierarchical in a followup patch.
>> Where's the rush? Why can't we do this right in one go?
> For me, the rush comes from RHEL8 as it is a goal to have a fully
> functioning cgroup v2 in that release.
>
> I also believe that most of the use cases of partition can be satisfied
> with partitions at the first level children. Getting hierarchical
> partition right may drag on for half a year, maybe, giving our history
> with cpu v2 controller. No matter what we do to enable hierarchical
> partition in the future, the current model of using a partition flag is
> intuitive enough that it won't be changed at least for the first level
> children.

Peter, are you OK that or do you still want to have hierarchical
partition done before submission?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-07-20 17:41                                 ` Tejun Heo
@ 2018-08-13 17:56                                   ` Waiman Long
  2018-08-17 15:59                                     ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2018-08-13 17:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 07/20/2018 01:41 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Jul 20, 2018 at 01:09:23PM -0400, Waiman Long wrote:
>> On 07/20/2018 12:37 PM, Peter Zijlstra wrote:
>>> On Fri, Jul 20, 2018 at 12:19:29PM -0400, Waiman Long wrote:
>>>> I am not against the idea of making it hierarchical eventually. I am
>>>> just hoping to get thing going by merging the patchset in its current
>>>> form and then we can make it hierarchical in a followup patch.
>>> Where's the rush? Why can't we do this right in one go?
>> For me, the rush comes from RHEL8 as it is a goal to have a fully
>> functioning cgroup v2 in that release.
>>
>> I also believe that most of the use cases of partition can be satisfied
>> with partitions at the first level children. Getting hierarchical
>> partition right may drag on for half a year, maybe, giving our history
>> with cpu v2 controller. No matter what we do to enable hierarchical
>> partition in the future, the current model of using a partition flag is
>> intuitive enough that it won't be changed at least for the first level
>> children.
> I'm fully with Waiman here.  There are people wanting to use it and
> the part most people isn't controversial at all.  I don't see what'd
> be gained by further delaying the whole thing.  If the first level
> partition thing isn't acceptable to everyone, we can even strip down
> further.  We can get .cpus and .mems merged first, which is what most
> people want anyway.

BTW, I am trying to support hierarchical partition. The first thing that
I want to support is to allow removing CPUs from partition root freely.
It turns out that the following existing code in validate_change() will
prevent the removal from happening when it touches any CPUs that are
used in child cpusets:

        /* Each of our child cpusets must be a subset of us */
        ret = -EBUSY;
        cpuset_for_each_child(c, css, cur)
                if (!is_cpuset_subset(c, trial))
                        goto out;

So this is not a new restriction after all.

The following restrictions are still imposed on a partition root wrt
allowable changes in cpuset.cpus:
1) cpuset.cpus cannot be set to "". There must be at least 1 cpu there.
2) Adding cpus that are not in parent's cpuset.cpus (as well as
cpuset.cpus.effective) or that will take all the parent's effective cpus
away is not allowed.

So are these limitations acceptable?

The easiest way to remove those restrictions is to forcefully turn off
the cpuset.sched.partition flag in the cpuset as well as any
sub-partitions when the user try to do that. With that change, there
will be no more new restriction on what you can do on cpuset.cpus.

What is your opinion on the best way forward wrt supporting hierarchical
partitioning?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-08-13 17:56                                   ` Waiman Long
@ 2018-08-17 15:59                                     ` Tejun Heo
  2018-08-18  1:03                                       ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2018-08-17 15:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Waiman.

On Mon, Aug 13, 2018 at 01:56:15PM -0400, Waiman Long wrote:
> The following restrictions are still imposed on a partition root wrt
> allowable changes in cpuset.cpus:
> 1) cpuset.cpus cannot be set to "". There must be at least 1 cpu there.
> 2) Adding cpus that are not in parent's cpuset.cpus (as well as
> cpuset.cpus.effective) or that will take all the parent's effective cpus
> away is not allowed.
> 
> So are these limitations acceptable?
> 
> The easiest way to remove those restrictions is to forcefully turn off
> the cpuset.sched.partition flag in the cpuset as well as any
> sub-partitions when the user try to do that. With that change, there
> will be no more new restriction on what you can do on cpuset.cpus.

I think that'd be the right thing to do.  That's what we do with
cpuset.cpus and cpuset.mems too.  If there's nothing available, we
ignore the config and show the settings which are currently in effect.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-08-17 15:59                                     ` Tejun Heo
@ 2018-08-18  1:03                                       ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2018-08-18  1:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 08/17/2018 11:59 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Aug 13, 2018 at 01:56:15PM -0400, Waiman Long wrote:
>> The following restrictions are still imposed on a partition root wrt
>> allowable changes in cpuset.cpus:
>> 1) cpuset.cpus cannot be set to "". There must be at least 1 cpu there.
>> 2) Adding cpus that are not in parent's cpuset.cpus (as well as
>> cpuset.cpus.effective) or that will take all the parent's effective cpus
>> away is not allowed.
>>
>> So are these limitations acceptable?
>>
>> The easiest way to remove those restrictions is to forcefully turn off
>> the cpuset.sched.partition flag in the cpuset as well as any
>> sub-partitions when the user try to do that. With that change, there
>> will be no more new restriction on what you can do on cpuset.cpus.
> I think that'd be the right thing to do.  That's what we do with
> cpuset.cpus and cpuset.mems too.  If there's nothing available, we
> ignore the config and show the settings which are currently in effect.
>
> Thanks.
>
Thanks! I will send out an updated patch after the 4.19 merge window closes.

-Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2018-08-18  1:03 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-24  7:30 [PATCH v11 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
2018-06-24  7:30 ` [PATCH v11 1/9] " Waiman Long
2018-06-24  7:30 ` [PATCH v11 2/9] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
2018-06-24  7:30 ` [PATCH v11 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal Waiman Long
2018-06-24  7:30 ` [PATCH v11 4/9] cpuset: Allow changes to cpus in a partition root Waiman Long
2018-06-24  7:30 ` [PATCH v11 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug Waiman Long
2018-06-24  7:30 ` [PATCH v11 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus Waiman Long
2018-06-24  7:30 ` [PATCH v11 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
2018-07-02 16:53   ` Tejun Heo
2018-07-03  0:41     ` Waiman Long
2018-07-03 15:58       ` Tejun Heo
2018-07-06 20:32         ` Waiman Long
2018-07-10 15:23           ` Waiman Long
2018-07-18 15:21             ` Waiman Long
2018-07-18 15:31               ` Tejun Heo
2018-07-18 21:12                 ` Waiman Long
2018-07-19 13:52         ` Peter Zijlstra
2018-07-19 14:04           ` Waiman Long
2018-07-19 15:30             ` Tejun Heo
2018-07-19 15:52               ` Waiman Long
2018-07-19 16:52                 ` Tejun Heo
2018-07-19 17:22                   ` Waiman Long
2018-07-19 17:25                     ` Tejun Heo
2018-07-19 17:38                       ` Waiman Long
2018-07-20 11:32                     ` Peter Zijlstra
2018-07-20 11:31                   ` Peter Zijlstra
2018-07-20 11:45                     ` Tejun Heo
2018-07-20 12:04                       ` Tejun Heo
2018-07-20 15:44                       ` Peter Zijlstra
2018-07-20 15:56                         ` Tejun Heo
2018-07-20 16:19                           ` Waiman Long
2018-07-20 16:37                             ` Peter Zijlstra
2018-07-20 17:09                               ` Waiman Long
2018-07-20 17:41                                 ` Tejun Heo
2018-08-13 17:56                                   ` Waiman Long
2018-08-17 15:59                                     ` Tejun Heo
2018-08-18  1:03                                       ` Waiman Long
2018-07-27 21:21                                 ` Waiman Long
2018-07-20 16:25                           ` Peter Zijlstra
2018-07-20 15:57                         ` Waiman Long
2018-07-20 11:29               ` Peter Zijlstra
2018-06-24  7:30 ` [PATCH v11 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root Waiman Long
2018-06-24  7:30 ` [PATCH v11 9/9] cpuset: Allow reporting of sched domain generation info Waiman Long
2018-07-19 13:54   ` Peter Zijlstra
2018-07-19 13:56     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).