linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy
@ 2018-08-27 14:41 Waiman Long
  2018-08-27 14:41 ` [PATCH v12 1/9] " Waiman Long
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

v12:
 - Take out the debugging patch to print partitions.
 - Add a patch to force turning off partition flag if newly modified CPU
   list doesn't meet the requirement of being a partition root.
 - Remove some unneeded checking code in update_reserved_cpumask().

v11:
 - Change the "domain_root" name to "partition" as suggested by Peter and
   update the documentation and code accordingly.
 - Remove the dying cgroup check in update_reserved_cpus() as the check
   may not be needed after all.
 - Document the effect of losing CPU affinity after offling all the cpus
   in a partition.
 - There is no other major code changes in this version.

v10:
 - Remove the cpuset.sched.load_balance patch for now as it may not
   be that useful.
 - Break the large patch 2 into smaller patches to make them a bit
   easier to review.
 - Test and fix issues related to changing "cpuset.cpus" and cpu
   online/offline in a domain root.
 - Rename isolated_cpus to reserved_cpus as this cpumask holds CPUs
   reserved for child sched domains.
 - Rework the scheduling domain debug printing code in the last patch.
 - Document update to the newly moved
   Documentation/admin-guide/cgroup-v2.rst.

v9:
 - Rename cpuset.sched.domain to cpuset.sched.domain_root to better
   identify its purpose as the root of a new scheduling domain or
   partition.
 - Clarify in the document about the purpose of domain_root and
   load_balance. Using domain_root is th only way to create new
   partition.
 - Fix a lockdep warning in update_isolated_cpumask() function.
 - Add a new patch to eliminate call to generate_sched_domains() for
   v2 when a change in cpu list does not touch a domain_root.

 v9 patch: https://lkml.org/lkml/2018/5/29/507
v10 patch: https://lkml.org/lkml/2018/6/18/3
v11 patch: https://lkml.org/lkml/2018/6/24/30

The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus",
"mems" and "sched.partition". The "cpus.effective" and "mems.effective"
will appear in all cpuset-enabled cgroups.

The new control file that is unique to v2 is "sched.partition". It
is a boolean flag file that designates if a cgroup is the root of a
new scheduling domain or partition with its own set of unique list of
CPUs from scheduling perspective disjointed from other partitions. The
root cgroup is always a partition root. Multiple levels of partitions
are supported with some limitations. So a container partition root can
behave like a real root.

When a partition root cgroup is removed, its list of exclusive
CPUs will be returned to the parent's cpus.effective automatically.

A container root can be a partition root with sub-partitions
created underneath it. One difference from the real root is that the
"cpuset.sched.partition" flag isn't present in the real root, but is
present in a container root. This is also true for other cpuset control
files as well as those from the other controllers. This is a general
issue that is not going to be addressed here in this patchset.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their effective
counterparts.

Patch 2 adds a new "sched.partition" control file for setting up multiple
scheduling domains or partitions. A partition root implies cpu_exclusive.

Patch 3 handles the proper deletion of a partition root cgroup by turning
off the partition flag automatically before deletion.

Patch 4 allows "cpuset.cpus" of a partition root cgroup to be changed
subject to certain constraints.

Patch 5 makes the hotplug code deal with partition root properly.

Patch 6 updates the scheduling domain genaration code to work with
the new domain root feature.

Patch 7 exposes cpus.effective and mems.effective to the root cgroup as
enabling child scheduling domains will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.

Patch 8 eliminates the need to rebuild sched domains for v2 if cpu list
changes occur to non-domain root cpusets only.

Patch 9 enables the forced turnning off of partition flag if changes
made to "cpuset.cpus" makes a partition root cpuset not qualified to
be a partition root anymore. Forced clearing of the partition flag,
though allowed, is not recommended and a warning will be printed.

With the addition of forced turning off of partition flag, any changes
made to the "cpuset.cpus" allowable on a non-partition root cpuset
can be made to a parition root with the exception that the implied
cpu_exclusive nature of a partition root will forbid adding cpus that
have been allocated to its siblings.

Waiman Long (9):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add new v2 cpuset.sched.partition flag
  cpuset: Simulate auto-off of sched.partition at cgroup removal
  cpuset: Allow changes to cpus in a partition root
  cpuset: Make sure that partition flag work properly with CPU hotplug
  cpuset: Make generate_sched_domains() recognize reserved_cpus
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Don't rebuild sched domains if cpu changes in non-partition
    root
  cpuset: Support forced turning off of partition flag

 Documentation/admin-guide/cgroup-v2.rst | 174 ++++++++++++-
 kernel/cgroup/cpuset.c                  | 441 ++++++++++++++++++++++++++++++--
 2 files changed, 593 insertions(+), 22 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v12 1/9] cpuset: Enable cpuset controller in default hierarchy
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 2/9] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 109 ++++++++++++++++++++++++++++++--
 kernel/cgroup/cpuset.c                  |  48 +++++++++++++-
 2 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1746131..a09b7f7 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/.
          5-3-3-2. IO Latency Interface Files
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1570,6 +1572,103 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested CPUs to be used by tasks within this
+	cgroup.  The actual list of CPUs to be granted, however, is
+	subjected to constraints imposed by its parent and can differ
+	from the requested CPUs.
+
+	The CPU numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined CPUs that are actually granted to this
+	cgroup by its parent.  These CPUs are allowed to be used by
+	tasks within the current cgroup.
+
+	If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+	all the CPUs from the parent cgroup that can be available to
+	be used by this cgroup.  Otherwise, it should be a subset of
+	"cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+	can be granted.  In this case, it will be treated just like an
+	empty "cpuset.cpus".
+
+	Its value will be affected by CPU hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested memory nodes to be used by tasks within
+	this cgroup.  The actual list of memory nodes granted, however,
+	is subjected to constraints imposed by its parent and can differ
+	from the requested memory nodes.
+
+	The memory node numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined memory nodes that are actually granted to
+	this cgroup by its parent. These memory nodes are allowed to
+	be used by tasks within the current cgroup.
+
+	If "cpuset.mems" is empty, it shows all the memory nodes from the
+	parent cgroup that will be available to be used by this cgroup.
+	Otherwise, it should be a subset of "cpuset.mems" unless none of
+	the memory nodes listed in "cpuset.mems" can be granted.  In this
+	case, it will be treated just like an empty "cpuset.mems".
+
+	Its value will be affected by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10c..2b5c447 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1824,12 +1824,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1932,6 +1931,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 /*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
+/*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
  */
@@ -2105,8 +2145,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 2/9] cpuset: Add new v2 cpuset.sched.partition flag
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
  2018-08-27 14:41 ` [PATCH v12 1/9] " Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal Waiman Long
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

A new cpuset.sched.partition boolean flag is added to cpuset v2.
This new flag, if set, indicates that the cgroup is the root of a
new scheduling domain or partition that includes itself and all its
descendants except those that are scheduling domain roots themselves
and their descendants.

With this new flag, one can directly create as many partitions as
necessary without ever using the v1 trick of turning off load balancing
in specific cpusets to create partitions as a side effect.

This new flag is owned by the parent and will cause the CPUs in the
cpuset to be removed from the effective CPUs of its parent.

This is implemented internally by adding a new reserved_cpus mask that
holds the CPUs belonging to child scheduling domain cpusets so that:

	reserved_cpus | effective_cpus = cpus_allowed
	reserved_cpus & effective_cpus = 0

This new flag can only be turned on in a cpuset if its parent is a
partition root itself. The state of this flag cannot be changed if the
cpuset has children.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  33 ++++++
 kernel/cgroup/cpuset.c                  | 184 +++++++++++++++++++++++++++++++-
 2 files changed, 214 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index a09b7f7..5e33d25 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1668,6 +1668,39 @@ Cpuset Interface Files
 
 	Its value will be affected by memory nodes hotplug events.
 
+  cpuset.sched.partition
+	A read-write single value file which exists on non-root
+	cpuset-enabled cgroups.  It is a binary value flag that accepts
+	either "0" (off) or "1" (on).  This flag is set and owned by the
+        parent cgroup.
+
+	If set, it indicates that the current cgroup is the root of a
+	new partition or scheduling domain that comprises itself and
+	all its descendants except those that are separate partition
+        roots themselves and their descendants.  The root cgroup is
+	always a partition root.
+
+	There are constraints on where this flag can be set.  It can
+	only be set in a cgroup if all the following conditions are true.
+
+	1) The "cpuset.cpus" is not empty and the list of CPUs are
+	   exclusive, i.e. they are not shared by any of its siblings.
+	2) The "cpuset.cpus" is also a proper subset of the parent's
+	   "cpuset.cpus.effective".
+	3) The parent cgroup is a partition root.
+	4) There is no child cgroups with cpuset enabled.  This is for
+	   eliminating corner cases that have to be handled if such a
+	   condition is allowed.
+
+	Setting this flag will take the CPUs away from the effective
+	CPUs of the parent cgroup.  That is why this flag has to be set
+	and owned by the parent.  Once it is set, this flag cannot be
+	cleared if there are any child cgroups with cpuset enabled.
+
+	A parent partition root cgroup cannot distribute all its CPUs to
+        its child partition root cgroups.  There must be at least one cpu
+        left in the parent partition root cgroup.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2b5c447..fdaa051 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,9 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/* CPUs reserved for child partitions */
+	cpumask_var_t reserved_cpus;
+
 	/*
 	 * This is old Memory Nodes tasks took on.
 	 *
@@ -134,6 +137,9 @@ struct cpuset {
 
 	/* for custom sched domain */
 	int relax_domain_level;
+
+	/* number of CPUs in reserved_cpus */
+	int nr_reserved;
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -175,6 +181,7 @@ static inline bool task_has_mempolicy(struct task_struct *task)
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_PARTITION_ROOT,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -203,6 +210,11 @@ static inline int is_sched_load_balance(const struct cpuset *cs)
 	return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 }
 
+static inline int is_partition_root(const struct cpuset *cs)
+{
+	return test_bit(CS_PARTITION_ROOT, &cs->flags);
+}
+
 static inline int is_memory_migrate(const struct cpuset *cs)
 {
 	return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
@@ -220,7 +232,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
 
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
-		  (1 << CS_MEM_EXCLUSIVE)),
+		  (1 << CS_MEM_EXCLUSIVE) | (1 << CS_PARTITION_ROOT)),
 };
 
 /**
@@ -881,6 +893,27 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	css_task_iter_end(&it);
 }
 
+/**
+ * compute_effective_cpumask - Compute the effective cpumask of the cpuset
+ * @new_cpus: the temp variable for the new effective_cpus mask
+ * @cs: the cpuset the need to recompute the new effective_cpus mask
+ * @parent: the parent cpuset
+ *
+ * If the parent has reserved CPUs, include them in the list of allowable
+ * CPUs in computing the new effective_cpus mask.
+ */
+static void compute_effective_cpumask(struct cpumask *new_cpus,
+				      struct cpuset *cs, struct cpuset *parent)
+{
+	if (parent->nr_reserved) {
+		cpumask_or(new_cpus, parent->effective_cpus,
+			   parent->reserved_cpus);
+		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+	} else {
+		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
+	}
+}
+
 /*
  * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
  * @cs: the cpuset to consider
@@ -903,7 +936,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
 
-		cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+		compute_effective_cpumask(new_cpus, cp, parent);
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
@@ -949,6 +982,105 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 }
 
 /**
+ * update_reserved_cpumask - update the reserved_cpus mask of parent cpuset
+ * @cpuset:  The cpuset that requests CPU reservation
+ * @delmask: The old reserved cpumask to be removed from the parent
+ * @addmask: The new reserved cpumask to be added to the parent
+ * Return: 0 if successful, an error code otherwise
+ *
+ * Removing CPUs from a partition root may not be allowed by the
+ * invalidate_change() function as it will check to make sure that the set
+ * of CPUs in a cpuset is always a superset of those in its child cpusets,
+ * if preset.
+ *
+ * Adding CPUs to "cpuset.cpus" is generally allowed. However, if the
+ * addition causes the cpuset to exceed the capability offered by its
+ * parent, that addition will not be allowed.
+ *
+ * Because of the implicit cpu exclusive nature of a partition root,
+ * cpumask changes tht violates the cpu exclusivity rule will not be
+ * permitted.
+ *
+ * If the sched.partition flag changes, either the delmask (0=>1) or the
+ * addmask (1=>0) will be NULL.
+ *
+ * Called with cpuset_mutex held.
+ */
+static int update_reserved_cpumask(struct cpuset *cpuset,
+	struct cpumask *delmask, struct cpumask *addmask)
+{
+	int retval;
+	struct cpuset *parent = parent_cs(cpuset);
+	int old_count = parent->nr_reserved;
+
+	/*
+	 * The parent must be a partition root.
+	 * The new cpumask, if present, must not be empty.
+	 */
+	if (!is_partition_root(parent) ||
+	   (addmask && cpumask_empty(addmask)))
+		return -EINVAL;
+
+	/*
+	 * A sched.partition state change is not allowed if there are
+	 * online children.
+	 */
+	if (css_has_online_children(&cpuset->css))
+		return -EBUSY;
+
+	if (!old_count) {
+		if (!zalloc_cpumask_var(&parent->reserved_cpus, GFP_KERNEL)) {
+			retval = -ENOMEM;
+			goto out;
+		}
+		old_count = 1;
+	}
+
+	retval = -EBUSY;
+
+	/*
+	 * The cpus to be added must be a proper subset of the parent's
+	 * effective_cpus mask but not in the reserved_cpus mask.
+	 */
+	if (addmask) {
+		if (!cpumask_subset(addmask, parent->effective_cpus) ||
+		     cpumask_equal(addmask, parent->effective_cpus))
+			goto out;
+		if (parent->nr_reserved &&
+		    cpumask_intersects(parent->reserved_cpus, addmask))
+			goto out;
+	}
+
+	/*
+	 * Change the reserved CPU list.
+	 * Newly added reserved CPUs will be removed from effective_cpus
+	 * and newly deleted ones will be added back if they are online.
+	 */
+	spin_lock_irq(&callback_lock);
+	if (addmask) {
+		cpumask_or(parent->reserved_cpus,
+			   parent->reserved_cpus, addmask);
+		cpumask_andnot(parent->effective_cpus,
+			       parent->effective_cpus, addmask);
+	}
+	if (delmask) {
+		cpumask_andnot(parent->reserved_cpus,
+			       parent->reserved_cpus, delmask);
+		cpumask_or(parent->effective_cpus,
+			   parent->effective_cpus, delmask);
+	}
+
+	parent->nr_reserved = cpumask_weight(parent->reserved_cpus);
+	spin_unlock_irq(&callback_lock);
+	retval = 0;
+out:
+	if (old_count && !parent->nr_reserved)
+		free_cpumask_var(parent->reserved_cpus);
+
+	return retval;
+}
+
+/**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
  * @trialcs: trial cpuset
@@ -989,6 +1121,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
+	if (is_partition_root(cs))
+		return -EBUSY;
+
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
@@ -1317,6 +1452,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	struct cpuset *trialcs;
 	int balance_flag_changed;
 	int spread_flag_changed;
+	int partition_flag_changed;
 	int err;
 
 	trialcs = alloc_trial_cpuset(cs);
@@ -1328,6 +1464,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	else
 		clear_bit(bit, &trialcs->flags);
 
+	/*
+	 *  Turning on sched.partition flag (default hierarchy only) implies
+	 *  an implicit cpu_exclusive. Turning off sched.partition will clear
+	 *  the cpu_exclusive flag.
+	 */
+	if (bit == CS_PARTITION_ROOT) {
+		if (turning_on)
+			set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+		else
+			clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+	}
+
 	err = validate_change(cs, trialcs);
 	if (err < 0)
 		goto out;
@@ -1338,11 +1486,27 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
+	partition_flag_changed = (is_partition_root(cs) !=
+				  is_partition_root(trialcs));
+
+	if (partition_flag_changed) {
+		err = turning_on
+		    ? update_reserved_cpumask(cs, NULL, cs->cpus_allowed)
+		    : update_reserved_cpumask(cs, cs->cpus_allowed, NULL);
+		if (err < 0)
+			goto out;
+		/*
+		 * At this point, the state has been changed.
+		 * So we can't back out with error anymore.
+		 */
+	}
+
 	spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
 	spin_unlock_irq(&callback_lock);
 
-	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
+	if (!cpumask_empty(trialcs->cpus_allowed) &&
+	   (balance_flag_changed || partition_flag_changed))
 		rebuild_sched_domains_locked();
 
 	if (spread_flag_changed)
@@ -1597,6 +1761,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
 	FILE_SCHED_LOAD_BALANCE,
+	FILE_PARTITION_ROOT,
 	FILE_SCHED_RELAX_DOMAIN_LEVEL,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
@@ -1630,6 +1795,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SCHED_LOAD_BALANCE:
 		retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val);
 		break;
+	case FILE_PARTITION_ROOT:
+		retval = update_flag(CS_PARTITION_ROOT, cs, val);
+		break;
 	case FILE_MEMORY_MIGRATE:
 		retval = update_flag(CS_MEMORY_MIGRATE, cs, val);
 		break;
@@ -1791,6 +1959,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 		return is_mem_hardwall(cs);
 	case FILE_SCHED_LOAD_BALANCE:
 		return is_sched_load_balance(cs);
+	case FILE_PARTITION_ROOT:
+		return is_partition_root(cs);
 	case FILE_MEMORY_MIGRATE:
 		return is_memory_migrate(cs);
 	case FILE_MEMORY_PRESSURE_ENABLED:
@@ -1967,6 +2137,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched.partition",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_PARTITION_ROOT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
  2018-08-27 14:41 ` [PATCH v12 1/9] " Waiman Long
  2018-08-27 14:41 ` [PATCH v12 2/9] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 4/9] cpuset: Allow changes to cpus in a partition root Waiman Long
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Making a cgroup a partition root will reserve cpu resource at its parent.
So when a partition root cgroup is destroyed, we need to free the
reserved cpus at its parent. This is now done by doing an auto-off of
the sched.partition flag in the offlining phase when a partition root
cgroup is being removed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fdaa051..d9e821c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2253,7 +2253,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 /*
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains_locked().
+ * will call rebuild_sched_domains_locked(). That is not needed
+ * in the default hierarchy where only changes in partition
+ * will cause repartitioning.
+ *
+ * If the cpuset has the 'sched.partition' flag enabled, simulate
+ * turning 'sched.partition" off.
  */
 
 static void cpuset_css_offline(struct cgroup_subsys_state *css)
@@ -2262,7 +2267,18 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 
 	mutex_lock(&cpuset_mutex);
 
-	if (is_sched_load_balance(cs))
+	/*
+	 * A WARN_ON_ONCE() check after calling update_flag() to make
+	 * sure that the operation succceeds without failure.
+	 */
+	if (is_partition_root(cs)) {
+		int ret = update_flag(CS_PARTITION_ROOT, cs, 0);
+
+		WARN_ON_ONCE(ret);
+	}
+
+	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+	    is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
 	cpuset_dec();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 4/9] cpuset: Allow changes to cpus in a partition root
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (2 preceding siblings ...)
  2018-08-27 14:41 ` [PATCH v12 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug Waiman Long
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The previous patch introduces a new partition flag, but won't allow
changes made to "cpuset.cpus" once the flag is on. That may be too
restrictive in some use cases. So this restiction is now relaxed to
allow changes made to the "cpuset.cpus" file with some constraints:

 1) The new set of cpus must still be exclusive.
 2) Newly added cpus must be a subset of the parent effective_cpus.
 3) None of the deleted cpus can be one of those allocated to a child
    partition roots, if present.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  9 +++++
 kernel/cgroup/cpuset.c                  | 61 ++++++++++++++++++++++++++-------
 2 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5e33d25..a5fc335 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1701,6 +1701,15 @@ Cpuset Interface Files
         its child partition root cgroups.  There must be at least one cpu
         left in the parent partition root cgroup.
 
+	In a partition root, changes to "cpuset.cpus" is allowed as long
+	as the first condition above as well as the following two
+	additional conditions are true.
+
+	1) Any added CPUs must be a proper subset of the parent's
+	   "cpuset.cpus.effective".
+	2) No CPU that has been distributed to child partition roots is
+	   is deleted.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d9e821c..56e0931 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -957,6 +957,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 
 		spin_lock_irq(&callback_lock);
 		cpumask_copy(cp->effective_cpus, new_cpus);
+		if (cp->nr_reserved)
+			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
+				       cp->reserved_cpus);
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -984,8 +987,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 /**
  * update_reserved_cpumask - update the reserved_cpus mask of parent cpuset
  * @cpuset:  The cpuset that requests CPU reservation
- * @delmask: The old reserved cpumask to be removed from the parent
- * @addmask: The new reserved cpumask to be added to the parent
+ * @oldmask: The old reserved cpumask to be removed from the parent
+ * @newmask: The new reserved cpumask to be added to the parent
  * Return: 0 if successful, an error code otherwise
  *
  * Removing CPUs from a partition root may not be allowed by the
@@ -1001,15 +1004,17 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
  * cpumask changes tht violates the cpu exclusivity rule will not be
  * permitted.
  *
- * If the sched.partition flag changes, either the delmask (0=>1) or the
- * addmask (1=>0) will be NULL.
+ * If the sched.partition flag changes, either the oldmask (0=>1) or the
+ * newmask (1=>0) will be NULL.
  *
  * Called with cpuset_mutex held.
  */
 static int update_reserved_cpumask(struct cpuset *cpuset,
-	struct cpumask *delmask, struct cpumask *addmask)
+	struct cpumask *oldmask, struct cpumask *newmask)
 {
 	int retval;
+	int adding, deleting;
+	cpumask_var_t addmask, delmask;
 	struct cpuset *parent = parent_cs(cpuset);
 	int old_count = parent->nr_reserved;
 
@@ -1018,16 +1023,23 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * The new cpumask, if present, must not be empty.
 	 */
 	if (!is_partition_root(parent) ||
-	   (addmask && cpumask_empty(addmask)))
+	   (newmask && cpumask_empty(newmask)))
 		return -EINVAL;
 
 	/*
 	 * A sched.partition state change is not allowed if there are
 	 * online children.
 	 */
-	if (css_has_online_children(&cpuset->css))
+	if ((!oldmask || !newmask) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
+	if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
+		return -ENOMEM;
+	if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
+		free_cpumask_var(addmask);
+		return -ENOMEM;
+	}
+
 	if (!old_count) {
 		if (!zalloc_cpumask_var(&parent->reserved_cpus, GFP_KERNEL)) {
 			retval = -ENOMEM;
@@ -1037,12 +1049,29 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	}
 
 	retval = -EBUSY;
+	adding = deleting = false;
+	/*
+	 * addmask = newmask & ~oldmask
+	 * delmask = oldmask & ~newmask
+	 */
+	if (oldmask && newmask) {
+		adding   = cpumask_andnot(addmask, newmask, oldmask);
+		deleting = cpumask_andnot(delmask, oldmask, newmask);
+		if (!adding && !deleting)
+			goto out_ok;
+	} else if (newmask) {
+		adding = true;
+		cpumask_copy(addmask, newmask);
+	} else if (oldmask) {
+		deleting = true;
+		cpumask_copy(delmask, oldmask);
+	}
 
 	/*
 	 * The cpus to be added must be a proper subset of the parent's
 	 * effective_cpus mask but not in the reserved_cpus mask.
 	 */
-	if (addmask) {
+	if (adding) {
 		if (!cpumask_subset(addmask, parent->effective_cpus) ||
 		     cpumask_equal(addmask, parent->effective_cpus))
 			goto out;
@@ -1057,13 +1086,13 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * and newly deleted ones will be added back if they are online.
 	 */
 	spin_lock_irq(&callback_lock);
-	if (addmask) {
+	if (adding) {
 		cpumask_or(parent->reserved_cpus,
 			   parent->reserved_cpus, addmask);
 		cpumask_andnot(parent->effective_cpus,
 			       parent->effective_cpus, addmask);
 	}
-	if (delmask) {
+	if (deleting) {
 		cpumask_andnot(parent->reserved_cpus,
 			       parent->reserved_cpus, delmask);
 		cpumask_or(parent->effective_cpus,
@@ -1072,8 +1101,12 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 
 	parent->nr_reserved = cpumask_weight(parent->reserved_cpus);
 	spin_unlock_irq(&callback_lock);
+
+out_ok:
 	retval = 0;
 out:
+	free_cpumask_var(addmask);
+	free_cpumask_var(delmask);
 	if (old_count && !parent->nr_reserved)
 		free_cpumask_var(parent->reserved_cpus);
 
@@ -1121,8 +1154,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	if (is_partition_root(cs))
-		return -EBUSY;
+	if (is_partition_root(cs)) {
+		retval = update_reserved_cpumask(cs, cs->cpus_allowed,
+						 trialcs->cpus_allowed);
+		if (retval < 0)
+			return retval;
+	}
 
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (3 preceding siblings ...)
  2018-08-27 14:41 ` [PATCH v12 4/9] cpuset: Allow changes to cpus in a partition root Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus Waiman Long
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

When there is a cpu hotplug event (CPU online or offline), the partitions
may need to be reconfigured and regenerated. So code is added to
the hotplug functions to make them work with new reserved_cpus mask to
compute the right effective_cpus for each of the affected cpusets.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
 kernel/cgroup/cpuset.c                  | 26 ++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index a5fc335..234fb18 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1710,6 +1710,16 @@ Cpuset Interface Files
 	2) No CPU that has been distributed to child partition roots is
 	   is deleted.
 
+	When all the CPUs allocated to a partition are offlined, the
+	partition will be temporaily gone and all the tasks in it will
+	be migrated to another one that belongs to the parent of the
+	partition root.  This is a destructive operation and all the
+	existing CPU affinity that is narrower than the cpuset itself
+	will be lost.
+
+	When any of those offlined CPUs is onlined again, a new partition
+	will be re-created and the tasks will be migrated back.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 56e0931..5b1e64d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -900,7 +900,8 @@ static void update_tasks_cpumask(struct cpuset *cs)
  * @parent: the parent cpuset
  *
  * If the parent has reserved CPUs, include them in the list of allowable
- * CPUs in computing the new effective_cpus mask.
+ * CPUs in computing the new effective_cpus mask. The cpu_active_mask is
+ * used to mask off cpus that are to be offlined.
  */
 static void compute_effective_cpumask(struct cpumask *new_cpus,
 				      struct cpuset *cs, struct cpuset *parent)
@@ -909,6 +910,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus,
 		cpumask_or(new_cpus, parent->effective_cpus,
 			   parent->reserved_cpus);
 		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+		cpumask_and(new_cpus, new_cpus, cpu_active_mask);
 	} else {
 		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
 	}
@@ -2528,9 +2530,17 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 		goto retry;
 	}
 
-	cpumask_and(&new_cpus, cs->cpus_allowed, parent_cs(cs)->effective_cpus);
+	compute_effective_cpumask(&new_cpus, cs, parent_cs(cs));
 	nodes_and(new_mems, cs->mems_allowed, parent_cs(cs)->effective_mems);
 
+	if (cs->nr_reserved) {
+		/*
+		 * Some of the CPUs may have been distributed to child
+		 * partition roots. So we need skip those when computing the
+		 * real effective cpus.
+		 */
+		cpumask_andnot(&new_cpus, &new_cpus, cs->reserved_cpus);
+	}
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
@@ -2580,6 +2590,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpumask_copy(&new_cpus, cpu_active_mask);
 	new_mems = node_states[N_MEMORY];
 
+	/*
+	 * If reserved_cpus is populated, it is likely that the check below
+	 * will produce a false positive on cpus_updated when the cpu list
+	 * isn't changed. It is extra work, but it is better to be safe.
+	 */
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2588,6 +2603,13 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+		/*
+		 * Make sure that the reserved cpus aren't in the
+		 * effective cpus.
+		 */
+		if (top_cpuset.nr_reserved)
+			cpumask_andnot(&new_cpus, &new_cpus,
+					top_cpuset.reserved_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (4 preceding siblings ...)
  2018-08-27 14:41 ` [PATCH v12 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The generate_sched_domains() function is modified to make it work
correctly with the newly introduced reserved_cpus mask for schedule
domains generation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5b1e64d..286b851 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	int ndoms = 0;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] struct cpumask slot */
 	struct cgroup_subsys_state *pos_css;
+	bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
 	doms = NULL;
 	dattr = NULL;
 	csa = NULL;
 
 	/* Special case for the 99% of systems with one, full, sched domain */
-	if (is_sched_load_balance(&top_cpuset)) {
+	if (root_load_balance && !top_cpuset.nr_reserved) {
 		ndoms = 1;
 		doms = alloc_sched_domains(ndoms);
 		if (!doms)
@@ -701,6 +702,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	csn = 0;
 
 	rcu_read_lock();
+	if (root_load_balance)
+		csa[csn++] = &top_cpuset;
 	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
 		if (cp == &top_cpuset)
 			continue;
@@ -711,6 +714,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 * parent's cpus, so just skip them, and then we call
 		 * update_domain_attr_tree() to calc relax_domain_level of
 		 * the corresponding sched domain.
+		 *
+		 * If root is load-balancing, we can skip @cp if it
+		 * is a subset of the root's effective_cpus.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
@@ -718,11 +724,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
 			continue;
 
+		if (root_load_balance &&
+		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+			continue;
+
 		if (is_sched_load_balance(cp))
 			csa[csn++] = cp;
 
-		/* skip @cp's subtree */
-		pos_css = css_rightmost_descendant(pos_css);
+		/* skip @cp's subtree if not a partition root */
+		if (!is_partition_root(cp))
+			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
 
@@ -850,7 +861,12 @@ static void rebuild_sched_domains_locked(void)
 	 * passing doms with offlined cpu to partition_sched_domains().
 	 * Anyways, hotplug work item will rebuild sched domains.
 	 */
-	if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+	if (!top_cpuset.nr_reserved &&
+	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+		goto out;
+
+	if (top_cpuset.nr_reserved &&
+	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
 		goto out;
 
 	/* Generate domain masks and attrs */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (5 preceding siblings ...)
  2018-08-27 14:41 ` [PATCH v12 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root Waiman Long
  2018-08-27 14:41 ` [PATCH v12 9/9] cpuset: Support forced turning off of partition flag Waiman Long
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Because of the fact that setting the "cpuset.sched.partition" in
a direct child of root can remove CPUs from the root's effective CPU
list, it makes sense to know what CPUs are left in the root cgroup for
scheduling purpose. So the "cpuset.cpus.effective" control file is now
exposed in the v2 cgroup root.

For consistency, the "cpuset.mems.effective" control file is exposed
as well.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 4 ++--
 kernel/cgroup/cpuset.c                  | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 234fb18..655e54e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1613,7 +1613,7 @@ Cpuset Interface Files
 	and won't be affected by any CPU hotplug events.
 
   cpuset.cpus.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined CPUs that are actually granted to this
@@ -1653,7 +1653,7 @@ Cpuset Interface Files
 	and won't be affected by any memory nodes hotplug events.
 
   cpuset.mems.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined memory nodes that are actually granted to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 286b851..0e41b62 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2182,14 +2182,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.name = "cpus.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_CPULIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
 		.name = "mems.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_MEMLIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (6 preceding siblings ...)
  2018-08-27 14:41 ` [PATCH v12 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 14:41 ` [PATCH v12 9/9] cpuset: Support forced turning off of partition flag Waiman Long
  8 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

With the cpuset v1, any changes to the list of allowable CPUs in a cpuset
may cause changes in the sched domain configuration depending on the
load balancing states and the cpu lists of its parent and its children.

With cpuset v2 (on default hierarchy), there are more restrictions
on how the load balancing state of a cpuset can change. As a result,
only changes made in a partition root will cause possible changes to
the corresponding sched domain. CPU changes to any of the non-partition
root cpusets will not cause changes in the sched domain configuration.
As a result, we don't need to call rebuild_sched_domains_locked()
for changes in a non-partition root cpuset saving precious cpu cycles.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0e41b62..d8970b4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -986,11 +986,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 		update_tasks_cpumask(cp);
 
 		/*
-		 * If the effective cpumask of any non-empty cpuset is changed,
-		 * we need to rebuild sched domains.
+		 * On legacy hierarchy, if the effective cpumask of any non-
+		 * empty cpuset is changed, we need to rebuild sched domains.
+		 * On default hierarchy, the cpuset needs to be a partition
+		 * root as well.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
-		    is_sched_load_balance(cp))
+		    is_sched_load_balance(cp) &&
+		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
+		    is_partition_root(cp)))
 			need_rebuild_sched_domains = true;
 
 		rcu_read_lock();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v12 9/9] cpuset: Support forced turning off of partition flag
  2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (7 preceding siblings ...)
  2018-08-27 14:41 ` [PATCH v12 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root Waiman Long
@ 2018-08-27 14:41 ` Waiman Long
  2018-08-27 16:40   ` Tejun Heo
  8 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2018-08-27 14:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Cpuset allows arbitrary modification of cpu list in "cpuset.cpus"
even if the requested CPUs won't be granted in "cpuset.cpus.effective"
as restricted by its parent. However, the validate_change() function
will inhibit removal of CPUs that have been used in child cpusets.

Being a partition root, however, limits the kind of cpu list
modification that is allowed. Adding CPUs is not allowed if the new
CPUs are not in the parent's effective cpu list that can be put into
"cpuset.cpus.reserved". In addition, a child partition cannot exhaust
all the parent's effective CPUs.

Because of the implicit cpu exclusive nature of the partition root,
cpu changes that break that cpu exclusivity will not be allowed. Other
changes that break the conditions of being a partition root is generally
allowed. However, the partition flag of the cpuset as well those of
the descendant partitions will be forcefully turned off.

Removing CPUs from a partition root is generally allowed as long as
there is at least one CPU left with no conflicts with child cpusets,
if present. If all the CPUs are removed, the partition flag will be
forced off as well.

The partition flag clearing code is being extracted out from
update_reserved_cpumask() into a new clear_partition_flag() function
so that the new code can be called recursively in the case of forced
turning off with minimal stack footprint.

Sched domains have to be rebuilt whenever a forced turning off of the
partition flag happens.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  45 +++++++----
 kernel/cgroup/cpuset.c                  | 136 +++++++++++++++++++++++++++-----
 2 files changed, 144 insertions(+), 37 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 655e54e..1f63ccf 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1695,27 +1695,40 @@ Cpuset Interface Files
 	Setting this flag will take the CPUs away from the effective
 	CPUs of the parent cgroup.  That is why this flag has to be set
 	and owned by the parent.  Once it is set, this flag cannot be
-	cleared if there are any child cgroups with cpuset enabled.
+	explicitly cleared if there are any child cgroups with cpuset
+	enabled.
 
 	A parent partition root cgroup cannot distribute all its CPUs to
         its child partition root cgroups.  There must be at least one cpu
         left in the parent partition root cgroup.
 
-	In a partition root, changes to "cpuset.cpus" is allowed as long
-	as the first condition above as well as the following two
-	additional conditions are true.
-
-	1) Any added CPUs must be a proper subset of the parent's
-	   "cpuset.cpus.effective".
-	2) No CPU that has been distributed to child partition roots is
-	   is deleted.
-
-	When all the CPUs allocated to a partition are offlined, the
-	partition will be temporaily gone and all the tasks in it will
-	be migrated to another one that belongs to the parent of the
-	partition root.  This is a destructive operation and all the
-	existing CPU affinity that is narrower than the cpuset itself
-	will be lost.
+	In a partition root, removing CPUs from "cpuset.cpus" is allowed
+	as long as none of the removed CPUs are used by any of the
+	child cpusets, if defined.  However, if the CPU removal cause
+	its effective CPU list to become empty, the kernel will have
+	no choice but to forcefully turn off the partition flag of the
+	current cpuset as well as any descendant partitions underneath it.
+	This is a destructive operation and the partition states will
+	not be restored even when the CPUs are added back later on.
+
+	Adding CPUs to "cpuset.cpus" of a partition root is generally
+	allowed.  Because of the cpu exclusivity nature of a partition
+	root, CPU changes that break the cpu exclusivity will not be
+	permitted.  For other CPU changes that break either one of the
+	first three conditions of being a partition root listed above,
+	it will cause the same forced turning off of the partition flag
+	as discussed before.
+
+	The act of forcefully clearing the partition flag by making
+	changes to "cpuset.cpus" is generally not recommended.	A warning
+	message will be printed when that happens.
+
+	CPU offlining is handled differently as it won't cause a forced
+	turning off the partition flag. When all the CPUs allocated to
+	a partition are offlined, the partition will be temporaily gone
+	and all the tasks in it will be migrated to the parent partition.
+	This is a destructive operation and all the existing CPU affinity
+	that is narrower than the cpuset itself will be lost.
 
 	When any of those offlined CPUs is onlined again, a new partition
 	will be re-created and the tasks will be migrated back.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d8970b4..5f2e942 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -313,6 +313,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
 
 /*
+ * Sched domains need to be rebuilt after forced off of partition flag.
+ */
+static bool force_rebuild_sched_domains;
+
+/*
  * Cgroup v2 behavior is used when on default hierarchy or the
  * cgroup_v2_mode flag is set.
  */
@@ -1002,8 +1007,77 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 	}
 	rcu_read_unlock();
 
-	if (need_rebuild_sched_domains)
+	if (need_rebuild_sched_domains || force_rebuild_sched_domains) {
 		rebuild_sched_domains_locked();
+		force_rebuild_sched_domains = false;
+	}
+}
+
+/**
+ * clear_partition_flag - Clear the partition flag of cpuset
+ * @cpuset: The cpuset to be cleared
+ * @forced: The forced turning off flag
+ * Return: 0 if successful, an error code otherwise
+ *
+ * Handles the turning off the sched.partition flag either when explicitly
+ * cleared by user or implicitly turned off by removing CPUs.
+ *
+ * Setting of partition flag is handled by update_reserved_cpumask().
+ * Called with cpuset_mutex held.
+ */
+static int clear_partition_flag(struct cpuset *cpuset, bool forced)
+{
+	struct cpuset *parent = parent_cs(cpuset);
+
+	WARN_ON_ONCE(!is_partition_root(cpuset));
+
+	/*
+	 * Normal partition flag clearing isn't allowed if sub-partition
+	 * is present.
+	 */
+	if (!forced && cpuset->nr_reserved)
+		return -EBUSY;
+
+	if (forced && cpuset->nr_reserved) {
+		struct cpuset *child;
+		struct cgroup_subsys_state *pos_css;
+
+		/*
+		 * Recursively call clear_partition_flag() if necessary.
+		 */
+		rcu_read_lock();
+		cpuset_for_each_child(child, pos_css, cpuset) {
+			if (is_partition_root(child))
+				clear_partition_flag(child, true);
+		}
+		rcu_read_unlock();
+		WARN_ON_ONCE(cpuset->nr_reserved);
+	}
+
+	if (forced) {
+		/* Forced clearing isn't recommended */
+		pr_warn("cpuset: sched.partition flag of ");
+		pr_cont_cgroup_name(cpuset->css.cgroup);
+		pr_cont(" is turned off!\n");
+		clear_bit(CS_PARTITION_ROOT, &cpuset->flags);
+		clear_bit(CS_CPU_EXCLUSIVE, &cpuset->flags);
+		force_rebuild_sched_domains = true;
+	}
+
+	/*
+	 * Remove cpus_allowed of current cpuset from parent's reserved_cpus.
+	 */
+	spin_lock_irq(&callback_lock);
+	cpumask_andnot(parent->reserved_cpus,
+		       parent->reserved_cpus, cpuset->cpus_allowed);
+	cpumask_or(parent->effective_cpus,
+		   parent->effective_cpus, cpuset->effective_cpus);
+	parent->nr_reserved = cpumask_weight(parent->reserved_cpus);
+	spin_unlock_irq(&callback_lock);
+
+	if (!parent->nr_reserved)
+		free_cpumask_var(parent->reserved_cpus);
+	return 0;
 }
 
 /**
@@ -1019,15 +1093,17 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
  * if preset.
  *
  * Adding CPUs to "cpuset.cpus" is generally allowed. However, if the
- * addition causes the cpuset to exceed the capability offered by its
- * parent, that addition will not be allowed.
+ * addition or removal causes the cpuset to exceed the capability offered
+ * by its parent or its constraint of being a partition root, the cpu
+ * list change will cause a forced turning-off of the partition flag.
  *
  * Because of the implicit cpu exclusive nature of a partition root,
- * cpumask changes tht violates the cpu exclusivity rule will not be
- * permitted.
+ * cpumask changes that violates the cpu exclusivity rule will not be
+ * permitted. One will have to turn off the partition flag before
+ * making the CPU changes.
  *
- * If the sched.partition flag changes, either the oldmask (0=>1) or the
- * newmask (1=>0) will be NULL.
+ * If the sched.partition flag is being set, the oldmask will be NULL.
+ * The newmask will never be NULL.
  *
  * Called with cpuset_mutex held.
  */
@@ -1042,17 +1118,25 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 
 	/*
 	 * The parent must be a partition root.
-	 * The new cpumask, if present, must not be empty.
 	 */
-	if (!is_partition_root(parent) ||
-	   (newmask && cpumask_empty(newmask)))
+	if (!is_partition_root(parent))
 		return -EINVAL;
 
 	/*
-	 * A sched.partition state change is not allowed if there are
+	 * If the newmask is empty or it is the same as the reserved_cpus,
+	 * we will have to turn off the partition flag.
+	 */
+	if (cpumask_empty(newmask) || (cpuset->nr_reserved &&
+	    cpumask_equal(newmask, cpuset->reserved_cpus))) {
+		clear_partition_flag(cpuset, true);
+		return 0;
+	}
+
+	/*
+	 * Turning on sched.partition is not allowed if there are
 	 * online children.
 	 */
-	if ((!oldmask || !newmask) && css_has_online_children(&cpuset->css))
+	if (!oldmask && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
 	if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
@@ -1076,30 +1160,40 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * addmask = newmask & ~oldmask
 	 * delmask = oldmask & ~newmask
 	 */
-	if (oldmask && newmask) {
+	if (oldmask) {
 		adding   = cpumask_andnot(addmask, newmask, oldmask);
 		deleting = cpumask_andnot(delmask, oldmask, newmask);
 		if (!adding && !deleting)
 			goto out_ok;
-	} else if (newmask) {
+	} else {
 		adding = true;
 		cpumask_copy(addmask, newmask);
-	} else if (oldmask) {
-		deleting = true;
-		cpumask_copy(delmask, oldmask);
 	}
 
 	/*
-	 * The cpus to be added must be a proper subset of the parent's
+	 * The cpus to be added should be a proper subset of the parent's
 	 * effective_cpus mask but not in the reserved_cpus mask.
 	 */
 	if (adding) {
+		bool error = false;
+
 		if (!cpumask_subset(addmask, parent->effective_cpus) ||
 		     cpumask_equal(addmask, parent->effective_cpus))
-			goto out;
+			error = true;
 		if (parent->nr_reserved &&
 		    cpumask_intersects(parent->reserved_cpus, addmask))
-			goto out;
+			error = true;
+		/*
+		 * Error condition isn't allowed when turning on the flag.
+		 * An error condition when changing cpu list will cause
+		 * a forced turning-off of the partition flag.
+		 */
+		if (error) {
+			if (!oldmask)
+				goto out;
+			clear_partition_flag(cpuset, true);
+			return 0;
+		}
 	}
 
 	/*
@@ -1551,7 +1645,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	if (partition_flag_changed) {
 		err = turning_on
 		    ? update_reserved_cpumask(cs, NULL, cs->cpus_allowed)
-		    : update_reserved_cpumask(cs, cs->cpus_allowed, NULL);
+		    : clear_partition_flag(cs, false);
 		if (err < 0)
 			goto out;
 		/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v12 9/9] cpuset: Support forced turning off of partition flag
  2018-08-27 14:41 ` [PATCH v12 9/9] cpuset: Support forced turning off of partition flag Waiman Long
@ 2018-08-27 16:40   ` Tejun Heo
  2018-08-27 17:50     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2018-08-27 16:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Waiman.

On Mon, Aug 27, 2018 at 10:41:24AM -0400, Waiman Long wrote:
> Cpuset allows arbitrary modification of cpu list in "cpuset.cpus"
> even if the requested CPUs won't be granted in "cpuset.cpus.effective"
> as restricted by its parent. However, the validate_change() function
> will inhibit removal of CPUs that have been used in child cpusets.
> 
> Being a partition root, however, limits the kind of cpu list
> modification that is allowed. Adding CPUs is not allowed if the new
> CPUs are not in the parent's effective cpu list that can be put into
> "cpuset.cpus.reserved". In addition, a child partition cannot exhaust
> all the parent's effective CPUs.
> 
> Because of the implicit cpu exclusive nature of the partition root,
> cpu changes that break that cpu exclusivity will not be allowed. Other
> changes that break the conditions of being a partition root is generally
> allowed. However, the partition flag of the cpuset as well those of
> the descendant partitions will be forcefully turned off.

First of all, thanks a lot for your persistence.  I'm not necessarily
against the flag being forced off but wonder whether the same
config/effective approach can be used as in .cpus and .mems.  Can you
elaborate a bit on the choice here?

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v12 9/9] cpuset: Support forced turning off of partition flag
  2018-08-27 16:40   ` Tejun Heo
@ 2018-08-27 17:50     ` Waiman Long
  2018-09-06 21:20       ` Waiman Long
  2018-10-02 20:06       ` Tejun Heo
  0 siblings, 2 replies; 16+ messages in thread
From: Waiman Long @ 2018-08-27 17:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 08/27/2018 12:40 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Aug 27, 2018 at 10:41:24AM -0400, Waiman Long wrote:
>> Cpuset allows arbitrary modification of cpu list in "cpuset.cpus"
>> even if the requested CPUs won't be granted in "cpuset.cpus.effective"
>> as restricted by its parent. However, the validate_change() function
>> will inhibit removal of CPUs that have been used in child cpusets.
>>
>> Being a partition root, however, limits the kind of cpu list
>> modification that is allowed. Adding CPUs is not allowed if the new
>> CPUs are not in the parent's effective cpu list that can be put into
>> "cpuset.cpus.reserved". In addition, a child partition cannot exhaust
>> all the parent's effective CPUs.
>>
>> Because of the implicit cpu exclusive nature of the partition root,
>> cpu changes that break that cpu exclusivity will not be allowed. Other
>> changes that break the conditions of being a partition root is generally
>> allowed. However, the partition flag of the cpuset as well those of
>> the descendant partitions will be forcefully turned off.
> First of all, thanks a lot for your persistence.  I'm not necessarily
> against the flag being forced off but wonder whether the same
> config/effective approach can be used as in .cpus and .mems.  Can you
> elaborate a bit on the choice here?
>
> Thank you.

My current code has explicitly assumed the following relationship for
partition root.

    cpus_allowed = effective_cpus + reserved_cpus

Also effective_cpus cannot be empty. Specifically, cpus_allowed has to
be equal to effective_cpus before a cpuset can be made a partition root.

Any changes that break the above conditions will turn off the partition
flag forcefully. The only exception is cpu offlining where cpus_allowed
> effective_cpus + reserved_cpus can happen.

One reason for doing so is because reserved_cpus is hidden. So the main
way to infer that is to do cpus_allowed - effective_cpus.

It is probably doable to make cpus_allowed >= effective_cpus +
reserved_cpus in general, but we may need to expose reserved_cpus as a
read-only file, for instance. There may also be other complications that
we will need to take care of if this is supported. My current preference
is to not doing that unless there is compelling reason to do so.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v12 9/9] cpuset: Support forced turning off of partition flag
  2018-08-27 17:50     ` Waiman Long
@ 2018-09-06 21:20       ` Waiman Long
  2018-09-24 15:47         ` Waiman Long
  2018-10-02 20:06       ` Tejun Heo
  1 sibling, 1 reply; 16+ messages in thread
From: Waiman Long @ 2018-09-06 21:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 08/27/2018 01:50 PM, Waiman Long wrote:
> On 08/27/2018 12:40 PM, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Mon, Aug 27, 2018 at 10:41:24AM -0400, Waiman Long wrote:
>>> Cpuset allows arbitrary modification of cpu list in "cpuset.cpus"
>>> even if the requested CPUs won't be granted in "cpuset.cpus.effective"
>>> as restricted by its parent. However, the validate_change() function
>>> will inhibit removal of CPUs that have been used in child cpusets.
>>>
>>> Being a partition root, however, limits the kind of cpu list
>>> modification that is allowed. Adding CPUs is not allowed if the new
>>> CPUs are not in the parent's effective cpu list that can be put into
>>> "cpuset.cpus.reserved". In addition, a child partition cannot exhaust
>>> all the parent's effective CPUs.
>>>
>>> Because of the implicit cpu exclusive nature of the partition root,
>>> cpu changes that break that cpu exclusivity will not be allowed. Other
>>> changes that break the conditions of being a partition root is generally
>>> allowed. However, the partition flag of the cpuset as well those of
>>> the descendant partitions will be forcefully turned off.
>> First of all, thanks a lot for your persistence.  I'm not necessarily
>> against the flag being forced off but wonder whether the same
>> config/effective approach can be used as in .cpus and .mems.  Can you
>> elaborate a bit on the choice here?
>>
>> Thank you.
> My current code has explicitly assumed the following relationship for
> partition root.
>
>     cpus_allowed = effective_cpus + reserved_cpus
>
> Also effective_cpus cannot be empty. Specifically, cpus_allowed has to
> be equal to effective_cpus before a cpuset can be made a partition root.
>
> Any changes that break the above conditions will turn off the partition
> flag forcefully. The only exception is cpu offlining where cpus_allowed
>> effective_cpus + reserved_cpus can happen.
> One reason for doing so is because reserved_cpus is hidden. So the main
> way to infer that is to do cpus_allowed - effective_cpus.
>
> It is probably doable to make cpus_allowed >= effective_cpus +
> reserved_cpus in general, but we may need to expose reserved_cpus as a
> read-only file, for instance. There may also be other complications that
> we will need to take care of if this is supported. My current preference
> is to not doing that unless there is compelling reason to do so.
>
> Cheers,
> Longman

For the current patch, one of conditions to allow users to turn on the
partition flag is cpus_allowed = effective_cpus. Once the partition flag
is on, the twos have to be the same too except with cpu offlining. This
patch is added to allow arbitrary modification to the cpuset.cpus file
except those that have already been forbidden in the existing code. When
the modification violates the rules of being a partition root, the flag
will be turned off or we will have to deny the modification request.

I am admitting that it is an easy way out in term of programming effort
as I am not sure how much effort is needed to enable arbitrary
modification to cpuset.cpus without force disabling and whether such an
effort is even worthwhile to do.

We can certainly relax the current restrictions in the future.  Any
suggestion to improve the patchset is welcome. I would really like to
see this merged in 4.20 (or maybe 5.0) kernel.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v12 9/9] cpuset: Support forced turning off of partition flag
  2018-09-06 21:20       ` Waiman Long
@ 2018-09-24 15:47         ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-09-24 15:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 09/06/2018 05:20 PM, Waiman Long wrote:
> On 08/27/2018 01:50 PM, Waiman Long wrote:
>> On 08/27/2018 12:40 PM, Tejun Heo wrote:
>>> Hello, Waiman.
>>>
>>> On Mon, Aug 27, 2018 at 10:41:24AM -0400, Waiman Long wrote:
>>>> Cpuset allows arbitrary modification of cpu list in "cpuset.cpus"
>>>> even if the requested CPUs won't be granted in "cpuset.cpus.effective"
>>>> as restricted by its parent. However, the validate_change() function
>>>> will inhibit removal of CPUs that have been used in child cpusets.
>>>>
>>>> Being a partition root, however, limits the kind of cpu list
>>>> modification that is allowed. Adding CPUs is not allowed if the new
>>>> CPUs are not in the parent's effective cpu list that can be put into
>>>> "cpuset.cpus.reserved". In addition, a child partition cannot exhaust
>>>> all the parent's effective CPUs.
>>>>
>>>> Because of the implicit cpu exclusive nature of the partition root,
>>>> cpu changes that break that cpu exclusivity will not be allowed. Other
>>>> changes that break the conditions of being a partition root is generally
>>>> allowed. However, the partition flag of the cpuset as well those of
>>>> the descendant partitions will be forcefully turned off.
>>> First of all, thanks a lot for your persistence.  I'm not necessarily
>>> against the flag being forced off but wonder whether the same
>>> config/effective approach can be used as in .cpus and .mems.  Can you
>>> elaborate a bit on the choice here?
>>>
>>> Thank you.
>> My current code has explicitly assumed the following relationship for
>> partition root.
>>
>>     cpus_allowed = effective_cpus + reserved_cpus
>>
>> Also effective_cpus cannot be empty. Specifically, cpus_allowed has to
>> be equal to effective_cpus before a cpuset can be made a partition root.
>>
>> Any changes that break the above conditions will turn off the partition
>> flag forcefully. The only exception is cpu offlining where cpus_allowed
>>> effective_cpus + reserved_cpus can happen.
>> One reason for doing so is because reserved_cpus is hidden. So the main
>> way to infer that is to do cpus_allowed - effective_cpus.
>>
>> It is probably doable to make cpus_allowed >= effective_cpus +
>> reserved_cpus in general, but we may need to expose reserved_cpus as a
>> read-only file, for instance. There may also be other complications that
>> we will need to take care of if this is supported. My current preference
>> is to not doing that unless there is compelling reason to do so.
>>
>> Cheers,
>> Longman
> For the current patch, one of conditions to allow users to turn on the
> partition flag is cpus_allowed = effective_cpus. Once the partition flag
> is on, the twos have to be the same too except with cpu offlining. This
> patch is added to allow arbitrary modification to the cpuset.cpus file
> except those that have already been forbidden in the existing code. When
> the modification violates the rules of being a partition root, the flag
> will be turned off or we will have to deny the modification request.
>
> I am admitting that it is an easy way out in term of programming effort
> as I am not sure how much effort is needed to enable arbitrary
> modification to cpuset.cpus without force disabling and whether such an
> effort is even worthwhile to do.
>
> We can certainly relax the current restrictions in the future.  Any
> suggestion to improve the patchset is welcome. I would really like to
> see this merged in 4.20 (or maybe 5.0) kernel.
>
> Cheers,
> Longman
>
Tejun,

Sorry for bothering you again.

I would like to know your assessment of this patchset as to whether it
is ready or not.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v12 9/9] cpuset: Support forced turning off of partition flag
  2018-08-27 17:50     ` Waiman Long
  2018-09-06 21:20       ` Waiman Long
@ 2018-10-02 20:06       ` Tejun Heo
  2018-10-02 20:44         ` Waiman Long
  1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2018-10-02 20:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Waiman.

My apologies for the delay.

On Mon, Aug 27, 2018 at 01:50:18PM -0400, Waiman Long wrote:
> My current code has explicitly assumed the following relationship for
> partition root.
> 
>     cpus_allowed = effective_cpus + reserved_cpus
> 
> Also effective_cpus cannot be empty. Specifically, cpus_allowed has to
> be equal to effective_cpus before a cpuset can be made a partition root.
> 
> Any changes that break the above conditions will turn off the partition
> flag forcefully. The only exception is cpu offlining where cpus_allowed
> > effective_cpus + reserved_cpus can happen.
> 
> One reason for doing so is because reserved_cpus is hidden. So the main
> way to infer that is to do cpus_allowed - effective_cpus.
> 
> It is probably doable to make cpus_allowed >= effective_cpus +
> reserved_cpus in general, but we may need to expose reserved_cpus as a
> read-only file, for instance. There may also be other complications that
> we will need to take care of if this is supported. My current preference
> is to not doing that unless there is compelling reason to do so.

So, if we're gonna make this hierarchical, I think it probably would
be better to go in all the way.  It's kinda weird to mix the two
approaches - the normal cpuset operation following the usual
convention (it'd be really great to fix the removal part too) and
parition code doing something else.

I think adding another interface file should be fine here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v12 9/9] cpuset: Support forced turning off of partition flag
  2018-10-02 20:06       ` Tejun Heo
@ 2018-10-02 20:44         ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2018-10-02 20:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 10/02/2018 04:06 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> My apologies for the delay.
>
> On Mon, Aug 27, 2018 at 01:50:18PM -0400, Waiman Long wrote:
>> My current code has explicitly assumed the following relationship for
>> partition root.
>>
>>     cpus_allowed = effective_cpus + reserved_cpus
>>
>> Also effective_cpus cannot be empty. Specifically, cpus_allowed has to
>> be equal to effective_cpus before a cpuset can be made a partition root.
>>
>> Any changes that break the above conditions will turn off the partition
>> flag forcefully. The only exception is cpu offlining where cpus_allowed
>>> effective_cpus + reserved_cpus can happen.
>> One reason for doing so is because reserved_cpus is hidden. So the main
>> way to infer that is to do cpus_allowed - effective_cpus.
>>
>> It is probably doable to make cpus_allowed >= effective_cpus +
>> reserved_cpus in general, but we may need to expose reserved_cpus as a
>> read-only file, for instance. There may also be other complications that
>> we will need to take care of if this is supported. My current preference
>> is to not doing that unless there is compelling reason to do so.
> So, if we're gonna make this hierarchical, I think it probably would
> be better to go in all the way.  It's kinda weird to mix the two
> approaches - the normal cpuset operation following the usual
> convention (it'd be really great to fix the removal part too) and
> parition code doing something else.
>
> I think adding another interface file should be fine here.
>
> Thanks.
>
OK, I will revise the patch to make it work without the removal part.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-10-02 20:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-27 14:41 [PATCH v12 0/9] cpuset: Enable cpuset controller in default hierarchy Waiman Long
2018-08-27 14:41 ` [PATCH v12 1/9] " Waiman Long
2018-08-27 14:41 ` [PATCH v12 2/9] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
2018-08-27 14:41 ` [PATCH v12 3/9] cpuset: Simulate auto-off of sched.partition at cgroup removal Waiman Long
2018-08-27 14:41 ` [PATCH v12 4/9] cpuset: Allow changes to cpus in a partition root Waiman Long
2018-08-27 14:41 ` [PATCH v12 5/9] cpuset: Make sure that partition flag work properly with CPU hotplug Waiman Long
2018-08-27 14:41 ` [PATCH v12 6/9] cpuset: Make generate_sched_domains() recognize reserved_cpus Waiman Long
2018-08-27 14:41 ` [PATCH v12 7/9] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
2018-08-27 14:41 ` [PATCH v12 8/9] cpuset: Don't rebuild sched domains if cpu changes in non-partition root Waiman Long
2018-08-27 14:41 ` [PATCH v12 9/9] cpuset: Support forced turning off of partition flag Waiman Long
2018-08-27 16:40   ` Tejun Heo
2018-08-27 17:50     ` Waiman Long
2018-09-06 21:20       ` Waiman Long
2018-09-24 15:47         ` Waiman Long
2018-10-02 20:06       ` Tejun Heo
2018-10-02 20:44         ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).