linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v14 00/12] Enable cpuset controller in default hierarchy
@ 2018-10-15 20:29 Waiman Long
  2018-10-15 20:29 ` [PATCH v14 01/12] cpuset: " Waiman Long
                   ` (12 more replies)
  0 siblings, 13 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

v14:
 - Fix a bug about cpumask handling in patch 4 by using
   CONFIG_CPUMASK_OFFSTACK #ifdef block.
 - Add patch 12 to show descriptive name when reading
   cpuset.sched.partition (suggested by Tejun).
 - Change the function prototype of free_cpumasks() to match that of
   alloc_cpumasks() (suggested by Tom Hromatka).

v13:
 - A major rewrite of the partition code so that there will be no
   auto-turning off anymore. Instead, the partition root can enter into
   an error state that can be restored back to a partition root later on.
 - Patches 1 and 9 are the same as previous version, the rests are
   either new or substantially revised.

v12:
 - Take out the debugging patch to print partitions.
 - Add a patch to force turning off partition flag if newly modified CPU
   list doesn't meet the requirement of being a partition root.
 - Remove some unneeded checking code in update_reserved_cpumask().

v11:
 - Change the "domain_root" name to "partition" as suggested by Peter and
   update the documentation and code accordingly.
 - Remove the dying cgroup check in update_reserved_cpus() as the check
   may not be needed after all.
 - Document the effect of losing CPU affinity after offling all the cpus
   in a partition.
 - There is no other major code changes in this version.

v11 patch: https://lkml.org/lkml/2018/6/24/30
v12 patch: https://lkml.org/lkml/2018/8/27/423
v13 patch: https://lkml.org/lkml/2018/10/12/861

The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus",
"mems" and "sched.partition". The "cpus.effective" and "mems.effective"
will appear in all cpuset-enabled cgroups.

The new control file that is unique to v2 is "sched.partition". It is
a tristate flag file that designates if a cgroup is the root of a new
scheduling domain or partition with its own set of unique list of CPUs
from scheduling perspective disjointed from other partitions. An user
can write only "1" or "0" into this file to turn on and off partition root.
Depending on circumstances, a partition root may become erroneous and has
a flag value of -1. However, if condition becomes favorable again, it can
be changed back to a partition root automatically.

The root cgroup is always a partition root. Multiple levels of partitions
are supported with some limitations. So a container partition root can
behave like a real root.

When a partition root cgroup is removed, its list of exclusive CPUs
will be returned back to the parent's cpus.effective automatically.

A container root can be a partition root with sub-partitions
created underneath it. One difference from the real root is that the
"cpuset.sched.partition" flag isn't present in the real root, but is
present in a container root. This is also true for other cpuset control
files as well as those from the other controllers. This is a general
issue that is not going to be addressed here in this patchset.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their effective
counterparts.

Patch 2 defines new data structures to support partitioning.

Patch 3 simplifies the allocation and freeing of cpumasks in the cpuset
code and prepares for use by subsequent patches.

Patch 4 adds a new "sched.partition" control file for setting up multiple
scheduling domains or partitions. A partition root implies cpu_exclusive.

Patch 5 makes new "sched.partition" file to have a new error value of -1
which indicates that the partition root enters into an erroneous state
where some of the constraints of a partition root (like cpu_exclusive)
will still hold but it is not a real partition root anymore. This allows
the cpuset to change back to a partition root later on automatically
if the conditions become favorable again.

Patch 6 adds tracking of the number of cpusets that use the parent's
effective_cpus in order to make sure that those cpusets will be properly
updated if their parents effective cpus changes because of changes in
sibling partitions.

Patch 7 makes the hotplug code deal with partition root properly.

Patch 8 updates the scheduling domain genaration code to work with
the new partition feature.

Patch 9 exposes cpus.effective and mems.effective to the root cgroup
as enabling child partitions will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.

Patch 10 updates the cgroup v2 documentation file with information
about the new "sched.partition" file.

Patch 11 adds a new read-only "cpus.subpartitions" file that list the
CPUs in the subparts_cpus mask in the cpuset data structure when the
command line option "cgroup_debug" is specified. This is mostly used
for debugging and verification purposes.

Patch 12 changes the output of reading cpuset.sched.partition from
integer to a descriptive text similar to what cgroup.type is doing.

A test script with various cpuset configurations was run on both regular
and debug kernels with this patchset applied to verify that the cpusets
behaved appropriate without unexpected error.

Waiman Long (12):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Define data structures to support scheduling partition
  cpuset: Simply allocation and freeing of cpumasks
  cpuset: Add new v2 cpuset.sched.partition flag
  cpuset: Add an error state to cpuset.sched.partition
  cpuset: Track cpusets that use parent's effective_cpus
  cpuset: Make CPU hotplug work with partition
  cpuset: Make generate_sched_domains() work with partition
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Add documentation about the new "cpuset.sched.partition" flag
  cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug
  cpuset: Show descriptive text when reading cpuset.sched.partition

 Documentation/admin-guide/cgroup-v2.rst | 175 ++++-
 include/linux/cgroup-defs.h             |   1 +
 kernel/cgroup/cgroup-internal.h         |   2 +
 kernel/cgroup/cgroup.c                  |  14 +-
 kernel/cgroup/cpuset.c                  | 909 ++++++++++++++++++++++--
 kernel/cgroup/debug.c                   |   4 +-
 6 files changed, 1030 insertions(+), 75 deletions(-)

-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 01/12] cpuset: Enable cpuset controller in default hierarchy
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-15 20:29 ` [PATCH v14 02/12] cpuset: Define data structures to support scheduling partition Waiman Long
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 109 ++++++++++++++++++++++--
 kernel/cgroup/cpuset.c                  |  48 ++++++++++-
 2 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..0465c231e2fb 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/.
          5-3-3-2. IO Latency Interface Files
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1588,6 +1590,103 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested CPUs to be used by tasks within this
+	cgroup.  The actual list of CPUs to be granted, however, is
+	subjected to constraints imposed by its parent and can differ
+	from the requested CPUs.
+
+	The CPU numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined CPUs that are actually granted to this
+	cgroup by its parent.  These CPUs are allowed to be used by
+	tasks within the current cgroup.
+
+	If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+	all the CPUs from the parent cgroup that can be available to
+	be used by this cgroup.  Otherwise, it should be a subset of
+	"cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+	can be granted.  In this case, it will be treated just like an
+	empty "cpuset.cpus".
+
+	Its value will be affected by CPU hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested memory nodes to be used by tasks within
+	this cgroup.  The actual list of memory nodes granted, however,
+	is subjected to constraints imposed by its parent and can differ
+	from the requested memory nodes.
+
+	The memory node numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined memory nodes that are actually granted to
+	this cgroup by its parent. These memory nodes are allowed to
+	be used by tasks within the current cgroup.
+
+	If "cpuset.mems" is empty, it shows all the memory nodes from the
+	parent cgroup that will be available to be used by this cgroup.
+	Otherwise, it should be a subset of "cpuset.mems" unless none of
+	the memory nodes listed in "cpuset.mems" can be granted.  In this
+	case, it will be treated just like an empty "cpuset.mems".
+
+	Its value will be affected by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10cb7222..2b5c4477d969 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1824,12 +1824,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1931,6 +1930,47 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
 /*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
@@ -2105,8 +2145,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 02/12] cpuset: Define data structures to support scheduling partition
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
  2018-10-15 20:29 ` [PATCH v14 01/12] cpuset: " Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-15 20:29 ` [PATCH v14 03/12] cpuset: Simply allocation and freeing of cpumasks Waiman Long
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

From a cpuset point of view, a scheduling partition is a group of
cpusets with their own set of exclusive CPUs that are not shared by
other tasks outside the scheduling partition.

In the legacy hierarchy, scheduling partitions are supported indirectly
via the right use of the load balancing and the exclusive CPUs flag
which is not intuitive and can be hard to use.

To fully support the concept of scheduling partitions in the default
hierarchy, we need to add some new field into the cpuset structure as
well as a new tmpmasks structure that is used to pre-allocate cpumasks
at the top level cpuset functions to avoid memory allocation in inner
functions as memory allocation failure in those inner functions may
cause a cpuset to have inconsistent states.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2b5c4477d969..29a2bdc671fd 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,13 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/*
+	 * CPUs allocated to child sub-partitions (default hierarchy only)
+	 * - CPUs granted by the parent = effective_cpus U subparts_cpus
+	 * - effective_cpus and subparts_cpus are mutually exclusive.
+	 */
+	cpumask_var_t subparts_cpus;
+
 	/*
 	 * This is old Memory Nodes tasks took on.
 	 *
@@ -134,6 +141,30 @@ struct cpuset {
 
 	/* for custom sched domain */
 	int relax_domain_level;
+
+	/* number of CPUs in subparts_cpus */
+	int nr_subparts_cpus;
+
+	/* partition root state */
+	int partition_root_state;
+};
+
+/*
+ * Partition root states:
+ *
+ *   0 - not a partition root
+ *   1 - partition root
+ */
+#define PRS_DISABLED		0
+#define PRS_ENABLED		1
+
+/*
+ * Temporary cpumasks for working with partitions that are passed among
+ * functions to avoid memory allocation in inner functions.
+ */
+struct tmpmasks {
+	cpumask_var_t addmask, delmask;	/* For partition root */
+	cpumask_var_t new_cpus;		/* For update_cpumasks_hier() */
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -218,9 +249,15 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_partition_root(const struct cpuset *cs)
+{
+	return cs->partition_root_state;
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
+	.partition_root_state = PRS_ENABLED,
 };
 
 /**
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 03/12] cpuset: Simply allocation and freeing of cpumasks
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
  2018-10-15 20:29 ` [PATCH v14 01/12] cpuset: " Waiman Long
  2018-10-15 20:29 ` [PATCH v14 02/12] cpuset: Define data structures to support scheduling partition Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-19 15:28   ` Tom Hromatka
  2018-10-15 20:29 ` [PATCH v14 04/12] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

The previous commit introduces a new subparts_cpus mask into the cpuset
data structure and a new tmpmasks structure.  Managing the allocation
and freeing of those cpumasks is becoming more complex.

So a number of helper functions are added to simplify and streamline
the management of those cpumasks. To make it simple, all the cpumasks
are now pre-cleared on allocation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 110 ++++++++++++++++++++++++++++-------------
 1 file changed, 77 insertions(+), 33 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 29a2bdc671fd..2ac9437ce8f1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -455,6 +455,65 @@ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
 		is_mem_exclusive(p) <= is_mem_exclusive(q);
 }
 
+/**
+ * alloc_cpumasks - allocate three cpumasks for cpuset
+ * @cs:  the cpuset that have cpumasks to be allocated.
+ * @tmp: the tmpmasks structure pointer
+ * Return: 0 if successful, -ENOMEM otherwise.
+ *
+ * Only one of the two input arguments should be non-NULL.
+ */
+static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
+{
+	cpumask_var_t *pmask1, *pmask2, *pmask3;
+
+	if (cs) {
+		pmask1 = &cs->cpus_allowed;
+		pmask2 = &cs->effective_cpus;
+		pmask3 = &cs->subparts_cpus;
+	} else {
+		pmask1 = &tmp->new_cpus;
+		pmask2 = &tmp->addmask;
+		pmask3 = &tmp->delmask;
+	}
+
+	if (!zalloc_cpumask_var(pmask1, GFP_KERNEL))
+		return -ENOMEM;
+
+	if (!zalloc_cpumask_var(pmask2, GFP_KERNEL))
+		goto free_one;
+
+	if (!zalloc_cpumask_var(pmask3, GFP_KERNEL))
+		goto free_two;
+
+	return 0;
+
+free_two:
+	free_cpumask_var(*pmask2);
+free_one:
+	free_cpumask_var(*pmask1);
+	return -ENOMEM;
+}
+
+/**
+ * free_cpumasks - free cpumasks in a tmpmasks structure
+ * @cs:  the cpuset that have cpumasks to be free.
+ * @tmp: the tmpmasks structure pointer
+ */
+static inline void free_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
+{
+	if (cs) {
+		free_cpumask_var(cs->cpus_allowed);
+		free_cpumask_var(cs->effective_cpus);
+		free_cpumask_var(cs->subparts_cpus);
+	}
+	if (tmp) {
+		free_cpumask_var(tmp->new_cpus);
+		free_cpumask_var(tmp->addmask);
+		free_cpumask_var(tmp->delmask);
+	}
+}
+
 /**
  * alloc_trial_cpuset - allocate a trial cpuset
  * @cs: the cpuset that the trial cpuset duplicates
@@ -467,31 +526,24 @@ static struct cpuset *alloc_trial_cpuset(struct cpuset *cs)
 	if (!trial)
 		return NULL;
 
-	if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL))
-		goto free_cs;
-	if (!alloc_cpumask_var(&trial->effective_cpus, GFP_KERNEL))
-		goto free_cpus;
+	if (alloc_cpumasks(trial, NULL)) {
+		kfree(trial);
+		return NULL;
+	}
 
 	cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
 	cpumask_copy(trial->effective_cpus, cs->effective_cpus);
 	return trial;
-
-free_cpus:
-	free_cpumask_var(trial->cpus_allowed);
-free_cs:
-	kfree(trial);
-	return NULL;
 }
 
 /**
- * free_trial_cpuset - free the trial cpuset
- * @trial: the trial cpuset to be freed
+ * free_cpuset - free the cpuset
+ * @cs: the cpuset to be freed
  */
-static void free_trial_cpuset(struct cpuset *trial)
+static inline void free_cpuset(struct cpuset *cs)
 {
-	free_cpumask_var(trial->effective_cpus);
-	free_cpumask_var(trial->cpus_allowed);
-	kfree(trial);
+	free_cpumasks(cs, NULL);
+	kfree(cs);
 }
 
 /*
@@ -1385,7 +1437,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	if (spread_flag_changed)
 		update_tasks_flags(cs);
 out:
-	free_trial_cpuset(trialcs);
+	free_cpuset(trialcs);
 	return err;
 }
 
@@ -1769,7 +1821,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 		break;
 	}
 
-	free_trial_cpuset(trialcs);
+	free_cpuset(trialcs);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	kernfs_unbreak_active_protection(of->kn);
@@ -2024,26 +2076,19 @@ cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
 	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
 	if (!cs)
 		return ERR_PTR(-ENOMEM);
-	if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
-		goto free_cs;
-	if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
-		goto free_cpus;
+
+	if (alloc_cpumasks(cs, NULL)) {
+		kfree(cs);
+		return ERR_PTR(-ENOMEM);
+	}
 
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
-	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
-	cpumask_clear(cs->effective_cpus);
 	nodes_clear(cs->effective_mems);
 	fmeter_init(&cs->fmeter);
 	cs->relax_domain_level = -1;
 
 	return &cs->css;
-
-free_cpus:
-	free_cpumask_var(cs->cpus_allowed);
-free_cs:
-	kfree(cs);
-	return ERR_PTR(-ENOMEM);
 }
 
 static int cpuset_css_online(struct cgroup_subsys_state *css)
@@ -2134,9 +2179,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
-	free_cpumask_var(cs->effective_cpus);
-	free_cpumask_var(cs->cpus_allowed);
-	kfree(cs);
+	free_cpuset(cs);
 }
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
@@ -2200,6 +2243,7 @@ int __init cpuset_init(void)
 
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 04/12] cpuset: Add new v2 cpuset.sched.partition flag
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (2 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 03/12] cpuset: Simply allocation and freeing of cpumasks Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-11-06 11:35   ` Peter Zijlstra
  2018-10-15 20:29 ` [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition Waiman Long
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

A new cpuset.sched.partition boolean flag is added to cpuset v2.
This new flag, if set, indicates that the cgroup is the root of a
new scheduling domain or partition that includes itself and all its
descendants except those that are scheduling domain roots themselves
and their descendants.

With this new flag, one can directly create as many partitions as
necessary without ever using the v1 trick of turning off load balancing
in specific cpusets to create partitions as a side effect.

This new flag is owned by the parent and will cause the CPUs in the
cpuset to be removed from the effective CPUs of its parent.

This is implemented internally by adding a new subparts_cpus mask that
holds the CPUs belonging to child partitions so that:

        subparts_cpus | effective_cpus = cpus_allowed
        subparts_cpus & effective_cpus = 0

This new flag can only be turned on in a cpuset if its parent is a
partition root itself. The state of this flag cannot be changed if the
cpuset has children.

Once turned on, further changes to "cpuset.cpus" is allowed as long
as there is at least one CPU left that can be granted from the parent
and a child partition root cannot use up all the CPUs in the parent's
effective_cpus.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 365 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 352 insertions(+), 13 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2ac9437ce8f1..a648a2ae851c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -970,10 +970,191 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	css_task_iter_end(&it);
 }
 
+/**
+ * compute_effective_cpumask - Compute the effective cpumask of the cpuset
+ * @new_cpus: the temp variable for the new effective_cpus mask
+ * @cs: the cpuset the need to recompute the new effective_cpus mask
+ * @parent: the parent cpuset
+ *
+ * If the parent has subpartition CPUs, include them in the list of
+ * allowable CPUs in computing the new effective_cpus mask.
+ */
+static void compute_effective_cpumask(struct cpumask *new_cpus,
+				      struct cpuset *cs, struct cpuset *parent)
+{
+	if (parent->nr_subparts_cpus) {
+		cpumask_or(new_cpus, parent->effective_cpus,
+			   parent->subparts_cpus);
+		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+	} else {
+		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
+	}
+}
+
+/*
+ * Commands for update_parent_subparts_cpumask
+ */
+enum subparts_cmd {
+	partcmd_enable,		/* Enable partition root	 */
+	partcmd_disable,	/* Disable partition root	 */
+	partcmd_update,		/* Update parent's subparts_cpus */
+};
+
+/**
+ * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
+ * @cpuset:  The cpuset that requests change in partition root state
+ * @cmd:     Partition root state change command
+ * @newmask: Optional new cpumask for partcmd_update
+ * @tmp:     Temporary addmask and delmask
+ * Return:   0, 1 or an error code
+ *
+ * For partcmd_enable, the cpuset is being transformed from a non-partition
+ * root to a partition root. The cpus_allowed mask of the given cpuset will
+ * be put into parent's subparts_cpus and taken away from parent's
+ * effective_cpus. The function will return 0 if all the CPUs listed in
+ * cpus_allowed can be granted or an error code will be returned.
+ *
+ * For partcmd_disable, the cpuset is being transofrmed from a partition
+ * root back to a non-partition root. any CPUs in cpus_allowed that are in
+ * parent's subparts_cpus will be taken away from that cpumask and put back
+ * into parent's effective_cpus. 0 should always be returned.
+ *
+ * For partcmd_update, if the optional newmask is specified, the cpu
+ * list is to be changed from cpus_allowed to newmask. Otherwise,
+ * cpus_allowed is assumed to remain the same.  The function will return
+ * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0
+ * otherwise. In case of error, an error code will be returned.
+ *
+ * The partcmd_enable and partcmd_disable commands are used by
+ * update_prstate(). The partcmd_update command is used by
+ * update_cpumasks_hier() with newmask NULL and update_cpumask() with
+ * newmask set.
+ *
+ * The checking is more strict when enabling partition root than the
+ * other two commands.
+ *
+ * Because of the implicit cpu exclusive nature of a partition root,
+ * cpumask changes that violates the cpu exclusivity rule will not be
+ * permitted when checked by validate_change(). The validate_change()
+ * function will also prevent any changes to the cpu list if it is not
+ * a superset of children's cpu lists.
+ *
+ * Called with cpuset_mutex held.
+ */
+static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
+					  struct cpumask *newmask,
+					  struct tmpmasks *tmp)
+{
+	struct cpuset *parent = parent_cs(cpuset);
+	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
+	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
+
+	/*
+	 * The parent must be a partition root.
+	 * The new cpumask, if present, or the current cpus_allowed must
+	 * not be empty.
+	 */
+	if (!is_partition_root(parent) ||
+	   (newmask && cpumask_empty(newmask)) ||
+	   (!newmask && cpumask_empty(cpuset->cpus_allowed)))
+		return -EINVAL;
+
+	/*
+	 * Enabling/disabling partition root is not allowed if there are
+	 * online children.
+	 */
+	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+		return -EBUSY;
+
+	/*
+	 * Enabling partition root is not allowed if not all the CPUs
+	 * can be granted from parent's effective_cpus or at least one
+	 * CPU will be left after that.
+	 */
+	if ((cmd == partcmd_enable) &&
+	   (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
+	     cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
+		return -EINVAL;
+
+	/*
+	 * A cpumask update cannot make parent's effective_cpus become empty.
+	 */
+	adding = deleting = false;
+	if (cmd == partcmd_enable) {
+		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
+		adding = true;
+	} else if (cmd == partcmd_disable) {
+		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+				       parent->subparts_cpus);
+	} else if (newmask) {
+		/*
+		 * partcmd_update with newmask:
+		 *
+		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
+		 * addmask = newmask & parent->effective_cpus
+		 *		     & ~parent->subparts_cpus
+		 */
+		cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
+		deleting = cpumask_and(tmp->delmask, tmp->delmask,
+				       parent->subparts_cpus);
+
+		cpumask_and(tmp->addmask, newmask, parent->effective_cpus);
+		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
+					parent->subparts_cpus);
+		/*
+		 * Return error if the new effective_cpus could become empty.
+		 */
+		if (adding && !deleting &&
+		    cpumask_equal(parent->effective_cpus, tmp->addmask))
+			return -EINVAL;
+	} else {
+		/*
+		 * partcmd_update w/o newmask:
+		 *
+		 * addmask = cpus_allowed & parent->effectiveb_cpus
+		 *
+		 * Note that parent's subparts_cpus may have been
+		 * pre-shrunk in case the CPUs granted to the parent
+		 * by the grandparent changes. So no deletion is needed.
+		 */
+		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
+				     parent->effective_cpus);
+		if (cpumask_equal(tmp->addmask, parent->effective_cpus))
+			return -EINVAL;
+	}
+
+	if (!adding && !deleting)
+		return 0;
+
+	/*
+	 * Change the parent's subparts_cpus.
+	 * Newly added CPUs will be removed from effective_cpus and
+	 * newly deleted ones will be added back to effective_cpus.
+	 */
+	spin_lock_irq(&callback_lock);
+	if (adding) {
+		cpumask_or(parent->subparts_cpus,
+			   parent->subparts_cpus, tmp->addmask);
+		cpumask_andnot(parent->effective_cpus,
+			       parent->effective_cpus, tmp->addmask);
+	}
+	if (deleting) {
+		cpumask_andnot(parent->subparts_cpus,
+			       parent->subparts_cpus, tmp->delmask);
+		cpumask_or(parent->effective_cpus,
+			   parent->effective_cpus, tmp->delmask);
+	}
+
+	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
+	spin_unlock_irq(&callback_lock);
+
+	return cmd == partcmd_update;
+}
+
 /*
  * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
- * @cs: the cpuset to consider
- * @new_cpus: temp variable for calculating new effective_cpus
+ * @cs:  the cpuset to consider
+ * @tmp: temp variables for calculating effective_cpus & partition setup
  *
  * When congifured cpumask is changed, the effective cpumasks of this cpuset
  * and all its descendants need to be updated.
@@ -982,7 +1163,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
  *
  * Called with cpuset_mutex held
  */
-static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
+static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 {
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
@@ -991,28 +1172,61 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
+		bool cs_empty;
 
-		cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+		compute_effective_cpumask(tmp->new_cpus, cp, parent);
+		cs_empty = cpumask_empty(tmp->new_cpus);
+
+		/*
+		 * A partition root cannot have empty effective_cpus
+		 */
+		WARN_ON_ONCE(cs_empty && is_partition_root(cp));
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
 		 * parent, which is guaranteed to have some CPUs.
 		 */
-		if (is_in_v2_mode() && cpumask_empty(new_cpus))
-			cpumask_copy(new_cpus, parent->effective_cpus);
+		if (is_in_v2_mode() && cs_empty)
+			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 
-		/* Skip the whole subtree if the cpumask remains the same. */
-		if (cpumask_equal(new_cpus, cp->effective_cpus)) {
+		/*
+		 * Skip the whole subtree if the cpumask remains the same
+		 * and has no partition root state.
+		 */
+		if (!is_partition_root(cp) &&
+		    cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
 			pos_css = css_rightmost_descendant(pos_css);
 			continue;
 		}
 
+		/*
+		 * update_parent_subparts_cpumask() should have been called
+		 * for cs already in update_cpumask(). We should also call
+		 * update_tasks_cpumask() again for tasks in the parent
+		 * cpuset if the parent's subparts_cpus changes.
+		 */
+		if ((cp != cs) && cp->partition_root_state &&
+		    update_parent_subparts_cpumask(cp, partcmd_update,
+						   NULL, tmp)) {
+			if (parent != &top_cpuset)
+				update_tasks_cpumask(parent);
+		}
+
 		if (!css_tryget_online(&cp->css))
 			continue;
 		rcu_read_unlock();
 
 		spin_lock_irq(&callback_lock);
-		cpumask_copy(cp->effective_cpus, new_cpus);
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
+			/*
+			 * Make sure that effective_cpus & subparts_cpus
+			 * are mutually exclusive.
+			 */
+			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
+				       cp->subparts_cpus);
+		}
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -1047,6 +1261,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			  const char *buf)
 {
 	int retval;
+	struct tmpmasks tmp;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
 	if (cs == &top_cpuset)
@@ -1078,12 +1293,39 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
+#ifdef CONFIG_CPUMASK_OFFSTACK
+	/*
+	 * Use the cpumasks in trialcs for tmpmasks when they are pointers
+	 * to allocated cpumasks.
+	 */
+	tmp.addmask  = trialcs->subparts_cpus;
+	tmp.delmask  = trialcs->effective_cpus;
+	tmp.new_cpus = trialcs->cpus_allowed;
+#endif
+
+	if (cs->partition_root_state) {
+		/* Cpumask of a partition root cannot be empty */
+		if (cpumask_empty(trialcs->cpus_allowed))
+			return -EINVAL;
+		if (update_parent_subparts_cpumask(cs, partcmd_update,
+					trialcs->cpus_allowed, &tmp) < 0)
+			return -EINVAL;
+	}
+
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
+
+	/*
+	 * Make sure that subparts_cpus is a subset of cpus_allowed.
+	 */
+	if (cs->nr_subparts_cpus) {
+		cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
+			       cs->cpus_allowed);
+		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+	}
 	spin_unlock_irq(&callback_lock);
 
-	/* use trialcs->cpus_allowed as a temp variable */
-	update_cpumasks_hier(cs, trialcs->cpus_allowed);
+	update_cpumasks_hier(cs, &tmp);
 	return 0;
 }
 
@@ -1441,6 +1683,80 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	return err;
 }
 
+/*
+ * update_prstate - update partititon_root_state
+ * cs:	the cpuset to update
+ * val: 0 - disabled, 1 - enabled
+ *
+ * Call with cpuset_mutex held.
+ */
+static int update_prstate(struct cpuset *cs, int val)
+{
+	int err;
+	struct cpuset *parent = parent_cs(cs);
+	struct tmpmasks tmp;
+
+	if ((val != 0) && (val != 1))
+		return -EINVAL;
+	if (val == cs->partition_root_state)
+		return 0;
+
+	/*
+	 * Cannot force a partial or erroneous partition root to a full
+	 * partition root.
+	 */
+	if (val && cs->partition_root_state)
+		return -EINVAL;
+
+	if (alloc_cpumasks(NULL, &tmp))
+		return -ENOMEM;
+
+	err = -EINVAL;
+	if (!cs->partition_root_state) {
+		/*
+		 * Turning on partition root requires setting the
+		 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
+		 * cannot be NULL.
+		 */
+		if (cpumask_empty(cs->cpus_allowed))
+			goto out;
+
+		err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
+		if (err)
+			goto out;
+
+		err = update_parent_subparts_cpumask(cs, partcmd_enable,
+						     NULL, &tmp);
+		if (err) {
+			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+			goto out;
+		}
+		cs->partition_root_state = PRS_ENABLED;
+	} else {
+		err = update_parent_subparts_cpumask(cs, partcmd_disable,
+						     NULL, &tmp);
+		if (err)
+			goto out;
+
+		cs->partition_root_state = 0;
+
+		/* Turning off CS_CPU_EXCLUSIVE will not return error */
+		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+	}
+
+	/*
+	 * Update cpumask of parent's tasks except when it is the top
+	 * cpuset as some system daemons cannot be mapped to other CPUs.
+	 */
+	if (parent != &top_cpuset)
+		update_tasks_cpumask(parent);
+
+	rebuild_sched_domains_locked();
+out:
+	free_cpumasks(NULL, &tmp);
+	return err;
+}
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -1686,6 +2002,7 @@ typedef enum {
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
 	FILE_SCHED_LOAD_BALANCE,
+	FILE_PARTITION_ROOT,
 	FILE_SCHED_RELAX_DOMAIN_LEVEL,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
@@ -1755,6 +2072,9 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		retval = update_relax_domain_level(cs, val);
 		break;
+	case FILE_PARTITION_ROOT:
+		retval = update_prstate(cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1905,6 +2225,8 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	switch (type) {
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		return cs->relax_domain_level;
+	case FILE_PARTITION_ROOT:
+		return cs->partition_root_state;
 	default:
 		BUG();
 	}
@@ -2056,6 +2378,14 @@ static struct cftype dfl_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched.partition",
+		.read_s64 = cpuset_read_s64,
+		.write_s64 = cpuset_write_s64,
+		.private = FILE_PARTITION_ROOT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -2157,7 +2487,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 /*
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains_locked().
+ * will call rebuild_sched_domains_locked(). That is not needed
+ * in the default hierarchy where only changes in partition
+ * will cause repartitioning.
+ *
+ * If the cpuset has the 'sched.partition' flag enabled, simulate
+ * turning 'sched.partition" off.
  */
 
 static void cpuset_css_offline(struct cgroup_subsys_state *css)
@@ -2166,7 +2501,11 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 
 	mutex_lock(&cpuset_mutex);
 
-	if (is_sched_load_balance(cs))
+	if (is_partition_root(cs))
+		update_prstate(cs, 0);
+
+	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+	    is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
 	cpuset_dec();
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (3 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 04/12] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-11-06 11:37   ` Peter Zijlstra
                     ` (2 more replies)
  2018-10-15 20:29 ` [PATCH v14 06/12] cpuset: Track cpusets that use parent's effective_cpus Waiman Long
                   ` (7 subsequent siblings)
  12 siblings, 3 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

Handling error returned by update_parent_subparts_cpumask() in
update_cpumasks_hier() is problematic as the states may become
inconsistent. To avoid that and increase flexibility in handling other
error cases, a new error state (-1) is added to the partition_root_state
flag. This new error state is set internally and user cannot write this
value to "cpuset.sched.partition".

In this error state, the partition root becomes an erroneous one. It is
no longer a real partition root, but the CS_CPU_EXCLUSIVE flag will
still be set as it can be changed back to a real one if appropriate
change happens later on.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 146 ++++++++++++++++++++++++++++++++++-------
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a648a2ae851c..a1f6fdb6be02 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -153,10 +153,19 @@ struct cpuset {
  * Partition root states:
  *
  *   0 - not a partition root
+ *
  *   1 - partition root
+ *
+ *  -1 - erroneous partition root
+ *       None of the cpus in cpus_allowed can be put into the parent's
+ *       subparts_cpus. In this case, the cpuset is not a real partition
+ *       root anymore.  However, the CPU_EXCLUSIVE bit will still be set
+ *       and the cpuset can be restored back to a partition root if the
+ *       parent cpuset can give more CPUs back to this child cpuset.
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
+#define PRS_ERROR		-1
 
 /*
  * Temporary cpumasks for working with partitions that are passed among
@@ -251,7 +260,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
 
 static inline int is_partition_root(const struct cpuset *cs)
 {
-	return cs->partition_root_state;
+	return cs->partition_root_state > 0;
 }
 
 static struct cpuset top_cpuset = {
@@ -1021,9 +1030,12 @@ enum subparts_cmd {
  *
  * For partcmd_update, if the optional newmask is specified, the cpu
  * list is to be changed from cpus_allowed to newmask. Otherwise,
- * cpus_allowed is assumed to remain the same.  The function will return
- * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0
- * otherwise. In case of error, an error code will be returned.
+ * cpus_allowed is assumed to remain the same. The cpuset should either
+ * be a partition root or an erroneous partition root. The partition root
+ * state may change if newmask is NULL and none of the requested CPUs can
+ * be granted by the parent. The function will return 1 if changes to
+ * parent's subparts_cpus and effective_cpus happen or 0 otherwise.
+ * Error code should only be returned when newmask is non-NULL.
  *
  * The partcmd_enable and partcmd_disable commands are used by
  * update_prstate(). The partcmd_update command is used by
@@ -1048,6 +1060,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	struct cpuset *parent = parent_cs(cpuset);
 	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
+	bool part_error = false;	/* Partition error? */
 
 	/*
 	 * The parent must be a partition root.
@@ -1114,13 +1127,48 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 * addmask = cpus_allowed & parent->effectiveb_cpus
 		 *
 		 * Note that parent's subparts_cpus may have been
-		 * pre-shrunk in case the CPUs granted to the parent
-		 * by the grandparent changes. So no deletion is needed.
+		 * pre-shrunk in case there is a change in the cpu list.
+		 * So no deletion is needed.
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		if (cpumask_equal(tmp->addmask, parent->effective_cpus))
-			return -EINVAL;
+		part_error = cpumask_equal(tmp->addmask,
+					   parent->effective_cpus);
+	}
+
+	if (cmd == partcmd_update) {
+		int prev_prs = cpuset->partition_root_state;
+
+		/*
+		 * Check for possible transition between PRS_ENABLED
+		 * and PRS_ERROR.
+		 */
+		switch (cpuset->partition_root_state) {
+		case PRS_ENABLED:
+			if (part_error)
+				cpuset->partition_root_state = PRS_ERROR;
+			break;
+		case PRS_ERROR:
+			if (!part_error)
+				cpuset->partition_root_state = PRS_ENABLED;
+			break;
+		}
+		/*
+		 * Set part_error if previously in erroneous state.
+		 */
+		part_error = (prev_prs == PRS_ERROR);
+	}
+
+	if (!part_error && (cpuset->partition_root_state == PRS_ERROR))
+		return 0;	/* Nothing need to be done */
+
+	if (cpuset->partition_root_state == PRS_ERROR) {
+		/*
+		 * Remove all its cpus from parent's subparts_cpus.
+		 */
+		adding = false;
+		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+				       parent->subparts_cpus);
 	}
 
 	if (!adding && !deleting)
@@ -1172,28 +1220,21 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
-		bool cs_empty;
 
 		compute_effective_cpumask(tmp->new_cpus, cp, parent);
-		cs_empty = cpumask_empty(tmp->new_cpus);
-
-		/*
-		 * A partition root cannot have empty effective_cpus
-		 */
-		WARN_ON_ONCE(cs_empty && is_partition_root(cp));
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
 		 * parent, which is guaranteed to have some CPUs.
 		 */
-		if (is_in_v2_mode() && cs_empty)
+		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus))
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 
 		/*
 		 * Skip the whole subtree if the cpumask remains the same
 		 * and has no partition root state.
 		 */
-		if (!is_partition_root(cp) &&
+		if (!cp->partition_root_state &&
 		    cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
 			pos_css = css_rightmost_descendant(pos_css);
 			continue;
@@ -1205,11 +1246,39 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		 * update_tasks_cpumask() again for tasks in the parent
 		 * cpuset if the parent's subparts_cpus changes.
 		 */
-		if ((cp != cs) && cp->partition_root_state &&
-		    update_parent_subparts_cpumask(cp, partcmd_update,
-						   NULL, tmp)) {
-			if (parent != &top_cpuset)
-				update_tasks_cpumask(parent);
+		if ((cp != cs) && cp->partition_root_state) {
+			switch (parent->partition_root_state) {
+			case PRS_DISABLED:
+				/*
+				 * If parent is not a partition root or an
+				 * erroneous partition root, clear the state
+				 * state and the CS_CPU_EXCLUSIVE flag.
+				 */
+				WARN_ON_ONCE(cp->partition_root_state
+					     != PRS_ERROR);
+				cp->partition_root_state = 0;
+				spin_lock_irq(&callback_lock);
+				clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
+				spin_unlock_irq(&callback_lock);
+				break;
+
+			case PRS_ENABLED:
+				if (update_parent_subparts_cpumask(cp,
+					partcmd_update, NULL, tmp))
+					update_tasks_cpumask(parent);
+				break;
+
+			case PRS_ERROR:
+				/*
+				 * When parent is erroneous, it has to be too.
+				 */
+				cp->partition_root_state = PRS_ERROR;
+				if (cp->nr_subparts_cpus) {
+					cp->nr_subparts_cpus = 0;
+					cpumask_clear(cp->subparts_cpus);
+				}
+				break;
+			}
 		}
 
 		if (!css_tryget_online(&cp->css))
@@ -1219,13 +1288,33 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		spin_lock_irq(&callback_lock);
 
 		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-		if (cp->nr_subparts_cpus) {
+		if (cp->nr_subparts_cpus &&
+		   (cp->partition_root_state != PRS_ENABLED)) {
+			cp->nr_subparts_cpus = 0;
+			cpumask_clear(cp->subparts_cpus);
+		} else if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
 			 * are mutually exclusive.
+			 *
+			 * In the unlikely event that effective_cpus
+			 * becomes empty. we clear cp->nr_subparts_cpus and
+			 * let its child partition roots to compete for
+			 * CPUs again.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
+			if (cpumask_empty(cp->effective_cpus)) {
+				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+				cpumask_clear(cp->subparts_cpus);
+				cp->nr_subparts_cpus = 0;
+			} else if (!cpumask_subset(cp->subparts_cpus,
+						   tmp->new_cpus)) {
+				cpumask_andnot(cp->subparts_cpus,
+					cp->subparts_cpus, tmp->new_cpus);
+				cp->nr_subparts_cpus
+					= cpumask_weight(cp->subparts_cpus);
+			}
 		}
 		spin_unlock_irq(&callback_lock);
 
@@ -1733,6 +1822,17 @@ static int update_prstate(struct cpuset *cs, int val)
 		}
 		cs->partition_root_state = PRS_ENABLED;
 	} else {
+		/*
+		 * Turning off partition root will clear the
+		 * CS_CPU_EXCLUSIVE bit.
+		 */
+		if (cs->partition_root_state == PRS_ERROR) {
+			cs->partition_root_state = 0;
+			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+			err = 0;
+			goto out;
+		}
+
 		err = update_parent_subparts_cpumask(cs, partcmd_disable,
 						     NULL, &tmp);
 		if (err)
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 06/12] cpuset: Track cpusets that use parent's effective_cpus
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (4 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-15 20:29 ` [PATCH v14 07/12] cpuset: Make CPU hotplug work with partition Waiman Long
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

In the default hierarchy, a cpuset will use the parent's effective_cpus
if none of the requested CPUs can be granted from the parent. That can
be a problem if a parent is a partition root with children partition
roots. Changes to a parent's effective_cpus list due to changes in a
child partition root may not be properly reflected in a child cpuset
that use parent's effective_cpus because the cpu_exclusive rule of a
partition root will not guard against that.

In order to avoid the mismatch, two new tracking variables are added to
the cpuset structure to track if a cpuset uses parent's effective_cpus
and the number of children cpusets that use its effective_cpus. So
whenever cpumask changes are made to a parent, it will also check to
see if it has other children cpusets that use its effective_cpus and
call update_cpumasks_hier() if that is the case.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 71 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a1f6fdb6be02..596e0b16fb96 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -147,6 +147,14 @@ struct cpuset {
 
 	/* partition root state */
 	int partition_root_state;
+
+	/*
+	 * Default hierarchy only:
+	 * use_parent_ecpus - set if using parent's effective_cpus
+	 * child_ecpus_count - # of children with use_parent_ecpus set
+	 */
+	int use_parent_ecpus;
+	int child_ecpus_count;
 };
 
 /*
@@ -1227,8 +1235,17 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		 * If it becomes empty, inherit the effective mask of the
 		 * parent, which is guaranteed to have some CPUs.
 		 */
-		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus))
+		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
+			if (!cp->use_parent_ecpus) {
+				cp->use_parent_ecpus = true;
+				parent->child_ecpus_count++;
+			}
+		} else if (cp->use_parent_ecpus) {
+			cp->use_parent_ecpus = false;
+			WARN_ON_ONCE(!parent->child_ecpus_count);
+			parent->child_ecpus_count--;
+		}
 
 		/*
 		 * Skip the whole subtree if the cpumask remains the same
@@ -1340,6 +1357,35 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		rebuild_sched_domains_locked();
 }
 
+/**
+ * update_sibling_cpumasks - Update siblings cpumasks
+ * @parent:  Parent cpuset
+ * @cs:      Current cpuset
+ * @tmp:     Temp variables
+ */
+static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
+				    struct tmpmasks *tmp)
+{
+	struct cpuset *sibling;
+	struct cgroup_subsys_state *pos_css;
+
+	/*
+	 * Check all its siblings and call update_cpumasks_hier()
+	 * if their use_parent_ecpus flag is set in order for them
+	 * to use the right effective_cpus value.
+	 */
+	rcu_read_lock();
+	cpuset_for_each_child(sibling, pos_css, parent) {
+		if (sibling == cs)
+			continue;
+		if (!sibling->use_parent_ecpus)
+			continue;
+
+		update_cpumasks_hier(sibling, tmp);
+	}
+	rcu_read_unlock();
+}
+
 /**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
@@ -1415,6 +1461,17 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	spin_unlock_irq(&callback_lock);
 
 	update_cpumasks_hier(cs, &tmp);
+
+	if (cs->partition_root_state) {
+		struct cpuset *parent = parent_cs(cs);
+
+		/*
+		 * For partition root, update the cpumasks of sibling
+		 * cpusets if they use parent's effective_cpus.
+		 */
+		if (parent->child_ecpus_count)
+			update_sibling_cpumasks(parent, cs, &tmp);
+	}
 	return 0;
 }
 
@@ -1851,6 +1908,9 @@ static int update_prstate(struct cpuset *cs, int val)
 	if (parent != &top_cpuset)
 		update_tasks_cpumask(parent);
 
+	if (parent->child_ecpus_count)
+		update_sibling_cpumasks(parent, cs, &tmp);
+
 	rebuild_sched_domains_locked();
 out:
 	free_cpumasks(NULL, &tmp);
@@ -2545,6 +2605,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
+		cs->use_parent_ecpus = true;
+		parent->child_ecpus_count++;
 	}
 	spin_unlock_irq(&callback_lock);
 
@@ -2608,6 +2670,13 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	    is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
+	if (cs->use_parent_ecpus) {
+		struct cpuset *parent = parent_cs(cs);
+
+		cs->use_parent_ecpus = false;
+		parent->child_ecpus_count--;
+	}
+
 	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 07/12] cpuset: Make CPU hotplug work with partition
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (5 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 06/12] cpuset: Track cpusets that use parent's effective_cpus Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-15 20:29 ` [PATCH v14 08/12] cpuset: Make generate_sched_domains() " Waiman Long
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

When there is a cpu hotplug event (CPU online or offline), the partitions
may need to be reconfigured and regenerated. So code is added to the
hotplug functions to make them work with new subparts_cpus mask to
compute the right effective_cpus for each of the affected cpusets.
It may also change the state of a partition root from real one to an
erroneous one or vice versa.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 131 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 116 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 596e0b16fb96..2e9fab21ea57 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -113,6 +113,9 @@ struct cpuset {
 	 * CPUs allocated to child sub-partitions (default hierarchy only)
 	 * - CPUs granted by the parent = effective_cpus U subparts_cpus
 	 * - effective_cpus and subparts_cpus are mutually exclusive.
+	 *
+	 * effective_cpus contains only onlined CPUs, but subparts_cpus
+	 * may have offlined ones.
 	 */
 	cpumask_var_t subparts_cpus;
 
@@ -994,7 +997,9 @@ static void update_tasks_cpumask(struct cpuset *cs)
  * @parent: the parent cpuset
  *
  * If the parent has subpartition CPUs, include them in the list of
- * allowable CPUs in computing the new effective_cpus mask.
+ * allowable CPUs in computing the new effective_cpus mask. Since offlined
+ * CPUs are not removed from subparts_cpus, we have to use cpu_active_mask
+ * to mask those out.
  */
 static void compute_effective_cpumask(struct cpumask *new_cpus,
 				      struct cpuset *cs, struct cpuset *parent)
@@ -1003,6 +1008,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus,
 		cpumask_or(new_cpus, parent->effective_cpus,
 			   parent->subparts_cpus);
 		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+		cpumask_and(new_cpus, new_cpus, cpu_active_mask);
 	} else {
 		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
 	}
@@ -1125,9 +1131,20 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * Return error if the new effective_cpus could become empty.
 		 */
-		if (adding && !deleting &&
-		    cpumask_equal(parent->effective_cpus, tmp->addmask))
-			return -EINVAL;
+		if (adding &&
+		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
+			if (!deleting)
+				return -EINVAL;
+			/*
+			 * As some of the CPUs in subparts_cpus might have
+			 * been offlined, we need to compute the real delmask
+			 * to confirm that.
+			 */
+			if (!cpumask_and(tmp->addmask, tmp->delmask,
+					 cpu_active_mask))
+				return -EINVAL;
+			cpumask_copy(tmp->addmask, parent->effective_cpus);
+		}
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
@@ -1197,6 +1214,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	if (deleting) {
 		cpumask_andnot(parent->subparts_cpus,
 			       parent->subparts_cpus, tmp->delmask);
+		/*
+		 * Some of the CPUs in subparts_cpus might have been offlined.
+		 */
+		cpumask_and(tmp->delmask, tmp->delmask, cpu_active_mask);
 		cpumask_or(parent->effective_cpus,
 			   parent->effective_cpus, tmp->delmask);
 	}
@@ -2858,20 +2879,29 @@ hotplug_update_tasks(struct cpuset *cs,
 		update_tasks_nodemask(cs);
 }
 
+static bool force_rebuild;
+
+void cpuset_force_rebuild(void)
+{
+	force_rebuild = true;
+}
+
 /**
  * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
  * @cs: cpuset in interest
+ * @tmp: the tmpmasks structure pointer
  *
  * Compare @cs's cpu and mem masks against top_cpuset and if some have gone
  * offline, update @cs accordingly.  If @cs ends up with no CPU or memory,
  * all its tasks are moved to the nearest ancestor with both resources.
  */
-static void cpuset_hotplug_update_tasks(struct cpuset *cs)
+static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 {
 	static cpumask_t new_cpus;
 	static nodemask_t new_mems;
 	bool cpus_updated;
 	bool mems_updated;
+	struct cpuset *parent;
 retry:
 	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
@@ -2886,9 +2916,60 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 		goto retry;
 	}
 
-	cpumask_and(&new_cpus, cs->cpus_allowed, parent_cs(cs)->effective_cpus);
-	nodes_and(new_mems, cs->mems_allowed, parent_cs(cs)->effective_mems);
+	parent =  parent_cs(cs);
+	compute_effective_cpumask(&new_cpus, cs, parent);
+	nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
+
+	if (cs->nr_subparts_cpus)
+		/*
+		 * Make sure that CPUs allocated to child partitions
+		 * do not show up in effective_cpus.
+		 */
+		cpumask_andnot(&new_cpus, &new_cpus, cs->subparts_cpus);
+
+	if (!tmp || !cs->partition_root_state)
+		goto update_tasks;
+
+	/*
+	 * In the unlikely event that a partition root has empty
+	 * effective_cpus or its parent becomes erroneous, we have to
+	 * transition it to the erroneous state.
+	 */
+	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
+	   (parent->partition_root_state == PRS_ERROR))) {
+		if (cs->nr_subparts_cpus) {
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+			compute_effective_cpumask(&new_cpus, cs, parent);
+		}
 
+		/*
+		 * If the effective_cpus is empty because the child
+		 * partitions take away all the CPUs, we can keep
+		 * the current partition and let the child partitions
+		 * fight for available CPUs.
+		 */
+		if ((parent->partition_root_state == PRS_ERROR) ||
+		     cpumask_empty(&new_cpus)) {
+			update_parent_subparts_cpumask(cs, partcmd_disable,
+						       NULL, tmp);
+			cs->partition_root_state = PRS_ERROR;
+		}
+		cpuset_force_rebuild();
+	}
+
+	/*
+	 * On the other hand, an erroneous partition root may be transitioned
+	 * back to a regular one or a partition root with no CPU allocated
+	 * from the parent may change to erroneous.
+	 */
+	if (is_partition_root(parent) &&
+	   ((cs->partition_root_state == PRS_ERROR) ||
+	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
+	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+		cpuset_force_rebuild();
+
+update_tasks:
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
@@ -2902,13 +2983,6 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 	mutex_unlock(&cpuset_mutex);
 }
 
-static bool force_rebuild;
-
-void cpuset_force_rebuild(void)
-{
-	force_rebuild = true;
-}
-
 /**
  * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
  *
@@ -2931,6 +3005,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	static nodemask_t new_mems;
 	bool cpus_updated, mems_updated;
 	bool on_dfl = is_in_v2_mode();
+	struct tmpmasks tmp, *ptmp = NULL;
+
+	if (on_dfl && !alloc_cpumasks(NULL, &tmp))
+		ptmp = &tmp;
 
 	mutex_lock(&cpuset_mutex);
 
@@ -2938,6 +3016,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpumask_copy(&new_cpus, cpu_active_mask);
 	new_mems = node_states[N_MEMORY];
 
+	/*
+	 * If subparts_cpus is populated, it is likely that the check below
+	 * will produce a false positive on cpus_updated when the cpu list
+	 * isn't changed. It is extra work, but it is better to be safe.
+	 */
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2946,6 +3029,22 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+		/*
+		 * Make sure that CPUs allocated to child partitions
+		 * do not show up in effective_cpus. If no CPU is left,
+		 * we clear the subparts_cpus & let the child partitions
+		 * fight for the CPUs again.
+		 */
+		if (top_cpuset.nr_subparts_cpus) {
+			if (cpumask_subset(&new_cpus,
+					   top_cpuset.subparts_cpus)) {
+				top_cpuset.nr_subparts_cpus = 0;
+				cpumask_clear(top_cpuset.subparts_cpus);
+			} else {
+				cpumask_andnot(&new_cpus, &new_cpus,
+					       top_cpuset.subparts_cpus);
+			}
+		}
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
@@ -2974,7 +3073,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 				continue;
 			rcu_read_unlock();
 
-			cpuset_hotplug_update_tasks(cs);
+			cpuset_hotplug_update_tasks(cs, ptmp);
 
 			rcu_read_lock();
 			css_put(&cs->css);
@@ -2987,6 +3086,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		force_rebuild = false;
 		rebuild_sched_domains();
 	}
+
+	free_cpumasks(NULL, ptmp);
 }
 
 void cpuset_update_active_cpus(void)
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 08/12] cpuset: Make generate_sched_domains() work with partition
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (6 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 07/12] cpuset: Make CPU hotplug work with partition Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-15 20:29 ` [PATCH v14 09/12] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

The generate_sched_domains() function is modified to make it work
correctly with the newly introduced subparts_cpus mask for scheduling
domains generation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2e9fab21ea57..53dbd473bd7e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -769,13 +769,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	int ndoms = 0;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] struct cpumask slot */
 	struct cgroup_subsys_state *pos_css;
+	bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
 	doms = NULL;
 	dattr = NULL;
 	csa = NULL;
 
 	/* Special case for the 99% of systems with one, full, sched domain */
-	if (is_sched_load_balance(&top_cpuset)) {
+	if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
 		ndoms = 1;
 		doms = alloc_sched_domains(ndoms);
 		if (!doms)
@@ -798,6 +799,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	csn = 0;
 
 	rcu_read_lock();
+	if (root_load_balance)
+		csa[csn++] = &top_cpuset;
 	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
 		if (cp == &top_cpuset)
 			continue;
@@ -808,6 +811,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 * parent's cpus, so just skip them, and then we call
 		 * update_domain_attr_tree() to calc relax_domain_level of
 		 * the corresponding sched domain.
+		 *
+		 * If root is load-balancing, we can skip @cp if it
+		 * is a subset of the root's effective_cpus.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
@@ -815,11 +821,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
 			continue;
 
+		if (root_load_balance &&
+		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+			continue;
+
 		if (is_sched_load_balance(cp))
 			csa[csn++] = cp;
 
-		/* skip @cp's subtree */
-		pos_css = css_rightmost_descendant(pos_css);
+		/* skip @cp's subtree if not a partition root */
+		if (!is_partition_root(cp))
+			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
 
@@ -947,7 +958,12 @@ static void rebuild_sched_domains_locked(void)
 	 * passing doms with offlined cpu to partition_sched_domains().
 	 * Anyways, hotplug work item will rebuild sched domains.
 	 */
-	if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+	if (!top_cpuset.nr_subparts_cpus &&
+	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+		goto out;
+
+	if (top_cpuset.nr_subparts_cpus &&
+	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
 		goto out;
 
 	/* Generate domain masks and attrs */
@@ -1362,11 +1378,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		update_tasks_cpumask(cp);
 
 		/*
-		 * If the effective cpumask of any non-empty cpuset is changed,
-		 * we need to rebuild sched domains.
+		 * On legacy hierarchy, if the effective cpumask of any non-
+		 * empty cpuset is changed, we need to rebuild sched domains.
+		 * On default hierarchy, the cpuset needs to be a partition
+		 * root as well.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
-		    is_sched_load_balance(cp))
+		    is_sched_load_balance(cp) &&
+		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
+		    is_partition_root(cp)))
 			need_rebuild_sched_domains = true;
 
 		rcu_read_lock();
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 09/12] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (7 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 08/12] cpuset: Make generate_sched_domains() " Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-15 20:29 ` [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag Waiman Long
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

Because of the fact that setting the "cpuset.sched.partition" in
a direct child of root can remove CPUs from the root's effective CPU
list, it makes sense to know what CPUs are left in the root cgroup for
scheduling purpose. So the "cpuset.cpus.effective" control file is now
exposed in the v2 cgroup root.

For consistency, the "cpuset.mems.effective" control file is exposed
as well.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 4 ++--
 kernel/cgroup/cpuset.c                  | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0465c231e2fb..533e85cb851b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1631,7 +1631,7 @@ Cpuset Interface Files
 	and won't be affected by any CPU hotplug events.
 
   cpuset.cpus.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined CPUs that are actually granted to this
@@ -1671,7 +1671,7 @@ Cpuset Interface Files
 	and won't be affected by any memory nodes hotplug events.
 
   cpuset.mems.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined memory nodes that are actually granted to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 53dbd473bd7e..b9832e355e36 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2569,14 +2569,12 @@ static struct cftype dfl_files[] = {
 		.name = "cpus.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_CPULIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
 		.name = "mems.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_MEMLIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (8 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 09/12] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-11-06 11:50   ` Peter Zijlstra
  2018-10-15 20:29 ` [PATCH v14 11/12] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug Waiman Long
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

The cgroup-v2.rst file is updated to document the purpose of the new
"cpuset.sched.partition" flag and how its usage.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 66 +++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 533e85cb851b..178cda473a26 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1686,6 +1686,72 @@ Cpuset Interface Files
 
 	Its value will be affected by memory nodes hotplug events.
 
+  cpuset.sched.partition
+	A read-write single value file which exists on non-root
+	cpuset-enabled cgroups.  It accepts either "0" (off) or "1"
+	(on) when written to.  This flag is set and owned by the
+	parent cgroup.
+
+	If set, it indicates that the current cgroup is the root of a
+	new partition or scheduling domain that comprises itself and
+	all its descendants except those that are separate partition
+	roots themselves and their descendants.  The root cgroup is
+	always a partition root.
+
+	There are constraints on where this flag can be set.  It can
+	only be set in a cgroup if all the following conditions are true.
+
+	1) The "cpuset.cpus" is not empty and the list of CPUs are
+	   exclusive, i.e. they are not shared by any of its siblings.
+	2) The parent cgroup is a partition root.
+	3) The "cpuset.cpus" is also a proper subset of the parent's
+	   "cpuset.cpus.effective".
+	4) There is no child cgroups with cpuset enabled.  This is for
+	   eliminating corner cases that have to be handled if such a
+	   condition is allowed.
+
+	Setting this flag will take the CPUs away from the effective
+	CPUs of the parent cgroup.  Once it is set, this flag cannot
+	be cleared if there are any child cgroups with cpuset enabled.
+
+	A parent partition cannot distribute all its CPUs to its
+	child partitions.  There must be at least one cpu left in the
+	parent partition.
+
+	Once becoming a partition root, changes to "cpuset.cpus" is
+	generally allowed as long as the first condition above is true,
+	the change will not take away all the CPUs from the parent
+	partition and the new "cpuset.cpus" value is a superset of its
+	children's "cpuset.cpus" values.
+
+	Sometimes, external factors like changes to ancestors'
+	"cpuset.cpus" or cpu hotplug can cause the state of the partition
+	root to change.  On read, the "cpuset.sched.partition" file
+	can show the following values.
+
+	"0"  Not a partition root
+	"1"  Partition root
+	"-1" Erroneous partition root
+
+	It is a partition root if the first 2 partition root conditions
+	above are true and at least one CPU from "cpuset.cpus" is
+	granted by the parent cgroup.
+
+	A partition root can become an erroneous partition root if none
+	of CPUs requested in "cpuset.cpus" can be granted by the parent
+	cgroup or the parent cgroup is no longer a partition root.
+	In this case, it is not a real partition even though the
+	restriction of the first partition root condition above will
+	still apply.  All the tasks in the cgroup will be migrated to
+	the nearest ancestor partition.
+
+	An erroneous partition root can be transitioned back to a real
+	partition root if at least one of the requested CPUs can now be
+	granted by its parent.	In this case, the tasks will be migrated
+	back to the newly created partition.  Clearing the partition
+	flag of an erroneous partition root is always allowed even if
+	child cpusets are present.
+
 
 Device controller
 -----------------
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 11/12] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (9 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-15 20:29 ` [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition Waiman Long
  2018-11-05 16:36 ` [PATCH v14 00/12] Enable cpuset controller in default hierarchy Tejun Heo
  12 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

For debugging purpose, it will be useful to expose the content of the
subparts_cpus as a read-only file to see if the code work correctly.
However, subparts_cpus will not be used at all in most use cases. So
adding a new cpuset file that clutters the cgroup directory may not be
desirable.  This is now being done by using the hidden "cgroup_debug"
kernel command line option to expose a new "cpuset.cpus.subpartitions"
file.

That option was originally used by the debug controller to expose
itself when configured into the kernel. This is now extended to set an
internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
can now be used to specify that a cgroup file should only be created
when the "cgroup_debug" option is specified.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h     |  1 +
 kernel/cgroup/cgroup-internal.h |  2 ++
 kernel/cgroup/cgroup.c          | 14 +++++++++++++-
 kernel/cgroup/cpuset.c          | 11 +++++++++++
 kernel/cgroup/debug.c           |  4 +---
 5 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ff20b677fb9f..94c7a78dd563 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -91,6 +91,7 @@ enum {
 
 	CFTYPE_NO_PREFIX	= (1 << 3),	/* (DON'T USE FOR NEW FILES) no subsys prefix */
 	CFTYPE_WORLD_WRITABLE	= (1 << 4),	/* (DON'T USE FOR NEW FILES) S_IWUGO */
+	CFTYPE_DEBUG		= (1 << 5),	/* create when cgroup_debug */
 
 	/* internal flags, do not use outside cgroup core proper */
 	__CFTYPE_ONLY_ON_DFL	= (1 << 16),	/* only on default hierarchy */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 75568fcf2180..c950864016e2 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -11,6 +11,8 @@
 #define TRACE_CGROUP_PATH_LEN 1024
 extern spinlock_t trace_cgroup_path_lock;
 extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
+extern bool cgroup_debug;
+extern void __init enable_debug_cgroup(void);
 
 /*
  * cgroup_path() takes a spin lock. It is good practice not to take
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index aae10baf1902..067e48aaa7a3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -85,6 +85,7 @@ EXPORT_SYMBOL_GPL(css_set_lock);
 
 DEFINE_SPINLOCK(trace_cgroup_path_lock);
 char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
+bool cgroup_debug __read_mostly;
 
 /*
  * Protects cgroup_idr and css_idr so that IDs can be released without
@@ -3616,7 +3617,8 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 			continue;
 		if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
 			continue;
-
+		if ((cft->flags & CFTYPE_DEBUG) && !cgroup_debug)
+			continue;
 		if (is_add) {
 			ret = cgroup_add_file(css, cgrp, cft);
 			if (ret) {
@@ -5695,6 +5697,16 @@ static int __init cgroup_disable(char *str)
 }
 __setup("cgroup_disable=", cgroup_disable);
 
+void __init __weak enable_debug_cgroup(void) { }
+
+static int __init enable_cgroup_debug(char *str)
+{
+	cgroup_debug = true;
+	enable_debug_cgroup();
+	return 1;
+}
+__setup("cgroup_debug", enable_cgroup_debug);
+
 /**
  * css_tryget_online_from_dir - get corresponding css from a cgroup dentry
  * @dentry: directory dentry of interest
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b9832e355e36..34d47e43c98a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2199,6 +2199,7 @@ typedef enum {
 	FILE_MEMLIST,
 	FILE_EFFECTIVE_CPULIST,
 	FILE_EFFECTIVE_MEMLIST,
+	FILE_SUBPARTS_CPULIST,
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
@@ -2380,6 +2381,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_EFFECTIVE_MEMLIST:
 		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
 		break;
+	case FILE_SUBPARTS_CPULIST:
+		seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus));
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -2585,6 +2589,13 @@ static struct cftype dfl_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "cpus.subpartitions",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_SUBPARTS_CPULIST,
+		.flags = CFTYPE_DEBUG,
+	},
+
 	{ }	/* terminate */
 };
 
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 9caeda610249..5f1b87330bee 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -373,11 +373,9 @@ struct cgroup_subsys debug_cgrp_subsys = {
  * On v2, debug is an implicit controller enabled by "cgroup_debug" boot
  * parameter.
  */
-static int __init enable_cgroup_debug(char *str)
+void __init enable_debug_cgroup(void)
 {
 	debug_cgrp_subsys.dfl_cftypes = debug_files;
 	debug_cgrp_subsys.implicit_on_dfl = true;
 	debug_cgrp_subsys.threaded = true;
-	return 1;
 }
-__setup("cgroup_debug", enable_cgroup_debug);
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (10 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 11/12] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug Waiman Long
@ 2018-10-15 20:29 ` Waiman Long
  2018-10-17 15:08   ` Tejun Heo
  2018-11-06 11:52   ` Peter Zijlstra
  2018-11-05 16:36 ` [PATCH v14 00/12] Enable cpuset controller in default hierarchy Tejun Heo
  12 siblings, 2 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-15 20:29 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Tom Hromatka, Waiman Long

Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
read. A person who is not familiar with the partition code may not
understand what they mean.

In order to make cpuset.sched.partition more user-friendly, it will
now display the following descriptive text on read:

  "normal" - A normal cpuset, not a partition root
  "partition" - A partition root
  "partition invalid" - An invalid partition root

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 12 ++++++------
 kernel/cgroup/cpuset.c                  | 22 +++++++++++++++++++---
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 178cda473a26..d9cc79ceb9aa 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1729,15 +1729,15 @@ Cpuset Interface Files
 	root to change.  On read, the "cpuset.sched.partition" file
 	can show the following values.
 
-	"0"  Not a partition root
-	"1"  Partition root
-	"-1" Erroneous partition root
+	"normal" - Normal cpuset, not a partition root
+	"partition" - Partition root
+	"partition invalid" - Invalid partition root
 
 	It is a partition root if the first 2 partition root conditions
 	above are true and at least one CPU from "cpuset.cpus" is
 	granted by the parent cgroup.
 
-	A partition root can become an erroneous partition root if none
+	A partition root can become an invalid partition root if none
 	of CPUs requested in "cpuset.cpus" can be granted by the parent
 	cgroup or the parent cgroup is no longer a partition root.
 	In this case, it is not a real partition even though the
@@ -1745,11 +1745,11 @@ Cpuset Interface Files
 	still apply.  All the tasks in the cgroup will be migrated to
 	the nearest ancestor partition.
 
-	An erroneous partition root can be transitioned back to a real
+	An invalid partition root can be transitioned back to a real
 	partition root if at least one of the requested CPUs can now be
 	granted by its parent.	In this case, the tasks will be migrated
 	back to the newly created partition.  Clearing the partition
-	flag of an erroneous partition root is always allowed even if
+	flag of an invalid partition root is always allowed even if
 	child cpusets are present.
 
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 34d47e43c98a..fe6d6366bfa7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2430,8 +2430,6 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	switch (type) {
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		return cs->relax_domain_level;
-	case FILE_PARTITION_ROOT:
-		return cs->partition_root_state;
 	default:
 		BUG();
 	}
@@ -2440,6 +2438,24 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
+static int sched_partition_show(struct seq_file *seq, void *v)
+{
+	struct cpuset *cs = css_cs(seq_css(seq));
+
+	switch (cs->partition_root_state) {
+	case PRS_ENABLED:
+		seq_puts(seq, "partition\n");
+		break;
+	case PRS_DISABLED:
+		seq_puts(seq, "normal\n");
+		break;
+	case PRS_ERROR:
+		seq_puts(seq, "partition invalid\n");
+		break;
+	}
+	return 0;
+}
+
 /*
  * for the common functions, 'private' gives the type of file
  */
@@ -2583,7 +2599,7 @@ static struct cftype dfl_files[] = {
 
 	{
 		.name = "sched.partition",
-		.read_s64 = cpuset_read_s64,
+		.seq_show = sched_partition_show,
 		.write_s64 = cpuset_write_s64,
 		.private = FILE_PARTITION_ROOT,
 		.flags = CFTYPE_NOT_ON_ROOT,
-- 
2.18.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-15 20:29 ` [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition Waiman Long
@ 2018-10-17 15:08   ` Tejun Heo
  2018-10-17 15:20     ` Waiman Long
  2018-10-19 18:56     ` Waiman Long
  2018-11-06 11:52   ` Peter Zijlstra
  1 sibling, 2 replies; 39+ messages in thread
From: Tejun Heo @ 2018-10-17 15:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Oct 15, 2018 at 04:29:37PM -0400, Waiman Long wrote:
> Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
> read. A person who is not familiar with the partition code may not
> understand what they mean.
> 
> In order to make cpuset.sched.partition more user-friendly, it will
> now display the following descriptive text on read:
> 
>   "normal" - A normal cpuset, not a partition root
>   "partition" - A partition root
>   "partition invalid" - An invalid partition root
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Waiman Long <longman@redhat.com>

Can you also make this consistent when writing to the file?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-17 15:08   ` Tejun Heo
@ 2018-10-17 15:20     ` Waiman Long
  2018-10-19 18:56     ` Waiman Long
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-17 15:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 10/17/2018 11:08 AM, Tejun Heo wrote:
> On Mon, Oct 15, 2018 at 04:29:37PM -0400, Waiman Long wrote:
>> Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
>> read. A person who is not familiar with the partition code may not
>> understand what they mean.
>>
>> In order to make cpuset.sched.partition more user-friendly, it will
>> now display the following descriptive text on read:
>>
>>   "normal" - A normal cpuset, not a partition root
>>   "partition" - A partition root
>>   "partition invalid" - An invalid partition root
>>
>> Suggested-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Can you also make this consistent when writing to the file?
>
> Thanks.
>
Yes, sure.

-Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 03/12] cpuset: Simply allocation and freeing of cpumasks
  2018-10-15 20:29 ` [PATCH v14 03/12] cpuset: Simply allocation and freeing of cpumasks Waiman Long
@ 2018-10-19 15:28   ` Tom Hromatka
  0 siblings, 0 replies; 39+ messages in thread
From: Tom Hromatka @ 2018-10-19 15:28 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi



On 10/15/2018 02:29 PM, Waiman Long wrote:
> The previous commit introduces a new subparts_cpus mask into the cpuset
> data structure and a new tmpmasks structure.  Managing the allocation
> and freeing of those cpumasks is becoming more complex.
>
> So a number of helper functions are added to simplify and streamline
> the management of those cpumasks. To make it simple, all the cpumasks
> are now pre-cleared on allocation.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Thanks!

Reviewed-by: Tom Hromatka <tom.hromatka@oracle.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-17 15:08   ` Tejun Heo
  2018-10-17 15:20     ` Waiman Long
@ 2018-10-19 18:56     ` Waiman Long
  2018-10-19 19:24       ` Tejun Heo
  1 sibling, 1 reply; 39+ messages in thread
From: Waiman Long @ 2018-10-19 18:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

On 10/17/2018 11:08 AM, Tejun Heo wrote:
> On Mon, Oct 15, 2018 at 04:29:37PM -0400, Waiman Long wrote:
>> Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
>> read. A person who is not familiar with the partition code may not
>> understand what they mean.
>>
>> In order to make cpuset.sched.partition more user-friendly, it will
>> now display the following descriptive text on read:
>>
>>   "normal" - A normal cpuset, not a partition root
>>   "partition" - A partition root
>>   "partition invalid" - An invalid partition root
>>
>> Suggested-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Can you also make this consistent when writing to the file?
>
> Thanks.
>
How about the attached patch instead?

Cheers,
Longman


[-- Attachment #2: v14-0012-cpuset-Use-descriptive-text-when-reading-writing.patch --]
[-- Type: text/x-patch, Size: 7676 bytes --]

From ab96b5d735102b0167a5e92f0aee92caa18a18ae Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Fri, 19 Oct 2018 11:46:41 -0400
Subject: [PATCH v14 12/12] cpuset: Use descriptive text when reading/writing
 cpuset.sched.partition

Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
read. A person who is not familiar with the partition code may not
understand what they mean.

In order to make cpuset.sched.partition more user-friendly, it will
now display the following descriptive text on read:

  "normal" - A normal cpuset, not a partition root
  "partition" - A partition root
  "partition invalid" - An invalid partition root

The cpuset.sched.partition file will now also accept "normal" and
"partition" besides 1 and 0 as valid input values. The "partition
invalid" value is internal only and cannot be written to the file.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 50 +++++++++++++++-------------
 kernel/cgroup/cpuset.c                  | 58 +++++++++++++++++++++++++++++----
 2 files changed, 79 insertions(+), 29 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 178cda4..072f09d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1688,18 +1688,23 @@ Cpuset Interface Files
 
   cpuset.sched.partition
 	A read-write single value file which exists on non-root
-	cpuset-enabled cgroups.  It accepts either "0" (off) or "1"
-	(on) when written to.  This flag is set and owned by the
+	cpuset-enabled cgroups.  This file is set and owned by the
 	parent cgroup.
 
-	If set, it indicates that the current cgroup is the root of a
-	new partition or scheduling domain that comprises itself and
+        It accepts only the following input values when written to.
+
+        "normal" or "0" - normal cpuset, not a partition root
+        "partition" or "1" - partition root
+
+	When set to a partition root, the current cgroup is the root of
+	a new partition or scheduling domain that comprises itself and
 	all its descendants except those that are separate partition
 	roots themselves and their descendants.  The root cgroup is
 	always a partition root.
 
-	There are constraints on where this flag can be set.  It can
-	only be set in a cgroup if all the following conditions are true.
+	There are constraints on where a partition root can be set.
+	It can only be set in a cgroup if all the following conditions
+	are true.
 
 	1) The "cpuset.cpus" is not empty and the list of CPUs are
 	   exclusive, i.e. they are not shared by any of its siblings.
@@ -1710,9 +1715,10 @@ Cpuset Interface Files
 	   eliminating corner cases that have to be handled if such a
 	   condition is allowed.
 
-	Setting this flag will take the CPUs away from the effective
-	CPUs of the parent cgroup.  Once it is set, this flag cannot
-	be cleared if there are any child cgroups with cpuset enabled.
+	Setting it to partition root will take the CPUs away from the
+	effective CPUs of the parent cgroup.  Once it is set, this
+	file cannot be reverted back to "normal" if there are any child
+	cgroups with cpuset enabled.
 
 	A parent partition cannot distribute all its CPUs to its
 	child partitions.  There must be at least one cpu left in the
@@ -1729,28 +1735,28 @@ Cpuset Interface Files
 	root to change.  On read, the "cpuset.sched.partition" file
 	can show the following values.
 
-	"0"  Not a partition root
-	"1"  Partition root
-	"-1" Erroneous partition root
+	"normal" - Normal cpuset, not a partition root
+	"partition" - Partition root
+	"partition invalid" - Invalid partition root
 
 	It is a partition root if the first 2 partition root conditions
 	above are true and at least one CPU from "cpuset.cpus" is
 	granted by the parent cgroup.
 
-	A partition root can become an erroneous partition root if none
-	of CPUs requested in "cpuset.cpus" can be granted by the parent
-	cgroup or the parent cgroup is no longer a partition root.
-	In this case, it is not a real partition even though the
+	A partition root can become an invalid partition root if none
+	of the CPUs requested in "cpuset.cpus" can be granted by the
+	parent cgroup or the parent cgroup is no longer a partition
+	root.  In this case, it is not a real partition even though the
 	restriction of the first partition root condition above will
 	still apply.  All the tasks in the cgroup will be migrated to
 	the nearest ancestor partition.
 
-	An erroneous partition root can be transitioned back to a real
-	partition root if at least one of the requested CPUs can now be
-	granted by its parent.	In this case, the tasks will be migrated
-	back to the newly created partition.  Clearing the partition
-	flag of an erroneous partition root is always allowed even if
-	child cpusets are present.
+	An invalid partition root can be transitioned back to a real
+	partition root if at least one of the requested CPUs can now
+	be granted by its parent.  In this case, the tasks will be
+	migrated back to the newly created partition.  Converting an
+	invalid partition root back to "normal" is always allowed even
+	if child cpusets are present.
 
 
 Device controller
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 34d47e4..d356205 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2274,9 +2274,6 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		retval = update_relax_domain_level(cs, val);
 		break;
-	case FILE_PARTITION_ROOT:
-		retval = update_prstate(cs, val);
-		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -2430,8 +2427,6 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	switch (type) {
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		return cs->relax_domain_level;
-	case FILE_PARTITION_ROOT:
-		return cs->partition_root_state;
 	default:
 		BUG();
 	}
@@ -2440,6 +2435,55 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
+static int sched_partition_show(struct seq_file *seq, void *v)
+{
+	struct cpuset *cs = css_cs(seq_css(seq));
+
+	switch (cs->partition_root_state) {
+	case PRS_ENABLED:
+		seq_puts(seq, "partition\n");
+		break;
+	case PRS_DISABLED:
+		seq_puts(seq, "normal\n");
+		break;
+	case PRS_ERROR:
+		seq_puts(seq, "partition invalid\n");
+		break;
+	}
+	return 0;
+}
+
+static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
+				     size_t nbytes, loff_t off)
+{
+	struct cpuset *cs = css_cs(of_css(of));
+	int val;
+	int retval = -ENODEV;
+
+	buf = strstrip(buf);
+
+	/*
+	 * Convert "partition"/"1" to 1, and convert "normal"/"0" to 0.
+	 */
+	if (!strcmp(buf, "partition") || !strcmp(buf, "1"))
+		val = PRS_ENABLED;
+	else if (!strcmp(buf, "normal") || !strcmp(buf, "0"))
+		val = PRS_DISABLED;
+	else
+		return -EINVAL;
+
+	css_get(&cs->css);
+	mutex_lock(&cpuset_mutex);
+	if (!is_cpuset_online(cs))
+		goto out_unlock;
+
+	retval = update_prstate(cs, val);
+out_unlock:
+	mutex_unlock(&cpuset_mutex);
+	css_put(&cs->css);
+	return retval ?: nbytes;
+}
+
 /*
  * for the common functions, 'private' gives the type of file
  */
@@ -2583,8 +2627,8 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 
 	{
 		.name = "sched.partition",
-		.read_s64 = cpuset_read_s64,
-		.write_s64 = cpuset_write_s64,
+		.seq_show = sched_partition_show,
+		.write = sched_partition_write,
 		.private = FILE_PARTITION_ROOT,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-19 18:56     ` Waiman Long
@ 2018-10-19 19:24       ` Tejun Heo
  2018-10-19 19:32         ` Waiman Long
  2018-11-02 14:34         ` Waiman Long
  0 siblings, 2 replies; 39+ messages in thread
From: Tejun Heo @ 2018-10-19 19:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Fri, Oct 19, 2018 at 02:56:13PM -0400, Waiman Long wrote:
> On 10/17/2018 11:08 AM, Tejun Heo wrote:
> > On Mon, Oct 15, 2018 at 04:29:37PM -0400, Waiman Long wrote:
> >> Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
> >> read. A person who is not familiar with the partition code may not
> >> understand what they mean.
> >>
> >> In order to make cpuset.sched.partition more user-friendly, it will
> >> now display the following descriptive text on read:
> >>
> >>   "normal" - A normal cpuset, not a partition root
> >>   "partition" - A partition root
> >>   "partition invalid" - An invalid partition root
> >>
> >> Suggested-by: Tejun Heo <tj@kernel.org>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> > Can you also make this consistent when writing to the file?
> >
> > Thanks.
> >
> How about the attached patch instead?

Looks good to me.  Once Peter is okay with it, I'll roll it into
cgroup devel branch.  As v4.19 is almost done, I think it makes more
sense to target the next merge window (v4.21).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-19 19:24       ` Tejun Heo
@ 2018-10-19 19:32         ` Waiman Long
  2018-11-02 14:34         ` Waiman Long
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-10-19 19:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 10/19/2018 03:24 PM, Tejun Heo wrote:
> On Fri, Oct 19, 2018 at 02:56:13PM -0400, Waiman Long wrote:
>> On 10/17/2018 11:08 AM, Tejun Heo wrote:
>>> On Mon, Oct 15, 2018 at 04:29:37PM -0400, Waiman Long wrote:
>>>> Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
>>>> read. A person who is not familiar with the partition code may not
>>>> understand what they mean.
>>>>
>>>> In order to make cpuset.sched.partition more user-friendly, it will
>>>> now display the following descriptive text on read:
>>>>
>>>>   "normal" - A normal cpuset, not a partition root
>>>>   "partition" - A partition root
>>>>   "partition invalid" - An invalid partition root
>>>>
>>>> Suggested-by: Tejun Heo <tj@kernel.org>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> Can you also make this consistent when writing to the file?
>>>
>>> Thanks.
>>>
>> How about the attached patch instead?
> Looks good to me.  Once Peter is okay with it, I'll roll it into
> cgroup devel branch.  As v4.19 is almost done, I think it makes more
> sense to target the next merge window (v4.21).
>
> Thanks.
>
v4.21 is fine. Thanks for valuable help and suggestions.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-19 19:24       ` Tejun Heo
  2018-10-19 19:32         ` Waiman Long
@ 2018-11-02 14:34         ` Waiman Long
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-11-02 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 10/19/2018 03:24 PM, Tejun Heo wrote:
> On Fri, Oct 19, 2018 at 02:56:13PM -0400, Waiman Long wrote:
>> On 10/17/2018 11:08 AM, Tejun Heo wrote:
>>> On Mon, Oct 15, 2018 at 04:29:37PM -0400, Waiman Long wrote:
>>>> Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
>>>> read. A person who is not familiar with the partition code may not
>>>> understand what they mean.
>>>>
>>>> In order to make cpuset.sched.partition more user-friendly, it will
>>>> now display the following descriptive text on read:
>>>>
>>>>   "normal" - A normal cpuset, not a partition root
>>>>   "partition" - A partition root
>>>>   "partition invalid" - An invalid partition root
>>>>
>>>> Suggested-by: Tejun Heo <tj@kernel.org>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> Can you also make this consistent when writing to the file?
>>>
>>> Thanks.
>>>
>> How about the attached patch instead?
> Looks good to me.  Once Peter is okay with it, I'll roll it into
> cgroup devel branch.  As v4.19 is almost done, I think it makes more
> sense to target the next merge window (v4.21).
>
> Thanks.
>
Peter, are you OK with the current cpuset v2 patch which does allow
hierarchical partitions as you have requested?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
                   ` (11 preceding siblings ...)
  2018-10-15 20:29 ` [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition Waiman Long
@ 2018-11-05 16:36 ` Tejun Heo
  2018-11-05 16:57   ` Peter Zijlstra
  2018-11-06 11:53   ` Peter Zijlstra
  12 siblings, 2 replies; 39+ messages in thread
From: Tejun Heo @ 2018-11-05 16:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

Hello,

So, this looks good to me.  Peter, I'm gonna roll the series into
cgroup/for-4.21-cpuset.  Please holler if you have any objections /
comments.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-11-05 16:36 ` [PATCH v14 00/12] Enable cpuset controller in default hierarchy Tejun Heo
@ 2018-11-05 16:57   ` Peter Zijlstra
  2018-11-06 11:53   ` Peter Zijlstra
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-05 16:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Nov 05, 2018 at 08:36:56AM -0800, Tejun Heo wrote:
> Hello,
> 
> So, this looks good to me.  Peter, I'm gonna roll the series into
> cgroup/for-4.21-cpuset.  Please holler if you have any objections /
> comments.

I'll try and have a look before LPC. Thanks!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 04/12] cpuset: Add new v2 cpuset.sched.partition flag
  2018-10-15 20:29 ` [PATCH v14 04/12] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
@ 2018-11-06 11:35   ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Oct 15, 2018 at 04:29:29PM -0400, Waiman Long wrote:
> + * Called with cpuset_mutex held.
> + */
> +static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
> +					  struct cpumask *newmask,
> +					  struct tmpmasks *tmp)
> +{
> +	struct cpuset *parent = parent_cs(cpuset);
> +	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
> +	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
> +

I much prefer doing away with that "called with * held" comment crud and
add:

	lockdep_assert_held(&cpuset_mutex);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition
  2018-10-15 20:29 ` [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition Waiman Long
@ 2018-11-06 11:37   ` Peter Zijlstra
  2018-11-06 14:17     ` Waiman Long
  2018-11-06 11:40   ` Peter Zijlstra
  2018-11-06 11:40   ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Oct 15, 2018 at 04:29:30PM -0400, Waiman Long wrote:
> Handling error returned by update_parent_subparts_cpumask() in
> update_cpumasks_hier() is problematic as the states may become
> inconsistent. To avoid that and increase flexibility in handling other
> error cases, a new error state (-1) is added to the partition_root_state
> flag. This new error state is set internally and user cannot write this
> value to "cpuset.sched.partition".
> 
> In this error state, the partition root becomes an erroneous one. It is
> no longer a real partition root, but the CS_CPU_EXCLUSIVE flag will
> still be set as it can be changed back to a real one if appropriate
> change happens later on.

I feel this Changelog should be much more explicit about the reasons
this -1 state can happen. As is, I've got no clue.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition
  2018-10-15 20:29 ` [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition Waiman Long
  2018-11-06 11:37   ` Peter Zijlstra
@ 2018-11-06 11:40   ` Peter Zijlstra
  2018-11-07 23:13     ` Waiman Long
  2018-11-06 11:40   ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Oct 15, 2018 at 04:29:30PM -0400, Waiman Long wrote:
> +				spin_lock_irq(&callback_lock);
> +				clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
> +				spin_unlock_irq(&callback_lock);

A single atomic bitop wrapped in a spinlock..  maybe it is correct, but
it looks suspicous.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition
  2018-10-15 20:29 ` [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition Waiman Long
  2018-11-06 11:37   ` Peter Zijlstra
  2018-11-06 11:40   ` Peter Zijlstra
@ 2018-11-06 11:40   ` Peter Zijlstra
  2 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Oct 15, 2018 at 04:29:30PM -0400, Waiman Long wrote:
> +			case PRS_ENABLED:
> +				if (update_parent_subparts_cpumask(cp,
> +					partcmd_update, NULL, tmp))
> +					update_tasks_cpumask(parent);
> +				break;

That code indenting had me go WTF a few times.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag
  2018-10-15 20:29 ` [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag Waiman Long
@ 2018-11-06 11:50   ` Peter Zijlstra
  2018-11-06 14:09     ` Waiman Long
  2018-11-07 22:58     ` Waiman Long
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Oct 15, 2018 at 04:29:35PM -0400, Waiman Long wrote:
> The cgroup-v2.rst file is updated to document the purpose of the new
> "cpuset.sched.partition" flag and how its usage.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 66 +++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 533e85cb851b..178cda473a26 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1686,6 +1686,72 @@ Cpuset Interface Files
>  
>  	Its value will be affected by memory nodes hotplug events.
>  
> +  cpuset.sched.partition
> +	A read-write single value file which exists on non-root
> +	cpuset-enabled cgroups.  It accepts either "0" (off) or "1"
> +	(on) when written to.  

> +     This flag is set and owned by the
> +	parent cgroup.

What does that mean? The parent cgroup doesn't 'set' anything at all.
The user will.

> +
> +	If set, it indicates that the current cgroup is the root of a
> +	new partition or scheduling domain that comprises itself and
> +	all its descendants except those that are separate partition
> +	roots themselves and their descendants.  The root cgroup is
> +	always a partition root.
> +
> +	There are constraints on where this flag can be set.  It can
> +	only be set in a cgroup if all the following conditions are true.
> +
> +	1) The "cpuset.cpus" is not empty and the list of CPUs are
> +	   exclusive, i.e. they are not shared by any of its siblings.
> +	2) The parent cgroup is a partition root.
> +	3) The "cpuset.cpus" is also a proper subset of the parent's
> +	   "cpuset.cpus.effective".
> +	4) There is no child cgroups with cpuset enabled.  This is for
> +	   eliminating corner cases that have to be handled if such a
> +	   condition is allowed.
> +
> +	Setting this flag will take the CPUs away from the effective
> +	CPUs of the parent cgroup.  Once it is set, this flag cannot
> +	be cleared if there are any child cgroups with cpuset enabled.
> +
> +	A parent partition cannot distribute all its CPUs to its
> +	child partitions.  There must be at least one cpu left in the
> +	parent partition.
> +
> +	Once becoming a partition root, changes to "cpuset.cpus" is
> +	generally allowed as long as the first condition above is true,
> +	the change will not take away all the CPUs from the parent
> +	partition and the new "cpuset.cpus" value is a superset of its
> +	children's "cpuset.cpus" values.

> +	Sometimes, external factors like changes to ancestors'
> +	"cpuset.cpus" or cpu hotplug can cause the state of the partition
> +	root to change.  On read, the "cpuset.sched.partition" file
> +	can show the following values.

Are those the only conditions under which that -1 can happen? Parent
taking away CPUs it previously granted and hotplug?

> +
> +	"0"  Not a partition root
> +	"1"  Partition root
> +	"-1" Erroneous partition root
> +
> +	It is a partition root if the first 2 partition root conditions
> +	above are true and at least one CPU from "cpuset.cpus" is
> +	granted by the parent cgroup.
> +
> +	A partition root can become an erroneous partition root if none
> +	of CPUs requested in "cpuset.cpus" can be granted by the parent
> +	cgroup or the parent cgroup is no longer a partition root.
> +	In this case, it is not a real partition even though the
> +	restriction of the first partition root condition above will
> +	still apply.  All the tasks in the cgroup will be migrated to
> +	the nearest ancestor partition.

Effectively or actual? Actual migrating tasks out of the cgroup is
irreversible.

> +	An erroneous partition root can be transitioned back to a real
> +	partition root if at least one of the requested CPUs can now be
> +	granted by its parent.	In this case, the tasks will be migrated
> +	back to the newly created partition.  Clearing the partition
> +	flag of an erroneous partition root is always allowed even if
> +	child cpusets are present.

So you need to clarify the above point (I think it is effectively),
because otherwise you don't know which tasks to put back.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition
  2018-10-15 20:29 ` [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition Waiman Long
  2018-10-17 15:08   ` Tejun Heo
@ 2018-11-06 11:52   ` Peter Zijlstra
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Oct 15, 2018 at 04:29:37PM -0400, Waiman Long wrote:
> Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
> read. A person who is not familiar with the partition code may not
> understand what they mean.
> 
> In order to make cpuset.sched.partition more user-friendly, it will
> now display the following descriptive text on read:
> 
>   "normal" - A normal cpuset, not a partition root
>   "partition" - A partition root
>   "partition invalid" - An invalid partition root

As a person who is familiar with it: "normal" doesn't make sense.

It either is or is not a parition. normal doesn't some into it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-11-05 16:36 ` [PATCH v14 00/12] Enable cpuset controller in default hierarchy Tejun Heo
  2018-11-05 16:57   ` Peter Zijlstra
@ 2018-11-06 11:53   ` Peter Zijlstra
  2018-11-06 11:55     ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Mon, Nov 05, 2018 at 08:36:56AM -0800, Tejun Heo wrote:
> Hello,
> 
> So, this looks good to me.  Peter, I'm gonna roll the series into
> cgroup/for-4.21-cpuset.  Please holler if you have any objections /
> comments.

Mostly nits, except I think the interface for cpuset.sched.partition
needs to find another term for "not a parition", "normal" really doesn't
work for me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-11-06 11:53   ` Peter Zijlstra
@ 2018-11-06 11:55     ` Peter Zijlstra
  2018-11-06 14:06       ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Tue, Nov 06, 2018 at 12:53:35PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 05, 2018 at 08:36:56AM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > So, this looks good to me.  Peter, I'm gonna roll the series into
> > cgroup/for-4.21-cpuset.  Please holler if you have any objections /
> > comments.
> 
> Mostly nits, except I think the interface for cpuset.sched.partition
> needs to find another term for "not a parition", "normal" really doesn't
> work for me.

cpuset.sched.partition: "true", "false", "error"

makes more sense to me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-11-06 11:55     ` Peter Zijlstra
@ 2018-11-06 14:06       ` Waiman Long
       [not found]         ` <CAOS58YPye=7Ga+y-ujFsgHqo6vdVnjykmON1z+UjNQLvvM_g4w@mail.gmail.com>
  2018-11-07 21:32         ` Tejun Heo
  0 siblings, 2 replies; 39+ messages in thread
From: Waiman Long @ 2018-11-06 14:06 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, kernel-team, pjt, luto, Mike Galbraith, torvalds,
	Roman Gushchin, Juri Lelli, Patrick Bellasi, Tom Hromatka

On 11/06/2018 06:55 AM, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 12:53:35PM +0100, Peter Zijlstra wrote:
>> On Mon, Nov 05, 2018 at 08:36:56AM -0800, Tejun Heo wrote:
>>> Hello,
>>>
>>> So, this looks good to me.  Peter, I'm gonna roll the series into
>>> cgroup/for-4.21-cpuset.  Please holler if you have any objections /
>>> comments.
>> Mostly nits, except I think the interface for cpuset.sched.partition
>> needs to find another term for "not a parition", "normal" really doesn't
>> work for me.
> cpuset.sched.partition: "true", "false", "error"
>
> makes more sense to me.

The terms "true", "false", "error" look good to me. I can change the
patchset to use those if Tejun has no objection. I will fix the other
nits you have as well.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag
  2018-11-06 11:50   ` Peter Zijlstra
@ 2018-11-06 14:09     ` Waiman Long
  2018-11-07 22:58     ` Waiman Long
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-11-06 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 11/06/2018 06:50 AM, Peter Zijlstra wrote:
> On Mon, Oct 15, 2018 at 04:29:35PM -0400, Waiman Long wrote:
>> The cgroup-v2.rst file is updated to document the purpose of the new
>> "cpuset.sched.partition" flag and how its usage.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  Documentation/admin-guide/cgroup-v2.rst | 66 +++++++++++++++++++++++++
>>  1 file changed, 66 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 533e85cb851b..178cda473a26 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -1686,6 +1686,72 @@ Cpuset Interface Files
>>  
>>  	Its value will be affected by memory nodes hotplug events.
>>  
>> +  cpuset.sched.partition
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It accepts either "0" (off) or "1"
>> +	(on) when written to.  
>> +     This flag is set and owned by the
>> +	parent cgroup.
> What does that mean? The parent cgroup doesn't 'set' anything at all.
> The user will.
>
>> +
>> +	If set, it indicates that the current cgroup is the root of a
>> +	new partition or scheduling domain that comprises itself and
>> +	all its descendants except those that are separate partition
>> +	roots themselves and their descendants.  The root cgroup is
>> +	always a partition root.
>> +
>> +	There are constraints on where this flag can be set.  It can
>> +	only be set in a cgroup if all the following conditions are true.
>> +
>> +	1) The "cpuset.cpus" is not empty and the list of CPUs are
>> +	   exclusive, i.e. they are not shared by any of its siblings.
>> +	2) The parent cgroup is a partition root.
>> +	3) The "cpuset.cpus" is also a proper subset of the parent's
>> +	   "cpuset.cpus.effective".
>> +	4) There is no child cgroups with cpuset enabled.  This is for
>> +	   eliminating corner cases that have to be handled if such a
>> +	   condition is allowed.
>> +
>> +	Setting this flag will take the CPUs away from the effective
>> +	CPUs of the parent cgroup.  Once it is set, this flag cannot
>> +	be cleared if there are any child cgroups with cpuset enabled.
>> +
>> +	A parent partition cannot distribute all its CPUs to its
>> +	child partitions.  There must be at least one cpu left in the
>> +	parent partition.
>> +
>> +	Once becoming a partition root, changes to "cpuset.cpus" is
>> +	generally allowed as long as the first condition above is true,
>> +	the change will not take away all the CPUs from the parent
>> +	partition and the new "cpuset.cpus" value is a superset of its
>> +	children's "cpuset.cpus" values.
>> +	Sometimes, external factors like changes to ancestors'
>> +	"cpuset.cpus" or cpu hotplug can cause the state of the partition
>> +	root to change.  On read, the "cpuset.sched.partition" file
>> +	can show the following values.
> Are those the only conditions under which that -1 can happen? Parent
> taking away CPUs it previously granted and hotplug?
>
>> +
>> +	"0"  Not a partition root
>> +	"1"  Partition root
>> +	"-1" Erroneous partition root
>> +
>> +	It is a partition root if the first 2 partition root conditions
>> +	above are true and at least one CPU from "cpuset.cpus" is
>> +	granted by the parent cgroup.
>> +
>> +	A partition root can become an erroneous partition root if none
>> +	of CPUs requested in "cpuset.cpus" can be granted by the parent
>> +	cgroup or the parent cgroup is no longer a partition root.
>> +	In this case, it is not a real partition even though the
>> +	restriction of the first partition root condition above will
>> +	still apply.  All the tasks in the cgroup will be migrated to
>> +	the nearest ancestor partition.
> Effectively or actual? Actual migrating tasks out of the cgroup is
> irreversible.

I should have reworded it to emphasize that this flag is not delegatable
to the child cpuset.

>> +	An erroneous partition root can be transitioned back to a real
>> +	partition root if at least one of the requested CPUs can now be
>> +	granted by its parent.	In this case, the tasks will be migrated
>> +	back to the newly created partition.  Clearing the partition
>> +	flag of an erroneous partition root is always allowed even if
>> +	child cpusets are present.
> So you need to clarify the above point (I think it is effectively),
> because otherwise you don't know which tasks to put back.

OK, will do that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
       [not found]         ` <CAOS58YPye=7Ga+y-ujFsgHqo6vdVnjykmON1z+UjNQLvvM_g4w@mail.gmail.com>
@ 2018-11-06 14:11           ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2018-11-06 14:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

Hello, again, guys.

On Tue, Nov 06, 2018 at 09:09:07AM -0500, Tejun Heo wrote:
> Hello, Waiman, Peter.
> 
> Let's skip this patch for now. I'll think more about the interface and
> adapt your patch later.

Sorry about the last message.  I thought I configured web interface
but obviously didn't and it went out with html subpart.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition
  2018-11-06 11:37   ` Peter Zijlstra
@ 2018-11-06 14:17     ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-11-06 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 11/06/2018 06:37 AM, Peter Zijlstra wrote:
> On Mon, Oct 15, 2018 at 04:29:30PM -0400, Waiman Long wrote:
>> Handling error returned by update_parent_subparts_cpumask() in
>> update_cpumasks_hier() is problematic as the states may become
>> inconsistent. To avoid that and increase flexibility in handling other
>> error cases, a new error state (-1) is added to the partition_root_state
>> flag. This new error state is set internally and user cannot write this
>> value to "cpuset.sched.partition".
>>
>> In this error state, the partition root becomes an erroneous one. It is
>> no longer a real partition root, but the CS_CPU_EXCLUSIVE flag will
>> still be set as it can be changed back to a real one if appropriate
>> change happens later on.
> I feel this Changelog should be much more explicit about the reasons
> this -1 state can happen. As is, I've got no clue.

OK, I will elaborate more on this changelog.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-11-06 14:06       ` Waiman Long
       [not found]         ` <CAOS58YPye=7Ga+y-ujFsgHqo6vdVnjykmON1z+UjNQLvvM_g4w@mail.gmail.com>
@ 2018-11-07 21:32         ` Tejun Heo
  2018-11-07 21:52           ` Waiman Long
  2018-11-08  9:41           ` Peter Zijlstra
  1 sibling, 2 replies; 39+ messages in thread
From: Tejun Heo @ 2018-11-07 21:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

Hello,

On Tue, Nov 06, 2018 at 09:06:01AM -0500, Waiman Long wrote:
> On 11/06/2018 06:55 AM, Peter Zijlstra wrote:
> > On Tue, Nov 06, 2018 at 12:53:35PM +0100, Peter Zijlstra wrote:
> >> On Mon, Nov 05, 2018 at 08:36:56AM -0800, Tejun Heo wrote:
> >>> Hello,
> >>>
> >>> So, this looks good to me.  Peter, I'm gonna roll the series into
> >>> cgroup/for-4.21-cpuset.  Please holler if you have any objections /
> >>> comments.
> >> Mostly nits, except I think the interface for cpuset.sched.partition
> >> needs to find another term for "not a parition", "normal" really doesn't
> >> work for me.
> > cpuset.sched.partition: "true", "false", "error"
> >
> > makes more sense to me.
> 
> The terms "true", "false", "error" look good to me. I can change the
> patchset to use those if Tejun has no objection. I will fix the other
> nits you have as well.

How about the following?

Interface file is named "cpuset.cpus.partition" as it's a partition of
cpus.  The writeable keys are "root" and "member" - a partition root
and a partition member.  When "root" is requested but can't be
satisfied it shows "root invalid".  This seems pretty consistent with
cgroup.type and other cpuset files.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-11-07 21:32         ` Tejun Heo
@ 2018-11-07 21:52           ` Waiman Long
  2018-11-08  9:41           ` Peter Zijlstra
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-11-07 21:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 11/07/2018 04:32 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 06, 2018 at 09:06:01AM -0500, Waiman Long wrote:
>> On 11/06/2018 06:55 AM, Peter Zijlstra wrote:
>>> On Tue, Nov 06, 2018 at 12:53:35PM +0100, Peter Zijlstra wrote:
>>>> On Mon, Nov 05, 2018 at 08:36:56AM -0800, Tejun Heo wrote:
>>>>> Hello,
>>>>>
>>>>> So, this looks good to me.  Peter, I'm gonna roll the series into
>>>>> cgroup/for-4.21-cpuset.  Please holler if you have any objections /
>>>>> comments.
>>>> Mostly nits, except I think the interface for cpuset.sched.partition
>>>> needs to find another term for "not a parition", "normal" really doesn't
>>>> work for me.
>>> cpuset.sched.partition: "true", "false", "error"
>>>
>>> makes more sense to me.
>> The terms "true", "false", "error" look good to me. I can change the
>> patchset to use those if Tejun has no objection. I will fix the other
>> nits you have as well.
> How about the following?
>
> Interface file is named "cpuset.cpus.partition" as it's a partition of
> cpus.  The writeable keys are "root" and "member" - a partition root
> and a partition member.  When "root" is requested but can't be
> satisfied it shows "root invalid".  This seems pretty consistent with
> cgroup.type and other cpuset files.
>
> Thanks.
>
Yes, that looks good. I will revise my patchset to include those changes
unless Peter has other better idea.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag
  2018-11-06 11:50   ` Peter Zijlstra
  2018-11-06 14:09     ` Waiman Long
@ 2018-11-07 22:58     ` Waiman Long
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-11-07 22:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 11/06/2018 06:50 AM, Peter Zijlstra wrote:
> On Mon, Oct 15, 2018 at 04:29:35PM -0400, Waiman Long wrote:
>> The cgroup-v2.rst file is updated to document the purpose of the new
>> "cpuset.sched.partition" flag and how its usage.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  Documentation/admin-guide/cgroup-v2.rst | 66 +++++++++++++++++++++++++
>>  1 file changed, 66 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 533e85cb851b..178cda473a26 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -1686,6 +1686,72 @@ Cpuset Interface Files
>>  
>>  	Its value will be affected by memory nodes hotplug events.
>>  
>> +  cpuset.sched.partition
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It accepts either "0" (off) or "1"
>> +	(on) when written to.  
>> +     This flag is set and owned by the
>> +	parent cgroup.
> What does that mean? The parent cgroup doesn't 'set' anything at all.
> The user will.
>
>> +
>> +	If set, it indicates that the current cgroup is the root of a
>> +	new partition or scheduling domain that comprises itself and
>> +	all its descendants except those that are separate partition
>> +	roots themselves and their descendants.  The root cgroup is
>> +	always a partition root.
>> +
>> +	There are constraints on where this flag can be set.  It can
>> +	only be set in a cgroup if all the following conditions are true.
>> +
>> +	1) The "cpuset.cpus" is not empty and the list of CPUs are
>> +	   exclusive, i.e. they are not shared by any of its siblings.
>> +	2) The parent cgroup is a partition root.
>> +	3) The "cpuset.cpus" is also a proper subset of the parent's
>> +	   "cpuset.cpus.effective".
>> +	4) There is no child cgroups with cpuset enabled.  This is for
>> +	   eliminating corner cases that have to be handled if such a
>> +	   condition is allowed.
>> +
>> +	Setting this flag will take the CPUs away from the effective
>> +	CPUs of the parent cgroup.  Once it is set, this flag cannot
>> +	be cleared if there are any child cgroups with cpuset enabled.
>> +
>> +	A parent partition cannot distribute all its CPUs to its
>> +	child partitions.  There must be at least one cpu left in the
>> +	parent partition.
>> +
>> +	Once becoming a partition root, changes to "cpuset.cpus" is
>> +	generally allowed as long as the first condition above is true,
>> +	the change will not take away all the CPUs from the parent
>> +	partition and the new "cpuset.cpus" value is a superset of its
>> +	children's "cpuset.cpus" values.
>> +	Sometimes, external factors like changes to ancestors'
>> +	"cpuset.cpus" or cpu hotplug can cause the state of the partition
>> +	root to change.  On read, the "cpuset.sched.partition" file
>> +	can show the following values.
> Are those the only conditions under which that -1 can happen? Parent
> taking away CPUs it previously granted and hotplug?

Yes, if none of the cpus are available. It will become invalid. It still
remains a partition (a partial one) if at least one cpu is allocated to
that partition.

>> +
>> +	"0"  Not a partition root
>> +	"1"  Partition root
>> +	"-1" Erroneous partition root
>> +
>> +	It is a partition root if the first 2 partition root conditions
>> +	above are true and at least one CPU from "cpuset.cpus" is
>> +	granted by the parent cgroup.
>> +
>> +	A partition root can become an erroneous partition root if none
>> +	of CPUs requested in "cpuset.cpus" can be granted by the parent
>> +	cgroup or the parent cgroup is no longer a partition root.
>> +	In this case, it is not a real partition even though the
>> +	restriction of the first partition root condition above will
>> +	still apply.  All the tasks in the cgroup will be migrated to
>> +	the nearest ancestor partition.
> Effectively or actual? Actual migrating tasks out of the cgroup is
> irreversible.

I am not talking about actual migration to a different cgroup. I am
talking moving to a different partition. When a partition become
invalid, it will inherent the partition parent's effective cpumask.

>> +	An erroneous partition root can be transitioned back to a real
>> +	partition root if at least one of the requested CPUs can now be
>> +	granted by its parent.	In this case, the tasks will be migrated
>> +	back to the newly created partition.  Clearing the partition
>> +	flag of an erroneous partition root is always allowed even if
>> +	child cpusets are present.
> So you need to clarify the above point (I think it is effectively),
> because otherwise you don't know which tasks to put back.

I will clarify that.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition
  2018-11-06 11:40   ` Peter Zijlstra
@ 2018-11-07 23:13     ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2018-11-07 23:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On 11/06/2018 06:40 AM, Peter Zijlstra wrote:
> On Mon, Oct 15, 2018 at 04:29:30PM -0400, Waiman Long wrote:
>> +				spin_lock_irq(&callback_lock);
>> +				clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
>> +				spin_unlock_irq(&callback_lock);
> A single atomic bitop wrapped in a spinlock..  maybe it is correct, but
> it looks suspicous.

Yes, you are right. The spin_lock() should not be needed in this case.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy
  2018-11-07 21:32         ` Tejun Heo
  2018-11-07 21:52           ` Waiman Long
@ 2018-11-08  9:41           ` Peter Zijlstra
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2018-11-08  9:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Tom Hromatka

On Wed, Nov 07, 2018 at 01:32:29PM -0800, Tejun Heo wrote:

> How about the following?
> 
> Interface file is named "cpuset.cpus.partition" as it's a partition of
> cpus.  The writeable keys are "root" and "member" - a partition root
> and a partition member.  When "root" is requested but can't be
> satisfied it shows "root invalid".  This seems pretty consistent with
> cgroup.type and other cpuset files.

I can live with that, thanks!

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-11-08  9:41 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-15 20:29 [PATCH v14 00/12] Enable cpuset controller in default hierarchy Waiman Long
2018-10-15 20:29 ` [PATCH v14 01/12] cpuset: " Waiman Long
2018-10-15 20:29 ` [PATCH v14 02/12] cpuset: Define data structures to support scheduling partition Waiman Long
2018-10-15 20:29 ` [PATCH v14 03/12] cpuset: Simply allocation and freeing of cpumasks Waiman Long
2018-10-19 15:28   ` Tom Hromatka
2018-10-15 20:29 ` [PATCH v14 04/12] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
2018-11-06 11:35   ` Peter Zijlstra
2018-10-15 20:29 ` [PATCH v14 05/12] cpuset: Add an error state to cpuset.sched.partition Waiman Long
2018-11-06 11:37   ` Peter Zijlstra
2018-11-06 14:17     ` Waiman Long
2018-11-06 11:40   ` Peter Zijlstra
2018-11-07 23:13     ` Waiman Long
2018-11-06 11:40   ` Peter Zijlstra
2018-10-15 20:29 ` [PATCH v14 06/12] cpuset: Track cpusets that use parent's effective_cpus Waiman Long
2018-10-15 20:29 ` [PATCH v14 07/12] cpuset: Make CPU hotplug work with partition Waiman Long
2018-10-15 20:29 ` [PATCH v14 08/12] cpuset: Make generate_sched_domains() " Waiman Long
2018-10-15 20:29 ` [PATCH v14 09/12] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
2018-10-15 20:29 ` [PATCH v14 10/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag Waiman Long
2018-11-06 11:50   ` Peter Zijlstra
2018-11-06 14:09     ` Waiman Long
2018-11-07 22:58     ` Waiman Long
2018-10-15 20:29 ` [PATCH v14 11/12] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug Waiman Long
2018-10-15 20:29 ` [PATCH v14 12/12] cpuset: Show descriptive text when reading cpuset.sched.partition Waiman Long
2018-10-17 15:08   ` Tejun Heo
2018-10-17 15:20     ` Waiman Long
2018-10-19 18:56     ` Waiman Long
2018-10-19 19:24       ` Tejun Heo
2018-10-19 19:32         ` Waiman Long
2018-11-02 14:34         ` Waiman Long
2018-11-06 11:52   ` Peter Zijlstra
2018-11-05 16:36 ` [PATCH v14 00/12] Enable cpuset controller in default hierarchy Tejun Heo
2018-11-05 16:57   ` Peter Zijlstra
2018-11-06 11:53   ` Peter Zijlstra
2018-11-06 11:55     ` Peter Zijlstra
2018-11-06 14:06       ` Waiman Long
     [not found]         ` <CAOS58YPye=7Ga+y-ujFsgHqo6vdVnjykmON1z+UjNQLvvM_g4w@mail.gmail.com>
2018-11-06 14:11           ` Tejun Heo
2018-11-07 21:32         ` Tejun Heo
2018-11-07 21:52           ` Waiman Long
2018-11-08  9:41           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).