linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy
@ 2018-10-12 17:55 Waiman Long
  2018-10-12 17:55 ` [PATCH v13 01/11] " Waiman Long
                   ` (11 more replies)
  0 siblings, 12 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

v13:
 - A major rewrite of the partition code so that there will be no
   auto-turning off anymore. Instead, the partition root can enter into
   an error state that can be restored back to a partition root later on.
 - Patches 1 and 9 are the same as previous version, the rests are
   either new or substantially revised.

v12:
 - Take out the debugging patch to print partitions.
 - Add a patch to force turning off partition flag if newly modified CPU
   list doesn't meet the requirement of being a partition root.
 - Remove some unneeded checking code in update_reserved_cpumask().

v11:
 - Change the "domain_root" name to "partition" as suggested by Peter and
   update the documentation and code accordingly.
 - Remove the dying cgroup check in update_reserved_cpus() as the check
   may not be needed after all.
 - Document the effect of losing CPU affinity after offling all the cpus
   in a partition.
 - There is no other major code changes in this version.

v10:
 - Remove the cpuset.sched.load_balance patch for now as it may not
   be that useful.
 - Break the large patch 2 into smaller patches to make them a bit
   easier to review.
 - Test and fix issues related to changing "cpuset.cpus" and cpu
   online/offline in a domain root.
 - Rename isolated_cpus to reserved_cpus as this cpumask holds CPUs
   reserved for child sched domains.
 - Rework the scheduling domain debug printing code in the last patch.
 - Document update to the newly moved
   Documentation/admin-guide/cgroup-v2.rst.

v10 patch: https://lkml.org/lkml/2018/6/18/3
v11 patch: https://lkml.org/lkml/2018/6/24/30
v12 patch: https://lkml.org/lkml/2018/8/27/423

The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus",
"mems" and "sched.partition". The "cpus.effective" and "mems.effective"
will appear in all cpuset-enabled cgroups.

The new control file that is unique to v2 is "sched.partition". It is
a tristate flag file that designates if a cgroup is the root of a new
scheduling domain or partition with its own set of unique list of CPUs
from scheduling perspective disjointed from other partitions. An user
can write only "1" or "0" into this file to turn on and off partition root.
Depending on circumstances, a partition root may become erroneous and has
a flag value of -1. However, if condition becomes favorable again, it can
be changed back to a partition root automatically.

The root cgroup is always a partition root. Multiple levels of partitions
are supported with some limitations. So a container partition root can
behave like a real root.

When a partition root cgroup is removed, its list of exclusive CPUs
will be returned back to the parent's cpus.effective automatically.

A container root can be a partition root with sub-partitions
created underneath it. One difference from the real root is that the
"cpuset.sched.partition" flag isn't present in the real root, but is
present in a container root. This is also true for other cpuset control
files as well as those from the other controllers. This is a general
issue that is not going to be addressed here in this patchset.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their effective
counterparts.

Patch 2 defines new data structures to support partitioning.

Patch 3 simplifies the allocation and freeing of cpumasks in the cpuset
code and prepares for use by subsequent patches.

Patch 4 adds a new "sched.partition" control file for setting up multiple
scheduling domains or partitions. A partition root implies cpu_exclusive.

Patch 5 makes new "sched.partition" file to have a new error value of -1
which indicates that the partition root enters into an erroneous state
where some of the constraints of a partition root (like cpu_exclusive)
will still hold but it is not a real partition root anymore. This allows
the cpuset to change back to a partition root later on automatically
if the conditions become favorable again.

Patch 6 adds tracking of the number of cpusets that use the parent's
effective_cpus in order to make sure that those cpusets will be properly
updated if their parents effective cpus changes because of changes in
sibling partitions.

Patch 7 makes the hotplug code deal with partition root properly.

Patch 8 updates the scheduling domain genaration code to work with
the new partition feature.

Patch 9 exposes cpus.effective and mems.effective to the root cgroup
as enabling child partitions will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.

Patch 10 updates the cgroup v2 documentation file with information
about the new "sched.partition" file.

Patch 11 adds a new read-only "cpus.subpartitions" file that list the
CPUs in the subparts_cpus mask in the cpuset data structure when the
command line option "cgroup_debug" is specified. This is mostly used
for debugging and verification purposes.

A test script with various cpuset configurations was run on both regular
and debug kernels with this patchset applied to verify that the cpusets
behaved appropriate without unexpected error.

Waiman Long (11):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Define data structures to support scheduling partition
  cpuset: Simply allocation and freeing of cpumasks
  cpuset: Add new v2 cpuset.sched.partition flag
  cpuset: Add an error state to cpuset.sched.partition
  cpuset: Track cpusets that use parent's effective_cpus
  cpuset: Make CPU hotplug work with partition
  cpuset: Make generate_sched_domains() work with partition
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Add documentation about the new "cpuset.sched.partition" flag
  cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug

 Documentation/admin-guide/cgroup-v2.rst | 175 ++++++-
 include/linux/cgroup-defs.h             |   1 +
 kernel/cgroup/cgroup-internal.h         |   2 +
 kernel/cgroup/cgroup.c                  |  14 +-
 kernel/cgroup/cpuset.c                  | 887 +++++++++++++++++++++++++++++---
 kernel/cgroup/debug.c                   |   4 +-
 6 files changed, 1007 insertions(+), 76 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v13 01/11] cpuset: Enable cpuset controller in default hierarchy
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-15 18:31   ` Tom Hromatka
  2018-10-12 17:55 ` [PATCH v13 02/11] cpuset: Define data structures to support scheduling partition Waiman Long
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 109 ++++++++++++++++++++++++++++++--
 kernel/cgroup/cpuset.c                  |  48 +++++++++++++-
 2 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193b..0465c23 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/.
          5-3-3-2. IO Latency Interface Files
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1588,6 +1590,103 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested CPUs to be used by tasks within this
+	cgroup.  The actual list of CPUs to be granted, however, is
+	subjected to constraints imposed by its parent and can differ
+	from the requested CPUs.
+
+	The CPU numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined CPUs that are actually granted to this
+	cgroup by its parent.  These CPUs are allowed to be used by
+	tasks within the current cgroup.
+
+	If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+	all the CPUs from the parent cgroup that can be available to
+	be used by this cgroup.  Otherwise, it should be a subset of
+	"cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+	can be granted.  In this case, it will be treated just like an
+	empty "cpuset.cpus".
+
+	Its value will be affected by CPU hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested memory nodes to be used by tasks within
+	this cgroup.  The actual list of memory nodes granted, however,
+	is subjected to constraints imposed by its parent and can differ
+	from the requested memory nodes.
+
+	The memory node numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined memory nodes that are actually granted to
+	this cgroup by its parent. These memory nodes are allowed to
+	be used by tasks within the current cgroup.
+
+	If "cpuset.mems" is empty, it shows all the memory nodes from the
+	parent cgroup that will be available to be used by this cgroup.
+	Otherwise, it should be a subset of "cpuset.mems" unless none of
+	the memory nodes listed in "cpuset.mems" can be granted.  In this
+	case, it will be treated just like an empty "cpuset.mems".
+
+	Its value will be affected by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10c..2b5c447 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1824,12 +1824,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1932,6 +1931,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 /*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
+/*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
  */
@@ -2105,8 +2145,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 02/11] cpuset: Define data structures to support scheduling partition
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
  2018-10-12 17:55 ` [PATCH v13 01/11] " Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-12 17:55 ` [PATCH v13 03/11] cpuset: Simply allocation and freeing of cpumasks Waiman Long
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

From a cpuset point of view, a scheduling partition is a group of
cpusets with their own set of exclusive CPUs that are not shared by
other tasks outside the scheduling partition.

In the legacy hierarchy, scheduling partitions are supported indirectly
via the right use of the load balancing and the exclusive CPUs flag
which is not intuitive and can be hard to use.

To fully support the concept of scheduling partitions in the default
hierarchy, we need to add some new field into the cpuset structure as
well as a new tmpmasks structure that is used to pre-allocate cpumasks
at the top level cpuset functions to avoid memory allocation in inner
functions as memory allocation failure in those inner functions may
cause a cpuset to have inconsistent states.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2b5c447..29a2bdc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -110,6 +110,13 @@ struct cpuset {
 	nodemask_t effective_mems;
 
 	/*
+	 * CPUs allocated to child sub-partitions (default hierarchy only)
+	 * - CPUs granted by the parent = effective_cpus U subparts_cpus
+	 * - effective_cpus and subparts_cpus are mutually exclusive.
+	 */
+	cpumask_var_t subparts_cpus;
+
+	/*
 	 * This is old Memory Nodes tasks took on.
 	 *
 	 * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
@@ -134,6 +141,30 @@ struct cpuset {
 
 	/* for custom sched domain */
 	int relax_domain_level;
+
+	/* number of CPUs in subparts_cpus */
+	int nr_subparts_cpus;
+
+	/* partition root state */
+	int partition_root_state;
+};
+
+/*
+ * Partition root states:
+ *
+ *   0 - not a partition root
+ *   1 - partition root
+ */
+#define PRS_DISABLED		0
+#define PRS_ENABLED		1
+
+/*
+ * Temporary cpumasks for working with partitions that are passed among
+ * functions to avoid memory allocation in inner functions.
+ */
+struct tmpmasks {
+	cpumask_var_t addmask, delmask;	/* For partition root */
+	cpumask_var_t new_cpus;		/* For update_cpumasks_hier() */
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -218,9 +249,15 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_partition_root(const struct cpuset *cs)
+{
+	return cs->partition_root_state;
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
+	.partition_root_state = PRS_ENABLED,
 };
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 03/11] cpuset: Simply allocation and freeing of cpumasks
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
  2018-10-12 17:55 ` [PATCH v13 01/11] " Waiman Long
  2018-10-12 17:55 ` [PATCH v13 02/11] cpuset: Define data structures to support scheduling partition Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-15 18:35   ` Tom Hromatka
  2018-10-12 17:55 ` [PATCH v13 04/11] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The previous commit introduces a new subparts_cpus mask into the cpuset
data structure and a new tmpmasks structure.  Managing the allocation
and freeing of those cpumasks is becoming more complex.

So a number of helper functions are added to simplify and streamline
the management of those cpumasks. To make it simple, all the cpumasks
are now pre-cleared on allocation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 104 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 71 insertions(+), 33 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 29a2bdc..9ac5f94 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -456,6 +456,57 @@ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
 }
 
 /**
+ * alloc_cpumasks - allocate three cpumasks for cpuset
+ * @cs:  the cpuset that have cpumasks to be allocated.
+ * @tmp: the tmpmasks structure pointer
+ * Return: 0 if successful, -ENOMEM otherwise.
+ *
+ * Only one of the two input arguments should be non-NULL.
+ */
+static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
+{
+	cpumask_var_t *pmask1, *pmask2, *pmask3;
+
+	if (cs) {
+		pmask1 = &cs->cpus_allowed;
+		pmask2 = &cs->effective_cpus;
+		pmask3 = &cs->subparts_cpus;
+	} else {
+		pmask1 = &tmp->new_cpus;
+		pmask2 = &tmp->addmask;
+		pmask3 = &tmp->delmask;
+	}
+
+	if (!zalloc_cpumask_var(pmask1, GFP_KERNEL))
+		return -ENOMEM;
+
+	if (!zalloc_cpumask_var(pmask2, GFP_KERNEL))
+		goto free_one;
+
+	if (!zalloc_cpumask_var(pmask3, GFP_KERNEL))
+		goto free_two;
+
+	return 0;
+
+free_two:
+	free_cpumask_var(*pmask2);
+free_one:
+	free_cpumask_var(*pmask1);
+	return -ENOMEM;
+}
+
+/**
+ * free_cpumasks - free cpumasks in a tmpmasks structure
+ * @tmp: the tmpmasks structure pointer
+ */
+static inline void free_cpumasks(struct tmpmasks *tmp)
+{
+	free_cpumask_var(tmp->new_cpus);
+	free_cpumask_var(tmp->addmask);
+	free_cpumask_var(tmp->delmask);
+}
+
+/**
  * alloc_trial_cpuset - allocate a trial cpuset
  * @cs: the cpuset that the trial cpuset duplicates
  */
@@ -467,31 +518,26 @@ static struct cpuset *alloc_trial_cpuset(struct cpuset *cs)
 	if (!trial)
 		return NULL;
 
-	if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL))
-		goto free_cs;
-	if (!alloc_cpumask_var(&trial->effective_cpus, GFP_KERNEL))
-		goto free_cpus;
+	if (alloc_cpumasks(trial, NULL)) {
+		kfree(trial);
+		return NULL;
+	}
 
 	cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
 	cpumask_copy(trial->effective_cpus, cs->effective_cpus);
 	return trial;
-
-free_cpus:
-	free_cpumask_var(trial->cpus_allowed);
-free_cs:
-	kfree(trial);
-	return NULL;
 }
 
 /**
- * free_trial_cpuset - free the trial cpuset
- * @trial: the trial cpuset to be freed
+ * free_cpuset - free the cpuset
+ * @cs: the cpuset to be freed
  */
-static void free_trial_cpuset(struct cpuset *trial)
+static inline void free_cpuset(struct cpuset *cs)
 {
-	free_cpumask_var(trial->effective_cpus);
-	free_cpumask_var(trial->cpus_allowed);
-	kfree(trial);
+	free_cpumask_var(cs->effective_cpus);
+	free_cpumask_var(cs->cpus_allowed);
+	free_cpumask_var(cs->subparts_cpus);
+	kfree(cs);
 }
 
 /*
@@ -1385,7 +1431,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	if (spread_flag_changed)
 		update_tasks_flags(cs);
 out:
-	free_trial_cpuset(trialcs);
+	free_cpuset(trialcs);
 	return err;
 }
 
@@ -1769,7 +1815,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 		break;
 	}
 
-	free_trial_cpuset(trialcs);
+	free_cpuset(trialcs);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	kernfs_unbreak_active_protection(of->kn);
@@ -2024,26 +2070,19 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
 	if (!cs)
 		return ERR_PTR(-ENOMEM);
-	if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
-		goto free_cs;
-	if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
-		goto free_cpus;
+
+	if (alloc_cpumasks(cs, NULL)) {
+		kfree(cs);
+		return ERR_PTR(-ENOMEM);
+	}
 
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
-	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
-	cpumask_clear(cs->effective_cpus);
 	nodes_clear(cs->effective_mems);
 	fmeter_init(&cs->fmeter);
 	cs->relax_domain_level = -1;
 
 	return &cs->css;
-
-free_cpus:
-	free_cpumask_var(cs->cpus_allowed);
-free_cs:
-	kfree(cs);
-	return ERR_PTR(-ENOMEM);
 }
 
 static int cpuset_css_online(struct cgroup_subsys_state *css)
@@ -2134,9 +2173,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
-	free_cpumask_var(cs->effective_cpus);
-	free_cpumask_var(cs->cpus_allowed);
-	kfree(cs);
+	free_cpuset(cs);
 }
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
@@ -2200,6 +2237,7 @@ int __init cpuset_init(void)
 
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 04/11] cpuset: Add new v2 cpuset.sched.partition flag
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (2 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 03/11] cpuset: Simply allocation and freeing of cpumasks Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-13 14:10   ` kbuild test robot
  2018-10-12 17:55 ` [PATCH v13 05/11] cpuset: Add an error state to cpuset.sched.partition Waiman Long
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

A new cpuset.sched.partition boolean flag is added to cpuset v2.
This new flag, if set, indicates that the cgroup is the root of a
new scheduling domain or partition that includes itself and all its
descendants except those that are scheduling domain roots themselves
and their descendants.

With this new flag, one can directly create as many partitions as
necessary without ever using the v1 trick of turning off load balancing
in specific cpusets to create partitions as a side effect.

This new flag is owned by the parent and will cause the CPUs in the
cpuset to be removed from the effective CPUs of its parent.

This is implemented internally by adding a new subparts_cpus mask that
holds the CPUs belonging to child partitions so that:

        subparts_cpus | effective_cpus = cpus_allowed
        subparts_cpus & effective_cpus = 0

This new flag can only be turned on in a cpuset if its parent is a
partition root itself. The state of this flag cannot be changed if the
cpuset has children.

Once turned on, further changes to "cpuset.cpus" is allowed as long
as there is at least one CPU left that can be granted from the parent
and a child partition root cannot use up all the CPUs in the parent's
effective_cpus.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 362 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 349 insertions(+), 13 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9ac5f94..9c2e73c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -964,10 +964,191 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	css_task_iter_end(&it);
 }
 
+/**
+ * compute_effective_cpumask - Compute the effective cpumask of the cpuset
+ * @new_cpus: the temp variable for the new effective_cpus mask
+ * @cs: the cpuset the need to recompute the new effective_cpus mask
+ * @parent: the parent cpuset
+ *
+ * If the parent has subpartition CPUs, include them in the list of
+ * allowable CPUs in computing the new effective_cpus mask.
+ */
+static void compute_effective_cpumask(struct cpumask *new_cpus,
+				      struct cpuset *cs, struct cpuset *parent)
+{
+	if (parent->nr_subparts_cpus) {
+		cpumask_or(new_cpus, parent->effective_cpus,
+			   parent->subparts_cpus);
+		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+	} else {
+		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
+	}
+}
+
+/*
+ * Commands for update_parent_subparts_cpumask
+ */
+enum subparts_cmd {
+	partcmd_enable,		/* Enable partition root	 */
+	partcmd_disable,	/* Disable partition root	 */
+	partcmd_update,		/* Update parent's subparts_cpus */
+};
+
+/**
+ * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
+ * @cpuset:  The cpuset that requests change in partition root state
+ * @cmd:     Partition root state change command
+ * @newmask: Optional new cpumask for partcmd_update
+ * @tmp:     Temporary addmask and delmask
+ * Return:   0, 1 or an error code
+ *
+ * For partcmd_enable, the cpuset is being transformed from a non-partition
+ * root to a partition root. The cpus_allowed mask of the given cpuset will
+ * be put into parent's subparts_cpus and taken away from parent's
+ * effective_cpus. The function will return 0 if all the CPUs listed in
+ * cpus_allowed can be granted or an error code will be returned.
+ *
+ * For partcmd_disable, the cpuset is being transofrmed from a partition
+ * root back to a non-partition root. any CPUs in cpus_allowed that are in
+ * parent's subparts_cpus will be taken away from that cpumask and put back
+ * into parent's effective_cpus. 0 should always be returned.
+ *
+ * For partcmd_update, if the optional newmask is specified, the cpu
+ * list is to be changed from cpus_allowed to newmask. Otherwise,
+ * cpus_allowed is assumed to remain the same.  The function will return
+ * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0
+ * otherwise. In case of error, an error code will be returned.
+ *
+ * The partcmd_enable and partcmd_disable commands are used by
+ * update_prstate(). The partcmd_update command is used by
+ * update_cpumasks_hier() with newmask NULL and update_cpumask() with
+ * newmask set.
+ *
+ * The checking is more strict when enabling partition root than the
+ * other two commands.
+ *
+ * Because of the implicit cpu exclusive nature of a partition root,
+ * cpumask changes that violates the cpu exclusivity rule will not be
+ * permitted when checked by validate_change(). The validate_change()
+ * function will also prevent any changes to the cpu list if it is not
+ * a superset of children's cpu lists.
+ *
+ * Called with cpuset_mutex held.
+ */
+static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
+					  struct cpumask *newmask,
+					  struct tmpmasks *tmp)
+{
+	struct cpuset *parent = parent_cs(cpuset);
+	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
+	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
+
+	/*
+	 * The parent must be a partition root.
+	 * The new cpumask, if present, or the current cpus_allowed must
+	 * not be empty.
+	 */
+	if (!is_partition_root(parent) ||
+	   (newmask && cpumask_empty(newmask)) ||
+	   (!newmask && cpumask_empty(cpuset->cpus_allowed)))
+		return -EINVAL;
+
+	/*
+	 * Enabling/disabling partition root is not allowed if there are
+	 * online children.
+	 */
+	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+		return -EBUSY;
+
+	/*
+	 * Enabling partition root is not allowed if not all the CPUs
+	 * can be granted from parent's effective_cpus or at least one
+	 * CPU will be left after that.
+	 */
+	if ((cmd == partcmd_enable) &&
+	   (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
+	     cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
+		return -EINVAL;
+
+	/*
+	 * A cpumask update cannot make parent's effective_cpus become empty.
+	 */
+	adding = deleting = false;
+	if (cmd == partcmd_enable) {
+		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
+		adding = true;
+	} else if (cmd == partcmd_disable) {
+		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+				       parent->subparts_cpus);
+	} else if (newmask) {
+		/*
+		 * partcmd_update with newmask:
+		 *
+		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
+		 * addmask = newmask & parent->effective_cpus
+		 *		     & ~parent->subparts_cpus
+		 */
+		cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
+		deleting = cpumask_and(tmp->delmask, tmp->delmask,
+				       parent->subparts_cpus);
+
+		cpumask_and(tmp->addmask, newmask, parent->effective_cpus);
+		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
+					parent->subparts_cpus);
+		/*
+		 * Return error if the new effective_cpus could become empty.
+		 */
+		if (adding && !deleting &&
+		    cpumask_equal(parent->effective_cpus, tmp->addmask))
+			return -EINVAL;
+	} else {
+		/*
+		 * partcmd_update w/o newmask:
+		 *
+		 * addmask = cpus_allowed & parent->effectiveb_cpus
+		 *
+		 * Note that parent's subparts_cpus may have been
+		 * pre-shrunk in case the CPUs granted to the parent
+		 * by the grandparent changes. So no deletion is needed.
+		 */
+		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
+				     parent->effective_cpus);
+		if (cpumask_equal(tmp->addmask, parent->effective_cpus))
+			return -EINVAL;
+	}
+
+	if (!adding && !deleting)
+		return 0;
+
+	/*
+	 * Change the parent's subparts_cpus.
+	 * Newly added CPUs will be removed from effective_cpus and
+	 * newly deleted ones will be added back to effective_cpus.
+	 */
+	spin_lock_irq(&callback_lock);
+	if (adding) {
+		cpumask_or(parent->subparts_cpus,
+			   parent->subparts_cpus, tmp->addmask);
+		cpumask_andnot(parent->effective_cpus,
+			       parent->effective_cpus, tmp->addmask);
+	}
+	if (deleting) {
+		cpumask_andnot(parent->subparts_cpus,
+			       parent->subparts_cpus, tmp->delmask);
+		cpumask_or(parent->effective_cpus,
+			   parent->effective_cpus, tmp->delmask);
+	}
+
+	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
+	spin_unlock_irq(&callback_lock);
+
+	return cmd == partcmd_update;
+}
+
 /*
  * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
- * @cs: the cpuset to consider
- * @new_cpus: temp variable for calculating new effective_cpus
+ * @cs:  the cpuset to consider
+ * @tmp: temp variables for calculating effective_cpus & partition setup
  *
  * When congifured cpumask is changed, the effective cpumasks of this cpuset
  * and all its descendants need to be updated.
@@ -976,7 +1157,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
  *
  * Called with cpuset_mutex held
  */
-static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
+static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 {
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
@@ -985,28 +1166,61 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
+		bool cs_empty;
 
-		cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+		compute_effective_cpumask(tmp->new_cpus, cp, parent);
+		cs_empty = cpumask_empty(tmp->new_cpus);
+
+		/*
+		 * A partition root cannot have empty effective_cpus
+		 */
+		WARN_ON_ONCE(cs_empty && is_partition_root(cp));
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
 		 * parent, which is guaranteed to have some CPUs.
 		 */
-		if (is_in_v2_mode() && cpumask_empty(new_cpus))
-			cpumask_copy(new_cpus, parent->effective_cpus);
+		if (is_in_v2_mode() && cs_empty)
+			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 
-		/* Skip the whole subtree if the cpumask remains the same. */
-		if (cpumask_equal(new_cpus, cp->effective_cpus)) {
+		/*
+		 * Skip the whole subtree if the cpumask remains the same
+		 * and has no partition root state.
+		 */
+		if (!is_partition_root(cp) &&
+		    cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
 			pos_css = css_rightmost_descendant(pos_css);
 			continue;
 		}
 
+		/*
+		 * update_parent_subparts_cpumask() should have been called
+		 * for cs already in update_cpumask(). We should also call
+		 * update_tasks_cpumask() again for tasks in the parent
+		 * cpuset if the parent's subparts_cpus changes.
+		 */
+		if ((cp != cs) && cp->partition_root_state &&
+		    update_parent_subparts_cpumask(cp, partcmd_update,
+						   NULL, tmp)) {
+			if (parent != &top_cpuset)
+				update_tasks_cpumask(parent);
+		}
+
 		if (!css_tryget_online(&cp->css))
 			continue;
 		rcu_read_unlock();
 
 		spin_lock_irq(&callback_lock);
-		cpumask_copy(cp->effective_cpus, new_cpus);
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
+			/*
+			 * Make sure that effective_cpus & subparts_cpus
+			 * are mutually exclusive.
+			 */
+			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
+				       cp->subparts_cpus);
+		}
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -1041,6 +1255,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			  const char *buf)
 {
 	int retval;
+	struct tmpmasks tmp;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
 	if (cs == &top_cpuset)
@@ -1072,12 +1287,36 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
+	/*
+	 * Use the cpumasks in trialcs for tmpmasks.
+	 */
+	tmp.addmask  = trialcs->subparts_cpus;
+	tmp.delmask  = trialcs->effective_cpus;
+	tmp.new_cpus = trialcs->cpus_allowed;
+
+	if (cs->partition_root_state) {
+		/* Cpumask of a partition root cannot be empty */
+		if (cpumask_empty(trialcs->cpus_allowed))
+			return -EINVAL;
+		if (update_parent_subparts_cpumask(cs, partcmd_update,
+					trialcs->cpus_allowed, &tmp) < 0)
+			return -EINVAL;
+	}
+
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
+
+	/*
+	 * Make sure that subparts_cpus is a subset of cpus_allowed.
+	 */
+	if (cs->nr_subparts_cpus) {
+		cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
+			       cs->cpus_allowed);
+		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+	}
 	spin_unlock_irq(&callback_lock);
 
-	/* use trialcs->cpus_allowed as a temp variable */
-	update_cpumasks_hier(cs, trialcs->cpus_allowed);
+	update_cpumasks_hier(cs, &tmp);
 	return 0;
 }
 
@@ -1436,6 +1675,80 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 }
 
 /*
+ * update_prstate - update partititon_root_state
+ * cs:	the cpuset to update
+ * val: 0 - disabled, 1 - enabled
+ *
+ * Call with cpuset_mutex held.
+ */
+static int update_prstate(struct cpuset *cs, int val)
+{
+	int err;
+	struct cpuset *parent = parent_cs(cs);
+	struct tmpmasks tmp;
+
+	if ((val != 0) && (val != 1))
+		return -EINVAL;
+	if (val == cs->partition_root_state)
+		return 0;
+
+	/*
+	 * Cannot force a partial or erroneous partition root to a full
+	 * partition root.
+	 */
+	if (val && cs->partition_root_state)
+		return -EINVAL;
+
+	if (alloc_cpumasks(NULL, &tmp))
+		return -ENOMEM;
+
+	err = -EINVAL;
+	if (!cs->partition_root_state) {
+		/*
+		 * Turning on partition root requires setting the
+		 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
+		 * cannot be NULL.
+		 */
+		if (cpumask_empty(cs->cpus_allowed))
+			goto out;
+
+		err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
+		if (err)
+			goto out;
+
+		err = update_parent_subparts_cpumask(cs, partcmd_enable,
+						     NULL, &tmp);
+		if (err) {
+			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+			goto out;
+		}
+		cs->partition_root_state = PRS_ENABLED;
+	} else {
+		err = update_parent_subparts_cpumask(cs, partcmd_disable,
+						     NULL, &tmp);
+		if (err)
+			goto out;
+
+		cs->partition_root_state = 0;
+
+		/* Turning off CS_CPU_EXCLUSIVE will not return error */
+		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+	}
+
+	/*
+	 * Update cpumask of parent's tasks except when it is the top
+	 * cpuset as some system daemons cannot be mapped to other CPUs.
+	 */
+	if (parent != &top_cpuset)
+		update_tasks_cpumask(parent);
+
+	rebuild_sched_domains_locked();
+out:
+	free_cpumasks(&tmp);
+	return err;
+}
+
+/*
  * Frequency meter - How fast is some event occurring?
  *
  * These routines manage a digitally filtered, constant time based,
@@ -1680,6 +1993,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
 	FILE_SCHED_LOAD_BALANCE,
+	FILE_PARTITION_ROOT,
 	FILE_SCHED_RELAX_DOMAIN_LEVEL,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
@@ -1749,6 +2063,9 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		retval = update_relax_domain_level(cs, val);
 		break;
+	case FILE_PARTITION_ROOT:
+		retval = update_prstate(cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1899,6 +2216,8 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	switch (type) {
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		return cs->relax_domain_level;
+	case FILE_PARTITION_ROOT:
+		return cs->partition_root_state;
 	default:
 		BUG();
 	}
@@ -2050,6 +2369,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched.partition",
+		.read_s64 = cpuset_read_s64,
+		.write_s64 = cpuset_write_s64,
+		.private = FILE_PARTITION_ROOT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -2151,7 +2478,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 /*
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains_locked().
+ * will call rebuild_sched_domains_locked(). That is not needed
+ * in the default hierarchy where only changes in partition
+ * will cause repartitioning.
+ *
+ * If the cpuset has the 'sched.partition' flag enabled, simulate
+ * turning 'sched.partition" off.
  */
 
 static void cpuset_css_offline(struct cgroup_subsys_state *css)
@@ -2160,7 +2492,11 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 
 	mutex_lock(&cpuset_mutex);
 
-	if (is_sched_load_balance(cs))
+	if (is_partition_root(cs))
+		update_prstate(cs, 0);
+
+	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+	    is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
 	cpuset_dec();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 05/11] cpuset: Add an error state to cpuset.sched.partition
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (3 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 04/11] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-12 17:55 ` [PATCH v13 06/11] cpuset: Track cpusets that use parent's effective_cpus Waiman Long
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Handling error returned by update_parent_subparts_cpumask() in
update_cpumasks_hier() is problematic as the states may become
inconsistent. To avoid that and increase flexibility in handling other
error cases, a new error state (-1) is added to the partition_root_state
flag. This new error state is set internally and user cannot write this
value to "cpuset.sched.partition".

In this error state, the partition root becomes an erroneous one. It is
no longer a real partition root, but the CS_CPU_EXCLUSIVE flag will
still be set as it can be changed back to a real one if appropriate
change happens later on.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 146 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9c2e73c..12ec61d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -153,10 +153,19 @@ struct cpuset {
  * Partition root states:
  *
  *   0 - not a partition root
+ *
  *   1 - partition root
+ *
+ *  -1 - erroneous partition root
+ *       None of the cpus in cpus_allowed can be put into the parent's
+ *       subparts_cpus. In this case, the cpuset is not a real partition
+ *       root anymore.  However, the CPU_EXCLUSIVE bit will still be set
+ *       and the cpuset can be restored back to a partition root if the
+ *       parent cpuset can give more CPUs back to this child cpuset.
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
+#define PRS_ERROR		-1
 
 /*
  * Temporary cpumasks for working with partitions that are passed among
@@ -251,7 +260,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
 
 static inline int is_partition_root(const struct cpuset *cs)
 {
-	return cs->partition_root_state;
+	return cs->partition_root_state > 0;
 }
 
 static struct cpuset top_cpuset = {
@@ -1015,9 +1024,12 @@ enum subparts_cmd {
  *
  * For partcmd_update, if the optional newmask is specified, the cpu
  * list is to be changed from cpus_allowed to newmask. Otherwise,
- * cpus_allowed is assumed to remain the same.  The function will return
- * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0
- * otherwise. In case of error, an error code will be returned.
+ * cpus_allowed is assumed to remain the same. The cpuset should either
+ * be a partition root or an erroneous partition root. The partition root
+ * state may change if newmask is NULL and none of the requested CPUs can
+ * be granted by the parent. The function will return 1 if changes to
+ * parent's subparts_cpus and effective_cpus happen or 0 otherwise.
+ * Error code should only be returned when newmask is non-NULL.
  *
  * The partcmd_enable and partcmd_disable commands are used by
  * update_prstate(). The partcmd_update command is used by
@@ -1042,6 +1054,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	struct cpuset *parent = parent_cs(cpuset);
 	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
+	bool part_error = false;	/* Partition error? */
 
 	/*
 	 * The parent must be a partition root.
@@ -1108,13 +1121,48 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 * addmask = cpus_allowed & parent->effectiveb_cpus
 		 *
 		 * Note that parent's subparts_cpus may have been
-		 * pre-shrunk in case the CPUs granted to the parent
-		 * by the grandparent changes. So no deletion is needed.
+		 * pre-shrunk in case there is a change in the cpu list.
+		 * So no deletion is needed.
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		if (cpumask_equal(tmp->addmask, parent->effective_cpus))
-			return -EINVAL;
+		part_error = cpumask_equal(tmp->addmask,
+					   parent->effective_cpus);
+	}
+
+	if (cmd == partcmd_update) {
+		int prev_prs = cpuset->partition_root_state;
+
+		/*
+		 * Check for possible transition between PRS_ENABLED
+		 * and PRS_ERROR.
+		 */
+		switch (cpuset->partition_root_state) {
+		case PRS_ENABLED:
+			if (part_error)
+				cpuset->partition_root_state = PRS_ERROR;
+			break;
+		case PRS_ERROR:
+			if (!part_error)
+				cpuset->partition_root_state = PRS_ENABLED;
+			break;
+		}
+		/*
+		 * Set part_error if previously in erroneous state.
+		 */
+		part_error = (prev_prs == PRS_ERROR);
+	}
+
+	if (!part_error && (cpuset->partition_root_state == PRS_ERROR))
+		return 0;	/* Nothing need to be done */
+
+	if (cpuset->partition_root_state == PRS_ERROR) {
+		/*
+		 * Remove all its cpus from parent's subparts_cpus.
+		 */
+		adding = false;
+		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+				       parent->subparts_cpus);
 	}
 
 	if (!adding && !deleting)
@@ -1166,28 +1214,21 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
-		bool cs_empty;
 
 		compute_effective_cpumask(tmp->new_cpus, cp, parent);
-		cs_empty = cpumask_empty(tmp->new_cpus);
-
-		/*
-		 * A partition root cannot have empty effective_cpus
-		 */
-		WARN_ON_ONCE(cs_empty && is_partition_root(cp));
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
 		 * parent, which is guaranteed to have some CPUs.
 		 */
-		if (is_in_v2_mode() && cs_empty)
+		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus))
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 
 		/*
 		 * Skip the whole subtree if the cpumask remains the same
 		 * and has no partition root state.
 		 */
-		if (!is_partition_root(cp) &&
+		if (!cp->partition_root_state &&
 		    cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
 			pos_css = css_rightmost_descendant(pos_css);
 			continue;
@@ -1199,11 +1240,39 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		 * update_tasks_cpumask() again for tasks in the parent
 		 * cpuset if the parent's subparts_cpus changes.
 		 */
-		if ((cp != cs) && cp->partition_root_state &&
-		    update_parent_subparts_cpumask(cp, partcmd_update,
-						   NULL, tmp)) {
-			if (parent != &top_cpuset)
-				update_tasks_cpumask(parent);
+		if ((cp != cs) && cp->partition_root_state) {
+			switch (parent->partition_root_state) {
+			case PRS_DISABLED:
+				/*
+				 * If parent is not a partition root or an
+				 * erroneous partition root, clear the state
+				 * state and the CS_CPU_EXCLUSIVE flag.
+				 */
+				WARN_ON_ONCE(cp->partition_root_state
+					     != PRS_ERROR);
+				cp->partition_root_state = 0;
+				spin_lock_irq(&callback_lock);
+				clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
+				spin_unlock_irq(&callback_lock);
+				break;
+
+			case PRS_ENABLED:
+				if (update_parent_subparts_cpumask(cp,
+					partcmd_update, NULL, tmp))
+					update_tasks_cpumask(parent);
+				break;
+
+			case PRS_ERROR:
+				/*
+				 * When parent is erroneous, it has to be too.
+				 */
+				cp->partition_root_state = PRS_ERROR;
+				if (cp->nr_subparts_cpus) {
+					cp->nr_subparts_cpus = 0;
+					cpumask_clear(cp->subparts_cpus);
+				}
+				break;
+			}
 		}
 
 		if (!css_tryget_online(&cp->css))
@@ -1213,13 +1282,33 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		spin_lock_irq(&callback_lock);
 
 		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-		if (cp->nr_subparts_cpus) {
+		if (cp->nr_subparts_cpus &&
+		   (cp->partition_root_state != PRS_ENABLED)) {
+			cp->nr_subparts_cpus = 0;
+			cpumask_clear(cp->subparts_cpus);
+		} else if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
 			 * are mutually exclusive.
+			 *
+			 * In the unlikely event that effective_cpus
+			 * becomes empty. we clear cp->nr_subparts_cpus and
+			 * let its child partition roots to compete for
+			 * CPUs again.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
+			if (cpumask_empty(cp->effective_cpus)) {
+				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+				cpumask_clear(cp->subparts_cpus);
+				cp->nr_subparts_cpus = 0;
+			} else if (!cpumask_subset(cp->subparts_cpus,
+						   tmp->new_cpus)) {
+				cpumask_andnot(cp->subparts_cpus,
+					cp->subparts_cpus, tmp->new_cpus);
+				cp->nr_subparts_cpus
+					= cpumask_weight(cp->subparts_cpus);
+			}
 		}
 		spin_unlock_irq(&callback_lock);
 
@@ -1724,6 +1813,17 @@ static int update_prstate(struct cpuset *cs, int val)
 		}
 		cs->partition_root_state = PRS_ENABLED;
 	} else {
+		/*
+		 * Turning off partition root will clear the
+		 * CS_CPU_EXCLUSIVE bit.
+		 */
+		if (cs->partition_root_state == PRS_ERROR) {
+			cs->partition_root_state = 0;
+			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+			err = 0;
+			goto out;
+		}
+
 		err = update_parent_subparts_cpumask(cs, partcmd_disable,
 						     NULL, &tmp);
 		if (err)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 06/11] cpuset: Track cpusets that use parent's effective_cpus
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (4 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 05/11] cpuset: Add an error state to cpuset.sched.partition Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-12 17:55 ` [PATCH v13 07/11] cpuset: Make CPU hotplug work with partition Waiman Long
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

In the default hierarchy, a cpuset will use the parent's effective_cpus
if none of the requested CPUs can be granted from the parent. That can
be a problem if a parent is a partition root with children partition
roots. Changes to a parent's effective_cpus list due to changes in a
child partition root may not be properly reflected in a child cpuset
that use parent's effective_cpus because the cpu_exclusive rule of a
partition root will not guard against that.

In order to avoid the mismatch, two new tracking variables are added to
the cpuset structure to track if a cpuset uses parent's effective_cpus
and the number of children cpusets that use its effective_cpus. So
whenever cpumask changes are made to a parent, it will also check to
see if it has other children cpusets that use its effective_cpus and
call update_cpumasks_hier() if that is the case.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 12ec61d..463c583 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -147,6 +147,14 @@ struct cpuset {
 
 	/* partition root state */
 	int partition_root_state;
+
+	/*
+	 * Default hierarchy only:
+	 * use_parent_ecpus - set if using parent's effective_cpus
+	 * child_ecpus_count - # of children with use_parent_ecpus set
+	 */
+	int use_parent_ecpus;
+	int child_ecpus_count;
 };
 
 /*
@@ -1221,8 +1229,17 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		 * If it becomes empty, inherit the effective mask of the
 		 * parent, which is guaranteed to have some CPUs.
 		 */
-		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus))
+		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
+			if (!cp->use_parent_ecpus) {
+				cp->use_parent_ecpus = true;
+				parent->child_ecpus_count++;
+			}
+		} else if (cp->use_parent_ecpus) {
+			cp->use_parent_ecpus = false;
+			WARN_ON_ONCE(!parent->child_ecpus_count);
+			parent->child_ecpus_count--;
+		}
 
 		/*
 		 * Skip the whole subtree if the cpumask remains the same
@@ -1335,6 +1352,35 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 }
 
 /**
+ * update_sibling_cpumasks - Update siblings cpumasks
+ * @parent:  Parent cpuset
+ * @cs:      Current cpuset
+ * @tmp:     Temp variables
+ */
+static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
+				    struct tmpmasks *tmp)
+{
+	struct cpuset *sibling;
+	struct cgroup_subsys_state *pos_css;
+
+	/*
+	 * Check all its siblings and call update_cpumasks_hier()
+	 * if their use_parent_ecpus flag is set in order for them
+	 * to use the right effective_cpus value.
+	 */
+	rcu_read_lock();
+	cpuset_for_each_child(sibling, pos_css, parent) {
+		if (sibling == cs)
+			continue;
+		if (!sibling->use_parent_ecpus)
+			continue;
+
+		update_cpumasks_hier(sibling, tmp);
+	}
+	rcu_read_unlock();
+}
+
+/**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
  * @trialcs: trial cpuset
@@ -1406,6 +1452,17 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	spin_unlock_irq(&callback_lock);
 
 	update_cpumasks_hier(cs, &tmp);
+
+	if (cs->partition_root_state) {
+		struct cpuset *parent = parent_cs(cs);
+
+		/*
+		 * For partition root, update the cpumasks of sibling
+		 * cpusets if they use parent's effective_cpus.
+		 */
+		if (parent->child_ecpus_count)
+			update_sibling_cpumasks(parent, cs, &tmp);
+	}
 	return 0;
 }
 
@@ -1842,6 +1899,9 @@ static int update_prstate(struct cpuset *cs, int val)
 	if (parent != &top_cpuset)
 		update_tasks_cpumask(parent);
 
+	if (parent->child_ecpus_count)
+		update_sibling_cpumasks(parent, cs, &tmp);
+
 	rebuild_sched_domains_locked();
 out:
 	free_cpumasks(&tmp);
@@ -2536,6 +2596,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
+		cs->use_parent_ecpus = true;
+		parent->child_ecpus_count++;
 	}
 	spin_unlock_irq(&callback_lock);
 
@@ -2599,6 +2661,13 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	    is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
+	if (cs->use_parent_ecpus) {
+		struct cpuset *parent = parent_cs(cs);
+
+		cs->use_parent_ecpus = false;
+		parent->child_ecpus_count--;
+	}
+
 	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 07/11] cpuset: Make CPU hotplug work with partition
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (5 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 06/11] cpuset: Track cpusets that use parent's effective_cpus Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-12 17:55 ` [PATCH v13 08/11] cpuset: Make generate_sched_domains() " Waiman Long
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

When there is a cpu hotplug event (CPU online or offline), the partitions
may need to be reconfigured and regenerated. So code is added to the
hotplug functions to make them work with new subparts_cpus mask to
compute the right effective_cpus for each of the affected cpusets.
It may also change the state of a partition root from real one to an
erroneous one or vice versa.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 132 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 117 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 463c583..578e6ae 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -113,6 +113,9 @@ struct cpuset {
 	 * CPUs allocated to child sub-partitions (default hierarchy only)
 	 * - CPUs granted by the parent = effective_cpus U subparts_cpus
 	 * - effective_cpus and subparts_cpus are mutually exclusive.
+	 *
+	 * effective_cpus contains only onlined CPUs, but subparts_cpus
+	 * may have offlined ones.
 	 */
 	cpumask_var_t subparts_cpus;
 
@@ -988,7 +991,9 @@ static void update_tasks_cpumask(struct cpuset *cs)
  * @parent: the parent cpuset
  *
  * If the parent has subpartition CPUs, include them in the list of
- * allowable CPUs in computing the new effective_cpus mask.
+ * allowable CPUs in computing the new effective_cpus mask. Since offlined
+ * CPUs are not removed from subparts_cpus, we have to use cpu_active_mask
+ * to mask those out.
  */
 static void compute_effective_cpumask(struct cpumask *new_cpus,
 				      struct cpuset *cs, struct cpuset *parent)
@@ -997,6 +1002,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus,
 		cpumask_or(new_cpus, parent->effective_cpus,
 			   parent->subparts_cpus);
 		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+		cpumask_and(new_cpus, new_cpus, cpu_active_mask);
 	} else {
 		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
 	}
@@ -1119,9 +1125,20 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * Return error if the new effective_cpus could become empty.
 		 */
-		if (adding && !deleting &&
-		    cpumask_equal(parent->effective_cpus, tmp->addmask))
-			return -EINVAL;
+		if (adding &&
+		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
+			if (!deleting)
+				return -EINVAL;
+			/*
+			 * As some of the CPUs in subparts_cpus might have
+			 * been offlined, we need to compute the real delmask
+			 * to confirm that.
+			 */
+			if (!cpumask_and(tmp->addmask, tmp->delmask,
+					 cpu_active_mask))
+				return -EINVAL;
+			cpumask_copy(tmp->addmask, parent->effective_cpus);
+		}
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
@@ -1191,6 +1208,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	if (deleting) {
 		cpumask_andnot(parent->subparts_cpus,
 			       parent->subparts_cpus, tmp->delmask);
+		/*
+		 * Some of the CPUs in subparts_cpus might have been offlined.
+		 */
+		cpumask_and(tmp->delmask, tmp->delmask, cpu_active_mask);
 		cpumask_or(parent->effective_cpus,
 			   parent->effective_cpus, tmp->delmask);
 	}
@@ -2849,20 +2870,29 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
 		update_tasks_nodemask(cs);
 }
 
+static bool force_rebuild;
+
+void cpuset_force_rebuild(void)
+{
+	force_rebuild = true;
+}
+
 /**
  * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
  * @cs: cpuset in interest
+ * @tmp: the tmpmasks structure pointer
  *
  * Compare @cs's cpu and mem masks against top_cpuset and if some have gone
  * offline, update @cs accordingly.  If @cs ends up with no CPU or memory,
  * all its tasks are moved to the nearest ancestor with both resources.
  */
-static void cpuset_hotplug_update_tasks(struct cpuset *cs)
+static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 {
 	static cpumask_t new_cpus;
 	static nodemask_t new_mems;
 	bool cpus_updated;
 	bool mems_updated;
+	struct cpuset *parent;
 retry:
 	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
@@ -2877,9 +2907,60 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 		goto retry;
 	}
 
-	cpumask_and(&new_cpus, cs->cpus_allowed, parent_cs(cs)->effective_cpus);
-	nodes_and(new_mems, cs->mems_allowed, parent_cs(cs)->effective_mems);
+	parent =  parent_cs(cs);
+	compute_effective_cpumask(&new_cpus, cs, parent);
+	nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
+
+	if (cs->nr_subparts_cpus)
+		/*
+		 * Make sure that CPUs allocated to child partitions
+		 * do not show up in effective_cpus.
+		 */
+		cpumask_andnot(&new_cpus, &new_cpus, cs->subparts_cpus);
+
+	if (!tmp || !cs->partition_root_state)
+		goto update_tasks;
+
+	/*
+	 * In the unlikely event that a partition root has empty
+	 * effective_cpus or its parent becomes erroneous, we have to
+	 * transition it to the erroneous state.
+	 */
+	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
+	   (parent->partition_root_state == PRS_ERROR))) {
+		if (cs->nr_subparts_cpus) {
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+			compute_effective_cpumask(&new_cpus, cs, parent);
+		}
 
+		/*
+		 * If the effective_cpus is empty because the child
+		 * partitions take away all the CPUs, we can keep
+		 * the current partition and let the child partitions
+		 * fight for available CPUs.
+		 */
+		if ((parent->partition_root_state == PRS_ERROR) ||
+		     cpumask_empty(&new_cpus)) {
+			update_parent_subparts_cpumask(cs, partcmd_disable,
+						       NULL, tmp);
+			cs->partition_root_state = PRS_ERROR;
+		}
+		cpuset_force_rebuild();
+	}
+
+	/*
+	 * On the other hand, an erroneous partition root may be transitioned
+	 * back to a regular one or a partition root with no CPU allocated
+	 * from the parent may change to erroneous.
+	 */
+	if (is_partition_root(parent) &&
+	   ((cs->partition_root_state == PRS_ERROR) ||
+	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
+	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+		cpuset_force_rebuild();
+
+update_tasks:
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
@@ -2893,13 +2974,6 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 	mutex_unlock(&cpuset_mutex);
 }
 
-static bool force_rebuild;
-
-void cpuset_force_rebuild(void)
-{
-	force_rebuild = true;
-}
-
 /**
  * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
  *
@@ -2922,6 +2996,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	static nodemask_t new_mems;
 	bool cpus_updated, mems_updated;
 	bool on_dfl = is_in_v2_mode();
+	struct tmpmasks tmp, *ptmp = NULL;
+
+	if (on_dfl && !alloc_cpumasks(NULL, &tmp))
+		ptmp = &tmp;
 
 	mutex_lock(&cpuset_mutex);
 
@@ -2929,6 +3007,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpumask_copy(&new_cpus, cpu_active_mask);
 	new_mems = node_states[N_MEMORY];
 
+	/*
+	 * If subparts_cpus is populated, it is likely that the check below
+	 * will produce a false positive on cpus_updated when the cpu list
+	 * isn't changed. It is extra work, but it is better to be safe.
+	 */
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2937,6 +3020,22 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+		/*
+		 * Make sure that CPUs allocated to child partitions
+		 * do not show up in effective_cpus. If no CPU is left,
+		 * we clear the subparts_cpus & let the child partitions
+		 * fight for the CPUs again.
+		 */
+		if (top_cpuset.nr_subparts_cpus) {
+			if (cpumask_subset(&new_cpus,
+					   top_cpuset.subparts_cpus)) {
+				top_cpuset.nr_subparts_cpus = 0;
+				cpumask_clear(top_cpuset.subparts_cpus);
+			} else {
+				cpumask_andnot(&new_cpus, &new_cpus,
+					       top_cpuset.subparts_cpus);
+			}
+		}
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
@@ -2965,7 +3064,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 				continue;
 			rcu_read_unlock();
 
-			cpuset_hotplug_update_tasks(cs);
+			cpuset_hotplug_update_tasks(cs, ptmp);
 
 			rcu_read_lock();
 			css_put(&cs->css);
@@ -2978,6 +3077,9 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		force_rebuild = false;
 		rebuild_sched_domains();
 	}
+
+	if (ptmp)
+		free_cpumasks(&tmp);
 }
 
 void cpuset_update_active_cpus(void)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 08/11] cpuset: Make generate_sched_domains() work with partition
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (6 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 07/11] cpuset: Make CPU hotplug work with partition Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-12 17:55 ` [PATCH v13 09/11] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The generate_sched_domains() function is modified to make it work
correctly with the newly introduced subparts_cpus mask for scheduling
domains generation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 578e6ae..c52074e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -763,13 +763,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	int ndoms = 0;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] struct cpumask slot */
 	struct cgroup_subsys_state *pos_css;
+	bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
 	doms = NULL;
 	dattr = NULL;
 	csa = NULL;
 
 	/* Special case for the 99% of systems with one, full, sched domain */
-	if (is_sched_load_balance(&top_cpuset)) {
+	if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
 		ndoms = 1;
 		doms = alloc_sched_domains(ndoms);
 		if (!doms)
@@ -792,6 +793,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	csn = 0;
 
 	rcu_read_lock();
+	if (root_load_balance)
+		csa[csn++] = &top_cpuset;
 	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
 		if (cp == &top_cpuset)
 			continue;
@@ -802,6 +805,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 * parent's cpus, so just skip them, and then we call
 		 * update_domain_attr_tree() to calc relax_domain_level of
 		 * the corresponding sched domain.
+		 *
+		 * If root is load-balancing, we can skip @cp if it
+		 * is a subset of the root's effective_cpus.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
@@ -809,11 +815,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
 			continue;
 
+		if (root_load_balance &&
+		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+			continue;
+
 		if (is_sched_load_balance(cp))
 			csa[csn++] = cp;
 
-		/* skip @cp's subtree */
-		pos_css = css_rightmost_descendant(pos_css);
+		/* skip @cp's subtree if not a partition root */
+		if (!is_partition_root(cp))
+			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
 
@@ -941,7 +952,12 @@ static void rebuild_sched_domains_locked(void)
 	 * passing doms with offlined cpu to partition_sched_domains().
 	 * Anyways, hotplug work item will rebuild sched domains.
 	 */
-	if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+	if (!top_cpuset.nr_subparts_cpus &&
+	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+		goto out;
+
+	if (top_cpuset.nr_subparts_cpus &&
+	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
 		goto out;
 
 	/* Generate domain masks and attrs */
@@ -1356,11 +1372,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		update_tasks_cpumask(cp);
 
 		/*
-		 * If the effective cpumask of any non-empty cpuset is changed,
-		 * we need to rebuild sched domains.
+		 * On legacy hierarchy, if the effective cpumask of any non-
+		 * empty cpuset is changed, we need to rebuild sched domains.
+		 * On default hierarchy, the cpuset needs to be a partition
+		 * root as well.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
-		    is_sched_load_balance(cp))
+		    is_sched_load_balance(cp) &&
+		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
+		    is_partition_root(cp)))
 			need_rebuild_sched_domains = true;
 
 		rcu_read_lock();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 09/11] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (7 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 08/11] cpuset: Make generate_sched_domains() " Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-12 17:55 ` [PATCH v13 10/11] cpuset: Add documentation about the new "cpuset.sched.partition" flag Waiman Long
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Because of the fact that setting the "cpuset.sched.partition" in
a direct child of root can remove CPUs from the root's effective CPU
list, it makes sense to know what CPUs are left in the root cgroup for
scheduling purpose. So the "cpuset.cpus.effective" control file is now
exposed in the v2 cgroup root.

For consistency, the "cpuset.mems.effective" control file is exposed
as well.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 4 ++--
 kernel/cgroup/cpuset.c                  | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0465c23..533e85c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1631,7 +1631,7 @@ Cpuset Interface Files
 	and won't be affected by any CPU hotplug events.
 
   cpuset.cpus.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined CPUs that are actually granted to this
@@ -1671,7 +1671,7 @@ Cpuset Interface Files
 	and won't be affected by any memory nodes hotplug events.
 
   cpuset.mems.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined memory nodes that are actually granted to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c52074e..a3b81d4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2560,14 +2560,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.name = "cpus.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_CPULIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
 		.name = "mems.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_MEMLIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 10/11] cpuset: Add documentation about the new "cpuset.sched.partition" flag
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (8 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 09/11] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-12 17:55 ` [PATCH v13 11/11] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug Waiman Long
  2018-10-15 16:35 ` [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Tejun Heo
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The cgroup-v2.rst file is updated to document the purpose of the new
"cpuset.sched.partition" flag and how its usage.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 66 +++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 533e85c..178cda4 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1686,6 +1686,72 @@ Cpuset Interface Files
 
 	Its value will be affected by memory nodes hotplug events.
 
+  cpuset.sched.partition
+	A read-write single value file which exists on non-root
+	cpuset-enabled cgroups.  It accepts either "0" (off) or "1"
+	(on) when written to.  This flag is set and owned by the
+	parent cgroup.
+
+	If set, it indicates that the current cgroup is the root of a
+	new partition or scheduling domain that comprises itself and
+	all its descendants except those that are separate partition
+	roots themselves and their descendants.  The root cgroup is
+	always a partition root.
+
+	There are constraints on where this flag can be set.  It can
+	only be set in a cgroup if all the following conditions are true.
+
+	1) The "cpuset.cpus" is not empty and the list of CPUs are
+	   exclusive, i.e. they are not shared by any of its siblings.
+	2) The parent cgroup is a partition root.
+	3) The "cpuset.cpus" is also a proper subset of the parent's
+	   "cpuset.cpus.effective".
+	4) There is no child cgroups with cpuset enabled.  This is for
+	   eliminating corner cases that have to be handled if such a
+	   condition is allowed.
+
+	Setting this flag will take the CPUs away from the effective
+	CPUs of the parent cgroup.  Once it is set, this flag cannot
+	be cleared if there are any child cgroups with cpuset enabled.
+
+	A parent partition cannot distribute all its CPUs to its
+	child partitions.  There must be at least one cpu left in the
+	parent partition.
+
+	Once becoming a partition root, changes to "cpuset.cpus" is
+	generally allowed as long as the first condition above is true,
+	the change will not take away all the CPUs from the parent
+	partition and the new "cpuset.cpus" value is a superset of its
+	children's "cpuset.cpus" values.
+
+	Sometimes, external factors like changes to ancestors'
+	"cpuset.cpus" or cpu hotplug can cause the state of the partition
+	root to change.  On read, the "cpuset.sched.partition" file
+	can show the following values.
+
+	"0"  Not a partition root
+	"1"  Partition root
+	"-1" Erroneous partition root
+
+	It is a partition root if the first 2 partition root conditions
+	above are true and at least one CPU from "cpuset.cpus" is
+	granted by the parent cgroup.
+
+	A partition root can become an erroneous partition root if none
+	of CPUs requested in "cpuset.cpus" can be granted by the parent
+	cgroup or the parent cgroup is no longer a partition root.
+	In this case, it is not a real partition even though the
+	restriction of the first partition root condition above will
+	still apply.  All the tasks in the cgroup will be migrated to
+	the nearest ancestor partition.
+
+	An erroneous partition root can be transitioned back to a real
+	partition root if at least one of the requested CPUs can now be
+	granted by its parent.	In this case, the tasks will be migrated
+	back to the newly created partition.  Clearing the partition
+	flag of an erroneous partition root is always allowed even if
+	child cpusets are present.
+
 
 Device controller
 -----------------
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v13 11/11] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (9 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 10/11] cpuset: Add documentation about the new "cpuset.sched.partition" flag Waiman Long
@ 2018-10-12 17:55 ` Waiman Long
  2018-10-15 16:35 ` [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Tejun Heo
  11 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

For debugging purpose, it will be useful to expose the content of the
subparts_cpus as a read-only file to see if the code work correctly.
However, subparts_cpus will not be used at all in most use cases. So
adding a new cpuset file that clutters the cgroup directory may not be
desirable.  This is now being done by using the hidden "cgroup_debug"
kernel command line option to expose a new "cpuset.cpus.subpartitions"
file.

That option was originally used by the debug controller to expose
itself when configured into the kernel. This is now extended to set an
internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
can now be used to specify that a cgroup file should only be created
when the "cgroup_debug" option is specified.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h     |  1 +
 kernel/cgroup/cgroup-internal.h |  2 ++
 kernel/cgroup/cgroup.c          | 14 +++++++++++++-
 kernel/cgroup/cpuset.c          | 11 +++++++++++
 kernel/cgroup/debug.c           |  4 +---
 5 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ff20b67..94c7a78 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -91,6 +91,7 @@ enum {
 
 	CFTYPE_NO_PREFIX	= (1 << 3),	/* (DON'T USE FOR NEW FILES) no subsys prefix */
 	CFTYPE_WORLD_WRITABLE	= (1 << 4),	/* (DON'T USE FOR NEW FILES) S_IWUGO */
+	CFTYPE_DEBUG		= (1 << 5),	/* create when cgroup_debug */
 
 	/* internal flags, do not use outside cgroup core proper */
 	__CFTYPE_ONLY_ON_DFL	= (1 << 16),	/* only on default hierarchy */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 75568fc..c950864 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -11,6 +11,8 @@
 #define TRACE_CGROUP_PATH_LEN 1024
 extern spinlock_t trace_cgroup_path_lock;
 extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
+extern bool cgroup_debug;
+extern void __init enable_debug_cgroup(void);
 
 /*
  * cgroup_path() takes a spin lock. It is good practice not to take
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index aae10ba..067e48a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -85,6 +85,7 @@
 
 DEFINE_SPINLOCK(trace_cgroup_path_lock);
 char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
+bool cgroup_debug __read_mostly;
 
 /*
  * Protects cgroup_idr and css_idr so that IDs can be released without
@@ -3616,7 +3617,8 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 			continue;
 		if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
 			continue;
-
+		if ((cft->flags & CFTYPE_DEBUG) && !cgroup_debug)
+			continue;
 		if (is_add) {
 			ret = cgroup_add_file(css, cgrp, cft);
 			if (ret) {
@@ -5695,6 +5697,16 @@ static int __init cgroup_disable(char *str)
 }
 __setup("cgroup_disable=", cgroup_disable);
 
+void __init __weak enable_debug_cgroup(void) { }
+
+static int __init enable_cgroup_debug(char *str)
+{
+	cgroup_debug = true;
+	enable_debug_cgroup();
+	return 1;
+}
+__setup("cgroup_debug", enable_cgroup_debug);
+
 /**
  * css_tryget_online_from_dir - get corresponding css from a cgroup dentry
  * @dentry: directory dentry of interest
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a3b81d4..9db4f10 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2190,6 +2190,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	FILE_MEMLIST,
 	FILE_EFFECTIVE_CPULIST,
 	FILE_EFFECTIVE_MEMLIST,
+	FILE_SUBPARTS_CPULIST,
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
@@ -2371,6 +2372,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_EFFECTIVE_MEMLIST:
 		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
 		break;
+	case FILE_SUBPARTS_CPULIST:
+		seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus));
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -2576,6 +2580,13 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "cpus.subpartitions",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_SUBPARTS_CPULIST,
+		.flags = CFTYPE_DEBUG,
+	},
+
 	{ }	/* terminate */
 };
 
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 9caeda6..5f1b873 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -373,11 +373,9 @@ struct cgroup_subsys debug_cgrp_subsys = {
  * On v2, debug is an implicit controller enabled by "cgroup_debug" boot
  * parameter.
  */
-static int __init enable_cgroup_debug(char *str)
+void __init enable_debug_cgroup(void)
 {
 	debug_cgrp_subsys.dfl_cftypes = debug_files;
 	debug_cgrp_subsys.implicit_on_dfl = true;
 	debug_cgrp_subsys.threaded = true;
-	return 1;
 }
-__setup("cgroup_debug", enable_cgroup_debug);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 04/11] cpuset: Add new v2 cpuset.sched.partition flag
  2018-10-12 17:55 ` [PATCH v13 04/11] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
@ 2018-10-13 14:10   ` kbuild test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2018-10-13 14:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 3911 bytes --]

Hi Waiman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cgroup/for-next]
[also build test ERROR on v4.19-rc7 next-20181012]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/cpuset-Enable-cpuset-controller-in-default-hierarchy/20181013-213352
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
config: i386-randconfig-x078-201840 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/cgroup/cpuset.c: In function 'update_cpumask':
>> kernel/cgroup/cpuset.c:1293:15: error: assignment to expression with array type
     tmp.addmask  = trialcs->subparts_cpus;
                  ^
   kernel/cgroup/cpuset.c:1294:15: error: assignment to expression with array type
     tmp.delmask  = trialcs->effective_cpus;
                  ^
   kernel/cgroup/cpuset.c:1295:15: error: assignment to expression with array type
     tmp.new_cpus = trialcs->cpus_allowed;
                  ^

vim +1293 kernel/cgroup/cpuset.c

  1247	
  1248	/**
  1249	 * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  1250	 * @cs: the cpuset to consider
  1251	 * @trialcs: trial cpuset
  1252	 * @buf: buffer of cpu numbers written to this cpuset
  1253	 */
  1254	static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
  1255				  const char *buf)
  1256	{
  1257		int retval;
  1258		struct tmpmasks tmp;
  1259	
  1260		/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
  1261		if (cs == &top_cpuset)
  1262			return -EACCES;
  1263	
  1264		/*
  1265		 * An empty cpus_allowed is ok only if the cpuset has no tasks.
  1266		 * Since cpulist_parse() fails on an empty mask, we special case
  1267		 * that parsing.  The validate_change() call ensures that cpusets
  1268		 * with tasks have cpus.
  1269		 */
  1270		if (!*buf) {
  1271			cpumask_clear(trialcs->cpus_allowed);
  1272		} else {
  1273			retval = cpulist_parse(buf, trialcs->cpus_allowed);
  1274			if (retval < 0)
  1275				return retval;
  1276	
  1277			if (!cpumask_subset(trialcs->cpus_allowed,
  1278					    top_cpuset.cpus_allowed))
  1279				return -EINVAL;
  1280		}
  1281	
  1282		/* Nothing to do if the cpus didn't change */
  1283		if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
  1284			return 0;
  1285	
  1286		retval = validate_change(cs, trialcs);
  1287		if (retval < 0)
  1288			return retval;
  1289	
  1290		/*
  1291		 * Use the cpumasks in trialcs for tmpmasks.
  1292		 */
> 1293		tmp.addmask  = trialcs->subparts_cpus;
  1294		tmp.delmask  = trialcs->effective_cpus;
  1295		tmp.new_cpus = trialcs->cpus_allowed;
  1296	
  1297		if (cs->partition_root_state) {
  1298			/* Cpumask of a partition root cannot be empty */
  1299			if (cpumask_empty(trialcs->cpus_allowed))
  1300				return -EINVAL;
  1301			if (update_parent_subparts_cpumask(cs, partcmd_update,
  1302						trialcs->cpus_allowed, &tmp) < 0)
  1303				return -EINVAL;
  1304		}
  1305	
  1306		spin_lock_irq(&callback_lock);
  1307		cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
  1308	
  1309		/*
  1310		 * Make sure that subparts_cpus is a subset of cpus_allowed.
  1311		 */
  1312		if (cs->nr_subparts_cpus) {
  1313			cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
  1314				       cs->cpus_allowed);
  1315			cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
  1316		}
  1317		spin_unlock_irq(&callback_lock);
  1318	
  1319		update_cpumasks_hier(cs, &tmp);
  1320		return 0;
  1321	}
  1322	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28944 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy
  2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
                   ` (10 preceding siblings ...)
  2018-10-12 17:55 ` [PATCH v13 11/11] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug Waiman Long
@ 2018-10-15 16:35 ` Tejun Heo
  2018-10-15 17:04   ` Waiman Long
  11 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2018-10-15 16:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

Hello, Waiman.

This looks great to me.  I have only one small nit in terms of
interface.  Currently, cpuset.partition file uses -1, 0, 1; however,
given that this is consistent with how cgroup.type behaves (something
can be set but may be invalid), I wonder whether using a similar
syntax would be more consistent.  e.g. sth like cpuset.type which
takes "normal" or "partition" and shows one of "normal", "partition"
and "partition invalid".  I'd be perfectly happy with that being a
follow-up patch.

Peter, does the patchset look good to you?  It does everything we want
it to in terms of sched domain partitioning and is compliant all
cgroup2 conventions.  If so, I'll apply the series to the cgroup devel
branch.

Thank you very much for the work and persistence!

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy
  2018-10-15 16:35 ` [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Tejun Heo
@ 2018-10-15 17:04   ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-15 17:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 10/15/2018 12:35 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> This looks great to me.  I have only one small nit in terms of
> interface.  Currently, cpuset.partition file uses -1, 0, 1; however,
> given that this is consistent with how cgroup.type behaves (something
> can be set but may be invalid), I wonder whether using a similar
> syntax would be more consistent.  e.g. sth like cpuset.type which
> takes "normal" or "partition" and shows one of "normal", "partition"
> and "partition invalid".  I'd be perfectly happy with that being a
> follow-up patch.

Sure. I can send a follow-up patch to add that.
> Peter, does the patchset look good to you?  It does everything we want
> it to in terms of sched domain partitioning and is compliant all
> cgroup2 conventions.  If so, I'll apply the series to the cgroup devel
> branch.
>
> Thank you very much for the work and persistence!
>
Thanks for the comment. There is a minor bug in patch 3 because of the 2
possibilities of how cpumask_var_t can be defined. I will send updated
version of that patch to fix that.

-Longman



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 01/11] cpuset: Enable cpuset controller in default hierarchy
  2018-10-12 17:55 ` [PATCH v13 01/11] " Waiman Long
@ 2018-10-15 18:31   ` Tom Hromatka
  2018-10-15 18:41     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tom Hromatka @ 2018-10-15 18:31 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi



On 10/12/2018 11:55 AM, Waiman Long wrote:
> Given the fact that thread mode had been merged into 4.14, it is now
> time to enable cpuset to be used in the default hierarchy (cgroup v2)
> as it is clearly threaded.
>
> The cpuset controller had experienced feature creep since its
> introduction more than a decade ago. Besides the core cpus and mems
> control files to limit cpus and memory nodes, there are a bunch of
> additional features that can be controlled from the userspace. Some of
> the features are of doubtful usefulness and may not be actively used.
>
> This patch enables cpuset controller in the default hierarchy with
> a minimal set of features, namely just the cpus and mems and their
> effective_* counterparts.  We can certainly add more features to the
> default hierarchy in the future if there is a real user need for them
> later on.

I really appreciate this simplification of the cpuset controller.

I currently have an in-house customer that is unhappy with the
way cgroups cpuset v1 handles offlined CPUs.  This patchset makes two
significant improvements over current cpuset v1 behavior.  1) It allows
for offlined CPUs to be added to a cpuset (even though the CPU is
offline) and 2) it restores CPUs back to the requested cpuset when they
come back online.  This is more intuitive and secure for my customers.
Thank you!

In part because of this patchset, I plan on proposing similar behavior
for offlined CPUs to cpuset v1.

Thanks.

Tom

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 03/11] cpuset: Simply allocation and freeing of cpumasks
  2018-10-12 17:55 ` [PATCH v13 03/11] cpuset: Simply allocation and freeing of cpumasks Waiman Long
@ 2018-10-15 18:35   ` Tom Hromatka
  2018-10-15 18:43     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tom Hromatka @ 2018-10-15 18:35 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi



On 10/12/2018 11:55 AM, Waiman Long wrote:
> The previous commit introduces a new subparts_cpus mask into the cpuset
> data structure and a new tmpmasks structure.  Managing the allocation
> and freeing of those cpumasks is becoming more complex.
>
> So a number of helper functions are added to simplify and streamline
> the management of those cpumasks. To make it simple, all the cpumasks
> are now pre-cleared on allocation.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset.c | 104 +++++++++++++++++++++++++++++++++----------------
>   1 file changed, 71 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 29a2bdc..9ac5f94 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -456,6 +456,57 @@ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
>   }
>   
>   /**
> + * alloc_cpumasks - allocate three cpumasks for cpuset
> + * @cs:  the cpuset that have cpumasks to be allocated.
> + * @tmp: the tmpmasks structure pointer
> + * Return: 0 if successful, -ENOMEM otherwise.
> + *
> + * Only one of the two input arguments should be non-NULL.
> + */
> +static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
> +{
> +	cpumask_var_t *pmask1, *pmask2, *pmask3;
> +
> +	if (cs) {
> +		pmask1 = &cs->cpus_allowed;
> +		pmask2 = &cs->effective_cpus;
> +		pmask3 = &cs->subparts_cpus;
> +	} else {
> +		pmask1 = &tmp->new_cpus;
> +		pmask2 = &tmp->addmask;
> +		pmask3 = &tmp->delmask;
> +	}
> +
> +	if (!zalloc_cpumask_var(pmask1, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	if (!zalloc_cpumask_var(pmask2, GFP_KERNEL))
> +		goto free_one;
> +
> +	if (!zalloc_cpumask_var(pmask3, GFP_KERNEL))
> +		goto free_two;
> +
> +	return 0;
> +
> +free_two:
> +	free_cpumask_var(*pmask2);
> +free_one:
> +	free_cpumask_var(*pmask1);
> +	return -ENOMEM;
> +}
> +
> +/**
> + * free_cpumasks - free cpumasks in a tmpmasks structure
> + * @tmp: the tmpmasks structure pointer
> + */
> +static inline void free_cpumasks(struct tmpmasks *tmp)
> +{
> +	free_cpumask_var(tmp->new_cpus);
> +	free_cpumask_var(tmp->addmask);
> +	free_cpumask_var(tmp->delmask);
> +}
> +

I hesitate to bring this up, but since you're respinning this
patch for a different bug...

Would it make sense to have free_cpumasks() have a similar
API and behavior to alloc_cpumasks()?  I could see this potentially
causing bugs/confusion in future patches.

Thanks.

Tom



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 01/11] cpuset: Enable cpuset controller in default hierarchy
  2018-10-15 18:31   ` Tom Hromatka
@ 2018-10-15 18:41     ` Waiman Long
  2018-10-15 19:10       ` Tom Hromatka
  0 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2018-10-15 18:41 UTC (permalink / raw)
  To: Tom Hromatka, Tejun Heo, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi

On 10/15/2018 02:31 PM, Tom Hromatka wrote:
>
>
> On 10/12/2018 11:55 AM, Waiman Long wrote:
>> Given the fact that thread mode had been merged into 4.14, it is now
>> time to enable cpuset to be used in the default hierarchy (cgroup v2)
>> as it is clearly threaded.
>>
>> The cpuset controller had experienced feature creep since its
>> introduction more than a decade ago. Besides the core cpus and mems
>> control files to limit cpus and memory nodes, there are a bunch of
>> additional features that can be controlled from the userspace. Some of
>> the features are of doubtful usefulness and may not be actively used.
>>
>> This patch enables cpuset controller in the default hierarchy with
>> a minimal set of features, namely just the cpus and mems and their
>> effective_* counterparts.  We can certainly add more features to the
>> default hierarchy in the future if there is a real user need for them
>> later on.
>
> I really appreciate this simplification of the cpuset controller.
>
> I currently have an in-house customer that is unhappy with the
> way cgroups cpuset v1 handles offlined CPUs.  This patchset makes two
> significant improvements over current cpuset v1 behavior.  1) It allows
> for offlined CPUs to be added to a cpuset (even though the CPU is
> offline) and 2) it restores CPUs back to the requested cpuset when they
> come back online.  This is more intuitive and secure for my customers.
> Thank you!
>
> In part because of this patchset, I plan on proposing similar behavior
> for offlined CPUs to cpuset v1.
>
> Thanks.
>
> Tom

If you are just talking about the ability for cpuset to revert back to
its original state when an offlined CPU becomes online again, that
capability can be activated in v1 using the "cpuset_v2_mode" mount
option when mounting the cpuset cgroup. However, that is not the default
for v1.

-Longman



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 03/11] cpuset: Simply allocation and freeing of cpumasks
  2018-10-15 18:35   ` Tom Hromatka
@ 2018-10-15 18:43     ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2018-10-15 18:43 UTC (permalink / raw)
  To: Tom Hromatka, Tejun Heo, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi

On 10/15/2018 02:35 PM, Tom Hromatka wrote:
>
>
> On 10/12/2018 11:55 AM, Waiman Long wrote:
>> The previous commit introduces a new subparts_cpus mask into the cpuset
>> data structure and a new tmpmasks structure.  Managing the allocation
>> and freeing of those cpumasks is becoming more complex.
>>
>> So a number of helper functions are added to simplify and streamline
>> the management of those cpumasks. To make it simple, all the cpumasks
>> are now pre-cleared on allocation.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 104
>> +++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 71 insertions(+), 33 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 29a2bdc..9ac5f94 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -456,6 +456,57 @@ static int is_cpuset_subset(const struct cpuset
>> *p, const struct cpuset *q)
>>   }
>>     /**
>> + * alloc_cpumasks - allocate three cpumasks for cpuset
>> + * @cs:  the cpuset that have cpumasks to be allocated.
>> + * @tmp: the tmpmasks structure pointer
>> + * Return: 0 if successful, -ENOMEM otherwise.
>> + *
>> + * Only one of the two input arguments should be non-NULL.
>> + */
>> +static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks
>> *tmp)
>> +{
>> +    cpumask_var_t *pmask1, *pmask2, *pmask3;
>> +
>> +    if (cs) {
>> +        pmask1 = &cs->cpus_allowed;
>> +        pmask2 = &cs->effective_cpus;
>> +        pmask3 = &cs->subparts_cpus;
>> +    } else {
>> +        pmask1 = &tmp->new_cpus;
>> +        pmask2 = &tmp->addmask;
>> +        pmask3 = &tmp->delmask;
>> +    }
>> +
>> +    if (!zalloc_cpumask_var(pmask1, GFP_KERNEL))
>> +        return -ENOMEM;
>> +
>> +    if (!zalloc_cpumask_var(pmask2, GFP_KERNEL))
>> +        goto free_one;
>> +
>> +    if (!zalloc_cpumask_var(pmask3, GFP_KERNEL))
>> +        goto free_two;
>> +
>> +    return 0;
>> +
>> +free_two:
>> +    free_cpumask_var(*pmask2);
>> +free_one:
>> +    free_cpumask_var(*pmask1);
>> +    return -ENOMEM;
>> +}
>> +
>> +/**
>> + * free_cpumasks - free cpumasks in a tmpmasks structure
>> + * @tmp: the tmpmasks structure pointer
>> + */
>> +static inline void free_cpumasks(struct tmpmasks *tmp)
>> +{
>> +    free_cpumask_var(tmp->new_cpus);
>> +    free_cpumask_var(tmp->addmask);
>> +    free_cpumask_var(tmp->delmask);
>> +}
>> +
>
> I hesitate to bring this up, but since you're respinning this
> patch for a different bug...
>
> Would it make sense to have free_cpumasks() have a similar
> API and behavior to alloc_cpumasks()?  I could see this potentially
> causing bugs/confusion in future patches.
>
> Thanks.
>
> Tom
>
>
Thanks for the comment. Yes I will make the change.

-Longman


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v13 01/11] cpuset: Enable cpuset controller in default hierarchy
  2018-10-15 18:41     ` Waiman Long
@ 2018-10-15 19:10       ` Tom Hromatka
  0 siblings, 0 replies; 20+ messages in thread
From: Tom Hromatka @ 2018-10-15 19:10 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi



On 10/15/2018 12:41 PM, Waiman Long wrote:
> On 10/15/2018 02:31 PM, Tom Hromatka wrote:
>>
>> On 10/12/2018 11:55 AM, Waiman Long wrote:
>>> Given the fact that thread mode had been merged into 4.14, it is now
>>> time to enable cpuset to be used in the default hierarchy (cgroup v2)
>>> as it is clearly threaded.
>>>
>>> The cpuset controller had experienced feature creep since its
>>> introduction more than a decade ago. Besides the core cpus and mems
>>> control files to limit cpus and memory nodes, there are a bunch of
>>> additional features that can be controlled from the userspace. Some of
>>> the features are of doubtful usefulness and may not be actively used.
>>>
>>> This patch enables cpuset controller in the default hierarchy with
>>> a minimal set of features, namely just the cpus and mems and their
>>> effective_* counterparts.  We can certainly add more features to the
>>> default hierarchy in the future if there is a real user need for them
>>> later on.
>> I really appreciate this simplification of the cpuset controller.
>>
>> I currently have an in-house customer that is unhappy with the
>> way cgroups cpuset v1 handles offlined CPUs.  This patchset makes two
>> significant improvements over current cpuset v1 behavior.  1) It allows
>> for offlined CPUs to be added to a cpuset (even though the CPU is
>> offline) and 2) it restores CPUs back to the requested cpuset when they
>> come back online.  This is more intuitive and secure for my customers.
>> Thank you!
>>
>> In part because of this patchset, I plan on proposing similar behavior
>> for offlined CPUs to cpuset v1.
>>
>> Thanks.
>>
>> Tom
> If you are just talking about the ability for cpuset to revert back to
> its original state when an offlined CPU becomes online again, that
> capability can be activated in v1 using the "cpuset_v2_mode" mount
> option when mounting the cpuset cgroup. However, that is not the default
> for v1.
>
> -Longman
>
>

Thanks!!!  I didn't realize cpuset_v2_mode exists.  (I see you
added it last year.  Thank you!)  I'll try it out and see if it
meets my customer's needs.

Tom




^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-10-15 19:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-12 17:55 [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Waiman Long
2018-10-12 17:55 ` [PATCH v13 01/11] " Waiman Long
2018-10-15 18:31   ` Tom Hromatka
2018-10-15 18:41     ` Waiman Long
2018-10-15 19:10       ` Tom Hromatka
2018-10-12 17:55 ` [PATCH v13 02/11] cpuset: Define data structures to support scheduling partition Waiman Long
2018-10-12 17:55 ` [PATCH v13 03/11] cpuset: Simply allocation and freeing of cpumasks Waiman Long
2018-10-15 18:35   ` Tom Hromatka
2018-10-15 18:43     ` Waiman Long
2018-10-12 17:55 ` [PATCH v13 04/11] cpuset: Add new v2 cpuset.sched.partition flag Waiman Long
2018-10-13 14:10   ` kbuild test robot
2018-10-12 17:55 ` [PATCH v13 05/11] cpuset: Add an error state to cpuset.sched.partition Waiman Long
2018-10-12 17:55 ` [PATCH v13 06/11] cpuset: Track cpusets that use parent's effective_cpus Waiman Long
2018-10-12 17:55 ` [PATCH v13 07/11] cpuset: Make CPU hotplug work with partition Waiman Long
2018-10-12 17:55 ` [PATCH v13 08/11] cpuset: Make generate_sched_domains() " Waiman Long
2018-10-12 17:55 ` [PATCH v13 09/11] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
2018-10-12 17:55 ` [PATCH v13 10/11] cpuset: Add documentation about the new "cpuset.sched.partition" flag Waiman Long
2018-10-12 17:55 ` [PATCH v13 11/11] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug Waiman Long
2018-10-15 16:35 ` [PATCH v13 00/11] cpuset: Enable cpuset controller in default hierarchy Tejun Heo
2018-10-15 17:04   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).