linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring
@ 2022-05-03 16:21 Waiman Long
  2022-05-03 16:21 ` [PATCH v10 1/8] cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask() Waiman Long
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

v10:
 - Relax constraints for changes made to "cpuset.cpus"
   and "cpuset.cpus.partition" as suggested. Now almost all changes
   are allowed.

v9:
 - Add a new patch 1 to remove the child cpuset restriction on parent's
   "cpuset.cpus".
 - Relax initial root partition entry limitation to allow cpuset.cpus to
   overlap that of parent's.
 - An "isolated invalid" displayed type is added to
   cpuset.cpus.partition.
 - Resetting partition root to "member" will leave child partition root
   as invalid.
 - Update documentation and test accordingly.

v8:
 - Reorganize the patch series and rationalize the features and
   constraints of a partition.
 - Update patch descriptions and documentation accordingly.

This patchset include the following enhancements to the cpuset v2
partition code.
 1) Allow partitions that have no task to have empty effective cpus.
 2) Relax the constraints on what changes are allowed in cpuset.cpus
    and cpuset.cpus.partition. However, the partition remain invalid
    until the constraints of a valid partition root is satisfied.
 3) Add a new "isolated" partition type for partitions with no load
    balancing which is available in v1 but not yet in v2.
 4) Allow the reading of cpuset.cpus.partition to include a reason
    string as to why the partition remain invalid.

 In addition, the cgroup-v2.rst documentation file is updated and
 a self test is added to verify the correctness the partition code.

Waiman Long (8):
  cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask()
  cgroup/cpuset: Miscellaneous cleanups & add helper functions
  cgroup/cpuset: Allow no-task partition to have empty
    cpuset.cpus.effective
  cgroup/cpuset: Relax constraints to partition & cpus changes
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Update description of cpuset.cpus.partition in
    cgroup-v2.rst
  kselftest/cgroup: Add cpuset v2 partition root state test

 Documentation/admin-guide/cgroup-v2.rst       | 145 ++--
 kernel/cgroup/cpuset.c                        | 712 +++++++++++-------
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 674 +++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
 5 files changed, 1295 insertions(+), 328 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

-- 
2.27.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v10 1/8] cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask()
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-03 16:21 ` [PATCH v10 2/8] cgroup/cpuset: Miscellaneous cleanups & add helper functions Waiman Long
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Right now, update_tasks_cpumask() is not supposed to be called with top
cpuset.  With cpuset partition that takes CPUs away from the top cpuset,
we may want to consider adjusting the cpu_allows mask of the tasks in
the top cpuset as well. However, we need more investigation to figure
out a way to do that without breaking non-cgroup aware applications in
the top cpuset. So move the top_cpuset check into update_tasks_cpumask()
and add a comment about possible future work.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9390bfd9f1cd..d90f33b068a3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1128,6 +1128,14 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
+	/*
+	 * TODO: With cpuset partition that takes CPUs away from the top
+	 * cpuset, we may want to properly adjust the cpus_allowed mask of
+	 * tasks in the top cpuset as well.
+	 */
+	if (cs == &top_cpuset)
+		return;
+
 	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		set_cpus_allowed_ptr(task, cs->effective_cpus);
@@ -2092,12 +2100,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 	}
 
-	/*
-	 * Update cpumask of parent's tasks except when it is the top
-	 * cpuset as some system daemons cannot be mapped to other CPUs.
-	 */
-	if (parent != &top_cpuset)
-		update_tasks_cpumask(parent);
+	update_tasks_cpumask(parent);
 
 	if (parent->child_ecpus_count)
 		update_sibling_cpumasks(parent, cs, &tmpmask);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 2/8] cgroup/cpuset: Miscellaneous cleanups & add helper functions
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
  2022-05-03 16:21 ` [PATCH v10 1/8] cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask() Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-03 16:21 ` [PATCH v10 3/8] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

The partition root state (PRS) macro names do not currently match the
external names. Change them to match the external names and add helper
functions to read or change the state.

Shorten the cpuset argument of update_parent_subparts_cpumask() to cs
to match other cpuset functions.

Remove the new_prs argument from notify_partition_change() as the
cs->partition_root_state has already been set to new_prs before it
is called.

There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 165 +++++++++++++++++++++--------------------
 1 file changed, 86 insertions(+), 79 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d90f33b068a3..d156a39d7a08 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -175,20 +175,18 @@ struct cpuset {
 /*
  * Partition root states:
  *
- *   0 - not a partition root
- *
+ *   0 - member (not a partition root)
  *   1 - partition root
- *
  *  -1 - invalid partition root
- *       None of the cpus in cpus_allowed can be put into the parent's
- *       subparts_cpus. In this case, the cpuset is not a real partition
- *       root anymore.  However, the CPU_EXCLUSIVE bit will still be set
- *       and the cpuset can be restored back to a partition root if the
- *       parent cpuset can give more CPUs back to this child cpuset.
  */
-#define PRS_DISABLED		0
-#define PRS_ENABLED		1
-#define PRS_ERROR		-1
+#define PRS_MEMBER		0
+#define PRS_ROOT		1
+#define PRS_INVALID_ROOT	-1
+
+static inline bool is_prs_invalid(int prs_state)
+{
+	return prs_state < 0;
+}
 
 /*
  * Temporary cpumasks for working with partitions that are passed among
@@ -268,25 +266,35 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
-static inline int is_partition_root(const struct cpuset *cs)
+static inline int is_partition_valid(const struct cpuset *cs)
 {
 	return cs->partition_root_state > 0;
 }
 
+static inline int is_partition_invalid(const struct cpuset *cs)
+{
+	return cs->partition_root_state < 0;
+}
+
+static inline void set_partition_invalid(struct cpuset *cs)
+{
+	cs->partition_root_state = PRS_INVALID_ROOT;
+}
+
 /*
  * Send notification event of whenever partition_root_state changes.
  */
-static inline void notify_partition_change(struct cpuset *cs,
-					   int old_prs, int new_prs)
+static inline void notify_partition_change(struct cpuset *cs, int old_prs)
 {
-	if (old_prs != new_prs)
-		cgroup_file_notify(&cs->partition_file);
+	if (old_prs == cs->partition_root_state)
+		return;
+	cgroup_file_notify(&cs->partition_file);
 }
 
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
-	.partition_root_state = PRS_ENABLED,
+	.partition_root_state = PRS_ROOT,
 };
 
 /**
@@ -875,7 +883,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 			csa[csn++] = cp;
 
 		/* skip @cp's subtree if not a partition root */
-		if (!is_partition_root(cp))
+		if (!is_partition_valid(cp))
 			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
@@ -1081,7 +1089,7 @@ static void rebuild_sched_domains_locked(void)
 	if (top_cpuset.nr_subparts_cpus) {
 		rcu_read_lock();
 		cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
-			if (!is_partition_root(cs)) {
+			if (!is_partition_valid(cs)) {
 				pos_css = css_rightmost_descendant(pos_css);
 				continue;
 			}
@@ -1215,11 +1223,11 @@ enum subparts_cmd {
  * cpumask changes that violates the cpu exclusivity rule will not be
  * permitted when checked by validate_change().
  */
-static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
+static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 					  struct cpumask *newmask,
 					  struct tmpmasks *tmp)
 {
-	struct cpuset *parent = parent_cs(cpuset);
+	struct cpuset *parent = parent_cs(cs);
 	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
 	int old_prs, new_prs;
@@ -1232,16 +1240,16 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	 * The new cpumask, if present, or the current cpus_allowed must
 	 * not be empty.
 	 */
-	if (!is_partition_root(parent) ||
+	if (!is_partition_valid(parent) ||
 	   (newmask && cpumask_empty(newmask)) ||
-	   (!newmask && cpumask_empty(cpuset->cpus_allowed)))
+	   (!newmask && cpumask_empty(cs->cpus_allowed)))
 		return -EINVAL;
 
 	/*
 	 * Enabling/disabling partition root is not allowed if there are
 	 * online children.
 	 */
-	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+	if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
 		return -EBUSY;
 
 	/*
@@ -1250,20 +1258,20 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	 * CPU will be left after that.
 	 */
 	if ((cmd == partcmd_enable) &&
-	   (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
-	     cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
+	   (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus) ||
+	     cpumask_equal(cs->cpus_allowed, parent->effective_cpus)))
 		return -EINVAL;
 
 	/*
 	 * A cpumask update cannot make parent's effective_cpus become empty.
 	 */
 	adding = deleting = false;
-	old_prs = new_prs = cpuset->partition_root_state;
+	old_prs = new_prs = cs->partition_root_state;
 	if (cmd == partcmd_enable) {
-		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
+		cpumask_copy(tmp->addmask, cs->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
-		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+		deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
 				       parent->subparts_cpus);
 	} else if (newmask) {
 		/*
@@ -1273,7 +1281,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 * addmask = newmask & parent->effective_cpus
 		 *		     & ~parent->subparts_cpus
 		 */
-		cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
+		cpumask_andnot(tmp->delmask, cs->cpus_allowed, newmask);
 		deleting = cpumask_and(tmp->delmask, tmp->delmask,
 				       parent->subparts_cpus);
 
@@ -1307,44 +1315,44 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 * pre-shrunk in case there is a change in the cpu list.
 		 * So no deletion is needed.
 		 */
-		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
+		adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
 				     parent->effective_cpus);
 		part_error = cpumask_equal(tmp->addmask,
 					   parent->effective_cpus);
 	}
 
 	if (cmd == partcmd_update) {
-		int prev_prs = cpuset->partition_root_state;
+		int prev_prs = cs->partition_root_state;
 
 		/*
-		 * Check for possible transition between PRS_ENABLED
-		 * and PRS_ERROR.
+		 * Check for possible transition between PRS_ROOT
+		 * and PRS_INVALID_ROOT.
 		 */
-		switch (cpuset->partition_root_state) {
-		case PRS_ENABLED:
+		switch (cs->partition_root_state) {
+		case PRS_ROOT:
 			if (part_error)
-				new_prs = PRS_ERROR;
+				new_prs = PRS_INVALID_ROOT;
 			break;
-		case PRS_ERROR:
+		case PRS_INVALID_ROOT:
 			if (!part_error)
-				new_prs = PRS_ENABLED;
+				new_prs = PRS_ROOT;
 			break;
 		}
 		/*
 		 * Set part_error if previously in invalid state.
 		 */
-		part_error = (prev_prs == PRS_ERROR);
+		part_error = is_prs_invalid(prev_prs);
 	}
 
-	if (!part_error && (new_prs == PRS_ERROR))
+	if (!part_error && is_prs_invalid(new_prs))
 		return 0;	/* Nothing need to be done */
 
-	if (new_prs == PRS_ERROR) {
+	if (is_prs_invalid(new_prs)) {
 		/*
 		 * Remove all its cpus from parent's subparts_cpus.
 		 */
 		adding = false;
-		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+		deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
 				       parent->subparts_cpus);
 	}
 
@@ -1377,10 +1385,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
 
 	if (old_prs != new_prs)
-		cpuset->partition_root_state = new_prs;
+		cs->partition_root_state = new_prs;
 
 	spin_unlock_irq(&callback_lock);
-	notify_partition_change(cpuset, old_prs, new_prs);
+	notify_partition_change(cs, old_prs);
 
 	return cmd == partcmd_update;
 }
@@ -1445,15 +1453,14 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		old_prs = new_prs = cp->partition_root_state;
 		if ((cp != cs) && old_prs) {
 			switch (parent->partition_root_state) {
-			case PRS_DISABLED:
+			case PRS_MEMBER:
 				/*
 				 * If parent is not a partition root or an
 				 * invalid partition root, clear its state
 				 * and its CS_CPU_EXCLUSIVE flag.
 				 */
-				WARN_ON_ONCE(cp->partition_root_state
-					     != PRS_ERROR);
-				new_prs = PRS_DISABLED;
+				WARN_ON_ONCE(!is_partition_invalid(cp));
+				new_prs = PRS_MEMBER;
 
 				/*
 				 * clear_bit() is an atomic operation and
@@ -1465,16 +1472,16 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
 				break;
 
-			case PRS_ENABLED:
+			case PRS_ROOT:
 				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
 					update_tasks_cpumask(parent);
 				break;
 
-			case PRS_ERROR:
+			case PRS_INVALID_ROOT:
 				/*
 				 * When parent is invalid, it has to be too.
 				 */
-				new_prs = PRS_ERROR;
+				new_prs = PRS_INVALID_ROOT;
 				break;
 			}
 		}
@@ -1486,7 +1493,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		spin_lock_irq(&callback_lock);
 
 		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+		if (cp->nr_subparts_cpus && !is_partition_valid(cp)) {
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
 		} else if (cp->nr_subparts_cpus) {
@@ -1518,7 +1525,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			cp->partition_root_state = new_prs;
 
 		spin_unlock_irq(&callback_lock);
-		notify_partition_change(cp, old_prs, new_prs);
+		notify_partition_change(cp, old_prs);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -1534,7 +1541,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    is_sched_load_balance(cp) &&
 		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
-		    is_partition_root(cp)))
+		    is_partition_valid(cp)))
 			need_rebuild_sched_domains = true;
 
 		rcu_read_lock();
@@ -2034,10 +2041,11 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	return err;
 }
 
-/*
+/**
  * update_prstate - update partition_root_state
- * cs: the cpuset to update
- * new_prs: new partition root state
+ * @cs: the cpuset to update
+ * @new_prs: new partition root state
+ * Return: 0 if successful, < 0 if error
  *
  * Call with cpuset_rwsem held.
  */
@@ -2054,7 +2062,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	 * Cannot force a partial or invalid partition root to a full
 	 * partition root.
 	 */
-	if (new_prs && (old_prs == PRS_ERROR))
+	if (new_prs && is_prs_invalid(old_prs))
 		return -EINVAL;
 
 	if (alloc_cpumasks(NULL, &tmpmask))
@@ -2085,7 +2093,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		 * Turning off partition root will clear the
 		 * CS_CPU_EXCLUSIVE bit.
 		 */
-		if (old_prs == PRS_ERROR) {
+		if (is_prs_invalid(old_prs)) {
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			err = 0;
 			goto out;
@@ -2111,7 +2119,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		spin_lock_irq(&callback_lock);
 		cs->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
-		notify_partition_change(cs, old_prs, new_prs);
+		notify_partition_change(cs, old_prs);
 	}
 
 	free_cpumasks(NULL, &tmpmask);
@@ -2604,13 +2612,13 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 	struct cpuset *cs = css_cs(seq_css(seq));
 
 	switch (cs->partition_root_state) {
-	case PRS_ENABLED:
+	case PRS_ROOT:
 		seq_puts(seq, "root\n");
 		break;
-	case PRS_DISABLED:
+	case PRS_MEMBER:
 		seq_puts(seq, "member\n");
 		break;
-	case PRS_ERROR:
+	case PRS_INVALID_ROOT:
 		seq_puts(seq, "root invalid\n");
 		break;
 	}
@@ -2630,9 +2638,9 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 	 * Convert "root" to ENABLED, and convert "member" to DISABLED.
 	 */
 	if (!strcmp(buf, "root"))
-		val = PRS_ENABLED;
+		val = PRS_ROOT;
 	else if (!strcmp(buf, "member"))
-		val = PRS_DISABLED;
+		val = PRS_MEMBER;
 	else
 		return -EINVAL;
 
@@ -2931,7 +2939,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	cpus_read_lock();
 	percpu_down_write(&cpuset_rwsem);
 
-	if (is_partition_root(cs))
+	if (is_partition_valid(cs))
 		update_prstate(cs, 0);
 
 	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
@@ -3176,11 +3184,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus or its parent becomes erroneous, we have to
-	 * transition it to the erroneous state.
+	 * effective_cpus or its parent becomes invalid, we have to
+	 * transition it to the invalid state.
 	 */
-	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
-	   (parent->partition_root_state == PRS_ERROR))) {
+	if (is_partition_valid(cs) && (cpumask_empty(&new_cpus) ||
+	    is_partition_invalid(parent))) {
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3195,30 +3203,29 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		 * the current partition and let the child partitions
 		 * fight for available CPUs.
 		 */
-		if ((parent->partition_root_state == PRS_ERROR) ||
+		if (is_partition_invalid(parent) ||
 		     cpumask_empty(&new_cpus)) {
 			int old_prs;
 
 			update_parent_subparts_cpumask(cs, partcmd_disable,
 						       NULL, tmp);
 			old_prs = cs->partition_root_state;
-			if (old_prs != PRS_ERROR) {
+			if (!is_prs_invalid(old_prs)) {
 				spin_lock_irq(&callback_lock);
-				cs->partition_root_state = PRS_ERROR;
+				set_partition_invalid(cs);
 				spin_unlock_irq(&callback_lock);
-				notify_partition_change(cs, old_prs, PRS_ERROR);
+				notify_partition_change(cs, old_prs);
 			}
 		}
 		cpuset_force_rebuild();
 	}
 
 	/*
-	 * On the other hand, an erroneous partition root may be transitioned
+	 * On the other hand, an invalid partition root may be transitioned
 	 * back to a regular one or a partition root with no CPU allocated
-	 * from the parent may change to erroneous.
+	 * from the parent may change to invalid.
 	 */
-	if (is_partition_root(parent) &&
-	   ((cs->partition_root_state == PRS_ERROR) ||
+	if (is_partition_valid(parent) && (is_partition_invalid(cs) ||
 	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
 	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
 		cpuset_force_rebuild();
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 3/8] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
  2022-05-03 16:21 ` [PATCH v10 1/8] cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask() Waiman Long
  2022-05-03 16:21 ` [PATCH v10 2/8] cgroup/cpuset: Miscellaneous cleanups & add helper functions Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-03 17:54   ` Phil Auld
  2022-05-03 16:21 ` [PATCH v10 4/8] cgroup/cpuset: Relax constraints to partition & cpus changes Waiman Long
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its CPUs
 to child partitions with no CPUs left. However in most cases, there
shouldn't be any tasks associated with intermediate nodes of the default
 hierarchy. So the current rule is too restrictive and can waste valuable
 CPU resource.

To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Therefore, a parent
partition with no task can now have all its CPUs distributed out to its
child partitions. The top cpuset always have some house-keeping tasks
running and so its list of effective cpu can't never be empty.

Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 111 +++++++++++++++++++++++++++++++----------
 1 file changed, 84 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d156a39d7a08..7d9abd50a1b9 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -412,6 +412,41 @@ static inline bool is_in_v2_mode(void)
 	      (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
 }
 
+/**
+ * partition_is_populated - check if partition has tasks
+ * @cs: partition root to be checked
+ * @excluded_child: a child cpuset to be excluded in task checking
+ * Return: true if there are tasks, false otherwise
+ *
+ * It is assumed that @cs is a valid partition root. @excluded_child should
+ * be non-NULL when this cpuset is going to become a partition itself.
+ */
+static inline bool partition_is_populated(struct cpuset *cs,
+					  struct cpuset *excluded_child)
+{
+	struct cgroup_subsys_state *css;
+	struct cpuset *child;
+
+	if (cs->css.cgroup->nr_populated_csets)
+		return true;
+	if (!excluded_child && !cs->nr_subparts_cpus)
+		return cgroup_is_populated(cs->css.cgroup);
+
+	rcu_read_lock();
+	cpuset_for_each_child(child, css, cs) {
+		if (child == excluded_child)
+			continue;
+		if (is_partition_valid(child))
+			continue;
+		if (cgroup_is_populated(child->css.cgroup)) {
+			rcu_read_unlock();
+			return true;
+		}
+	}
+	rcu_read_unlock();
+	return false;
+}
+
 /*
  * Return in pmask the portion of a task's cpusets's cpus_allowed that
  * are online and are capable of running the task.  If none are found,
@@ -1252,22 +1287,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 	if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
 		return -EBUSY;
 
-	/*
-	 * Enabling partition root is not allowed if not all the CPUs
-	 * can be granted from parent's effective_cpus or at least one
-	 * CPU will be left after that.
-	 */
-	if ((cmd == partcmd_enable) &&
-	   (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus) ||
-	     cpumask_equal(cs->cpus_allowed, parent->effective_cpus)))
-		return -EINVAL;
-
-	/*
-	 * A cpumask update cannot make parent's effective_cpus become empty.
-	 */
 	adding = deleting = false;
 	old_prs = new_prs = cs->partition_root_state;
 	if (cmd == partcmd_enable) {
+		/*
+		 * Enabling partition root is not allowed if not all the CPUs
+		 * can be granted from parent's effective_cpus.
+		 */
+		if (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus))
+			return -EINVAL;
+
+		/*
+		 * A parent can be left with no CPU as long as there is no
+		 * task directly associated with the parent partition. For
+		 * such a parent, no new task can be moved into it.
+		 */
+		if (partition_is_populated(parent, cs) &&
+		    cpumask_equal(cs->cpus_allowed, parent->effective_cpus))
+			return -EINVAL;
+
 		cpumask_copy(tmp->addmask, cs->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
@@ -1289,9 +1327,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty.
+		 * Return error if the new effective_cpus could become empty
+		 * and there are tasks in the parent.
 		 */
-		if (adding &&
+		if (adding && partition_is_populated(parent, cs) &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
 			if (!deleting)
 				return -EINVAL;
@@ -1317,8 +1356,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 		 */
 		adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
 				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask,
-					   parent->effective_cpus);
+		part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+			     partition_is_populated(parent, cs);
 	}
 
 	if (cmd == partcmd_update) {
@@ -1420,9 +1459,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
-		 * parent, which is guaranteed to have some CPUs.
+		 * parent, which is guaranteed to have some CPUs unless
+		 * it is a partition root that has explicitly distributed
+		 * out all its CPUs.
 		 */
 		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+			if (is_partition_valid(cp) &&
+			    cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+				goto update_parent_subparts;
+
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 			if (!cp->use_parent_ecpus) {
 				cp->use_parent_ecpus = true;
@@ -1444,6 +1489,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		}
 
+update_parent_subparts:
 		/*
 		 * update_parent_subparts_cpumask() should have been called
 		 * for cs already in update_cpumask(). We should also call
@@ -2249,6 +2295,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
+	/*
+	 * On default hierarchy, task cannot be moved to a cpuset with empty
+	 * effective cpus.
+	 */
+	if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+		goto out_unlock;
+
 	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task, cs->cpus_allowed);
 		if (ret)
@@ -3115,7 +3168,8 @@ hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
-	if (cpumask_empty(new_cpus))
+	/* A partition root is allowed to have empty effective cpus */
+	if (cpumask_empty(new_cpus) && !is_partition_valid(cs))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
@@ -3184,10 +3238,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus or its parent becomes invalid, we have to
-	 * transition it to the invalid state.
+	 * effective_cpus with tasks or its parent becomes invalid, we
+	 * have to transition it to the invalid state.
 	 */
-	if (is_partition_valid(cs) && (cpumask_empty(&new_cpus) ||
+	if (is_partition_valid(cs) &&
+	   ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
 	    is_partition_invalid(parent))) {
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
@@ -3198,13 +3253,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		}
 
 		/*
-		 * If the effective_cpus is empty because the child
-		 * partitions take away all the CPUs, we can keep
-		 * the current partition and let the child partitions
-		 * fight for available CPUs.
+		 * Force the partition to become invalid if either one of
+		 * the following conditions hold:
+		 * 1) empty effective cpus but not valid empty partition.
+		 * 2) parent is invalid or doesn't grant any cpus to child
+		 *    partitions.
 		 */
 		if (is_partition_invalid(parent) ||
-		     cpumask_empty(&new_cpus)) {
+		    (cpumask_empty(&new_cpus) &&
+		     partition_is_populated(cs, NULL))) {
 			int old_prs;
 
 			update_parent_subparts_cpumask(cs, partcmd_disable,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 4/8] cgroup/cpuset: Relax constraints to partition & cpus changes
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
                   ` (2 preceding siblings ...)
  2022-05-03 16:21 ` [PATCH v10 3/8] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-03 16:21 ` [PATCH v10 5/8] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Currently, enabling a partition root is only allowed if all the
constraints of a valid partition are satisfied. Even changes to
"cpuset.cpus" may not be allowed in some cases. Moreover, there are
limits to changes made to a parent cpuset if it is a valid partition
root. This is contrary to the general cgroup v2 philosophy.

This patch relaxes the constraints of changing the state of "cpuset.cpus"
and "cpuset.cpus.partition". Now all valid changes ("member" or "root")
to "cpuset.cpus.partition" are allowed even if there are child cpusets
underneath it.

Trying to make a cpuset a partition root, however, will cause its state
to become invalid if the following constraints of a valid partition
root are not satisfied.

 1) The "cpuset.cpus" is non-empty and exclusive.
 2) The parent cpuset is a valid partition root.
 3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".

Similarly, almost all changes to "cpuset.cpus" are allowed with the
exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the
exclusivity rule will still apply.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 400 +++++++++++++++++++++--------------------
 1 file changed, 208 insertions(+), 192 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7d9abd50a1b9..1f1d76013046 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1218,13 +1218,15 @@ enum subparts_cmd {
 	partcmd_update,		/* Update parent's subparts_cpus */
 };
 
+static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
+		       int turning_on);
 /**
  * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
  * @cpuset:  The cpuset that requests change in partition root state
  * @cmd:     Partition root state change command
  * @newmask: Optional new cpumask for partcmd_update
  * @tmp:     Temporary addmask and delmask
- * Return:   0, 1 or an error code
+ * Return:   0 or -1 (error)
  *
  * For partcmd_enable, the cpuset is being transformed from a non-partition
  * root to a partition root. The cpus_allowed mask of the given cpuset will
@@ -1235,28 +1237,19 @@ enum subparts_cmd {
  * For partcmd_disable, the cpuset is being transformed from a partition
  * root back to a non-partition root. Any CPUs in cpus_allowed that are in
  * parent's subparts_cpus will be taken away from that cpumask and put back
- * into parent's effective_cpus. 0 should always be returned.
+ * into parent's effective_cpus. 0 will always be returned.
  *
- * For partcmd_update, if the optional newmask is specified, the cpu
- * list is to be changed from cpus_allowed to newmask. Otherwise,
- * cpus_allowed is assumed to remain the same. The cpuset should either
- * be a partition root or an invalid partition root. The partition root
- * state may change if newmask is NULL and none of the requested CPUs can
- * be granted by the parent. The function will return 1 if changes to
- * parent's subparts_cpus and effective_cpus happen or 0 otherwise.
- * Error code should only be returned when newmask is non-NULL.
+ * For partcmd_update, if the optional newmask is specified, the cpu list is
+ * to be changed from cpus_allowed to newmask. Otherwise, cpus_allowed is
+ * assumed to remain the same. The cpuset should either be a valid or invalid
+ * partition root. The partition root state may change from valid to invalid
+ * or vice versa. An error code will only be returned if transitioning from
+ * invalid to valid violates the exclusivity rule.
  *
  * The partcmd_enable and partcmd_disable commands are used by
  * update_prstate(). The partcmd_update command is used by
  * update_cpumasks_hier() with newmask NULL and update_cpumask() with
  * newmask set.
- *
- * The checking is more strict when enabling partition root than the
- * other two commands.
- *
- * Because of the implicit cpu exclusive nature of a partition root,
- * cpumask changes that violates the cpu exclusivity rule will not be
- * permitted when checked by validate_change().
  */
 static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 					  struct cpumask *newmask,
@@ -1278,91 +1271,103 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 	if (!is_partition_valid(parent) ||
 	   (newmask && cpumask_empty(newmask)) ||
 	   (!newmask && cpumask_empty(cs->cpus_allowed)))
-		return -EINVAL;
-
-	/*
-	 * Enabling/disabling partition root is not allowed if there are
-	 * online children.
-	 */
-	if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
-		return -EBUSY;
+		return -1;
 
 	adding = deleting = false;
 	old_prs = new_prs = cs->partition_root_state;
 	if (cmd == partcmd_enable) {
 		/*
-		 * Enabling partition root is not allowed if not all the CPUs
-		 * can be granted from parent's effective_cpus.
+		 * Enabling partition root is not allowed if cpus_allowed
+		 * doesn't overlap parent's cpus_allowed.
 		 */
-		if (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus))
-			return -EINVAL;
+		if (!cpumask_intersects(cs->cpus_allowed, parent->cpus_allowed))
+			return -1;
 
 		/*
 		 * A parent can be left with no CPU as long as there is no
-		 * task directly associated with the parent partition. For
-		 * such a parent, no new task can be moved into it.
+		 * task directly associated with the parent partition.
 		 */
 		if (partition_is_populated(parent, cs) &&
-		    cpumask_equal(cs->cpus_allowed, parent->effective_cpus))
-			return -EINVAL;
+		   !cpumask_intersects(cs->cpus_allowed, parent->effective_cpus))
+			return -1;
 
 		cpumask_copy(tmp->addmask, cs->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
-		deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+		/*
+		 * Need to remove cpus from parent's subparts_cpus for valid
+		 * partition root.
+		 */
+		deleting = !is_prs_invalid(old_prs) &&
+			   cpumask_and(tmp->delmask, cs->cpus_allowed,
 				       parent->subparts_cpus);
 	} else if (newmask) {
 		/*
 		 * partcmd_update with newmask:
 		 *
+		 * Compute add/delete mask to/from subparts_cpus
+		 *
 		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
-		 * addmask = newmask & parent->effective_cpus
+		 * addmask = newmask & parent->cpus_allowed
 		 *		     & ~parent->subparts_cpus
 		 */
 		cpumask_andnot(tmp->delmask, cs->cpus_allowed, newmask);
 		deleting = cpumask_and(tmp->delmask, tmp->delmask,
 				       parent->subparts_cpus);
 
-		cpumask_and(tmp->addmask, newmask, parent->effective_cpus);
+		cpumask_and(tmp->addmask, newmask, parent->cpus_allowed);
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty
-		 * and there are tasks in the parent.
+		 * Make partition invalid if parent's effective_cpus could
+		 * become empty and there are tasks in the parent.
 		 */
 		if (adding && partition_is_populated(parent, cs) &&
-		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
-			if (!deleting)
-				return -EINVAL;
-			/*
-			 * As some of the CPUs in subparts_cpus might have
-			 * been offlined, we need to compute the real delmask
-			 * to confirm that.
-			 */
-			if (!cpumask_and(tmp->addmask, tmp->delmask,
-					 cpu_active_mask))
-				return -EINVAL;
-			cpumask_copy(tmp->addmask, parent->effective_cpus);
+		    cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+		    !cpumask_intersects(tmp->delmask, cpu_active_mask)) {
+			part_error = true;
+			adding = false;
+			deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+					       parent->subparts_cpus);
 		}
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
 		 *
-		 * addmask = cpus_allowed & parent->effective_cpus
+		 * delmask = cpus_allowed & parent->subparts_cpus
+		 * addmask = cpus_allowed & parent->cpus_allowed
+		 *			  & ~parent->subparts_cpus
 		 *
-		 * Note that parent's subparts_cpus may have been
-		 * pre-shrunk in case there is a change in the cpu list.
-		 * So no deletion is needed.
+		 * This gets invoked either due to a hotplug event or from
+		 * update_cpumasks_hier(). This can cause the state of a
+		 * partition root to transition from valid to invalid or vice
+		 * versa. So we still need to compute the addmask and delmask.
+
+		 * A partition error happens when:
+		 * 1) Cpuset is valid partition, but parent does not distribute
+		 *    out any CPUs.
+		 * 2) Parent has tasks and all its effective CPUs will have
+		 *    to be distributed out.
 		 */
-		adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
-				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
-			     partition_is_populated(parent, cs);
+		cpumask_and(tmp->addmask, cs->cpus_allowed,
+					  parent->cpus_allowed);
+		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
+					parent->subparts_cpus);
+		if ((is_partition_valid(cs) && !parent->nr_subparts_cpus) ||
+		    (adding &&
+		     cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+		     partition_is_populated(parent, cs))) {
+			part_error = true;
+			adding = false;
+		}
+
+		if (part_error && is_partition_valid(cs) &&
+		    parent->nr_subparts_cpus)
+			deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+					       parent->subparts_cpus);
 	}
 
 	if (cmd == partcmd_update) {
-		int prev_prs = cs->partition_root_state;
-
 		/*
 		 * Check for possible transition between PRS_ROOT
 		 * and PRS_INVALID_ROOT.
@@ -1377,27 +1382,21 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 				new_prs = PRS_ROOT;
 			break;
 		}
-		/*
-		 * Set part_error if previously in invalid state.
-		 */
-		part_error = is_prs_invalid(prev_prs);
-	}
-
-	if (!part_error && is_prs_invalid(new_prs))
-		return 0;	/* Nothing need to be done */
-
-	if (is_prs_invalid(new_prs)) {
-		/*
-		 * Remove all its cpus from parent's subparts_cpus.
-		 */
-		adding = false;
-		deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
-				       parent->subparts_cpus);
 	}
 
 	if (!adding && !deleting && (new_prs == old_prs))
 		return 0;
 
+	/*
+	 * Transitioning from invalid to valid (partcmd_update) may require
+	 * setting CS_CPU_EXCLUSIVE.
+	 */
+	if ((old_prs != new_prs) && is_prs_invalid(old_prs)) {
+		if (!is_cpu_exclusive(cs) &&
+		    (update_flag(CS_CPU_EXCLUSIVE, cs, 1) < 0))
+			return -1;
+	}
+
 	/*
 	 * Change the parent's subparts_cpus.
 	 * Newly added CPUs will be removed from effective_cpus and
@@ -1427,15 +1426,20 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 		cs->partition_root_state = new_prs;
 
 	spin_unlock_irq(&callback_lock);
+
+	if (adding || deleting)
+		update_tasks_cpumask(parent);
+
 	notify_partition_change(cs, old_prs);
 
-	return cmd == partcmd_update;
+	return 0;
 }
 
 /*
  * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
  * @cs:  the cpuset to consider
  * @tmp: temp variables for calculating effective_cpus & partition setup
+ * @force: don't skip any descendant cpusets if set
  *
  * When configured cpumask is changed, the effective cpumasks of this cpuset
  * and all its descendants need to be updated.
@@ -1444,7 +1448,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
  *
  * Called with cpuset_rwsem held
  */
-static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
+static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
+				 bool force)
 {
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
@@ -1454,6 +1459,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
+		bool update_parent = false;
 
 		compute_effective_cpumask(tmp->new_cpus, cp, parent);
 
@@ -1481,9 +1487,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		/*
 		 * Skip the whole subtree if the cpumask remains the same
-		 * and has no partition root state.
+		 * and has no partition root state and force flag not set.
 		 */
-		if (!cp->partition_root_state &&
+		if (!cp->partition_root_state && !force &&
 		    cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
 			pos_css = css_rightmost_descendant(pos_css);
 			continue;
@@ -1499,33 +1505,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		old_prs = new_prs = cp->partition_root_state;
 		if ((cp != cs) && old_prs) {
 			switch (parent->partition_root_state) {
-			case PRS_MEMBER:
-				/*
-				 * If parent is not a partition root or an
-				 * invalid partition root, clear its state
-				 * and its CS_CPU_EXCLUSIVE flag.
-				 */
-				WARN_ON_ONCE(!is_partition_invalid(cp));
-				new_prs = PRS_MEMBER;
-
-				/*
-				 * clear_bit() is an atomic operation and
-				 * readers aren't interested in the state
-				 * of CS_CPU_EXCLUSIVE anyway. So we can
-				 * just update the flag without holding
-				 * the callback_lock.
-				 */
-				clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
-				break;
-
 			case PRS_ROOT:
-				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
-					update_tasks_cpumask(parent);
+				update_parent = true;
 				break;
 
-			case PRS_INVALID_ROOT:
+			default:
 				/*
-				 * When parent is invalid, it has to be too.
+				 * When parent is not a partition root or is
+				 * invalid, child partition roots become
+				 * invalid too.
 				 */
 				new_prs = PRS_INVALID_ROOT;
 				break;
@@ -1536,41 +1524,43 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		rcu_read_unlock();
 
+		if (update_parent) {
+			update_parent_subparts_cpumask(cp, partcmd_update, NULL,
+						       tmp);
+			/*
+			 * The cpuset partition_root_state may become
+			 * invalid. Capture it.
+			 */
+			new_prs = cp->partition_root_state;
+		}
+
 		spin_lock_irq(&callback_lock);
 
-		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		if (cp->nr_subparts_cpus && !is_partition_valid(cp)) {
+			/*
+			 * Put all active subparts_cpus back to effective_cpus.
+			 */
+			cpumask_or(tmp->new_cpus, tmp->new_cpus,
+				   cp->subparts_cpus);
+			cpumask_and(tmp->new_cpus, tmp->new_cpus,
+				   cpu_active_mask);
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
-		} else if (cp->nr_subparts_cpus) {
+		}
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
 			 * are mutually exclusive.
-			 *
-			 * In the unlikely event that effective_cpus
-			 * becomes empty. we clear cp->nr_subparts_cpus and
-			 * let its child partition roots to compete for
-			 * CPUs again.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			if (cpumask_empty(cp->effective_cpus)) {
-				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-				cpumask_clear(cp->subparts_cpus);
-				cp->nr_subparts_cpus = 0;
-			} else if (!cpumask_subset(cp->subparts_cpus,
-						   tmp->new_cpus)) {
-				cpumask_andnot(cp->subparts_cpus,
-					cp->subparts_cpus, tmp->new_cpus);
-				cp->nr_subparts_cpus
-					= cpumask_weight(cp->subparts_cpus);
-			}
 		}
 
-		if (new_prs != old_prs)
-			cp->partition_root_state = new_prs;
-
+		cp->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
+
 		notify_partition_change(cp, old_prs);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -1631,7 +1621,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
 			continue;
 
 		rcu_read_unlock();
-		update_cpumasks_hier(sibling, tmp);
+		update_cpumasks_hier(sibling, tmp, false);
 		rcu_read_lock();
 		css_put(&sibling->css);
 	}
@@ -1691,27 +1681,36 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 #endif
 
 	if (cs->partition_root_state) {
-		/* Cpumask of a partition root cannot be empty */
-		if (cpumask_empty(trialcs->cpus_allowed))
-			return -EINVAL;
-		if (update_parent_subparts_cpumask(cs, partcmd_update,
-					trialcs->cpus_allowed, &tmp) < 0)
-			return -EINVAL;
+		update_parent_subparts_cpumask(cs, partcmd_update,
+					       trialcs->cpus_allowed, &tmp);
 	}
 
+	compute_effective_cpumask(trialcs->effective_cpus, trialcs,
+				  parent_cs(cs));
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 
 	/*
-	 * Make sure that subparts_cpus is a subset of cpus_allowed.
+	 * Make sure that subparts_cpus, if not empty, is a subset of
+	 * cpus_allowed. Clear subparts_cpus if there is an error or
+	 * empty effective cpus with tasks.
 	 */
 	if (cs->nr_subparts_cpus) {
-		cpumask_and(cs->subparts_cpus, cs->subparts_cpus, cs->cpus_allowed);
-		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+		if (cs->prs_err ||
+		   (partition_is_populated(cs, NULL) &&
+		    cpumask_subset(trialcs->effective_cpus, cs->subparts_cpus))) {
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+		} else {
+			cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+				    cs->cpus_allowed);
+			cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+		}
 	}
 	spin_unlock_irq(&callback_lock);
 
-	update_cpumasks_hier(cs, &tmp);
+	/* effective_cpus will be updated here */
+	update_cpumasks_hier(cs, &tmp, false);
 
 	if (cs->partition_root_state) {
 		struct cpuset *parent = parent_cs(cs);
@@ -2097,7 +2096,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
  */
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
-	int err, old_prs = cs->partition_root_state;
+	int err = 0, old_prs = cs->partition_root_state;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
 
@@ -2105,24 +2104,25 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		return 0;
 
 	/*
-	 * Cannot force a partial or invalid partition root to a full
-	 * partition root.
+	 * For a previously invalid partition root, leave it at being
+	 * invalid if new_prs is not "member".
 	 */
 	if (new_prs && is_prs_invalid(old_prs))
-		return -EINVAL;
+		return 0;
 
 	if (alloc_cpumasks(NULL, &tmpmask))
 		return -ENOMEM;
 
-	err = -EINVAL;
 	if (!old_prs) {
 		/*
 		 * Turning on partition root requires setting the
 		 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
-		 * cannot be NULL.
+		 * cannot be empty.
 		 */
-		if (cpumask_empty(cs->cpus_allowed))
+		if (cpumask_empty(cs->cpus_allowed)) {
+			err = 1;
 			goto out;
+		}
 
 		err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
 		if (err)
@@ -2136,19 +2136,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		}
 	} else {
 		/*
-		 * Turning off partition root will clear the
-		 * CS_CPU_EXCLUSIVE bit.
+		 * Switching back to member is always allowed even if it
+		 * disables child partitions.
 		 */
-		if (is_prs_invalid(old_prs)) {
-			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
-			err = 0;
-			goto out;
-		}
+		update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
+					       &tmpmask);
 
-		err = update_parent_subparts_cpumask(cs, partcmd_disable,
-						     NULL, &tmpmask);
-		if (err)
-			goto out;
+		/*
+		 * If there are child partitions, they will all become invalid.
+		 */
+		if (unlikely(cs->nr_subparts_cpus)) {
+			spin_lock_irq(&callback_lock);
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+			compute_effective_cpumask(cs->effective_cpus, cs, parent);
+			spin_unlock_irq(&callback_lock);
+		}
 
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
@@ -2161,15 +2164,24 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 
 	rebuild_sched_domains_locked();
 out:
-	if (!err) {
-		spin_lock_irq(&callback_lock);
-		cs->partition_root_state = new_prs;
-		spin_unlock_irq(&callback_lock);
-		notify_partition_change(cs, old_prs);
-	}
+	/*
+	 * Make partition invalid if an error happen
+	 */
+	if (err)
+		new_prs = PRS_INVALID_ROOT;
+	spin_lock_irq(&callback_lock);
+	cs->partition_root_state = new_prs;
+	spin_unlock_irq(&callback_lock);
+	/*
+	 * Update child cpusets, if present.
+	 * Force update if switching back to member.
+	 */
+	if (!list_empty(&cs->css.children))
+		update_cpumasks_hier(cs, &tmpmask, !new_prs);
 
+	notify_partition_change(cs, old_prs);
 	free_cpumasks(NULL, &tmpmask);
-	return err;
+	return 0;
 }
 
 /*
@@ -3238,12 +3250,31 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus with tasks or its parent becomes invalid, we
-	 * have to transition it to the invalid state.
+	 * effective_cpus with tasks, we will have to invalidate child
+	 * partitions, if present, by setting nr_subparts_cpus to 0 to
+	 * reclaim their cpus.
 	 */
-	if (is_partition_valid(cs) &&
-	   ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
-	    is_partition_invalid(parent))) {
+	if (cs->nr_subparts_cpus && is_partition_valid(cs) &&
+	    cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) {
+		spin_lock_irq(&callback_lock);
+		cs->nr_subparts_cpus = 0;
+		cpumask_clear(cs->subparts_cpus);
+		spin_unlock_irq(&callback_lock);
+		compute_effective_cpumask(&new_cpus, cs, parent);
+	}
+
+	/*
+	 * Force the partition to become invalid if either one of
+	 * the following conditions hold:
+	 * 1) empty effective cpus but not valid empty partition.
+	 * 2) parent is invalid or doesn't grant any cpus to child
+	 *    partitions.
+	 */
+	if (is_partition_valid(cs) && (!parent->nr_subparts_cpus ||
+	   (cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)))) {
+		int old_prs;
+
+		update_parent_subparts_cpumask(cs, partcmd_disable, NULL, tmp);
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3252,40 +3283,25 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 			compute_effective_cpumask(&new_cpus, cs, parent);
 		}
 
-		/*
-		 * Force the partition to become invalid if either one of
-		 * the following conditions hold:
-		 * 1) empty effective cpus but not valid empty partition.
-		 * 2) parent is invalid or doesn't grant any cpus to child
-		 *    partitions.
-		 */
-		if (is_partition_invalid(parent) ||
-		    (cpumask_empty(&new_cpus) &&
-		     partition_is_populated(cs, NULL))) {
-			int old_prs;
-
-			update_parent_subparts_cpumask(cs, partcmd_disable,
-						       NULL, tmp);
-			old_prs = cs->partition_root_state;
-			if (!is_prs_invalid(old_prs)) {
-				spin_lock_irq(&callback_lock);
-				set_partition_invalid(cs);
-				spin_unlock_irq(&callback_lock);
-				notify_partition_change(cs, old_prs);
-			}
+		old_prs = cs->partition_root_state;
+		if (is_partition_valid(cs)) {
+			spin_lock_irq(&callback_lock);
+			set_partition_invalid(cs);
+			spin_unlock_irq(&callback_lock);
+			notify_partition_change(cs, old_prs);
 		}
 		cpuset_force_rebuild();
 	}
 
 	/*
 	 * On the other hand, an invalid partition root may be transitioned
-	 * back to a regular one or a partition root with no CPU allocated
-	 * from the parent may change to invalid.
+	 * back to a regular one.
 	 */
-	if (is_partition_valid(parent) && (is_partition_invalid(cs) ||
-	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
-	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
-		cpuset_force_rebuild();
+	else if (is_partition_valid(parent) && is_partition_invalid(cs)) {
+		update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp);
+		if (is_partition_valid(cs))
+			cpuset_force_rebuild();
+	}
 
 update_tasks:
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 5/8] cgroup/cpuset: Add a new isolated cpus.partition type
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
                   ` (3 preceding siblings ...)
  2022-05-03 16:21 ` [PATCH v10 4/8] cgroup/cpuset: Relax constraints to partition & cpus changes Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-03 16:21 ` [PATCH v10 6/8] cgroup/cpuset: Show invalid partition reason string Waiman Long
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.

For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.

When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.

To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 74 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 63 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1f1d76013046..073db69ac8fc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -177,11 +177,15 @@ struct cpuset {
  *
  *   0 - member (not a partition root)
  *   1 - partition root
+ *   2 - partition root without load balancing (isolated)
  *  -1 - invalid partition root
+ *  -2 - invalid isolated partition root
  */
 #define PRS_MEMBER		0
 #define PRS_ROOT		1
+#define PRS_ISOLATED		2
 #define PRS_INVALID_ROOT	-1
+#define PRS_INVALID_ISOLATED	-2
 
 static inline bool is_prs_invalid(int prs_state)
 {
@@ -278,7 +282,8 @@ static inline int is_partition_invalid(const struct cpuset *cs)
 
 static inline void set_partition_invalid(struct cpuset *cs)
 {
-	cs->partition_root_state = PRS_INVALID_ROOT;
+	if (is_partition_valid(cs))
+		cs->partition_root_state = -cs->partition_root_state;
 }
 
 /*
@@ -1369,17 +1374,19 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 
 	if (cmd == partcmd_update) {
 		/*
-		 * Check for possible transition between PRS_ROOT
-		 * and PRS_INVALID_ROOT.
+		 * Check for possible transition between valid and invalid
+		 * partition root.
 		 */
 		switch (cs->partition_root_state) {
 		case PRS_ROOT:
+		case PRS_ISOLATED:
 			if (part_error)
-				new_prs = PRS_INVALID_ROOT;
+				new_prs = -old_prs;
 			break;
 		case PRS_INVALID_ROOT:
+		case PRS_INVALID_ISOLATED:
 			if (!part_error)
-				new_prs = PRS_ROOT;
+				new_prs = -old_prs;
 			break;
 		}
 	}
@@ -1389,7 +1396,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 
 	/*
 	 * Transitioning from invalid to valid (partcmd_update) may require
-	 * setting CS_CPU_EXCLUSIVE.
+	 * setting CS_CPU_EXCLUSIVE and clearing CS_SCHED_LOAD_BALANCE later.
 	 */
 	if ((old_prs != new_prs) && is_prs_invalid(old_prs)) {
 		if (!is_cpu_exclusive(cs) &&
@@ -1430,8 +1437,17 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 	if (adding || deleting)
 		update_tasks_cpumask(parent);
 
+	/*
+	 * Set or clear CS_SCHED_LOAD_BALANCE when partcmd_update, if necessary.
+	 * rebuild_sched_domains_locked() may be called.
+	 */
+	if ((old_prs != new_prs) && (cmd == partcmd_update)) {
+		if (old_prs == PRS_ISOLATED)
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+		else if (new_prs == PRS_ISOLATED)
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+	}
 	notify_partition_change(cs, old_prs);
-
 	return 0;
 }
 
@@ -1506,6 +1522,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 		if ((cp != cs) && old_prs) {
 			switch (parent->partition_root_state) {
 			case PRS_ROOT:
+			case PRS_ISOLATED:
 				update_parent = true;
 				break;
 
@@ -1515,7 +1532,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 				 * invalid, child partition roots become
 				 * invalid too.
 				 */
-				new_prs = PRS_INVALID_ROOT;
+				if (is_partition_valid(cp))
+					new_prs = -cp->partition_root_state;
 				break;
 			}
 		}
@@ -2097,6 +2115,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
 	int err = 0, old_prs = cs->partition_root_state;
+	bool sched_domain_rebuilt = false;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
 
@@ -2107,8 +2126,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	 * For a previously invalid partition root, leave it at being
 	 * invalid if new_prs is not "member".
 	 */
-	if (new_prs && is_prs_invalid(old_prs))
+	if (new_prs && is_prs_invalid(old_prs)) {
+		cs->partition_root_state = -new_prs;
 		return 0;
+	}
 
 	if (alloc_cpumasks(NULL, &tmpmask))
 		return -ENOMEM;
@@ -2134,6 +2155,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			goto out;
 		}
+
+		if (new_prs == PRS_ISOLATED) {
+			/*
+			 * Disable the load balance flag should not return an
+			 * error unless the system is running out of memory.
+			 */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+			sched_domain_rebuilt = true;
+		}
+	} else if (old_prs && new_prs) {
+		/*
+		 * A change in load balance state only, no change in cpumasks.
+		 */
+		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
+		sched_domain_rebuilt = true;
+		goto out;	/* Sched domain is rebuilt in update_flag() */
 	} else {
 		/*
 		 * Switching back to member is always allowed even if it
@@ -2155,6 +2192,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+
+		if (!is_sched_load_balance(cs)) {
+			/* Make sure load balance is on */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+			sched_domain_rebuilt = true;
+		}
 	}
 
 	update_tasks_cpumask(parent);
@@ -2162,13 +2205,14 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	if (parent->child_ecpus_count)
 		update_sibling_cpumasks(parent, cs, &tmpmask);
 
-	rebuild_sched_domains_locked();
+	if (!sched_domain_rebuilt)
+		rebuild_sched_domains_locked();
 out:
 	/*
 	 * Make partition invalid if an error happen
 	 */
 	if (err)
-		new_prs = PRS_INVALID_ROOT;
+		new_prs = -new_prs;
 	spin_lock_irq(&callback_lock);
 	cs->partition_root_state = new_prs;
 	spin_unlock_irq(&callback_lock);
@@ -2680,12 +2724,18 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 	case PRS_ROOT:
 		seq_puts(seq, "root\n");
 		break;
+	case PRS_ISOLATED:
+		seq_puts(seq, "isolated\n");
+		break;
 	case PRS_MEMBER:
 		seq_puts(seq, "member\n");
 		break;
 	case PRS_INVALID_ROOT:
 		seq_puts(seq, "root invalid\n");
 		break;
+	case PRS_INVALID_ISOLATED:
+		seq_puts(seq, "isolated invalid\n");
+		break;
 	}
 	return 0;
 }
@@ -2706,6 +2756,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		val = PRS_ROOT;
 	else if (!strcmp(buf, "member"))
 		val = PRS_MEMBER;
+	else if (!strcmp(buf, "isolated"))
+		val = PRS_ISOLATED;
 	else
 		return -EINVAL;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 6/8] cgroup/cpuset: Show invalid partition reason string
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
                   ` (4 preceding siblings ...)
  2022-05-03 16:21 ` [PATCH v10 5/8] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-03 16:21 ` [PATCH v10 7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

There are a number of different reasons which can cause a partition to
become invalid. A user seeing an invalid partition may not know exactly
why. To help user to get a better understanding of the underlying reason,
The cpuset.cpus.partition control file, when read, will now report the
reason why a partition become invalid. When a partition does become
invalid, reading the control file will show "root invalid (<reason>)"
where <reason> is a string that describes why the partition is invalid.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 91 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 74 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 073db69ac8fc..90ee0e4d8d7e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -85,6 +85,30 @@ struct fmeter {
 	spinlock_t lock;	/* guards read or write of above */
 };
 
+/*
+ * Invalid partition error code
+ */
+enum prs_errcode {
+	PERR_NONE = 0,
+	PERR_INVCPUS,
+	PERR_INVPARENT,
+	PERR_NOTPART,
+	PERR_NOTEXCL,
+	PERR_NOCPUS,
+	PERR_HOTPLUG,
+	PERR_CPUSEMPTY,
+};
+
+static const char * const perr_strings[] = {
+	[PERR_INVCPUS]   = "Invalid cpu list in cpuset.cpus",
+	[PERR_INVPARENT] = "Parent is an invalid partition root",
+	[PERR_NOTPART]   = "Parent is not a partition root",
+	[PERR_NOTEXCL]   = "Cpu list in cpuset.cpus not exclusive",
+	[PERR_NOCPUS]    = "Parent unable to distribute cpu downstream",
+	[PERR_HOTPLUG]   = "No cpu available due to hotplug",
+	[PERR_CPUSEMPTY] = "cpuset.cpus is empty",
+};
+
 struct cpuset {
 	struct cgroup_subsys_state css;
 
@@ -168,6 +192,9 @@ struct cpuset {
 	int use_parent_ecpus;
 	int child_ecpus_count;
 
+	/* Invalid partition error code, not lock protected */
+	enum prs_errcode prs_err;
+
 	/* Handle for cpuset.cpus.partition */
 	struct cgroup_file partition_file;
 };
@@ -294,6 +321,10 @@ static inline void notify_partition_change(struct cpuset *cs, int old_prs)
 	if (old_prs == cs->partition_root_state)
 		return;
 	cgroup_file_notify(&cs->partition_file);
+
+	/* Reset prs_err if not invalid */
+	if (is_partition_valid(cs))
+		WRITE_ONCE(cs->prs_err, PERR_NONE);
 }
 
 static struct cpuset top_cpuset = {
@@ -1231,7 +1262,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
  * @cmd:     Partition root state change command
  * @newmask: Optional new cpumask for partcmd_update
  * @tmp:     Temporary addmask and delmask
- * Return:   0 or -1 (error)
+ * Return:   0 or a partition root state error code
  *
  * For partcmd_enable, the cpuset is being transformed from a non-partition
  * root to a partition root. The cpus_allowed mask of the given cpuset will
@@ -1264,7 +1295,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
 	int old_prs, new_prs;
-	bool part_error = false;	/* Partition error? */
+	int part_error = PERR_NONE;	/* Partition error? */
 
 	percpu_rwsem_assert_held(&cpuset_rwsem);
 
@@ -1273,10 +1304,13 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 	 * The new cpumask, if present, or the current cpus_allowed must
 	 * not be empty.
 	 */
-	if (!is_partition_valid(parent) ||
-	   (newmask && cpumask_empty(newmask)) ||
+	if (!is_partition_valid(parent)) {
+		return is_partition_invalid(parent)
+		       ? PERR_INVPARENT : PERR_NOTPART;
+	}
+	if ((newmask && cpumask_empty(newmask)) ||
 	   (!newmask && cpumask_empty(cs->cpus_allowed)))
-		return -1;
+		return PERR_CPUSEMPTY;
 
 	adding = deleting = false;
 	old_prs = new_prs = cs->partition_root_state;
@@ -1286,7 +1320,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 		 * doesn't overlap parent's cpus_allowed.
 		 */
 		if (!cpumask_intersects(cs->cpus_allowed, parent->cpus_allowed))
-			return -1;
+			return PERR_INVCPUS;
 
 		/*
 		 * A parent can be left with no CPU as long as there is no
@@ -1294,7 +1328,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 		 */
 		if (partition_is_populated(parent, cs) &&
 		   !cpumask_intersects(cs->cpus_allowed, parent->effective_cpus))
-			return -1;
+			return PERR_NOCPUS;
 
 		cpumask_copy(tmp->addmask, cs->cpus_allowed);
 		adding = true;
@@ -1330,7 +1364,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 		if (adding && partition_is_populated(parent, cs) &&
 		    cpumask_subset(parent->effective_cpus, tmp->addmask) &&
 		    !cpumask_intersects(tmp->delmask, cpu_active_mask)) {
-			part_error = true;
+			part_error = PERR_NOCPUS;
 			adding = false;
 			deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
 					       parent->subparts_cpus);
@@ -1362,7 +1396,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 		    (adding &&
 		     cpumask_subset(parent->effective_cpus, tmp->addmask) &&
 		     partition_is_populated(parent, cs))) {
-			part_error = true;
+			part_error = PERR_NOCPUS;
 			adding = false;
 		}
 
@@ -1371,6 +1405,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 			deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
 					       parent->subparts_cpus);
 	}
+	if (part_error)
+		WRITE_ONCE(cs->prs_err, part_error);
 
 	if (cmd == partcmd_update) {
 		/*
@@ -1401,7 +1437,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 	if ((old_prs != new_prs) && is_prs_invalid(old_prs)) {
 		if (!is_cpu_exclusive(cs) &&
 		    (update_flag(CS_CPU_EXCLUSIVE, cs, 1) < 0))
-			return -1;
+			return PERR_NOTEXCL;
 	}
 
 	/*
@@ -1534,6 +1570,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 				 */
 				if (is_partition_valid(cp))
 					new_prs = -cp->partition_root_state;
+				WRITE_ONCE(cp->prs_err,
+					   is_partition_invalid(parent)
+					   ? PERR_INVPARENT : PERR_NOTPART);
 				break;
 			}
 		}
@@ -2108,13 +2147,13 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
  * update_prstate - update partition_root_state
  * @cs: the cpuset to update
  * @new_prs: new partition root state
- * Return: 0 if successful, < 0 if error
+ * Return: 0 if successful, != 0 if error
  *
  * Call with cpuset_rwsem held.
  */
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
-	int err = 0, old_prs = cs->partition_root_state;
+	int err = PERR_NONE, old_prs = cs->partition_root_state;
 	bool sched_domain_rebuilt = false;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
@@ -2141,13 +2180,15 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		 * cannot be empty.
 		 */
 		if (cpumask_empty(cs->cpus_allowed)) {
-			err = 1;
+			err = PERR_CPUSEMPTY;
 			goto out;
 		}
 
 		err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
-		if (err)
+		if (err) {
+			err = PERR_NOTEXCL;
 			goto out;
+		}
 
 		err = update_parent_subparts_cpumask(cs, partcmd_enable,
 						     NULL, &tmpmask);
@@ -2719,6 +2760,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 static int sched_partition_show(struct seq_file *seq, void *v)
 {
 	struct cpuset *cs = css_cs(seq_css(seq));
+	const char *err, *type = NULL;
 
 	switch (cs->partition_root_state) {
 	case PRS_ROOT:
@@ -2731,9 +2773,17 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 		seq_puts(seq, "member\n");
 		break;
 	case PRS_INVALID_ROOT:
-		seq_puts(seq, "root invalid\n");
-		break;
+		type = "root";
+		fallthrough;
 	case PRS_INVALID_ISOLATED:
+		if (!type)
+			type = "isolated";
+		err = perr_strings[READ_ONCE(cs->prs_err)];
+		if (err)
+			seq_printf(seq, "%s invalid (%s)\n", type, err);
+		else
+			seq_printf(seq, "%s invalid\n", type);
+		break;
 		seq_puts(seq, "isolated invalid\n");
 		break;
 	}
@@ -3324,7 +3374,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	 */
 	if (is_partition_valid(cs) && (!parent->nr_subparts_cpus ||
 	   (cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)))) {
-		int old_prs;
+		int old_prs, parent_prs;
 
 		update_parent_subparts_cpumask(cs, partcmd_disable, NULL, tmp);
 		if (cs->nr_subparts_cpus) {
@@ -3336,10 +3386,17 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		}
 
 		old_prs = cs->partition_root_state;
+		parent_prs = parent->partition_root_state;
 		if (is_partition_valid(cs)) {
 			spin_lock_irq(&callback_lock);
 			set_partition_invalid(cs);
 			spin_unlock_irq(&callback_lock);
+			if (is_prs_invalid(parent_prs))
+				WRITE_ONCE(cs->prs_err, PERR_INVPARENT);
+			else if (!parent_prs)
+				WRITE_ONCE(cs->prs_err, PERR_NOTPART);
+			else
+				WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
 			notify_partition_change(cs, old_prs);
 		}
 		cpuset_force_rebuild();
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
                   ` (5 preceding siblings ...)
  2022-05-03 16:21 ` [PATCH v10 6/8] cgroup/cpuset: Show invalid partition reason string Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-04 11:25   ` Michal Koutný
  2022-05-03 16:21 ` [PATCH v10 8/8] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
  2022-05-04 11:28 ` [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Michal Koutný
  8 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as other changes made in other
cpuset patches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 145 +++++++++++++-----------
 1 file changed, 79 insertions(+), 66 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69d7a6983f78..94e1e3771830 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2110,74 +2110,87 @@ Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
-	When set to be a partition root, the current cgroup is the
-	root of a new partition or scheduling domain that comprises
-	itself and all its descendants except those that are separate
-	partition roots themselves and their descendants.  The root
-	cgroup is always a partition root.
-
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
-
-	1) The "cpuset.cpus" is not empty and the list of CPUs are
-	   exclusive, i.e. they are not shared by any of its siblings.
-	2) The parent cgroup is a partition root.
-	3) The "cpuset.cpus" is also a proper subset of the parent's
-	   "cpuset.cpus.effective".
-	4) There is no child cgroups with cpuset enabled.  This is for
-	   eliminating corner cases that have to be handled if such a
-	   condition is allowed.
-
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
-	file cannot be reverted back to "member" if there are any child
-	cgroups with cpuset enabled.
-
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
-
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
-
-	Sometimes, external factors like changes to ancestors'
-	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
-
-	  ==============	==============================
-	  "member"		Non-root member of a partition
-	  "root"		Partition root
-	  "root invalid"	Invalid partition root
-	  ==============	==============================
-
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
-	The cpu affinity of all the tasks in the cgroup will then be
-	associated with CPUs in the nearest ancestor partition.
-
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	The root cgroup is always a partition root and its state
+	cannot be changed.  All other non-root cgroups start out as
+	"member".
+
+	When set to "root", the current cgroup is the root of a new
+	partition or scheduling domain that comprises itself and all
+	its descendants except those that are separate partition roots
+	themselves and their descendants.
+
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks placed in such a partition with multiple
+	CPUs should be carefully distributed and bound to each of the
+	individual CPUs for optimal performance.
+
+	The value shown in "cpuset.cpus.effective" of a partition root is
+	the CPUs that the parent partition root can dedicate to the new
+	partition root.  They are subtracted from "cpuset.cpus.effective"
+	of the parent and may be different from "cpuset.cpus"
+
+	A partition root ("root" or "isolated") can be in one of the
+	two possible states - valid or invalid.  An invalid partition
+	root is in a degraded state where some state information may
+	be retained, but behaves more like a "member".
+
+	All possible state transitions among "member", "root" and
+	"isolated" are allowed.
+
+	On read, the "cpuset.cpus.partition" file can show the following
+	values.
+
+	  ======================	==============================
+	  "member"			Non-root member of a partition
+	  "root"			Partition root
+	  "isolated"			Partition root without load balancing
+	  "root invalid (<reason>)"	Invalid partition root
+	  "isolated invalid (<reason>)"	Invalid isolated partition root
+	  ======================	==============================
+
+	In the case of an invalid partition root, a descriptive string on
+	why the partition is invalid is included within parentheses.
+
+	For a partition root to become valid, the following conditions
+	must be met.
+
+	1) The "cpuset.cpus" is exclusive, i.e. they are not shared by
+	   any of its siblings (exclusivity rule).
+	2) The parent cgroup is a valid partition root.
+	3) The "cpuset.cpus" is not empty and must contain at least
+	   one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
+        4) The "cpuset.cpus.effective" must be a subset of "cpuset.cpus"
+           and cannot be empty unless there is no task associated with
+           this partition.
+
+	External events like hotplug or changes to "cpuset.cpus" can
+	cause a valid partition root to become invalid and vice versa.
+	Note that a task cannot be moved to a cgroup with empty
+	"cpuset.cpus.effective".
+
+        For a valid partition root or an invalid partition root with
+        the exclusivity rule enabled, changes made to "cpuset.cpus"
+        that violate the exclusivity rule will not be allowed.
+
+	A valid non-root parent partition may distribute out all its CPUs
+	to its child partitions when there is no task associated with it.
+
+        Care must be taken to change a valid partition root to "member"
+        as all its child partitions, if present, will become invalid.
+
+	Poll and inotify events are triggered whenever the state of
+	"cpuset.cpus.partition" changes.  That includes changes caused
+	by write to "cpuset.cpus.partition", cpu hotplug or other
+	changes that modify the validity status of the partition.
+	This will allow user space agents to monitor unexpected changes
+	to "cpuset.cpus.partition" without the need to do continuous
+	polling.
 
 
 Device controller
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v10 8/8] kselftest/cgroup: Add cpuset v2 partition root state test
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
                   ` (6 preceding siblings ...)
  2022-05-03 16:21 ` [PATCH v10 7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
@ 2022-05-03 16:21 ` Waiman Long
  2022-05-04 11:28 ` [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Michal Koutný
  8 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 16:21 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Add a test script test_cpuset_prs.sh with a helper program wait_inotify
for exercising the cpuset v2 partition root state code.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 674 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
 3 files changed, 764 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 745fe25fa0b9..01687418b92f 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,10 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -Wall -pthread
 
-all:
+all: ${HELPER_PROGS}
 
 TEST_FILES     := with_stress.sh
-TEST_PROGS     := test_stress.sh
+TEST_PROGS     := test_stress.sh test_cpuset_prs.sh
+TEST_GEN_FILES := wait_inotify
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_kmem
 TEST_GEN_PROGS += test_core
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
new file mode 100755
index 000000000000..3ee12f8dc752
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -0,0 +1,674 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test for cpuset v2 partition root state (PRS)
+#
+# The sched verbose flag is set, if available, so that the console log
+# can be examined for the correct setting of scheduling domain.
+#
+
+skip_test() {
+	echo "$1"
+	echo "Test SKIPPED"
+	exit 0
+}
+
+[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
+
+# Set sched verbose flag, if available
+[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
+
+# Get wait_inotify location
+WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
+
+# Find cgroup v2 mount point
+CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
+
+CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
+[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
+
+# Set verbose flag and delay factor
+PROG=$1
+VERBOSE=
+DELAY_FACTOR=1
+while [[ "$1" = -* ]]
+do
+	case "$1" in
+		-v) VERBOSE=1
+		    break
+		    ;;
+		-d) DELAY_FACTOR=$2
+		    shift
+		    break
+		    ;;
+		*)  echo "Usage: $PROG [-v] [-d <delay-factor>"
+		    exit
+		    ;;
+	esac
+	shift
+done
+
+cd $CGROUP2
+echo +cpuset > cgroup.subtree_control
+[[ -d test ]] || mkdir test
+cd test
+
+# Pause in ms
+pause()
+{
+	DELAY=$1
+	LOOP=0
+	while [[ $LOOP -lt $DELAY_FACTOR ]]
+	do
+		sleep $DELAY
+		((LOOP++))
+	done
+	return 0
+}
+
+console_msg()
+{
+	MSG=$1
+	echo "$MSG"
+	echo "" > /dev/console
+	echo "$MSG" > /dev/console
+	pause 0.01
+}
+
+test_partition()
+{
+	EXPECTED_VAL=$1
+	echo $EXPECTED_VAL > cpuset.cpus.partition
+	[[ $? -eq 0 ]] || exit 1
+	ACTUAL_VAL=$(cat cpuset.cpus.partition)
+	[[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
+		echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+test_effective_cpus()
+{
+	EXPECTED_VAL=$1
+	ACTUAL_VAL=$(cat cpuset.cpus.effective)
+	[[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
+		echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+# Adding current process to cgroup.procs as a test
+test_add_proc()
+{
+	OUTSTR="$1"
+	ERRMSG=$((echo $$ > cgroup.procs) |& cat)
+	echo $ERRMSG | grep -q "$OUTSTR"
+	[[ $? -ne 0 ]] && {
+		echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
+		echo "Test FAILED"
+		exit 1
+	}
+	echo $$ > $CGROUP2/cgroup.procs	# Move out the task
+}
+
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+	echo 2-3 > cpuset.cpus
+	TYPE=$(cat cpuset.cpus.partition)
+	[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+	console_msg "Change from member to root"
+	test_partition root
+
+	console_msg "Change from root to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to member"
+	test_partition member
+
+	console_msg "Change from member to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to root"
+	test_partition root
+
+	console_msg "Change from root to member"
+	test_partition member
+
+	#
+	# Testing partition root with no cpu
+	#
+	console_msg "Distribute all cpus to child partition"
+	echo +cpuset > cgroup.subtree_control
+	test_partition root
+
+	mkdir A1
+	cd A1
+	echo 2-3 > cpuset.cpus
+	test_partition root
+	test_effective_cpus 2-3
+	cd ..
+	test_effective_cpus ""
+
+	console_msg "Moving task to partition test"
+	test_add_proc "No space left"
+	cd A1
+	test_add_proc ""
+	cd ..
+
+	console_msg "Shrink and expand child partition"
+	cd A1
+	echo 2 > cpuset.cpus
+	cd ..
+	test_effective_cpus 3
+	cd A1
+	echo 2-3 > cpuset.cpus
+	cd ..
+	test_effective_cpus ""
+
+	# Cleaning up
+	console_msg "Cleaning up"
+	echo $$ > $CGROUP2/cgroup.procs
+	[[ -d A1 ]] && rmdir A1
+}
+
+#
+# Cpuset controller state transition test matrix.
+#
+# Cgroup test hierarchy
+#
+# test -- A1 -- A2 -- A3
+#      \- B1
+#
+#  P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
+#  C<l> = add cpu-list
+#  S<p> = use prefix in subtree_control
+#  T    = put a task into cgroup
+#  O<c>-<v> = Write <v> to CPU online file of <c>
+#
+SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+TEST_MATRIX=(
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .      .     0 A2:0-1"
+	"  S+    C0-1     .      .    C2-3    P1      .      .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+ C0-1:P1   .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+  C1:P1    .      .     0 "
+	"  S+   C0-1:S+   .      .    C2-3     .      .      .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+     C1      .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3   C4-5     .      .      .     0 A1:4-5"
+	"  S+   C0-1:P1   .      .    C2-3  S+:C4-5   .      .      .     0 A1:4-5"
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C2     0 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .    C4-5    0 B1:4-5"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .      .      .      .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     C1-3    .      .      .     0 A1:1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      P0     .      .     0 A1:3,A2:3 A1:P1,A2:P0"
+	"  S+ C2-3:P1:S+  C2:P1  .      .     C2-4    .      .      .     0 A1:3-4,A2:2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .     C0-2   0 A1:,B1:0-2 A1:P1,A2:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# CPU offlining cases:
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .     O2-0   0 A1:0-1,B1:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0    .      .      .     0 A1:0-1,A2:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0   O2-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0    .      .      .     0 A1:0,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0   O1-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0    .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0    .      .      .     0 A1:2,A2: A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .    T:O2-0   .      .      .     0 A1:3,A2:3 A1:P1,A2:P-1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .      .    T:O3-0   .      .     0 A1:2,A2:2 A1:P1,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O1-0    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O2-0    .      .      .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O3-0    .      .      .     0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0   .      .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0   .      .     0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0   .     0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O1-1    .      .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0  O2-1    .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0  O3-1   0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O1-1    .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O2-1    .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	#
+	# Incorrect change to cpuset.cpus invalidates partition root
+	#
+	# Adding CPUs to partition root that are not in parent's
+	# cpuset.cpus is allowed, but those extra CPUs are ignored.
+	"  S+ C2-3:P1:S+ C3:P1   .      .      .     C2-4    .      .     0 A1:,A2:2-3 A1:P1,A2:P1"
+
+	# Taking away all CPUs from parent or itself if there are tasks
+	# will make the partition invalid.
+	"  S+ C2-3:P1:S+  C3:P1  .      .      T     C2-3    .      .     0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:C2-3   .      .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    . T:C2-3:C1-3 .      .      .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# Changing a partition root to member makes child partitions invalid
+	"  S+ C2-3:P1:S+  C3:P1  .      .      P0     .      .      .     0 A1:2-3,A2:3 A1:P0,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    P0     .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P-1"
+
+	# cpuset.cpus can contains cpus not in parent's cpuset.cpus as long
+	# as they overlap.
+	"  S+ C2-3:P1:S+  .      .      .      .   C3-4:P1   .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+
+	# Deletion of CPUs distributed to child cgroup is allowed.
+	"  S+ C0-1:P1:S+ C1      .    C2-3   C4-5     .      .      .     0 A1:4-5,A2:4-5"
+
+	# To become a valid partition root, cpuset.cpus must overlap parent's
+	# cpuset.cpus.
+	"  S+   C0-1:P1   .      .    C2-3    S+   C4-5:P1   .      .     0 A1:0-1,A2:0-1 A1:P1,A2:P-1"
+
+	# Enabling partition with child cpusets is allowed
+	"  S+   C0-1:S+  C1      .    C2-3    P1      .      .      .     0 A1:0-1,A2:1 A1:P1"
+
+	# A partition root with non-partition root parent is invalid, but it
+	# can be made valid if its parent becomes a partition root too.
+	"  S+   C0-1:S+  C1      .    C2-3     .      P2     .      .     0 A1:0-1,A2:1 A1:P0,A2:P-2"
+	"  S+   C0-1:S+ C1:P2    .    C2-3     P1     .      .      .     0 A1:0,A2:1 A1:P1,A2:P2"
+
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	# Failure cases:
+
+	# Any change to cpuset.cpus of a partition root must be exclusive.
+	"  S+   C0-1:P1   .      .    C2-3   C0-2     .      .      .     1 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C1     1 "
+	"  S+ C2-3:P1:S+  C2:P1  .     C1    C1-3     .      .      .     1 "
+
+	# A task cannot be added to a partition with no cpu
+	"  S+ C2-3:P1:S+  C3:P1  .      .    O2-0:T   .      .      .     1 A1:,A2:3 A1:P1,A2:P1"
+)
+
+#
+# Write to the cpu online file
+#  $1 - <c>-<v> where <c> = cpu number, <v> value to be written
+#
+write_cpu_online()
+{
+	CPU=${1%-*}
+	VAL=${1#*-}
+	CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
+	if [[ $VAL -eq 0 ]]
+	then
+		OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
+	else
+		[[ -n "$OFFLINE_CPUS" ]] && {
+			OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
+					sort | uniq -u)
+		}
+	fi
+	echo $VAL > $CPUFILE
+	pause 0.01
+}
+
+#
+# Set controller state
+#  $1 - cgroup directory
+#  $2 - state
+#  $3 - showerr
+#
+# The presence of ":" in state means transition from one to the next.
+#
+set_ctrl_state()
+{
+	TMPMSG=/tmp/.msg_$$
+	CGRP=$1
+	STATE=$2
+	SHOWERR=${3}${VERBOSE}
+	CTRL=${CTRL:=$CONTROLLER}
+	HASERR=0
+	REDIRECT="2> $TMPMSG"
+	[[ -z "$STATE" || "$STATE" = '.' ]] && return 0
+
+	rm -f $TMPMSG
+	for CMD in $(echo $STATE | sed -e "s/:/ /g")
+	do
+		TFILE=$CGRP/cgroup.procs
+		SFILE=$CGRP/cgroup.subtree_control
+		PFILE=$CGRP/cpuset.cpus.partition
+		CFILE=$CGRP/cpuset.cpus
+		S=$(expr substr $CMD 1 1)
+		if [[ $S = S ]]
+		then
+			PREFIX=${CMD#?}
+			COMM="echo ${PREFIX}${CTRL} > $SFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = C ]]
+		then
+			CPUS=${CMD#?}
+			COMM="echo $CPUS > $CFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = P ]]
+		then
+			VAL=${CMD#?}
+			case $VAL in
+			0)  VAL=member
+			    ;;
+			1)  VAL=root
+			    ;;
+			2)  VAL=isolated
+			    ;;
+			*)
+			    echo "Invalid partition state - $VAL"
+			    exit 1
+			    ;;
+			esac
+			COMM="echo $VAL > $PFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = O ]]
+		then
+			VAL=${CMD#?}
+			write_cpu_online $VAL
+		elif [[ $S = T ]]
+		then
+			COMM="echo 0 > $TFILE"
+			eval $COMM $REDIRECT
+		fi
+		RET=$?
+		[[ $RET -ne 0 ]] && {
+			[[ -n "$SHOWERR" ]] && {
+				echo "$COMM"
+				cat $TMPMSG
+			}
+			HASERR=1
+		}
+		pause 0.01
+		rm -f $TMPMSG
+	done
+	return $HASERR
+}
+
+set_ctrl_state_noerr()
+{
+	CGRP=$1
+	STATE=$2
+	[[ -d $CGRP ]] || mkdir $CGRP
+	set_ctrl_state $CGRP $STATE 1
+	[[ $? -ne 0 ]] && {
+		echo "ERROR: Failed to set $2 to cgroup $1!"
+		exit 1
+	}
+}
+
+online_cpus()
+{
+	[[ -n "OFFLINE_CPUS" ]] && {
+		for C in $OFFLINE_CPUS
+		do
+			write_cpu_online ${C}-1
+		done
+	}
+}
+
+#
+# Return 1 if the list of effective cpus isn't the same as the initial list.
+#
+reset_cgroup_states()
+{
+	echo 0 > $CGROUP2/cgroup.procs
+	online_cpus
+	rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
+	set_ctrl_state . S-
+	pause 0.01
+}
+
+dump_states()
+{
+	for DIR in A1 A1/A2 A1/A2/A3 B1
+	do
+		ECPUS=$DIR/cpuset.cpus.effective
+		PRS=$DIR/cpuset.cpus.partition
+		[[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
+		[[ -e $PRS   ]] && echo "$PRS: $(cat $PRS)"
+	done
+}
+
+#
+# Check effective cpus
+# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
+#
+check_effective_cpus()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		CPUS=$2
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+		FILE=$CGRP/cpuset.cpus.effective
+		[[ -e $FILE ]] || return 1
+		[[ $CPUS = $(cat $FILE) ]] || return 1
+	done
+}
+
+#
+# Check cgroup states
+#  $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
+#
+check_cgroup_states()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		STATE=$2
+		FILE=
+		EVAL=$(expr substr $STATE 2 2)
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+
+		case $STATE in
+			P*) FILE=$CGRP/cpuset.cpus.partition
+			    ;;
+			*)  echo "Unknown state: $STATE!"
+			    exit 1
+			    ;;
+		esac
+		VAL=$(cat $FILE)
+
+		case "$VAL" in
+			member) VAL=0
+				;;
+			root)	VAL=1
+				;;
+			isolated)
+				VAL=2
+				;;
+			"root invalid"*)
+				VAL=-1
+				;;
+			"isolated invalid"*)
+				VAL=-2
+				;;
+		esac
+		[[ $EVAL != $VAL ]] && return 1
+	done
+	return 0
+}
+
+#
+# Run cpuset state transition test
+#  $1 - test matrix name
+#
+# This test is somewhat fragile as delays (sleep x) are added in various
+# places to make sure state changes are fully propagated before the next
+# action. These delays may need to be adjusted if running in a slower machine.
+#
+run_state_test()
+{
+	TEST=$1
+	CONTROLLER=cpuset
+	CPULIST=0-6
+	I=0
+	eval CNT="\${#$TEST[@]}"
+
+	reset_cgroup_states
+	echo $CPULIST > cpuset.cpus
+	echo root > cpuset.cpus.partition
+	console_msg "Running state transition test ..."
+
+	while [[ $I -lt $CNT ]]
+	do
+		echo "Running test $I ..." > /dev/console
+		eval set -- "\${$TEST[$I]}"
+		ROOT=$1
+		OLD_A1=$2
+		OLD_A2=$3
+		OLD_A3=$4
+		OLD_B1=$5
+		NEW_A1=$6
+		NEW_A2=$7
+		NEW_A3=$8
+		NEW_B1=$9
+		RESULT=${10}
+		ECPUS=${11}
+		STATES=${12}
+
+		set_ctrl_state_noerr .        $ROOT
+		set_ctrl_state_noerr A1       $OLD_A1
+		set_ctrl_state_noerr A1/A2    $OLD_A2
+		set_ctrl_state_noerr A1/A2/A3 $OLD_A3
+		set_ctrl_state_noerr B1       $OLD_B1
+		RETVAL=0
+		set_ctrl_state A1       $NEW_A1; ((RETVAL += $?))
+		set_ctrl_state A1/A2    $NEW_A2; ((RETVAL += $?))
+		set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
+		set_ctrl_state B1       $NEW_B1; ((RETVAL += $?))
+
+		[[ $RETVAL -ne $RESULT ]] && {
+			echo "Test $TEST[$I] failed result check!"
+			eval echo \"\${$TEST[$I]}\"
+			dump_states
+			online_cpus
+			exit 1
+		}
+
+		[[ -n "$ECPUS" && "$ECPUS" != . ]] && {
+			check_effective_cpus $ECPUS
+			[[ $? -ne 0 ]] && {
+				echo "Test $TEST[$I] failed effective CPU check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		[[ -n "$STATES" ]] && {
+			check_cgroup_states $STATES
+			[[ $? -ne 0 ]] && {
+				echo "FAILED: Test $TEST[$I] failed states check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		reset_cgroup_states
+		#
+		# Check to see if effective cpu list changes
+		#
+		pause 0.05
+		NEWLIST=$(cat cpuset.cpus.effective)
+		[[ $NEWLIST != $CPULIST ]] && {
+			echo "Effective cpus changed to $NEWLIST after test $I!"
+			exit 1
+		}
+		[[ -n "$VERBOSE" ]] && echo "Test $I done."
+		((I++))
+	done
+	echo "All $I tests of $TEST PASSED."
+
+	echo member > cpuset.cpus.partition
+}
+
+#
+# Wait for inotify event for the given file and read it
+# $1: cgroup file to wait for
+# $2: file to store the read result
+#
+wait_inotify()
+{
+	CGROUP_FILE=$1
+	OUTPUT_FILE=$2
+
+	$WAIT_INOTIFY $CGROUP_FILE
+	cat $CGROUP_FILE > $OUTPUT_FILE
+}
+
+#
+# Test if inotify events are properly generated when going into and out of
+# invalid partition state.
+#
+test_inotify()
+{
+	ERR=0
+	PRS=/tmp/.prs_$$
+	[[ -f $WAIT_INOTIFY ]] || {
+		echo "wait_inotify not found, inotify test SKIPPED."
+		return
+	}
+
+	pause 0.01
+	echo 1 > cpuset.cpus
+	echo 0 > cgroup.procs
+	echo root > cpuset.cpus.partition
+	pause 0.01
+	rm -f $PRS
+	wait_inotify $PWD/cpuset.cpus.partition $PRS &
+	pause 0.01
+	set_ctrl_state . "O1-0"
+	pause 0.01
+	check_cgroup_states ".:P-1"
+	if [[ $? -ne 0 ]]
+	then
+		echo "FAILED: Inotify test - partition not invalid"
+		ERR=1
+	elif [[ ! -f $PRS ]]
+	then
+		echo "FAILED: Inotify test - event not generated"
+		ERR=1
+		kill %1
+	elif [[ $(cat $PRS) != "root invalid"* ]]
+	then
+		echo "FAILED: Inotify test - incorrect state"
+		cat $PRS
+		ERR=1
+	fi
+	online_cpus
+	echo member > cpuset.cpus.partition
+	echo 0 > ../cgroup.procs
+	if [[ $ERR -ne 0 ]]
+	then
+		exit 1
+	else
+		echo "Inotify test PASSED"
+	fi
+}
+
+run_state_test TEST_MATRIX
+test_isolated
+test_inotify
+echo "All tests PASSED."
+cd ..
+rmdir test
diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c
new file mode 100644
index 000000000000..e11b431e1b62
--- /dev/null
+++ b/tools/testing/selftests/cgroup/wait_inotify.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Wait until an inotify event on the given cgroup file.
+ */
+#include <linux/limits.h>
+#include <sys/inotify.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static const char usage[] = "Usage: %s [-v] <cgroup_file>\n";
+static char *file;
+static int verbose;
+
+static inline void fail_message(char *msg)
+{
+	fprintf(stderr, msg, file);
+	exit(1);
+}
+
+int main(int argc, char *argv[])
+{
+	char *cmd = argv[0];
+	int c, fd;
+	struct pollfd fds = { .events = POLLIN, };
+
+	while ((c = getopt(argc, argv, "v")) != -1) {
+		switch (c) {
+		case 'v':
+			verbose++;
+			break;
+		}
+		argv++, argc--;
+	}
+
+	if (argc != 2) {
+		fprintf(stderr, usage, cmd);
+		return -1;
+	}
+	file = argv[1];
+	fd = open(file, O_RDONLY);
+	if (fd < 0)
+		fail_message("Cgroup file %s not found!\n");
+	close(fd);
+
+	fd = inotify_init();
+	if (fd < 0)
+		fail_message("inotify_init() fails on %s!\n");
+	if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
+		fail_message("inotify_add_watch() fails on %s!\n");
+	fds.fd = fd;
+
+	/*
+	 * poll waiting loop
+	 */
+	for (;;) {
+		int ret = poll(&fds, 1, 10000);
+
+		if (ret < 0) {
+			if (errno == EINTR)
+				continue;
+			perror("poll");
+			exit(1);
+		}
+		if ((ret > 0) && (fds.revents & POLLIN))
+			break;
+	}
+	if (verbose) {
+		struct inotify_event events[10];
+		long len;
+
+		usleep(1000);
+		len = read(fd, events, sizeof(events));
+		printf("Number of events read = %ld\n",
+			len/sizeof(struct inotify_event));
+	}
+	close(fd);
+	return 0;
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 3/8] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
  2022-05-03 16:21 ` [PATCH v10 3/8] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
@ 2022-05-03 17:54   ` Phil Auld
  2022-05-03 18:39     ` Waiman Long
  0 siblings, 1 reply; 15+ messages in thread
From: Phil Auld @ 2022-05-03 17:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

Hi Waiman

On Tue, May 03, 2022 at 12:21:44PM -0400 Waiman Long wrote:
> Currently, a partition root cannot have empty "cpuset.cpus.effective".
> As a result, a parent partition root cannot distribute out all its CPUs
>  to child partitions with no CPUs left. However in most cases, there
> shouldn't be any tasks associated with intermediate nodes of the default
>  hierarchy. So the current rule is too restrictive and can waste valuable
>  CPU resource.
> 
> To address this issue, we are now allowing a partition to have empty
> "cpuset.cpus.effective" as long as it has no task. Therefore, a parent
> partition with no task can now have all its CPUs distributed out to its
> child partitions. The top cpuset always have some house-keeping tasks
> running and so its list of effective cpu can't never be empty.

s/never/ever/

> 
> Once a partition with empty "cpuset.cpus.effective" is formed, no
> new task can be moved into it until "cpuset.cpus.effective" becomes
> non-empty.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 111 +++++++++++++++++++++++++++++++----------
>  1 file changed, 84 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index d156a39d7a08..7d9abd50a1b9 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -412,6 +412,41 @@ static inline bool is_in_v2_mode(void)
>  	      (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
>  }
>  
> +/**
> + * partition_is_populated - check if partition has tasks
> + * @cs: partition root to be checked
> + * @excluded_child: a child cpuset to be excluded in task checking
> + * Return: true if there are tasks, false otherwise
> + *
> + * It is assumed that @cs is a valid partition root. @excluded_child should
> + * be non-NULL when this cpuset is going to become a partition itself.
> + */
> +static inline bool partition_is_populated(struct cpuset *cs,
> +					  struct cpuset *excluded_child)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cpuset *child;
> +
> +	if (cs->css.cgroup->nr_populated_csets)
> +		return true;
> +	if (!excluded_child && !cs->nr_subparts_cpus)
> +		return cgroup_is_populated(cs->css.cgroup);
> +
> +	rcu_read_lock();
> +	cpuset_for_each_child(child, css, cs) {
> +		if (child == excluded_child)
> +			continue;
> +		if (is_partition_valid(child))
> +			continue;
> +		if (cgroup_is_populated(child->css.cgroup)) {
> +			rcu_read_unlock();
> +			return true;
> +		}
> +	}
> +	rcu_read_unlock();
> +	return false;
> +}
> +
>  /*
>   * Return in pmask the portion of a task's cpusets's cpus_allowed that
>   * are online and are capable of running the task.  If none are found,
> @@ -1252,22 +1287,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
>  	if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
>  		return -EBUSY;
>  
> -	/*
> -	 * Enabling partition root is not allowed if not all the CPUs
> -	 * can be granted from parent's effective_cpus or at least one
> -	 * CPU will be left after that.
> -	 */
> -	if ((cmd == partcmd_enable) &&
> -	   (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus) ||
> -	     cpumask_equal(cs->cpus_allowed, parent->effective_cpus)))
> -		return -EINVAL;
> -
> -	/*
> -	 * A cpumask update cannot make parent's effective_cpus become empty.
> -	 */
>  	adding = deleting = false;
>  	old_prs = new_prs = cs->partition_root_state;
>  	if (cmd == partcmd_enable) {
> +		/*
> +		 * Enabling partition root is not allowed if not all the CPUs
> +		 * can be granted from parent's effective_cpus.
> +		 */
> +		if (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus))
> +			return -EINVAL;
> +
> +		/*
> +		 * A parent can be left with no CPU as long as there is no
> +		 * task directly associated with the parent partition. For
> +		 * such a parent, no new task can be moved into it.
> +		 */
> +		if (partition_is_populated(parent, cs) &&
> +		    cpumask_equal(cs->cpus_allowed, parent->effective_cpus))
> +			return -EINVAL;
> +

You might consider switching these around to check the cpumasks first.


>  		cpumask_copy(tmp->addmask, cs->cpus_allowed);
>  		adding = true;
>  	} else if (cmd == partcmd_disable) {
> @@ -1289,9 +1327,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
>  		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
>  					parent->subparts_cpus);
>  		/*
> -		 * Return error if the new effective_cpus could become empty.
> +		 * Return error if the new effective_cpus could become empty
> +		 * and there are tasks in the parent.
>  		 */
> -		if (adding &&
> +		if (adding && partition_is_populated(parent, cs) &&
>  		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {

Same.



Cheers,
Phil


>  			if (!deleting)
>  				return -EINVAL;
> @@ -1317,8 +1356,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
>  		 */
>  		adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
>  				     parent->effective_cpus);
> -		part_error = cpumask_equal(tmp->addmask,
> -					   parent->effective_cpus);
> +		part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
> +			     partition_is_populated(parent, cs);
>  	}
>  
>  	if (cmd == partcmd_update) {
> @@ -1420,9 +1459,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
>  
>  		/*
>  		 * If it becomes empty, inherit the effective mask of the
> -		 * parent, which is guaranteed to have some CPUs.
> +		 * parent, which is guaranteed to have some CPUs unless
> +		 * it is a partition root that has explicitly distributed
> +		 * out all its CPUs.
>  		 */
>  		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
> +			if (is_partition_valid(cp) &&
> +			    cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
> +				goto update_parent_subparts;
> +
>  			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
>  			if (!cp->use_parent_ecpus) {
>  				cp->use_parent_ecpus = true;
> @@ -1444,6 +1489,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
>  			continue;
>  		}
>  
> +update_parent_subparts:
>  		/*
>  		 * update_parent_subparts_cpumask() should have been called
>  		 * for cs already in update_cpumask(). We should also call
> @@ -2249,6 +2295,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>  	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
>  		goto out_unlock;
>  
> +	/*
> +	 * On default hierarchy, task cannot be moved to a cpuset with empty
> +	 * effective cpus.
> +	 */
> +	if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
> +		goto out_unlock;
> +
>  	cgroup_taskset_for_each(task, css, tset) {
>  		ret = task_can_attach(task, cs->cpus_allowed);
>  		if (ret)
> @@ -3115,7 +3168,8 @@ hotplug_update_tasks(struct cpuset *cs,
>  		     struct cpumask *new_cpus, nodemask_t *new_mems,
>  		     bool cpus_updated, bool mems_updated)
>  {
> -	if (cpumask_empty(new_cpus))
> +	/* A partition root is allowed to have empty effective cpus */
> +	if (cpumask_empty(new_cpus) && !is_partition_valid(cs))
>  		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
>  	if (nodes_empty(*new_mems))
>  		*new_mems = parent_cs(cs)->effective_mems;
> @@ -3184,10 +3238,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>  
>  	/*
>  	 * In the unlikely event that a partition root has empty
> -	 * effective_cpus or its parent becomes invalid, we have to
> -	 * transition it to the invalid state.
> +	 * effective_cpus with tasks or its parent becomes invalid, we
> +	 * have to transition it to the invalid state.
>  	 */
> -	if (is_partition_valid(cs) && (cpumask_empty(&new_cpus) ||
> +	if (is_partition_valid(cs) &&
> +	   ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
>  	    is_partition_invalid(parent))) {
>  		if (cs->nr_subparts_cpus) {
>  			spin_lock_irq(&callback_lock);
> @@ -3198,13 +3253,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>  		}
>  
>  		/*
> -		 * If the effective_cpus is empty because the child
> -		 * partitions take away all the CPUs, we can keep
> -		 * the current partition and let the child partitions
> -		 * fight for available CPUs.
> +		 * Force the partition to become invalid if either one of
> +		 * the following conditions hold:
> +		 * 1) empty effective cpus but not valid empty partition.
> +		 * 2) parent is invalid or doesn't grant any cpus to child
> +		 *    partitions.
>  		 */
>  		if (is_partition_invalid(parent) ||
> -		     cpumask_empty(&new_cpus)) {
> +		    (cpumask_empty(&new_cpus) &&
> +		     partition_is_populated(cs, NULL))) {
>  			int old_prs;
>  
>  			update_parent_subparts_cpumask(cs, partcmd_disable,
> -- 
> 2.27.0
> 


-- 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 3/8] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
  2022-05-03 17:54   ` Phil Auld
@ 2022-05-03 18:39     ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-03 18:39 UTC (permalink / raw)
  To: Phil Auld
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 5/3/22 13:54, Phil Auld wrote:
> Hi Waiman
>
> On Tue, May 03, 2022 at 12:21:44PM -0400 Waiman Long wrote:
>> Currently, a partition root cannot have empty "cpuset.cpus.effective".
>> As a result, a parent partition root cannot distribute out all its CPUs
>>   to child partitions with no CPUs left. However in most cases, there
>> shouldn't be any tasks associated with intermediate nodes of the default
>>   hierarchy. So the current rule is too restrictive and can waste valuable
>>   CPU resource.
>>
>> To address this issue, we are now allowing a partition to have empty
>> "cpuset.cpus.effective" as long as it has no task. Therefore, a parent
>> partition with no task can now have all its CPUs distributed out to its
>> child partitions. The top cpuset always have some house-keeping tasks
>> running and so its list of effective cpu can't never be empty.
> s/never/ever/
It is a double negative. I think I will just remove "never".
>> Once a partition with empty "cpuset.cpus.effective" is formed, no
>> new task can be moved into it until "cpuset.cpus.effective" becomes
>> non-empty.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 111 +++++++++++++++++++++++++++++++----------
>>   1 file changed, 84 insertions(+), 27 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index d156a39d7a08..7d9abd50a1b9 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -412,6 +412,41 @@ static inline bool is_in_v2_mode(void)
>>   	      (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
>>   }
>>   
>> +/**
>> + * partition_is_populated - check if partition has tasks
>> + * @cs: partition root to be checked
>> + * @excluded_child: a child cpuset to be excluded in task checking
>> + * Return: true if there are tasks, false otherwise
>> + *
>> + * It is assumed that @cs is a valid partition root. @excluded_child should
>> + * be non-NULL when this cpuset is going to become a partition itself.
>> + */
>> +static inline bool partition_is_populated(struct cpuset *cs,
>> +					  struct cpuset *excluded_child)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +	struct cpuset *child;
>> +
>> +	if (cs->css.cgroup->nr_populated_csets)
>> +		return true;
>> +	if (!excluded_child && !cs->nr_subparts_cpus)
>> +		return cgroup_is_populated(cs->css.cgroup);
>> +
>> +	rcu_read_lock();
>> +	cpuset_for_each_child(child, css, cs) {
>> +		if (child == excluded_child)
>> +			continue;
>> +		if (is_partition_valid(child))
>> +			continue;
>> +		if (cgroup_is_populated(child->css.cgroup)) {
>> +			rcu_read_unlock();
>> +			return true;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +	return false;
>> +}
>> +
>>   /*
>>    * Return in pmask the portion of a task's cpusets's cpus_allowed that
>>    * are online and are capable of running the task.  If none are found,
>> @@ -1252,22 +1287,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
>>   	if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
>>   		return -EBUSY;
>>   
>> -	/*
>> -	 * Enabling partition root is not allowed if not all the CPUs
>> -	 * can be granted from parent's effective_cpus or at least one
>> -	 * CPU will be left after that.
>> -	 */
>> -	if ((cmd == partcmd_enable) &&
>> -	   (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus) ||
>> -	     cpumask_equal(cs->cpus_allowed, parent->effective_cpus)))
>> -		return -EINVAL;
>> -
>> -	/*
>> -	 * A cpumask update cannot make parent's effective_cpus become empty.
>> -	 */
>>   	adding = deleting = false;
>>   	old_prs = new_prs = cs->partition_root_state;
>>   	if (cmd == partcmd_enable) {
>> +		/*
>> +		 * Enabling partition root is not allowed if not all the CPUs
>> +		 * can be granted from parent's effective_cpus.
>> +		 */
>> +		if (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus))
>> +			return -EINVAL;
>> +
>> +		/*
>> +		 * A parent can be left with no CPU as long as there is no
>> +		 * task directly associated with the parent partition. For
>> +		 * such a parent, no new task can be moved into it.
>> +		 */
>> +		if (partition_is_populated(parent, cs) &&
>> +		    cpumask_equal(cs->cpus_allowed, parent->effective_cpus))
>> +			return -EINVAL;
>> +
> You might consider switching these around to check the cpumasks first.
Good point, partition_is_populated() is more expensive.
>
>
>>   		cpumask_copy(tmp->addmask, cs->cpus_allowed);
>>   		adding = true;
>>   	} else if (cmd == partcmd_disable) {
>> @@ -1289,9 +1327,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
>>   		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
>>   					parent->subparts_cpus);
>>   		/*
>> -		 * Return error if the new effective_cpus could become empty.
>> +		 * Return error if the new effective_cpus could become empty
>> +		 * and there are tasks in the parent.
>>   		 */
>> -		if (adding &&
>> +		if (adding && partition_is_populated(parent, cs) &&
>>   		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
> Same.
Thanks,
Longman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2022-05-03 16:21 ` [PATCH v10 7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
@ 2022-05-04 11:25   ` Michal Koutný
  2022-05-04 16:02     ` Waiman Long
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2022-05-04 11:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

Hello.

On Tue, May 03, 2022 at 12:21:48PM -0400, Waiman Long <longman@redhat.com> wrote:
>  Documentation/admin-guide/cgroup-v2.rst | 145 +++++++++++++-----------
>  1 file changed, 79 insertions(+), 66 deletions(-)

A note across various lines -- it seems your new text accidentally mixes
both spaces and tabs for indentation.

> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69d7a6983f78..94e1e3771830 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> [...]
> +	The value shown in "cpuset.cpus.effective" of a partition root is
> +	the CPUs that the parent partition root can dedicate to the new
> +	partition root.  They are subtracted from "cpuset.cpus.effective"
> +	of the parent and may be different from "cpuset.cpus"

I find this paragraph a bit hard to comprehend (I read it as it talks
about three levels of cgroups (parent, child, grandparent). It is
correct but I'd suggect following formulation (where I additionally
simplifed it by talking about "available" cpus):

> The value shown in "cpuset.cpus.effective" of a partition root is
> the CPUs that the partition root can dedicate to a potential new child
> partition root. The new child subtracts available CPUs from its parent
> "cpuset.cpus.effective".


> +	For a partition root to become valid, the following conditions
> +	must be met.
> +
> +	1) The "cpuset.cpus" is exclusive, i.e. they are not shared by
> +	   any of its siblings (exclusivity rule).
> +	2) The parent cgroup is a valid partition root.
> +	3) The "cpuset.cpus" is not empty and must contain at least
> +	   one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
> +        4) The "cpuset.cpus.effective" must be a subset of "cpuset.cpus"
> +           and cannot be empty unless there is no task associated with
> +           this partition.

This sounds good to me.

> +        Care must be taken to change a valid partition root to "member"
> +        as all its child partitions, if present, will become invalid.

This does not talk about recovering. Is it intentional? (I.e. to left
implementation defined)

Except the remarks above, I find the concepts described here good. I'll
reply to implementation separately & later.

Regards,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring
  2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
                   ` (7 preceding siblings ...)
  2022-05-03 16:21 ` [PATCH v10 8/8] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
@ 2022-05-04 11:28 ` Michal Koutný
  2022-05-04 18:33   ` Waiman Long
  8 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2022-05-04 11:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

Hello.

On Tue, May 03, 2022 at 12:21:41PM -0400, Waiman Long <longman@redhat.com> wrote:
> v10:
>  - Relax constraints for changes made to "cpuset.cpus"
>    and "cpuset.cpus.partition" as suggested. Now almost all changes
>    are allowed.

I see there were also some other changes from v9 (like the first patches
of series).
Any chance you have a public git repo with both versions for a
convenient range-diff?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2022-05-04 11:25   ` Michal Koutný
@ 2022-05-04 16:02     ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-04 16:02 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

On 5/4/22 07:25, Michal Koutný wrote:
> Hello.
>
> On Tue, May 03, 2022 at 12:21:48PM -0400, Waiman Long <longman@redhat.com> wrote:
>>   Documentation/admin-guide/cgroup-v2.rst | 145 +++++++++++++-----------
>>   1 file changed, 79 insertions(+), 66 deletions(-)
> A note across various lines -- it seems your new text accidentally mixes
> both spaces and tabs for indentation.

You are right. I will fix that.

>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 69d7a6983f78..94e1e3771830 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> [...]
>> +	The value shown in "cpuset.cpus.effective" of a partition root is
>> +	the CPUs that the parent partition root can dedicate to the new
>> +	partition root.  They are subtracted from "cpuset.cpus.effective"
>> +	of the parent and may be different from "cpuset.cpus"
> I find this paragraph a bit hard to comprehend (I read it as it talks
> about three levels of cgroups (parent, child, grandparent). It is
> correct but I'd suggect following formulation (where I additionally
> simplifed it by talking about "available" cpus):
>
>> The value shown in "cpuset.cpus.effective" of a partition root is
>> the CPUs that the partition root can dedicate to a potential new child
>> partition root. The new child subtracts available CPUs from its parent
>> "cpuset.cpus.effective".


Thanks for the suggestion, will modify the text as suggested.


>
>> +	For a partition root to become valid, the following conditions
>> +	must be met.
>> +
>> +	1) The "cpuset.cpus" is exclusive, i.e. they are not shared by
>> +	   any of its siblings (exclusivity rule).
>> +	2) The parent cgroup is a valid partition root.
>> +	3) The "cpuset.cpus" is not empty and must contain at least
>> +	   one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
>> +        4) The "cpuset.cpus.effective" must be a subset of "cpuset.cpus"
>> +           and cannot be empty unless there is no task associated with
>> +           this partition.
> This sounds good to me.
>
>> +        Care must be taken to change a valid partition root to "member"
>> +        as all its child partitions, if present, will become invalid.
> This does not talk about recovering. Is it intentional? (I.e. to left
> implementation defined)

This new patch series does have the ability to recover now.  I am just 
not emphasizing the recovery aspect of it in the doc file. I will add a 
sentence about it.

>
> Except the remarks above, I find the concepts described here good. I'll
> reply to implementation separately & later.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring
  2022-05-04 11:28 ` [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Michal Koutný
@ 2022-05-04 18:33   ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2022-05-04 18:33 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]

On 5/4/22 07:28, Michal Koutný wrote:
> Hello.
>
> On Tue, May 03, 2022 at 12:21:41PM -0400, Waiman Long <longman@redhat.com> wrote:
>> v10:
>>   - Relax constraints for changes made to "cpuset.cpus"
>>     and "cpuset.cpus.partition" as suggested. Now almost all changes
>>     are allowed.
> I see there were also some other changes from v9 (like the first patches
> of series).
> Any chance you have a public git repo with both versions for a
> convenient range-diff?

That is true. Both patches 1 and 2 are new and the changes are pretty 
straight forward. Patch 1 of v9 has been merged with a latent bug. Patch 
4 of this series is a replacement of patch 3 "cgroup/cpuset: Refining 
features and constraints of a  partition" of v9. The other patches are 
similar to their versions in v9 with some adjustment based on the 
different code base.

I don't have a public repo. Attached is the file diff between v9 and v10 
in cpuset.c with some other unrelated cpuset patches included.

Cheers,
Longman

[-- Attachment #2: cpuset.c.v9.diff --]
[-- Type: text/x-patch, Size: 36505 bytes --]

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8703a8452c33..90ee0e4d8d7e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -71,7 +71,7 @@ DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
 
 /*
  * There could be abnormal cpuset configurations for cpu or memory
- * node binding, add this key to provide a quick low-cost judgement
+ * node binding, add this key to provide a quick low-cost judgment
  * of the situation.
  */
 DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
@@ -93,16 +93,20 @@ enum prs_errcode {
 	PERR_INVCPUS,
 	PERR_INVPARENT,
 	PERR_NOTPART,
+	PERR_NOTEXCL,
 	PERR_NOCPUS,
 	PERR_HOTPLUG,
+	PERR_CPUSEMPTY,
 };
 
 static const char * const perr_strings[] = {
-	[PERR_INVCPUS]   = "Invalid change to cpuset.cpus",
+	[PERR_INVCPUS]   = "Invalid cpu list in cpuset.cpus",
 	[PERR_INVPARENT] = "Parent is an invalid partition root",
 	[PERR_NOTPART]   = "Parent is not a partition root",
+	[PERR_NOTEXCL]   = "Cpu list in cpuset.cpus not exclusive",
 	[PERR_NOCPUS]    = "Parent unable to distribute cpu downstream",
 	[PERR_HOTPLUG]   = "No cpu available due to hotplug",
+	[PERR_CPUSEMPTY] = "cpuset.cpus is empty",
 };
 
 struct cpuset {
@@ -198,23 +202,22 @@ struct cpuset {
 /*
  * Partition root states:
  *
- *   0 - not a partition root
- *
+ *   0 - member (not a partition root)
  *   1 - partition root
- *
  *   2 - partition root without load balancing (isolated)
- *
  *  -1 - invalid partition root
- *       None of the cpus in cpus_allowed can be put into the parent's
- *       subparts_cpus. In this case, the cpuset is not a real partition
- *       root anymore.  However, the CPU_EXCLUSIVE bit will still be set
- *       and the cpuset can be restored back to a partition root if the
- *       parent cpuset can give more CPUs back to this child cpuset.
+ *  -2 - invalid isolated partition root
  */
-#define PRS_DISABLED		0
-#define PRS_ENABLED		1
+#define PRS_MEMBER		0
+#define PRS_ROOT		1
 #define PRS_ISOLATED		2
-#define PRS_ERROR		-1
+#define PRS_INVALID_ROOT	-1
+#define PRS_INVALID_ISOLATED	-2
+
+static inline bool is_prs_invalid(int prs_state)
+{
+	return prs_state < 0;
+}
 
 /*
  * Temporary cpumasks for working with partitions that are passed among
@@ -294,30 +297,40 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
-static inline int is_partition_root(const struct cpuset *cs)
+static inline int is_partition_valid(const struct cpuset *cs)
 {
 	return cs->partition_root_state > 0;
 }
 
+static inline int is_partition_invalid(const struct cpuset *cs)
+{
+	return cs->partition_root_state < 0;
+}
+
+static inline void set_partition_invalid(struct cpuset *cs)
+{
+	if (is_partition_valid(cs))
+		cs->partition_root_state = -cs->partition_root_state;
+}
+
 /*
  * Send notification event of whenever partition_root_state changes.
  */
-static inline void notify_partition_change(struct cpuset *cs,
-					   int old_prs, int new_prs)
+static inline void notify_partition_change(struct cpuset *cs, int old_prs)
 {
-	if (old_prs == new_prs)
+	if (old_prs == cs->partition_root_state)
 		return;
 	cgroup_file_notify(&cs->partition_file);
 
 	/* Reset prs_err if not invalid */
-	if (new_prs != PRS_ERROR)
+	if (is_partition_valid(cs))
 		WRITE_ONCE(cs->prs_err, PERR_NONE);
 }
 
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
-	.partition_root_state = PRS_ENABLED,
+	.partition_root_state = PRS_ROOT,
 };
 
 /**
@@ -459,7 +472,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
 	cpuset_for_each_child(child, css, cs) {
 		if (child == excluded_child)
 			continue;
-		if (is_partition_root(child))
+		if (is_partition_valid(child))
 			continue;
 		if (cgroup_is_populated(child->css.cgroup)) {
 			rcu_read_unlock();
@@ -656,6 +669,35 @@ static inline void free_cpuset(struct cpuset *cs)
 	kfree(cs);
 }
 
+/*
+ * validate_change_legacy() - Validate conditions specific to legacy (v1)
+ *                            behavior.
+ */
+static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial)
+{
+	struct cgroup_subsys_state *css;
+	struct cpuset *c, *par;
+	int ret;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	/* Each of our child cpusets must be a subset of us */
+	ret = -EBUSY;
+	cpuset_for_each_child(c, css, cur)
+		if (!is_cpuset_subset(c, trial))
+			goto out;
+
+	/* On legacy hierarchy, we must be a subset of our parent cpuset. */
+	ret = -EACCES;
+	par = parent_cs(cur);
+	if (par && !is_cpuset_subset(trial, par))
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
+
 /*
  * validate_change() - Used to validate that any proposed cpuset change
  *		       follows the structural rules for cpusets.
@@ -680,20 +722,21 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 {
 	struct cgroup_subsys_state *css;
 	struct cpuset *c, *par;
-	int ret;
-
-	/* The checks don't apply to root cpuset */
-	if (cur == &top_cpuset)
-		return 0;
+	int ret = 0;
 
 	rcu_read_lock();
-	par = parent_cs(cur);
 
-	/* On legacy hierarchy, we must be a subset of our parent cpuset. */
-	ret = -EACCES;
-	if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
+	if (!is_in_v2_mode())
+		ret = validate_change_legacy(cur, trial);
+	if (ret)
+		goto out;
+
+	/* Remaining checks don't apply to root cpuset */
+	if (cur == &top_cpuset)
 		goto out;
 
+	par = parent_cs(cur);
+
 	/*
 	 * If either I or some sibling (!= me) is exclusive, we can't
 	 * overlap
@@ -869,7 +912,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 			update_domain_attr_tree(dattr, &top_cpuset);
 		}
 		cpumask_and(doms[0], top_cpuset.effective_cpus,
-			    housekeeping_cpumask(HK_FLAG_DOMAIN));
+			    housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 		goto done;
 	}
@@ -899,7 +942,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
 		      cpumask_intersects(cp->cpus_allowed,
-					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
+					 housekeeping_cpumask(HK_TYPE_DOMAIN))))
 			continue;
 
 		if (root_load_balance &&
@@ -911,7 +954,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 			csa[csn++] = cp;
 
 		/* skip @cp's subtree if not a partition root */
-		if (!is_partition_root(cp))
+		if (!is_partition_valid(cp))
 			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
@@ -988,7 +1031,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 
 			if (apn == b->pn) {
 				cpumask_or(dp, dp, b->effective_cpus);
-				cpumask_and(dp, dp, housekeeping_cpumask(HK_FLAG_DOMAIN));
+				cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
 				if (dattr)
 					update_domain_attr_tree(dattr + nslot, b);
 
@@ -1117,7 +1160,7 @@ static void rebuild_sched_domains_locked(void)
 	if (top_cpuset.nr_subparts_cpus) {
 		rcu_read_lock();
 		cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
-			if (!is_partition_root(cs)) {
+			if (!is_partition_valid(cs)) {
 				pos_css = css_rightmost_descendant(pos_css);
 				continue;
 			}
@@ -1164,6 +1207,14 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
+	/*
+	 * TODO: With cpuset partition that takes CPUs away from the top
+	 * cpuset, we may want to properly adjust the cpus_allowed mask of
+	 * tasks in the top cpuset as well.
+	 */
+	if (cs == &top_cpuset)
+		return;
+
 	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		set_cpus_allowed_ptr(task, cs->effective_cpus);
@@ -1203,13 +1254,15 @@ enum subparts_cmd {
 	partcmd_update,		/* Update parent's subparts_cpus */
 };
 
+static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
+		       int turning_on);
 /**
  * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
  * @cpuset:  The cpuset that requests change in partition root state
  * @cmd:     Partition root state change command
  * @newmask: Optional new cpumask for partcmd_update
  * @tmp:     Temporary addmask and delmask
- * Return:   0, 1 or an error code
+ * Return:   0 or a partition root state error code
  *
  * For partcmd_enable, the cpuset is being transformed from a non-partition
  * root to a partition root. The cpus_allowed mask of the given cpuset will
@@ -1217,43 +1270,32 @@ enum subparts_cmd {
  * effective_cpus. The function will return 0 if all the CPUs listed in
  * cpus_allowed can be granted or an error code will be returned.
  *
- * For partcmd_disable, the cpuset is being transofrmed from a partition
+ * For partcmd_disable, the cpuset is being transformed from a partition
  * root back to a non-partition root. Any CPUs in cpus_allowed that are in
  * parent's subparts_cpus will be taken away from that cpumask and put back
- * into parent's effective_cpus. 0 should always be returned.
+ * into parent's effective_cpus. 0 will always be returned.
  *
- * For partcmd_update, if the optional newmask is specified, the cpu
- * list is to be changed from cpus_allowed to newmask. Otherwise,
- * cpus_allowed is assumed to remain the same. The cpuset should either
- * be a partition root or an invalid partition root. The partition root
- * state may change if newmask is NULL and none of the requested CPUs can
- * be granted by the parent. The function will return 1 if changes to
- * parent's subparts_cpus and effective_cpus happen or 0 otherwise.
- * Error code should only be returned when newmask is non-NULL.
+ * For partcmd_update, if the optional newmask is specified, the cpu list is
+ * to be changed from cpus_allowed to newmask. Otherwise, cpus_allowed is
+ * assumed to remain the same. The cpuset should either be a valid or invalid
+ * partition root. The partition root state may change from valid to invalid
+ * or vice versa. An error code will only be returned if transitioning from
+ * invalid to valid violates the exclusivity rule.
  *
  * The partcmd_enable and partcmd_disable commands are used by
  * update_prstate(). The partcmd_update command is used by
  * update_cpumasks_hier() with newmask NULL and update_cpumask() with
  * newmask set.
- *
- * The checking is more strict when enabling partition root than the
- * other two commands.
- *
- * Because of the implicit cpu exclusive nature of a partition root,
- * cpumask changes that violates the cpu exclusivity rule will not be
- * permitted when checked by validate_change(). The validate_change()
- * function will also prevent any changes to the cpu list if it is not
- * a superset of children's cpu lists.
  */
-static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
+static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
 					  struct cpumask *newmask,
 					  struct tmpmasks *tmp)
 {
-	struct cpuset *parent = parent_cs(cpuset);
+	struct cpuset *parent = parent_cs(cs);
 	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
 	int old_prs, new_prs;
-	bool part_error = false;	/* Partition error? */
+	int part_error = PERR_NONE;	/* Partition error? */
 
 	percpu_rwsem_assert_held(&cpuset_rwsem);
 
@@ -1262,39 +1304,41 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	 * The new cpumask, if present, or the current cpus_allowed must
 	 * not be empty.
 	 */
-	if (!is_partition_root(parent) ||
-	   (newmask && cpumask_empty(newmask)) ||
-	   (!newmask && cpumask_empty(cpuset->cpus_allowed)))
-		return -EINVAL;
-
-	/*
-	 * Enabling partition root is not allowed if there are online children.
-	 */
-	if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css))
-		return -EBUSY;
+	if (!is_partition_valid(parent)) {
+		return is_partition_invalid(parent)
+		       ? PERR_INVPARENT : PERR_NOTPART;
+	}
+	if ((newmask && cpumask_empty(newmask)) ||
+	   (!newmask && cpumask_empty(cs->cpus_allowed)))
+		return PERR_CPUSEMPTY;
 
 	adding = deleting = false;
-	old_prs = new_prs = cpuset->partition_root_state;
+	old_prs = new_prs = cs->partition_root_state;
 	if (cmd == partcmd_enable) {
 		/*
 		 * Enabling partition root is not allowed if cpus_allowed
 		 * doesn't overlap parent's cpus_allowed.
 		 */
-		if (!cpumask_intersects(cpuset->cpus_allowed, parent->cpus_allowed))
-			return -EINVAL;
+		if (!cpumask_intersects(cs->cpus_allowed, parent->cpus_allowed))
+			return PERR_INVCPUS;
 
 		/*
 		 * A parent can be left with no CPU as long as there is no
 		 * task directly associated with the parent partition.
 		 */
-		if (partition_is_populated(parent, cpuset) &&
-		    cpumask_subset(parent->effective_cpus, cpuset->cpus_allowed))
-			return -EINVAL;
+		if (partition_is_populated(parent, cs) &&
+		   !cpumask_intersects(cs->cpus_allowed, parent->effective_cpus))
+			return PERR_NOCPUS;
 
-		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
+		cpumask_copy(tmp->addmask, cs->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
-		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+		/*
+		 * Need to remove cpus from parent's subparts_cpus for valid
+		 * partition root.
+		 */
+		deleting = !is_prs_invalid(old_prs) &&
+			   cpumask_and(tmp->delmask, cs->cpus_allowed,
 				       parent->subparts_cpus);
 	} else if (newmask) {
 		/*
@@ -1306,7 +1350,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 * addmask = newmask & parent->cpus_allowed
 		 *		     & ~parent->subparts_cpus
 		 */
-		cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
+		cpumask_andnot(tmp->delmask, cs->cpus_allowed, newmask);
 		deleting = cpumask_and(tmp->delmask, tmp->delmask,
 				       parent->subparts_cpus);
 
@@ -1317,77 +1361,85 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 * Make partition invalid if parent's effective_cpus could
 		 * become empty and there are tasks in the parent.
 		 */
-		part_error = partition_is_populated(parent, cpuset) &&
-			cpumask_subset(parent->effective_cpus, tmp->addmask) &&
-			!cpumask_intersects(tmp->delmask, cpu_active_mask);
-
-		if ((READ_ONCE(cpuset->prs_err) == PERR_NONE) && part_error)
-			WRITE_ONCE(cpuset->prs_err, PERR_INVCPUS);
+		if (adding && partition_is_populated(parent, cs) &&
+		    cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+		    !cpumask_intersects(tmp->delmask, cpu_active_mask)) {
+			part_error = PERR_NOCPUS;
+			adding = false;
+			deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+					       parent->subparts_cpus);
+		}
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
 		 *
-		 * addmask = cpus_allowed & parent->effective_cpus
-		 *
-		 * This gets invoked either due to a hotplug event or
-		 * from update_cpumasks_hier() where we can't return an
-		 * error. This can cause a partition root to become invalid
-		 * in the case of a hotplug.
+		 * delmask = cpus_allowed & parent->subparts_cpus
+		 * addmask = cpus_allowed & parent->cpus_allowed
+		 *			  & ~parent->subparts_cpus
 		 *
+		 * This gets invoked either due to a hotplug event or from
+		 * update_cpumasks_hier(). This can cause the state of a
+		 * partition root to transition from valid to invalid or vice
+		 * versa. So we still need to compute the addmask and delmask.
+
 		 * A partition error happens when:
 		 * 1) Cpuset is valid partition, but parent does not distribute
 		 *    out any CPUs.
 		 * 2) Parent has tasks and all its effective CPUs will have
 		 *    to be distributed out.
 		 */
-		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
-				     parent->effective_cpus);
-		part_error = (is_partition_root(cpuset) &&
-			      !parent->nr_subparts_cpus) ||
-			     (cpumask_equal(parent->effective_cpus, tmp->addmask) &&
-			      partition_is_populated(parent, cpuset));
+		cpumask_and(tmp->addmask, cs->cpus_allowed,
+					  parent->cpus_allowed);
+		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
+					parent->subparts_cpus);
+		if ((is_partition_valid(cs) && !parent->nr_subparts_cpus) ||
+		    (adding &&
+		     cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+		     partition_is_populated(parent, cs))) {
+			part_error = PERR_NOCPUS;
+			adding = false;
+		}
 
-		if (is_partition_root(cpuset) && part_error)
-			WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS);
+		if (part_error && is_partition_valid(cs) &&
+		    parent->nr_subparts_cpus)
+			deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+					       parent->subparts_cpus);
 	}
+	if (part_error)
+		WRITE_ONCE(cs->prs_err, part_error);
 
 	if (cmd == partcmd_update) {
 		/*
-		 * Check for possible transition between PRS_ERROR and
-		 * PRS_ENABLED/PRS_ISOLATED.
+		 * Check for possible transition between valid and invalid
+		 * partition root.
 		 */
-		switch (cpuset->partition_root_state) {
-		case PRS_ENABLED:
+		switch (cs->partition_root_state) {
+		case PRS_ROOT:
 		case PRS_ISOLATED:
 			if (part_error)
-				new_prs = PRS_ERROR;
+				new_prs = -old_prs;
 			break;
-		case PRS_ERROR:
-			if (part_error)
-				break;
-			if (is_sched_load_balance(cpuset))
-				new_prs = PRS_ENABLED;
-			else
-				new_prs = PRS_ISOLATED;
+		case PRS_INVALID_ROOT:
+		case PRS_INVALID_ISOLATED:
+			if (!part_error)
+				new_prs = -old_prs;
 			break;
 		}
 	}
 
-	if ((old_prs == PRS_ERROR) && (new_prs == PRS_ERROR))
-		return 0;	/* Nothing need to be done */
-
-	if (new_prs == PRS_ERROR) {
-		/*
-		 * Remove all its cpus from parent's subparts_cpus.
-		 */
-		adding = false;
-		deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
-				       parent->subparts_cpus);
-	}
-
 	if (!adding && !deleting && (new_prs == old_prs))
 		return 0;
 
+	/*
+	 * Transitioning from invalid to valid (partcmd_update) may require
+	 * setting CS_CPU_EXCLUSIVE and clearing CS_SCHED_LOAD_BALANCE later.
+	 */
+	if ((old_prs != new_prs) && is_prs_invalid(old_prs)) {
+		if (!is_cpu_exclusive(cs) &&
+		    (update_flag(CS_CPU_EXCLUSIVE, cs, 1) < 0))
+			return PERR_NOTEXCL;
+	}
+
 	/*
 	 * Change the parent's subparts_cpus.
 	 * Newly added CPUs will be removed from effective_cpus and
@@ -1414,12 +1466,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
 
 	if (old_prs != new_prs)
-		cpuset->partition_root_state = new_prs;
+		cs->partition_root_state = new_prs;
 
 	spin_unlock_irq(&callback_lock);
-	notify_partition_change(cpuset, old_prs, new_prs);
 
-	return cmd == partcmd_update;
+	if (adding || deleting)
+		update_tasks_cpumask(parent);
+
+	/*
+	 * Set or clear CS_SCHED_LOAD_BALANCE when partcmd_update, if necessary.
+	 * rebuild_sched_domains_locked() may be called.
+	 */
+	if ((old_prs != new_prs) && (cmd == partcmd_update)) {
+		if (old_prs == PRS_ISOLATED)
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+		else if (new_prs == PRS_ISOLATED)
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+	}
+	notify_partition_change(cs, old_prs);
+	return 0;
 }
 
 /*
@@ -1457,7 +1522,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 		 * out all its CPUs.
 		 */
 		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
-			if (is_partition_root(cp) &&
+			if (is_partition_valid(cp) &&
 			    cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
 				goto update_parent_subparts;
 
@@ -1474,7 +1539,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 
 		/*
 		 * Skip the whole subtree if the cpumask remains the same
-		 * with no partition root state and force flag not set.
+		 * and has no partition root state and force flag not set.
 		 */
 		if (!cp->partition_root_state && !force &&
 		    cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
@@ -1492,21 +1557,21 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 		old_prs = new_prs = cp->partition_root_state;
 		if ((cp != cs) && old_prs) {
 			switch (parent->partition_root_state) {
-			case PRS_ENABLED:
+			case PRS_ROOT:
 			case PRS_ISOLATED:
 				update_parent = true;
 				break;
 
-			case PRS_DISABLED:
-			case PRS_ERROR:
+			default:
 				/*
 				 * When parent is not a partition root or is
 				 * invalid, child partition roots become
 				 * invalid too.
 				 */
-				new_prs = PRS_ERROR;
+				if (is_partition_valid(cp))
+					new_prs = -cp->partition_root_state;
 				WRITE_ONCE(cp->prs_err,
-					  (parent->partition_root_state == PRS_ERROR)
+					   is_partition_invalid(parent)
 					   ? PERR_INVPARENT : PERR_NOTPART);
 				break;
 			}
@@ -1517,25 +1582,25 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 		rcu_read_unlock();
 
 		if (update_parent) {
-			if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
-				update_tasks_cpumask(parent);
+			update_parent_subparts_cpumask(cp, partcmd_update, NULL,
+						       tmp);
 			/*
-			 * The cpuset partition_root_state may be changed
-			 * to PRS_ERROR. Capture it.
+			 * The cpuset partition_root_state may become
+			 * invalid. Capture it.
 			 */
 			new_prs = cp->partition_root_state;
 		}
 
 		spin_lock_irq(&callback_lock);
 
-		if (cp->nr_subparts_cpus && (new_prs <= 0)) {
+		if (cp->nr_subparts_cpus && !is_partition_valid(cp)) {
 			/*
 			 * Put all active subparts_cpus back to effective_cpus.
 			 */
 			cpumask_or(tmp->new_cpus, tmp->new_cpus,
 				   cp->subparts_cpus);
 			cpumask_and(tmp->new_cpus, tmp->new_cpus,
-				    cpu_active_mask);
+				   cpu_active_mask);
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
 		}
@@ -1552,7 +1617,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 
 		cp->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
-		notify_partition_change(cp, old_prs, new_prs);
+
+		notify_partition_change(cp, old_prs);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -1568,7 +1634,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    is_sched_load_balance(cp) &&
 		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
-		    is_partition_root(cp)))
+		    is_partition_valid(cp)))
 			need_rebuild_sched_domains = true;
 
 		rcu_read_lock();
@@ -1592,10 +1658,15 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
 	struct cpuset *sibling;
 	struct cgroup_subsys_state *pos_css;
 
+	percpu_rwsem_assert_held(&cpuset_rwsem);
+
 	/*
 	 * Check all its siblings and call update_cpumasks_hier()
 	 * if their use_parent_ecpus flag is set in order for them
 	 * to use the right effective_cpus value.
+	 *
+	 * The update_cpumasks_hier() function may sleep. So we have to
+	 * release the RCU read lock before calling it.
 	 */
 	rcu_read_lock();
 	cpuset_for_each_child(sibling, pos_css, parent) {
@@ -1603,8 +1674,13 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
 			continue;
 		if (!sibling->use_parent_ecpus)
 			continue;
+		if (!css_tryget_online(&sibling->css))
+			continue;
 
+		rcu_read_unlock();
 		update_cpumasks_hier(sibling, tmp, false);
+		rcu_read_lock();
+		css_put(&sibling->css);
 	}
 	rcu_read_unlock();
 }
@@ -1662,27 +1738,35 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 #endif
 
 	if (cs->partition_root_state) {
-		/* Cpumask of a partition root cannot be empty */
-		if (cpumask_empty(trialcs->cpus_allowed))
-			return -EINVAL;
-		if (update_parent_subparts_cpumask(cs, partcmd_update,
-					trialcs->cpus_allowed, &tmp) < 0)
-			return -EINVAL;
+		update_parent_subparts_cpumask(cs, partcmd_update,
+					       trialcs->cpus_allowed, &tmp);
 	}
 
+	compute_effective_cpumask(trialcs->effective_cpus, trialcs,
+				  parent_cs(cs));
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 
 	/*
-	 * Make sure that subparts_cpus is a subset of cpus_allowed.
+	 * Make sure that subparts_cpus, if not empty, is a subset of
+	 * cpus_allowed. Clear subparts_cpus if there is an error or
+	 * empty effective cpus with tasks.
 	 */
 	if (cs->nr_subparts_cpus) {
-		cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
-			    cs->cpus_allowed);
-		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+		if (cs->prs_err ||
+		   (partition_is_populated(cs, NULL) &&
+		    cpumask_subset(trialcs->effective_cpus, cs->subparts_cpus))) {
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+		} else {
+			cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+				    cs->cpus_allowed);
+			cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+		}
 	}
 	spin_unlock_irq(&callback_lock);
 
+	/* effective_cpus will be updated here */
 	update_cpumasks_hier(cs, &tmp, false);
 
 	if (cs->partition_root_state) {
@@ -2059,16 +2143,17 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	return err;
 }
 
-/*
- * update_prstate - update partititon_root_state
- * cs: the cpuset to update
- * new_prs: new partition root state
+/**
+ * update_prstate - update partition_root_state
+ * @cs: the cpuset to update
+ * @new_prs: new partition root state
+ * Return: 0 if successful, != 0 if error
  *
  * Call with cpuset_rwsem held.
  */
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
-	int err, old_prs = cs->partition_root_state;
+	int err = PERR_NONE, old_prs = cs->partition_root_state;
 	bool sched_domain_rebuilt = false;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
@@ -2077,28 +2162,33 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		return 0;
 
 	/*
-	 * Cannot force a partial or invalid partition root to a full
-	 * partition root.
+	 * For a previously invalid partition root, leave it at being
+	 * invalid if new_prs is not "member".
 	 */
-	if (new_prs && (old_prs == PRS_ERROR))
-		return -EINVAL;
+	if (new_prs && is_prs_invalid(old_prs)) {
+		cs->partition_root_state = -new_prs;
+		return 0;
+	}
 
 	if (alloc_cpumasks(NULL, &tmpmask))
 		return -ENOMEM;
 
-	err = -EINVAL;
 	if (!old_prs) {
 		/*
 		 * Turning on partition root requires setting the
 		 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
-		 * cannot be NULL.
+		 * cannot be empty.
 		 */
-		if (cpumask_empty(cs->cpus_allowed))
+		if (cpumask_empty(cs->cpus_allowed)) {
+			err = PERR_CPUSEMPTY;
 			goto out;
+		}
 
 		err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
-		if (err)
+		if (err) {
+			err = PERR_NOTEXCL;
 			goto out;
+		}
 
 		err = update_parent_subparts_cpumask(cs, partcmd_enable,
 						     NULL, &tmpmask);
@@ -2120,16 +2210,16 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		 * A change in load balance state only, no change in cpumasks.
 		 */
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
-		err = 0;
+		sched_domain_rebuilt = true;
 		goto out;	/* Sched domain is rebuilt in update_flag() */
 	} else {
 		/*
-		 * Switch back to member is always allowed even if it
+		 * Switching back to member is always allowed even if it
 		 * disables child partitions.
 		 */
-		err = 0;
 		update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
 					       &tmpmask);
+
 		/*
 		 * If there are child partitions, they will all become invalid.
 		 */
@@ -2151,12 +2241,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		}
 	}
 
-	/*
-	 * Update cpumask of parent's tasks except when it is the top
-	 * cpuset as some system daemons cannot be mapped to other CPUs.
-	 */
-	if (parent != &top_cpuset)
-		update_tasks_cpumask(parent);
+	update_tasks_cpumask(parent);
 
 	if (parent->child_ecpus_count)
 		update_sibling_cpumasks(parent, cs, &tmpmask);
@@ -2164,20 +2249,24 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	if (!sched_domain_rebuilt)
 		rebuild_sched_domains_locked();
 out:
-	if (!err) {
-		spin_lock_irq(&callback_lock);
-		cs->partition_root_state = new_prs;
-		spin_unlock_irq(&callback_lock);
-		/*
-		 * Update child cpusets when disabling partition.
-		 */
-		if (new_prs == PRS_DISABLED && !list_empty(&cs->css.children))
-			update_cpumasks_hier(cs, &tmpmask, true);
-		notify_partition_change(cs, old_prs, new_prs);
-	}
+	/*
+	 * Make partition invalid if an error happen
+	 */
+	if (err)
+		new_prs = -new_prs;
+	spin_lock_irq(&callback_lock);
+	cs->partition_root_state = new_prs;
+	spin_unlock_irq(&callback_lock);
+	/*
+	 * Update child cpusets, if present.
+	 * Force update if switching back to member.
+	 */
+	if (!list_empty(&cs->css.children))
+		update_cpumasks_hier(cs, &tmpmask, !new_prs);
 
+	notify_partition_change(cs, old_prs);
 	free_cpumasks(NULL, &tmpmask);
-	return err;
+	return 0;
 }
 
 /*
@@ -2361,6 +2450,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
 
+	cpus_read_lock();
 	percpu_down_write(&cpuset_rwsem);
 
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
@@ -2414,6 +2504,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 		wake_up(&cpuset_attach_wq);
 
 	percpu_up_write(&cpuset_rwsem);
+	cpus_read_unlock();
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -2669,26 +2760,32 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 static int sched_partition_show(struct seq_file *seq, void *v)
 {
 	struct cpuset *cs = css_cs(seq_css(seq));
-	const char *err, *type;
+	const char *err, *type = NULL;
 
 	switch (cs->partition_root_state) {
-	case PRS_ENABLED:
+	case PRS_ROOT:
 		seq_puts(seq, "root\n");
 		break;
 	case PRS_ISOLATED:
 		seq_puts(seq, "isolated\n");
 		break;
-	case PRS_DISABLED:
+	case PRS_MEMBER:
 		seq_puts(seq, "member\n");
 		break;
-	case PRS_ERROR:
-		type = is_sched_load_balance(cs) ? "root" : "isolated";
+	case PRS_INVALID_ROOT:
+		type = "root";
+		fallthrough;
+	case PRS_INVALID_ISOLATED:
+		if (!type)
+			type = "isolated";
 		err = perr_strings[READ_ONCE(cs->prs_err)];
 		if (err)
 			seq_printf(seq, "%s invalid (%s)\n", type, err);
 		else
 			seq_printf(seq, "%s invalid\n", type);
 		break;
+		seq_puts(seq, "isolated invalid\n");
+		break;
 	}
 	return 0;
 }
@@ -2706,9 +2803,9 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 	 * Convert "root" to ENABLED, and convert "member" to DISABLED.
 	 */
 	if (!strcmp(buf, "root"))
-		val = PRS_ENABLED;
+		val = PRS_ROOT;
 	else if (!strcmp(buf, "member"))
-		val = PRS_DISABLED;
+		val = PRS_MEMBER;
 	else if (!strcmp(buf, "isolated"))
 		val = PRS_ISOLATED;
 	else
@@ -2960,7 +3057,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	/*
 	 * Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is
 	 * set.  This flag handling is implemented in cgroup core for
-	 * histrical reasons - the flag may be specified during mount.
+	 * historical reasons - the flag may be specified during mount.
 	 *
 	 * Currently, if any sibling cpusets have exclusive cpus or mem, we
 	 * refuse to clone the configuration - thereby refusing the task to
@@ -3009,7 +3106,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	cpus_read_lock();
 	percpu_down_write(&cpuset_rwsem);
 
-	if (is_partition_root(cs))
+	if (is_partition_valid(cs))
 		update_prstate(cs, 0);
 
 	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
@@ -3157,7 +3254,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
-	 * as the tasks will be migratecd to an ancestor.
+	 * as the tasks will be migrated to an ancestor.
 	 */
 	if (cpus_updated && !cpumask_empty(cs->cpus_allowed))
 		update_tasks_cpumask(cs);
@@ -3186,7 +3283,7 @@ hotplug_update_tasks(struct cpuset *cs,
 		     bool cpus_updated, bool mems_updated)
 {
 	/* A partition root is allowed to have empty effective cpus */
-	if (cpumask_empty(new_cpus) && !is_partition_root(cs))
+	if (cpumask_empty(new_cpus) && !is_partition_valid(cs))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
@@ -3259,8 +3356,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	 * partitions, if present, by setting nr_subparts_cpus to 0 to
 	 * reclaim their cpus.
 	 */
-	if (is_partition_root(cs) && cpumask_empty(&new_cpus) &&
-	    cs->nr_subparts_cpus && partition_is_populated(cs, NULL)) {
+	if (cs->nr_subparts_cpus && is_partition_valid(cs) &&
+	    cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) {
 		spin_lock_irq(&callback_lock);
 		cs->nr_subparts_cpus = 0;
 		cpumask_clear(cs->subparts_cpus);
@@ -3271,16 +3368,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	/*
 	 * Force the partition to become invalid if either one of
 	 * the following conditions hold:
-	 * 1) empty effective cpus with tasks in partition
-	 * 2) parent is invalid or doesn't grant any cpus to child partitions.
+	 * 1) empty effective cpus but not valid empty partition.
+	 * 2) parent is invalid or doesn't grant any cpus to child
+	 *    partitions.
 	 */
-	if (is_partition_root(cs) &&
-	   ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
-	    !parent->nr_subparts_cpus)) {
+	if (is_partition_valid(cs) && (!parent->nr_subparts_cpus ||
+	   (cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)))) {
 		int old_prs, parent_prs;
 
-		update_parent_subparts_cpumask(cs, partcmd_disable,
-					       NULL, tmp);
+		update_parent_subparts_cpumask(cs, partcmd_disable, NULL, tmp);
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3291,29 +3387,30 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 		old_prs = cs->partition_root_state;
 		parent_prs = parent->partition_root_state;
-		if (old_prs != PRS_ERROR) {
+		if (is_partition_valid(cs)) {
 			spin_lock_irq(&callback_lock);
-			cs->partition_root_state = PRS_ERROR;
+			set_partition_invalid(cs);
 			spin_unlock_irq(&callback_lock);
-			if (parent_prs == PRS_ERROR)
+			if (is_prs_invalid(parent_prs))
 				WRITE_ONCE(cs->prs_err, PERR_INVPARENT);
 			else if (!parent_prs)
 				WRITE_ONCE(cs->prs_err, PERR_NOTPART);
 			else
 				WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
-			notify_partition_change(cs, old_prs, PRS_ERROR);
+			notify_partition_change(cs, old_prs);
 		}
 		cpuset_force_rebuild();
 	}
 
 	/*
-	 * On the other hand, an erroneous partition root may be transitioned
+	 * On the other hand, an invalid partition root may be transitioned
 	 * back to a regular one.
 	 */
-	else if (is_partition_root(parent) &&
-		(cs->partition_root_state == PRS_ERROR) &&
-		 update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
-		cpuset_force_rebuild();
+	else if (is_partition_valid(parent) && is_partition_invalid(cs)) {
+		update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp);
+		if (is_partition_valid(cs))
+			cpuset_force_rebuild();
+	}
 
 update_tasks:
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
@@ -3619,8 +3716,8 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
 	return cs;
 }
 
-/**
- * cpuset_node_allowed - Can we allocate on a memory node?
+/*
+ * __cpuset_node_allowed - Can we allocate on a memory node?
  * @node: is this an allowed node?
  * @gfp_mask: memory allocation flags
  *
@@ -3662,7 +3759,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
 bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
-	int allowed;			/* is allocation in zone z allowed? */
+	bool allowed;			/* is allocation in zone z allowed? */
 	unsigned long flags;
 
 	if (in_interrupt())
@@ -3791,8 +3888,8 @@ void cpuset_print_current_mems_allowed(void)
 
 int cpuset_memory_pressure_enabled __read_mostly;
 
-/**
- * cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims.
+/*
+ * __cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims.
  *
  * Keep a running average of the rate of synchronous (direct)
  * page reclaim efforts initiated by tasks in each cpuset.
@@ -3807,7 +3904,7 @@ int cpuset_memory_pressure_enabled __read_mostly;
  * "memory_pressure".  Value displayed is an integer
  * representing the recent rate of entry into the synchronous
  * (direct) page reclaim by any task attached to the cpuset.
- **/
+ */
 
 void __cpuset_memory_pressure_bump(void)
 {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-05-04 18:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-03 16:21 [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Waiman Long
2022-05-03 16:21 ` [PATCH v10 1/8] cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask() Waiman Long
2022-05-03 16:21 ` [PATCH v10 2/8] cgroup/cpuset: Miscellaneous cleanups & add helper functions Waiman Long
2022-05-03 16:21 ` [PATCH v10 3/8] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
2022-05-03 17:54   ` Phil Auld
2022-05-03 18:39     ` Waiman Long
2022-05-03 16:21 ` [PATCH v10 4/8] cgroup/cpuset: Relax constraints to partition & cpus changes Waiman Long
2022-05-03 16:21 ` [PATCH v10 5/8] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
2022-05-03 16:21 ` [PATCH v10 6/8] cgroup/cpuset: Show invalid partition reason string Waiman Long
2022-05-03 16:21 ` [PATCH v10 7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
2022-05-04 11:25   ` Michal Koutný
2022-05-04 16:02     ` Waiman Long
2022-05-03 16:21 ` [PATCH v10 8/8] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
2022-05-04 11:28 ` [PATCH v10 0/8] cgroup/cpuset: Major cpu partition code restructuring Michal Koutný
2022-05-04 18:33   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).