linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
@ 2021-08-25 21:37 Waiman Long
  2021-08-25 21:37 ` [PATCH v7 1/6] cgroup/cpuset: Properly transition to invalid partition Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-25 21:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

v7:
 - Simplify the documentation patch (patch 5) as suggested by Tejun.
 - Fix a typo in patch 2 and improper commit log in patch 3.

v6:
 - Remove duplicated tmpmask from update_prstate() which should fix the
   frame size too large problem reported by kernel test robot.

v5:
 - Rebased to the latest for-5.15 branch of cgroup git tree and drop the
   1st v4 patch as it has been merged.
 - Update patch 1 to always allow changing partition root back to member
   even if it invalidates child partitions undeneath it.
 - Adjust the empty effective cpu partition patch to not allow 0 effective
   cpu for terminal partition which will make it invalid).
 - Add a new patch to enable reading of cpuset.cpus.partition to display
   the reason that causes invalid partition.
 - Adjust the documentation and testing patch accordingly.

This patchset makes four enhancements to the cpuset v2 code.

 Patch 1: Properly handle partition root tree and make partition
 invalid in case changes to cpuset.cpus violate any of the partition
 root constraints.

 Patch 2: Enable the "cpuset.cpus.partition" file to show the reason
 that causes invalid partition like "root invalid (No cpu available
 due to hotplug)".

 Patch 3: Add a new partition state "isolated" to create a partition
 root without load balancing. This is for handling intermitten workloads
 that have a strict low latency requirement.

 Patch 4: Allow partition roots that are not the top cpuset to distribute
 all its cpus to child partitions as long as there is no task associated
 with that partition root. This allows more flexibility for middleware
 to manage multiple partitions.

Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
cpuset test to test the new cpuset partition code.


Waiman Long (6):
  cgroup/cpuset: Properly transition to invalid partition
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Allow non-top parent partition to distribute out all
    CPUs
  cgroup/cpuset: Update description of cpuset.cpus.partition in
    cgroup-v2.rst
  kselftest/cgroup: Add cpuset v2 partition root state test

 Documentation/admin-guide/cgroup-v2.rst       | 112 +--
 kernel/cgroup/cpuset.c                        | 337 ++++++---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 663 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  86 +++
 5 files changed, 1050 insertions(+), 153 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v7 1/6] cgroup/cpuset: Properly transition to invalid partition
  2021-08-25 21:37 [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
@ 2021-08-25 21:37 ` Waiman Long
  2021-08-25 21:37 ` [PATCH v7 2/6] cgroup/cpuset: Show invalid partition reason string Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-25 21:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

For cpuset partition, the special state of PRS_ERROR (invalid partition
root) was originally designed to handle hotplug events.  In this state,
CPUs allocated to the partition root is released back to the parent
but the cpuset exclusive flags remain unchanged.

Changing a cpuset into a partition root is strictly controlled. The
following constraints must be satisfied in order to make the transition
possible:

 - The "cpuset.cpus" is not empty and the list of CPUs are exclusive,
   i.e. they are not shared by any of its siblings.
 - The parent cgroup is a partition root.
 - The "cpuset.cpus" is a subset of the parent's "cpuset.cpus.effective".
 - There is no child cgroups with cpuset enabled.

Changing a partition root back to a member is always allowed, though care
must be taken to make sure that this change won't break child cpusets,
if present.

Since partition root sets the CPU_EXCLUSIVE flag, cpuset.cpus changes
that break the cpu exclusivity rule will not be allowed. However,
other changes to cpuset.cpus on a partition root may still cause it to
become invalid. So users must always check the partition root state of
"cpuset.cpus.partition" after making changes to cpuset.cpus to make sure
that the partition root is still valid.

For a partition root tree with parent and child partition roots, there
are two cases where the child partitions can become invalid. Firstly,
changing partition state to "member" will force the child partitions
to become invalid.

Secondly, if some cpus are taken away from the parent partition root
so that its cpuset.cpus.effective becomes empty, it will try to pull
cpus away from the child partitions and force them to become invalid
which may allow the parent partition to remain valid.

This patch makes sure that partitions are properly changed to invalid
when some of the valid partition constraints are violated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 167 ++++++++++++++++++++++-------------------
 1 file changed, 90 insertions(+), 77 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d497a65c4f04..ddea05e4d1f0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1177,10 +1177,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		return -EINVAL;
 
 	/*
-	 * Enabling/disabling partition root is not allowed if there are
-	 * online children.
+	 * Enabling partition root is not allowed if there are online children.
 	 */
-	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+	if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
 	/*
@@ -1208,6 +1207,14 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * partcmd_update with newmask:
 		 *
+		 * Make partition invalid if newmask isn't a subset of
+		 * (cpus_allowed | parent->effective_cpus).
+		 */
+		cpumask_or(tmp->addmask, cpuset->cpus_allowed,
+					 parent->effective_cpus);
+		part_error = !cpumask_subset(newmask, tmp->addmask);
+
+		/*
 		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
 		 * addmask = newmask & parent->effective_cpus
 		 *		     & ~parent->subparts_cpus
@@ -1220,20 +1227,21 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty.
+		 * Make partition invalid if parent's effective_cpus could
+		 * become empty.
 		 */
 		if (adding &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
 			if (!deleting)
-				return -EINVAL;
+				part_error = true;
 			/*
 			 * As some of the CPUs in subparts_cpus might have
 			 * been offlined, we need to compute the real delmask
 			 * to confirm that.
 			 */
-			if (!cpumask_and(tmp->addmask, tmp->delmask,
-					 cpu_active_mask))
-				return -EINVAL;
+			else if (!cpumask_and(tmp->addmask, tmp->delmask,
+					      cpu_active_mask))
+				part_error = true;
 			cpumask_copy(tmp->addmask, parent->effective_cpus);
 		}
 	} else {
@@ -1242,19 +1250,23 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 *
 		 * addmask = cpus_allowed & parent->effective_cpus
 		 *
+		 * This gets invoked either due to a hotplug event or
+		 * from update_cpumasks_hier() where we can't return an
+		 * error. This can cause a partition root to become invalid
+		 * in the case of a hotplug.
+		 *
 		 * Note that parent's subparts_cpus may have been
 		 * pre-shrunk in case there is a change in the cpu list.
 		 * So no deletion is needed.
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask,
-					   parent->effective_cpus);
+		part_error = (is_partition_root(cpuset) &&
+			      !parent->nr_subparts_cpus) ||
+			     cpumask_equal(tmp->addmask, parent->effective_cpus);
 	}
 
 	if (cmd == partcmd_update) {
-		int prev_prs = cpuset->partition_root_state;
-
 		/*
 		 * Check for possible transition between PRS_ENABLED
 		 * and PRS_ERROR.
@@ -1269,13 +1281,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				new_prs = PRS_ENABLED;
 			break;
 		}
-		/*
-		 * Set part_error if previously in invalid state.
-		 */
-		part_error = (prev_prs == PRS_ERROR);
 	}
 
-	if (!part_error && (new_prs == PRS_ERROR))
+	if ((old_prs == PRS_ERROR) && (new_prs == PRS_ERROR))
 		return 0;	/* Nothing need to be done */
 
 	if (new_prs == PRS_ERROR) {
@@ -1407,6 +1415,11 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			case PRS_ENABLED:
 				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
 					update_tasks_cpumask(parent);
+				/*
+				 * The cpuset partition_root_state may be
+				 * changed to PRS_ERROR. Capture it.
+				 */
+				new_prs = cp->partition_root_state;
 				break;
 
 			case PRS_ERROR:
@@ -1424,33 +1437,27 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		spin_lock_irq(&callback_lock);
 
-		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+			/*
+			 * Put all active subparts_cpus back to effective_cpus.
+			 */
+			cpumask_or(tmp->new_cpus, tmp->new_cpus,
+				   cp->subparts_cpus);
+			cpumask_and(tmp->new_cpus, tmp->new_cpus,
+				    cpu_active_mask);
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
-		} else if (cp->nr_subparts_cpus) {
+		}
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
-			 * are mutually exclusive.
-			 *
-			 * In the unlikely event that effective_cpus
-			 * becomes empty. we clear cp->nr_subparts_cpus and
-			 * let its child partition roots to compete for
-			 * CPUs again.
+			 * of a partition root are mutually exclusive.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			if (cpumask_empty(cp->effective_cpus)) {
-				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-				cpumask_clear(cp->subparts_cpus);
-				cp->nr_subparts_cpus = 0;
-			} else if (!cpumask_subset(cp->subparts_cpus,
-						   tmp->new_cpus)) {
-				cpumask_andnot(cp->subparts_cpus,
-					cp->subparts_cpus, tmp->new_cpus);
-				cp->nr_subparts_cpus
-					= cpumask_weight(cp->subparts_cpus);
-			}
+			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
 		}
 
 		if (new_prs != old_prs)
@@ -1582,8 +1589,8 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	 * Make sure that subparts_cpus is a subset of cpus_allowed.
 	 */
 	if (cs->nr_subparts_cpus) {
-		cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
-			       cs->cpus_allowed);
+		cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+			    cs->cpus_allowed);
 		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
 	}
 	spin_unlock_irq(&callback_lock);
@@ -2010,20 +2017,24 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		}
 	} else {
 		/*
-		 * Turning off partition root will clear the
-		 * CS_CPU_EXCLUSIVE bit.
+		 * Switch back to member is always allowed even if it
+		 * causes child partitions to become invalid.
 		 */
-		if (old_prs == PRS_ERROR) {
-			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
-			err = 0;
-			goto out;
+		err = 0;
+		update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
+					       &tmpmask);
+		/*
+		 * If there are child partitions, we have to make them invalid.
+		 */
+		if (unlikely(cs->nr_subparts_cpus)) {
+			spin_lock_irq(&callback_lock);
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+			compute_effective_cpumask(cs->effective_cpus, cs, parent);
+			spin_unlock_irq(&callback_lock);
+			update_cpumasks_hier(cs, &tmpmask);
 		}
 
-		err = update_parent_subparts_cpumask(cs, partcmd_disable,
-						     NULL, &tmpmask);
-		if (err)
-			goto out;
-
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 	}
@@ -3109,11 +3120,28 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus or its parent becomes erroneous, we have to
-	 * transition it to the erroneous state.
+	 * effective_cpus, we will have to force any child partitions,
+	 * if present, to become invalid by setting nr_subparts_cpus to 0
+	 * without causing itself to become invalid.
+	 */
+	if (is_partition_root(cs) && cs->nr_subparts_cpus &&
+	    cpumask_empty(&new_cpus)) {
+		cs->nr_subparts_cpus = 0;
+		cpumask_clear(cs->subparts_cpus);
+		compute_effective_cpumask(&new_cpus, cs, parent);
+	}
+
+	/*
+	 * If empty effective_cpus or zero nr_subparts_cpus or its parent
+	 * becomes erroneous, we have to transition it to the erroneous state.
 	 */
 	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
-	   (parent->partition_root_state == PRS_ERROR))) {
+	    (parent->partition_root_state == PRS_ERROR) ||
+	    !parent->nr_subparts_cpus)) {
+		int old_prs;
+
+		update_parent_subparts_cpumask(cs, partcmd_disable,
+					       NULL, tmp);
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3122,38 +3150,23 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 			compute_effective_cpumask(&new_cpus, cs, parent);
 		}
 
-		/*
-		 * If the effective_cpus is empty because the child
-		 * partitions take away all the CPUs, we can keep
-		 * the current partition and let the child partitions
-		 * fight for available CPUs.
-		 */
-		if ((parent->partition_root_state == PRS_ERROR) ||
-		     cpumask_empty(&new_cpus)) {
-			int old_prs;
-
-			update_parent_subparts_cpumask(cs, partcmd_disable,
-						       NULL, tmp);
-			old_prs = cs->partition_root_state;
-			if (old_prs != PRS_ERROR) {
-				spin_lock_irq(&callback_lock);
-				cs->partition_root_state = PRS_ERROR;
-				spin_unlock_irq(&callback_lock);
-				notify_partition_change(cs, old_prs, PRS_ERROR);
-			}
+		old_prs = cs->partition_root_state;
+		if (old_prs != PRS_ERROR) {
+			spin_lock_irq(&callback_lock);
+			cs->partition_root_state = PRS_ERROR;
+			spin_unlock_irq(&callback_lock);
+			notify_partition_change(cs, old_prs, PRS_ERROR);
 		}
 		cpuset_force_rebuild();
 	}
 
 	/*
 	 * On the other hand, an erroneous partition root may be transitioned
-	 * back to a regular one or a partition root with no CPU allocated
-	 * from the parent may change to erroneous.
+	 * back to a regular one.
 	 */
-	if (is_partition_root(parent) &&
-	   ((cs->partition_root_state == PRS_ERROR) ||
-	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
-	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+	else if (is_partition_root(parent) &&
+		(cs->partition_root_state == PRS_ERROR) &&
+		 update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
 		cpuset_force_rebuild();
 
 update_tasks:
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 2/6] cgroup/cpuset: Show invalid partition reason string
  2021-08-25 21:37 [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
  2021-08-25 21:37 ` [PATCH v7 1/6] cgroup/cpuset: Properly transition to invalid partition Waiman Long
@ 2021-08-25 21:37 ` Waiman Long
  2021-08-25 21:37 ` [PATCH v7 3/6] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-25 21:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

There are a number of different reasons which can cause a partition to
become invalid. A user seeing an invalid partition may not know exactly
why. To help user to get a better understanding of the underlying reason,
The cpuset.cpus.partition control file, when read, will now report the
reason why a partition become invalid. When a partition does become
invalid, reading the control file will show "root invalid (<reason>)"
where <reason> is a string that describes why the partition is invalid.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 46 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 43 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ddea05e4d1f0..eb2e81f9326b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -78,6 +78,24 @@ struct fmeter {
 	spinlock_t lock;	/* guards read or write of above */
 };
 
+/*
+ * Invalid partition error code
+ */
+enum prs_errcode {
+	PERR_NONE = 0,
+	PERR_INVCPUS,
+	PERR_NOCPUS,
+	PERR_PARENT,
+	PERR_HOTPLUG,
+};
+
+static const char * const perr_strings[] = {
+	[PERR_INVCPUS] = "Invalid change to cpuset.cpus",
+	[PERR_PARENT]  = "Parent is no longer a partition root",
+	[PERR_NOCPUS]  = "Parent unable to distribute cpu downstream",
+	[PERR_HOTPLUG] = "No cpu available due to hotplug",
+};
+
 struct cpuset {
 	struct cgroup_subsys_state css;
 
@@ -163,6 +181,9 @@ struct cpuset {
 
 	/* Handle for cpuset.cpus.partition */
 	struct cgroup_file partition_file;
+
+	/* Invalid partition error code, not lock protected */
+	enum prs_errcode prs_err;
 };
 
 /*
@@ -272,8 +293,13 @@ static inline int is_partition_root(const struct cpuset *cs)
 static inline void notify_partition_change(struct cpuset *cs,
 					   int old_prs, int new_prs)
 {
-	if (old_prs != new_prs)
-		cgroup_file_notify(&cs->partition_file);
+	if (old_prs == new_prs)
+		return;
+	cgroup_file_notify(&cs->partition_file);
+
+	/* Reset prs_err if not invalid */
+	if (new_prs != PRS_ERROR)
+		WRITE_ONCE(cs->prs_err, PERR_NONE);
 }
 
 static struct cpuset top_cpuset = {
@@ -1243,6 +1269,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 					      cpu_active_mask))
 				part_error = true;
 			cpumask_copy(tmp->addmask, parent->effective_cpus);
+			if ((READ_ONCE(cpuset->prs_err) == PERR_NONE) && part_error)
+				WRITE_ONCE(cpuset->prs_err, PERR_INVCPUS);
 		}
 	} else {
 		/*
@@ -1264,6 +1292,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		part_error = (is_partition_root(cpuset) &&
 			      !parent->nr_subparts_cpus) ||
 			     cpumask_equal(tmp->addmask, parent->effective_cpus);
+		if (is_partition_root(cpuset) && part_error)
+			WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS);
 	}
 
 	if (cmd == partcmd_update) {
@@ -1427,6 +1457,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				 * When parent is invalid, it has to be too.
 				 */
 				new_prs = PRS_ERROR;
+				WRITE_ONCE(cp->prs_err, PERR_PARENT);
 				break;
 			}
 		}
@@ -2546,6 +2577,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 static int sched_partition_show(struct seq_file *seq, void *v)
 {
 	struct cpuset *cs = css_cs(seq_css(seq));
+	const char *err;
 
 	switch (cs->partition_root_state) {
 	case PRS_ENABLED:
@@ -2555,7 +2587,11 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 		seq_puts(seq, "member\n");
 		break;
 	case PRS_ERROR:
-		seq_puts(seq, "root invalid\n");
+		err = perr_strings[READ_ONCE(cs->prs_err)];
+		if (err)
+			seq_printf(seq, "root invalid (%s)\n", err);
+		else
+			seq_puts(seq, "root invalid\n");
 		break;
 	}
 	return 0;
@@ -3155,6 +3191,10 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 			spin_lock_irq(&callback_lock);
 			cs->partition_root_state = PRS_ERROR;
 			spin_unlock_irq(&callback_lock);
+			if (parent->partition_root_state == PRS_ERROR)
+				WRITE_ONCE(cs->prs_err, PERR_PARENT);
+			else
+				WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
 			notify_partition_change(cs, old_prs, PRS_ERROR);
 		}
 		cpuset_force_rebuild();
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 3/6] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-08-25 21:37 [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
  2021-08-25 21:37 ` [PATCH v7 1/6] cgroup/cpuset: Properly transition to invalid partition Waiman Long
  2021-08-25 21:37 ` [PATCH v7 2/6] cgroup/cpuset: Show invalid partition reason string Waiman Long
@ 2021-08-25 21:37 ` Waiman Long
  2021-08-25 21:37 ` [PATCH v7 4/6] cgroup/cpuset: Allow non-top parent partition to distribute out all CPUs Waiman Long
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-25 21:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.

For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.

When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.

To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index eb2e81f9326b..de770a3781c2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -193,6 +193,8 @@ struct cpuset {
  *
  *   1 - partition root
  *
+ *   2 - partition root without load balancing (isolated)
+ *
  *  -1 - invalid partition root
  *       None of the cpus in cpus_allowed can be put into the parent's
  *       subparts_cpus. In this case, the cpuset is not a real partition
@@ -202,6 +204,7 @@ struct cpuset {
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
+#define PRS_ISOLATED		2
 #define PRS_ERROR		-1
 
 /*
@@ -1298,17 +1301,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 
 	if (cmd == partcmd_update) {
 		/*
-		 * Check for possible transition between PRS_ENABLED
-		 * and PRS_ERROR.
+		 * Check for possible transition between PRS_ERROR and
+		 * PRS_ENABLED/PRS_ISOLATED.
 		 */
 		switch (cpuset->partition_root_state) {
 		case PRS_ENABLED:
+		case PRS_ISOLATED:
 			if (part_error)
 				new_prs = PRS_ERROR;
 			break;
 		case PRS_ERROR:
-			if (!part_error)
+			if (part_error)
+				break;
+			if (is_sched_load_balance(cpuset))
 				new_prs = PRS_ENABLED;
+			else
+				new_prs = PRS_ISOLATED;
 			break;
 		}
 	}
@@ -1443,6 +1451,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				break;
 
 			case PRS_ENABLED:
+			case PRS_ISOLATED:
 				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
 					update_tasks_cpumask(parent);
 				/*
@@ -1468,7 +1477,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		spin_lock_irq(&callback_lock);
 
-		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+		if (cp->nr_subparts_cpus && (new_prs <= 0)) {
 			/*
 			 * Put all active subparts_cpus back to effective_cpus.
 			 */
@@ -2012,6 +2021,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	int err, old_prs = cs->partition_root_state;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
+	bool sched_domain_rebuilt = false;
 
 	if (old_prs == new_prs)
 		return 0;
@@ -2046,6 +2056,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			goto out;
 		}
+
+		if (new_prs == PRS_ISOLATED) {
+			/*
+			 * Disable the load balance flag should not return an
+			 * error unless the system is running out of memory.
+			 */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+			sched_domain_rebuilt = true;
+		}
+	} else if (old_prs && new_prs) {
+		/*
+		 * A change in load balance state only, no change in cpumasks.
+		 */
+		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
+		err = 0;
+		goto out;	/* Sched domain is rebuilt in update_flag() */
 	} else {
 		/*
 		 * Switch back to member is always allowed even if it
@@ -2068,6 +2094,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+
+		if (!is_sched_load_balance(cs)) {
+			/* Make sure load balance is on */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+			sched_domain_rebuilt = true;
+		}
 	}
 
 	/*
@@ -2080,7 +2112,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	if (parent->child_ecpus_count)
 		update_sibling_cpumasks(parent, cs, &tmpmask);
 
-	rebuild_sched_domains_locked();
+	if (!sched_domain_rebuilt)
+		rebuild_sched_domains_locked();
 out:
 	if (!err) {
 		spin_lock_irq(&callback_lock);
@@ -2583,6 +2616,9 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 	case PRS_ENABLED:
 		seq_puts(seq, "root\n");
 		break;
+	case PRS_ISOLATED:
+		seq_puts(seq, "isolated\n");
+		break;
 	case PRS_DISABLED:
 		seq_puts(seq, "member\n");
 		break;
@@ -2613,6 +2649,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		val = PRS_ENABLED;
 	else if (!strcmp(buf, "member"))
 		val = PRS_DISABLED;
+	else if (!strcmp(buf, "isolated"))
+		val = PRS_ISOLATED;
 	else
 		return -EINVAL;
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 4/6] cgroup/cpuset: Allow non-top parent partition to distribute out all CPUs
  2021-08-25 21:37 [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (2 preceding siblings ...)
  2021-08-25 21:37 ` [PATCH v7 3/6] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
@ 2021-08-25 21:37 ` Waiman Long
  2021-08-25 21:37 ` [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
  2021-08-25 21:37 ` [PATCH v7 6/6] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
  5 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-25 21:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Currently, a parent partition cannot distribute all its CPUs to child
partitions with no CPUs left. However in some use cases, a management
application may want to create a parent partition as a management unit
with no task associated with it and has all its CPUs distributed to
various child partitions dynamically according to their needs. Leaving
a cpu in the parent partition in such a case is now a waste.

To accommodate such use cases, a parent partition can now have all its
CPUs distributed to its child partitions with 0 effective cpu left as
long as it is not the top cpuset and it has no task at the time the
child partition is being created. A terminal partition with no child
partition underlying it, however, cannot have 0 effective cpu which
will make the partition invalid.

Once an empty parent partition is formed, no new task can be moved
into it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 96 ++++++++++++++++++++++++++++++------------
 1 file changed, 69 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index de770a3781c2..3658270d694c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -305,6 +305,21 @@ static inline void notify_partition_change(struct cpuset *cs,
 		WRITE_ONCE(cs->prs_err, PERR_NONE);
 }
 
+static inline int cpuset_has_tasks(const struct cpuset *cs)
+{
+	return cs->css.cgroup->nr_populated_csets;
+}
+
+/*
+ * A empty partition (one with no effective cpu) is valid if it has no
+ * associated task and all its cpus have been distributed out to child
+ * partitions.
+ */
+static inline bool valid_empty_partition(const struct cpuset *cs)
+{
+	return !cpuset_has_tasks(cs) && cs->nr_subparts_cpus;
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
@@ -1211,22 +1226,32 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
-	/*
-	 * Enabling partition root is not allowed if not all the CPUs
-	 * can be granted from parent's effective_cpus or at least one
-	 * CPU will be left after that.
-	 */
-	if ((cmd == partcmd_enable) &&
-	   (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
-	     cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
-		return -EINVAL;
-
 	/*
 	 * A cpumask update cannot make parent's effective_cpus become empty.
 	 */
 	adding = deleting = false;
 	old_prs = new_prs = cpuset->partition_root_state;
 	if (cmd == partcmd_enable) {
+		bool parent_is_top_cpuset = !parent_cs(parent);
+		bool no_cpu_in_parent = cpumask_equal(cpuset->cpus_allowed,
+						      parent->effective_cpus);
+		/*
+		 * Enabling partition root is not allowed if not all the CPUs
+		 * can be granted from parent's effective_cpus. If the parent
+		 * is the top cpuset, at least one CPU must be left after that.
+		 */
+		if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
+		    (parent_is_top_cpuset && no_cpu_in_parent))
+			return -EINVAL;
+
+		/*
+		 * A non-top parent can be left with no CPU as long as there
+		 * is no task directly associated with the parent. For such
+		 * a parent, no new task can be moved into it.
+		 */
+		if (no_cpu_in_parent && cpuset_has_tasks(parent))
+			return -EINVAL;
+
 		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
@@ -1257,9 +1282,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 					parent->subparts_cpus);
 		/*
 		 * Make partition invalid if parent's effective_cpus could
-		 * become empty.
+		 * become empty and there are tasks in the parent.
 		 */
-		if (adding &&
+		if (adding && cpuset_has_tasks(parent) &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
 			if (!deleting)
 				part_error = true;
@@ -1294,7 +1319,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				     parent->effective_cpus);
 		part_error = (is_partition_root(cpuset) &&
 			      !parent->nr_subparts_cpus) ||
-			     cpumask_equal(tmp->addmask, parent->effective_cpus);
+			     (cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+			      cpuset_has_tasks(parent));
+
 		if (is_partition_root(cpuset) && part_error)
 			WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS);
 	}
@@ -1397,9 +1424,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
-		 * parent, which is guaranteed to have some CPUs.
+		 * parent, which is guaranteed to have some CPUs unless
+		 * it is a partition root that has explicitly distributed
+		 * out all its CPUs.
 		 */
 		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+			if (is_partition_root(cp) &&
+			    cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+				goto update_parent_subparts;
+
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 			if (!cp->use_parent_ecpus) {
 				cp->use_parent_ecpus = true;
@@ -1421,6 +1454,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		}
 
+update_parent_subparts:
 		/*
 		 * update_parent_subparts_cpumask() should have been called
 		 * for cs already in update_cpumask(). We should also call
@@ -1497,12 +1531,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
 		}
 
-		if (new_prs != old_prs)
-			cp->partition_root_state = new_prs;
-
+		cp->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
 		notify_partition_change(cp, old_prs, new_prs);
 
@@ -2249,6 +2280,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
+	/*
+	 * On default hierarchy, task cannot be moved to a cpuset with empty
+	 * effective cpus.
+	 */
+	if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+		goto out_unlock;
+
 	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task, cs->cpus_allowed);
 		if (ret)
@@ -3125,7 +3163,8 @@ hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
-	if (cpumask_empty(new_cpus))
+	/* A partition root is allowed to have empty effective cpus */
+	if (cpumask_empty(new_cpus) && !is_partition_root(cs))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
@@ -3194,22 +3233,25 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus, we will have to force any child partitions,
-	 * if present, to become invalid by setting nr_subparts_cpus to 0
-	 * without causing itself to become invalid.
+	 * effective_cpus with tasks, we will have to force any child
+	 * partitions, if present, to become invalid by setting
+	 * nr_subparts_cpus to 0 without causing itself to become invalid.
 	 */
-	if (is_partition_root(cs) && cs->nr_subparts_cpus &&
-	    cpumask_empty(&new_cpus)) {
+	if (is_partition_root(cs) && cpumask_empty(&new_cpus) &&
+	    !valid_empty_partition(cs)) {
 		cs->nr_subparts_cpus = 0;
 		cpumask_clear(cs->subparts_cpus);
 		compute_effective_cpumask(&new_cpus, cs, parent);
 	}
 
 	/*
-	 * If empty effective_cpus or zero nr_subparts_cpus or its parent
-	 * becomes erroneous, we have to transition it to the erroneous state.
+	 * Force the partition to become invalid if either one of
+	 * the following conditions hold:
+	 * 1) empty effective cpus but not valid empty partition.
+	 * 2) parent is invalid or doesn't grant any cpus to child partitions.
 	 */
-	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
+	if (is_partition_root(cs) &&
+	   ((cpumask_empty(&new_cpus) && !valid_empty_partition(cs)) ||
 	    (parent->partition_root_state == PRS_ERROR) ||
 	    !parent->nr_subparts_cpus)) {
 		int old_prs;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-25 21:37 [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (3 preceding siblings ...)
  2021-08-25 21:37 ` [PATCH v7 4/6] cgroup/cpuset: Allow non-top parent partition to distribute out all CPUs Waiman Long
@ 2021-08-25 21:37 ` Waiman Long
  2021-08-26 17:35   ` Tejun Heo
  2021-08-25 21:37 ` [PATCH v7 6/6] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
  5 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2021-08-25 21:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as the ability to create
non-top cpuset partition with no cpu allocated to it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 112 +++++++++++++-----------
 1 file changed, 63 insertions(+), 49 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index babbe04c8d37..e759b0898bce 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2091,8 +2091,9 @@ Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
 	When set to be a partition root, the current cgroup is the
@@ -2101,64 +2102,77 @@ Cpuset Interface Files
 	partition roots themselves and their descendants.  The root
 	cgroup is always a partition root.
 
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks in such a partition must be explicitly bound
+	to each individual CPU.
+
+	There are constraints on where a partition root can be set
+	("root" or "isolated").  It can only be set in a cgroup if all
+	the following conditions are true.
 
 	1) The "cpuset.cpus" is not empty and the list of CPUs are
 	   exclusive, i.e. they are not shared by any of its siblings.
 	2) The parent cgroup is a partition root.
-	3) The "cpuset.cpus" is also a proper subset of the parent's
+	3) The "cpuset.cpus" is a subset of the parent's
 	   "cpuset.cpus.effective".
 	4) There is no child cgroups with cpuset enabled.  This is for
 	   eliminating corner cases that have to be handled if such a
 	   condition is allowed.
 
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
-	file cannot be reverted back to "member" if there are any child
+	Setting it to a partition root will take the CPUs away from the
+	effective CPUs of the parent cgroup.  Once it is set, this file
+	should not be reverted back to "member" if there are any child
 	cgroups with cpuset enabled.
 
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
-
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
-
-	Sometimes, external factors like changes to ancestors'
-	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
-
-	  ==============	==============================
-	  "member"		Non-root member of a partition
-	  "root"		Partition root
-	  "root invalid"	Invalid partition root
-	  ==============	==============================
-
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
-	The cpu affinity of all the tasks in the cgroup will then be
-	associated with CPUs in the nearest ancestor partition.
-
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	A parent partition may distribute all its CPUs to its child
+	partitions as long as it is not the root cgroup.
+
+	Once becoming a partition root, changes to "cpuset.cpus"
+	is generally allowed as long as the first condition above
+	(cpu exclusivity rule) is true.
+
+	Sometimes, changes to "cpuset.cpus" or cpu hotplug may cause
+	the state of the partition root to become invalid when the
+	other constraints of partition root are violated.  Therefore,
+	user space agents that manage partition roots should avoid
+	unnecessary changes to "cpuset.cpus" and monitor the state of
+	"cpuset.cpus.partition" to make sure that the partitions are
+	functioning as expected.
+
+	On read, the "cpuset.cpus.partition" file can show the following
+	values.
+
+	  ======================	==============================
+	  "member"			Non-root member of a partition
+	  "root"			Partition root
+	  "isolated"			Partition root without load balancing
+	  "root invalid (<reason>)"	Invalid partition root
+	  ======================	==============================
+
+	A partition root becomes invalid if all the CPUs requested in
+	"cpuset.cpus" become unavailable.  This can happen if all the
+	CPUs have been offlined, or the state of an ancestor partition
+	root become invalid. "<reason>" is a string that describes why
+	the partition becomes invalid.
+
+	An invalid partition is not a real partition even though some
+	internal states may still be kept.  The cpu affinity of all
+	the tasks in the cgroup will then be associated with CPUs in
+	the nearest ancestor partition.
+
+	An invalid partition root can be reverted back to a real
+	partition root if at least one of the requested CPUs become
+	available again.  In this case, the cpu affinity of all the
+	tasks in the formerly invalid partition will be associated to
+	the CPUs of the newly formed partition.
+
+	Poll and inotify events are triggered whenever the state of
+	"cpuset.cpus.partition" changes.  That includes changes caused by
+	write to "cpuset.cpus.partition", cpu hotplug and other changes
+	that make the partition invalid.  This will allow user space
+	agents to monitor unexpected changes to "cpuset.cpus.partition"
+	without the need to do continuous polling.
 
 
 Device controller
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v7 6/6] kselftest/cgroup: Add cpuset v2 partition root state test
  2021-08-25 21:37 [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (4 preceding siblings ...)
  2021-08-25 21:37 ` [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
@ 2021-08-25 21:37 ` Waiman Long
  5 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-25 21:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Add a test script test_cpuset_prs.sh with a helper program wait_inotify
for exercising the cpuset v2 partition root state code.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 663 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
 3 files changed, 753 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 59e222460581..3f1fd3f93f41 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,10 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -Wall -pthread
 
-all:
+all: ${HELPER_PROGS}
 
 TEST_FILES     := with_stress.sh
-TEST_PROGS     := test_stress.sh
+TEST_PROGS     := test_stress.sh test_cpuset_prs.sh
+TEST_GEN_FILES := wait_inotify
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_kmem
 TEST_GEN_PROGS += test_core
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
new file mode 100755
index 000000000000..07c4a6ef3700
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -0,0 +1,663 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test for cpuset v2 partition root state (PRS)
+#
+# The sched verbose flag is set, if available, so that the console log
+# can be examined for the correct setting of scheduling domain.
+#
+
+skip_test() {
+	echo "$1"
+	echo "Test SKIPPED"
+	exit 0
+}
+
+[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
+
+# Set sched verbose flag, if available
+[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
+
+# Get wait_inotify location
+WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
+
+# Find cgroup v2 mount point
+CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
+
+CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
+[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
+
+# Set verbose flag and delay factor
+PROG=$1
+VERBOSE=
+DELAY_FACTOR=1
+while [[ "$1" = -* ]]
+do
+	case "$1" in
+		-v) VERBOSE=1
+		    break
+		    ;;
+		-d) DELAY_FACTOR=$2
+		    shift
+		    break
+		    ;;
+		*)  echo "Usage: $PROG [-v] [-d <delay-factor>"
+		    exit
+		    ;;
+	esac
+	shift
+done
+
+cd $CGROUP2
+echo +cpuset > cgroup.subtree_control
+[[ -d test ]] || mkdir test
+cd test
+
+# Pause in ms
+pause()
+{
+	DELAY=$1
+	LOOP=0
+	while [[ $LOOP -lt $DELAY_FACTOR ]]
+	do
+		sleep $DELAY
+		((LOOP++))
+	done
+	return 0
+}
+
+console_msg()
+{
+	MSG=$1
+	echo "$MSG"
+	echo "" > /dev/console
+	echo "$MSG" > /dev/console
+	pause 0.01
+}
+
+test_partition()
+{
+	EXPECTED_VAL=$1
+	echo $EXPECTED_VAL > cpuset.cpus.partition
+	[[ $? -eq 0 ]] || exit 1
+	ACTUAL_VAL=$(cat cpuset.cpus.partition)
+	[[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
+		echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+test_effective_cpus()
+{
+	EXPECTED_VAL=$1
+	ACTUAL_VAL=$(cat cpuset.cpus.effective)
+	[[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
+		echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+# Adding current process to cgroup.procs as a test
+test_add_proc()
+{
+	OUTSTR="$1"
+	ERRMSG=$((echo $$ > cgroup.procs) |& cat)
+	echo $ERRMSG | grep -q "$OUTSTR"
+	[[ $? -ne 0 ]] && {
+		echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
+		echo "Test FAILED"
+		exit 1
+	}
+	echo $$ > $CGROUP2/cgroup.procs	# Move out the task
+}
+
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+	echo 2-3 > cpuset.cpus
+	TYPE=$(cat cpuset.cpus.partition)
+	[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+	console_msg "Change from member to root"
+	test_partition root
+
+	console_msg "Change from root to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to member"
+	test_partition member
+
+	console_msg "Change from member to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to root"
+	test_partition root
+
+	console_msg "Change from root to member"
+	test_partition member
+
+	#
+	# Testing partition root with no cpu
+	#
+	console_msg "Distribute all cpus to child partition"
+	echo +cpuset > cgroup.subtree_control
+	test_partition root
+
+	mkdir A1
+	cd A1
+	echo 2-3 > cpuset.cpus
+	test_partition root
+	test_effective_cpus 2-3
+	cd ..
+	test_effective_cpus ""
+
+	console_msg "Moving task to partition test"
+	test_add_proc "No space left"
+	cd A1
+	test_add_proc ""
+	cd ..
+
+	console_msg "Shrink and expand child partition"
+	cd A1
+	echo 2 > cpuset.cpus
+	cd ..
+	test_effective_cpus 3
+	cd A1
+	echo 2-3 > cpuset.cpus
+	cd ..
+	test_effective_cpus ""
+
+	# Cleaning up
+	console_msg "Cleaning up"
+	echo $$ > $CGROUP2/cgroup.procs
+	[[ -d A1 ]] && rmdir A1
+}
+
+#
+# Cpuset controller state transition test matrix.
+#
+# Cgroup test hierarchy
+#
+# test -- A1 -- A2 -- A3
+#      \- B1
+#
+#  P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
+#  C<l> = add cpu-list
+#  S<p> = use prefix in subtree_control
+#  T    = put a task into cgroup
+#  O<c>-<v> = Write <v> to CPU online file of <c>
+#
+SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+TEST_MATRIX=(
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .      .     0 A2:0-1"
+	"  S+    C0-1     .      .    C2-3    P1      .      .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+ C0-1:P1   .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+  C1:P1    .      .     0 "
+	"  S+   C0-1:S+   .      .    C2-3     .      .      .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+     C1      .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3   C4-5     .      .      .     0 A1:4-5"
+	"  S+   C0-1:P1   .      .    C2-3  S+:C4-5   .      .      .     0 A1:4-5"
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C2     0 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .    C4-5    0 B1:4-5"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .      .      .      .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     C1-3    .      .      .     0 A1:1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      P0     .      .     0 A1:3,A2:3 A1:P1,A2:P0"
+	"  S+ C2-3:P1:S+  C2:P1  .      .     C2-4    .      .      .     0 A1:3-4,A2:2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .     C0-2   0 A1:,B1:0-2 A1:P1,A2:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# CPU offlining cases:
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .     O2-0   0 A1:0-1,B1:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0    .      .      .     0 A1:0-1,A2:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0   O2-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0    .      .      .     0 A1:0,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0   O1-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0    .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0    .      .      .     0 A1:2,A2:2 A1:P1,A2:P-1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .    T:O2-0   .      .      .     0 A1:3,A2:3 A1:P1,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O1-0    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O2-0    .      .      .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O3-0    .      .      .     0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0   .      .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0   .      .     0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0   .     0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O1-1    .      .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0  O2-1    .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0  O3-1   0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O1-1    .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O2-1    .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	#
+	# Incorrect change to cpuset.cpus invalidates partition root
+	#
+	# Adding CPUs to partition root that are not in parent's
+	# cpuset.cpus.effective makes it invalid.
+	"  S+ C2-3:P1:S+ C3:P1   .      .      .     C2-4    .      .     0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
+
+	# Taking away all CPUs from parent or itself if there are tasks
+	# will make the partition invalid.
+	"  S+ C2-3:P1:S+  C3:P1  .      .      T     C2-3    .      .     0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:C2-3   .      .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    . T:C2-3:C1-3 .      .      .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# Changing a partition root member invalidates child partitions
+	"  S+ C2-3:P1:S+  C3:P1  .      .      P0     .      .      .     0 A1:2-3,A2:3 A1:P0,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    P0     .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P-1"
+
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	# Failure cases:
+
+	# To become a partition root, cpuset.cpus must be a subset of
+	# parent's cpuset.cpus.effective.
+	"  S+    C0-1     .      .    C2-3    S+   C4-5:P1   .      .     1 "
+
+	# A cpuset cannot become a partition root if it has child cpusets
+	# with non-empty cpuset.cpus.
+	"  S+   C0-1:S+   C1     .    C2-3    P1      .      .      .     1 "
+
+	# Any change to cpuset.cpus of a partition root must be exclusive.
+	"  S+   C0-1:P1   .      .    C2-3   C0-2     .      .      .     1 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C1     1 "
+	"  S+ C2-3:P1:S+  C2:P1  .     C1    C1-3     .      .      .     1 "
+
+	# Deletion of CPUs distributed to child partition root is not allowed.
+	"  S+ C0-1:P1:S+ C1      .    C2-3   C4-5     .      .      .     1 "
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .    C0-2     .      .      .     1 "
+
+	# A task cannot be added to a non-terminal partition with no cpu
+	"  S+ C2-3:P1:S+  C3:P1  .      .    O2-0:T   .      .      .     1 A1:,A2:3 A1:P1,A2:P1"
+)
+
+#
+# Write to the cpu online file
+#  $1 - <c>-<v> where <c> = cpu number, <v> value to be written
+#
+write_cpu_online()
+{
+	CPU=${1%-*}
+	VAL=${1#*-}
+	CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
+	if [[ $VAL -eq 0 ]]
+	then
+		OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
+	else
+		[[ -n "$OFFLINE_CPUS" ]] && {
+			OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
+					sort | uniq -u)
+		}
+	fi
+	echo $VAL > $CPUFILE
+	pause 0.01
+}
+
+#
+# Set controller state
+#  $1 - cgroup directory
+#  $2 - state
+#  $3 - showerr
+#
+# The presence of ":" in state means transition from one to the next.
+#
+set_ctrl_state()
+{
+	TMPMSG=/tmp/.msg_$$
+	CGRP=$1
+	STATE=$2
+	SHOWERR=${3}${VERBOSE}
+	CTRL=${CTRL:=$CONTROLLER}
+	HASERR=0
+	REDIRECT="2> $TMPMSG"
+	[[ -z "$STATE" || "$STATE" = '.' ]] && return 0
+
+	rm -f $TMPMSG
+	for CMD in $(echo $STATE | sed -e "s/:/ /g")
+	do
+		TFILE=$CGRP/cgroup.procs
+		SFILE=$CGRP/cgroup.subtree_control
+		PFILE=$CGRP/cpuset.cpus.partition
+		CFILE=$CGRP/cpuset.cpus
+		S=$(expr substr $CMD 1 1)
+		if [[ $S = S ]]
+		then
+			PREFIX=${CMD#?}
+			COMM="echo ${PREFIX}${CTRL} > $SFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = C ]]
+		then
+			CPUS=${CMD#?}
+			COMM="echo $CPUS > $CFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = P ]]
+		then
+			VAL=${CMD#?}
+			case $VAL in
+			0)  VAL=member
+			    ;;
+			1)  VAL=root
+			    ;;
+			2)  VAL=isolated
+			    ;;
+			*)
+			    echo "Invalid partition state - $VAL"
+			    exit 1
+			    ;;
+			esac
+			COMM="echo $VAL > $PFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = O ]]
+		then
+			VAL=${CMD#?}
+			write_cpu_online $VAL
+		elif [[ $S = T ]]
+		then
+			COMM="echo 0 > $TFILE"
+			eval $COMM $REDIRECT
+		fi
+		RET=$?
+		[[ $RET -ne 0 ]] && {
+			[[ -n "$SHOWERR" ]] && {
+				echo "$COMM"
+				cat $TMPMSG
+			}
+			HASERR=1
+		}
+		pause 0.01
+		rm -f $TMPMSG
+	done
+	return $HASERR
+}
+
+set_ctrl_state_noerr()
+{
+	CGRP=$1
+	STATE=$2
+	[[ -d $CGRP ]] || mkdir $CGRP
+	set_ctrl_state $CGRP $STATE 1
+	[[ $? -ne 0 ]] && {
+		echo "ERROR: Failed to set $2 to cgroup $1!"
+		exit 1
+	}
+}
+
+online_cpus()
+{
+	[[ -n "OFFLINE_CPUS" ]] && {
+		for C in $OFFLINE_CPUS
+		do
+			write_cpu_online ${C}-1
+		done
+	}
+}
+
+#
+# Return 1 if the list of effective cpus isn't the same as the initial list.
+#
+reset_cgroup_states()
+{
+	echo 0 > $CGROUP2/cgroup.procs
+	online_cpus
+	rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
+	set_ctrl_state . S-
+	pause 0.01
+}
+
+dump_states()
+{
+	for DIR in A1 A1/A2 A1/A2/A3 B1
+	do
+		ECPUS=$DIR/cpuset.cpus.effective
+		PRS=$DIR/cpuset.cpus.partition
+		[[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
+		[[ -e $PRS   ]] && echo "$PRS: $(cat $PRS)"
+	done
+}
+
+#
+# Check effective cpus
+# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
+#
+check_effective_cpus()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		CPUS=$2
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+		FILE=$CGRP/cpuset.cpus.effective
+		[[ -e $FILE ]] || return 1
+		[[ $CPUS = $(cat $FILE) ]] || return 1
+	done
+}
+
+#
+# Check cgroup states
+#  $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
+#
+check_cgroup_states()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		STATE=$2
+		FILE=
+		EVAL=$(expr substr $STATE 2 2)
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+
+		case $STATE in
+			P*) FILE=$CGRP/cpuset.cpus.partition
+			    ;;
+			*)  echo "Unknown state: $STATE!"
+			    exit 1
+			    ;;
+		esac
+		VAL=$(cat $FILE)
+
+		case "$VAL" in
+			member) VAL=0
+				;;
+			root)	VAL=1
+				;;
+			isolated)
+				VAL=2
+				;;
+			"root invalid"*)
+				VAL=-1
+				;;
+		esac
+		[[ $EVAL != $VAL ]] && return 1
+	done
+	return 0
+}
+
+#
+# Run cpuset state transition test
+#  $1 - test matrix name
+#
+# This test is somewhat fragile as delays (sleep x) are added in various
+# places to make sure state changes are fully propagated before the next
+# action. These delays may need to be adjusted if running in a slower machine.
+#
+run_state_test()
+{
+	TEST=$1
+	CONTROLLER=cpuset
+	CPULIST=0-6
+	I=0
+	eval CNT="\${#$TEST[@]}"
+
+	reset_cgroup_states
+	echo $CPULIST > cpuset.cpus
+	echo root > cpuset.cpus.partition
+	console_msg "Running state transition test ..."
+
+	while [[ $I -lt $CNT ]]
+	do
+		echo "Running test $I ..." > /dev/console
+		eval set -- "\${$TEST[$I]}"
+		ROOT=$1
+		OLD_A1=$2
+		OLD_A2=$3
+		OLD_A3=$4
+		OLD_B1=$5
+		NEW_A1=$6
+		NEW_A2=$7
+		NEW_A3=$8
+		NEW_B1=$9
+		RESULT=${10}
+		ECPUS=${11}
+		STATES=${12}
+
+		set_ctrl_state_noerr .        $ROOT
+		set_ctrl_state_noerr A1       $OLD_A1
+		set_ctrl_state_noerr A1/A2    $OLD_A2
+		set_ctrl_state_noerr A1/A2/A3 $OLD_A3
+		set_ctrl_state_noerr B1       $OLD_B1
+		RETVAL=0
+		set_ctrl_state A1       $NEW_A1; ((RETVAL += $?))
+		set_ctrl_state A1/A2    $NEW_A2; ((RETVAL += $?))
+		set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
+		set_ctrl_state B1       $NEW_B1; ((RETVAL += $?))
+
+		[[ $RETVAL -ne $RESULT ]] && {
+			echo "Test $TEST[$I] failed result check!"
+			eval echo \"\${$TEST[$I]}\"
+			dump_states
+			online_cpus
+			exit 1
+		}
+
+		[[ -n "$ECPUS" && "$ECPUS" != . ]] && {
+			check_effective_cpus $ECPUS
+			[[ $? -ne 0 ]] && {
+				echo "Test $TEST[$I] failed effective CPU check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		[[ -n "$STATES" ]] && {
+			check_cgroup_states $STATES
+			[[ $? -ne 0 ]] && {
+				echo "FAILED: Test $TEST[$I] failed states check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		reset_cgroup_states
+		#
+		# Check to see if effective cpu list changes
+		#
+		pause 0.05
+		NEWLIST=$(cat cpuset.cpus.effective)
+		[[ $NEWLIST != $CPULIST ]] && {
+			echo "Effective cpus changed to $NEWLIST after test $I!"
+			exit 1
+		}
+		[[ -n "$VERBOSE" ]] && echo "Test $I done."
+		((I++))
+	done
+	echo "All $I tests of $TEST PASSED."
+
+	echo member > cpuset.cpus.partition
+}
+
+#
+# Wait for inotify event for the given file and read it
+# $1: cgroup file to wait for
+# $2: file to store the read result
+#
+wait_inotify()
+{
+	CGROUP_FILE=$1
+	OUTPUT_FILE=$2
+
+	$WAIT_INOTIFY $CGROUP_FILE
+	cat $CGROUP_FILE > $OUTPUT_FILE
+}
+
+#
+# Test if inotify events are properly generated when going into and out of
+# invalid partition state.
+#
+test_inotify()
+{
+	ERR=0
+	PRS=/tmp/.prs_$$
+	[[ -f $WAIT_INOTIFY ]] || {
+		echo "wait_inotify not found, inotify test SKIPPED."
+		return
+	}
+
+	pause 0.01
+	echo 1 > cpuset.cpus
+	echo 0 > cgroup.procs
+	echo root > cpuset.cpus.partition
+	pause 0.01
+	rm -f $PRS
+	wait_inotify $PWD/cpuset.cpus.partition $PRS &
+	pause 0.01
+	set_ctrl_state . "O1-0"
+	pause 0.01
+	check_cgroup_states ".:P-1"
+	if [[ $? -ne 0 ]]
+	then
+		echo "FAILED: Inotify test - partition not invalid"
+		ERR=1
+	elif [[ ! -f $PRS ]]
+	then
+		echo "FAILED: Inotify test - event not generated"
+		ERR=1
+		kill %1
+	elif [[ $(cat $PRS) != "root invalid"* ]]
+	then
+		echo "FAILED: Inotify test - incorrect state"
+		cat $PRS
+		ERR=1
+	fi
+	online_cpus
+	echo member > cpuset.cpus.partition
+	echo 0 > ../cgroup.procs
+	if [[ $ERR -ne 0 ]]
+	then
+		exit 1
+	else
+		echo "Inotify test PASSED"
+	fi
+}
+
+run_state_test TEST_MATRIX
+test_isolated
+test_inotify
+echo "All tests PASSED."
+cd ..
+rmdir test
diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c
new file mode 100644
index 000000000000..e11b431e1b62
--- /dev/null
+++ b/tools/testing/selftests/cgroup/wait_inotify.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Wait until an inotify event on the given cgroup file.
+ */
+#include <linux/limits.h>
+#include <sys/inotify.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static const char usage[] = "Usage: %s [-v] <cgroup_file>\n";
+static char *file;
+static int verbose;
+
+static inline void fail_message(char *msg)
+{
+	fprintf(stderr, msg, file);
+	exit(1);
+}
+
+int main(int argc, char *argv[])
+{
+	char *cmd = argv[0];
+	int c, fd;
+	struct pollfd fds = { .events = POLLIN, };
+
+	while ((c = getopt(argc, argv, "v")) != -1) {
+		switch (c) {
+		case 'v':
+			verbose++;
+			break;
+		}
+		argv++, argc--;
+	}
+
+	if (argc != 2) {
+		fprintf(stderr, usage, cmd);
+		return -1;
+	}
+	file = argv[1];
+	fd = open(file, O_RDONLY);
+	if (fd < 0)
+		fail_message("Cgroup file %s not found!\n");
+	close(fd);
+
+	fd = inotify_init();
+	if (fd < 0)
+		fail_message("inotify_init() fails on %s!\n");
+	if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
+		fail_message("inotify_add_watch() fails on %s!\n");
+	fds.fd = fd;
+
+	/*
+	 * poll waiting loop
+	 */
+	for (;;) {
+		int ret = poll(&fds, 1, 10000);
+
+		if (ret < 0) {
+			if (errno == EINTR)
+				continue;
+			perror("poll");
+			exit(1);
+		}
+		if ((ret > 0) && (fds.revents & POLLIN))
+			break;
+	}
+	if (verbose) {
+		struct inotify_event events[10];
+		long len;
+
+		usleep(1000);
+		len = read(fd, events, sizeof(events));
+		printf("Number of events read = %ld\n",
+			len/sizeof(struct inotify_event));
+	}
+	close(fd);
+	return 0;
+}
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-25 21:37 ` [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
@ 2021-08-26 17:35   ` Tejun Heo
  2021-08-27  3:01     ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2021-08-26 17:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

Hello, Waiman.

Let's stop iterating on the patchset until we reach a consensus.

On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote:
>  	1) The "cpuset.cpus" is not empty and the list of CPUs are
>  	   exclusive, i.e. they are not shared by any of its siblings.

Part of it can be reached by cpus going offline.

>  	2) The parent cgroup is a partition root.

This condition can happen if a parent stop being a partition.

> -	3) The "cpuset.cpus" is also a proper subset of the parent's
> +	3) The "cpuset.cpus" is a subset of the parent's
>  	   "cpuset.cpus.effective".

This can happen if cpus go offline.

>  	4) There is no child cgroups with cpuset enabled.  This is for
>  	   eliminating corner cases that have to be handled if such a
>  	   condition is allowed.

This may make sense as a short cut for us but doesn't really stem from
interface or behavior requirements.

Of the four conditions listed, two are bogus (the states can be
reached through a different path and the configuration success or
failure can be timing dependent if configuration racaes against cpu
hotplug operations) and one maybe makes sense half-way and one is more
of a shortcut.

Can't we just replace these with transitions to invalid state with
proper explanation? That'd get rid of the error handling duplications
from both the kernel and user side, make automated configurations
which may race against hot plug operations reliable, and consistently
provide users with why something failed.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-26 17:35   ` Tejun Heo
@ 2021-08-27  3:01     ` Waiman Long
  2021-08-27  4:00       ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2021-08-27  3:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 8/26/21 1:35 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> Let's stop iterating on the patchset until we reach a consensus.
>
> On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote:
>>   	1) The "cpuset.cpus" is not empty and the list of CPUs are
>>   	   exclusive, i.e. they are not shared by any of its siblings.
> Part of it can be reached by cpus going offline.
>
>>   	2) The parent cgroup is a partition root.
> This condition can happen if a parent stop being a partition.
>
>> -	3) The "cpuset.cpus" is also a proper subset of the parent's
>> +	3) The "cpuset.cpus" is a subset of the parent's
>>   	   "cpuset.cpus.effective".
> This can happen if cpus go offline.
>
>>   	4) There is no child cgroups with cpuset enabled.  This is for
>>   	   eliminating corner cases that have to be handled if such a
>>   	   condition is allowed.
> This may make sense as a short cut for us but doesn't really stem from
> interface or behavior requirements.
>
> Of the four conditions listed, two are bogus (the states can be
> reached through a different path and the configuration success or
> failure can be timing dependent if configuration racaes against cpu
> hotplug operations) and one maybe makes sense half-way and one is more
> of a shortcut.
>
> Can't we just replace these with transitions to invalid state with
> proper explanation? That'd get rid of the error handling duplications
> from both the kernel and user side, make automated configurations
> which may race against hot plug operations reliable, and consistently
> provide users with why something failed.

What I am doing here is setting a high bar for transitioning from member 
to either "root" or "isolated". Once it becomes a partition, there are 
multiple ways that can make it invalid. I am fine with that. However, I 
am not sure it is a good idea to allow users to echo "root" to 
cpuset.cpus.partition anywhere in the cgroup hierarchy and require them 
to read it back to see if it succeed.

All the checking are done with cpuset_rwsem held. So there shouldn't be 
any racing. Of course, a hotplug can immediately follow and make the 
partition invalid.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-27  3:01     ` Waiman Long
@ 2021-08-27  4:00       ` Tejun Heo
  2021-08-27 21:19         ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2021-08-27  4:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

Hello,

On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote:
> What I am doing here is setting a high bar for transitioning from member to
> either "root" or "isolated". Once it becomes a partition, there are multiple
> ways that can make it invalid. I am fine with that. However, I am not sure
> it is a good idea to allow users to echo "root" to cpuset.cpus.partition
> anywhere in the cgroup hierarchy and require them to read it back to see if
> it succeed.

The problem is that the "high" bar is rather arbitrary. It might feel like a
good idea to some but not to others. There are no clear technical reasons or
principles for rules to be set this particular way.

> All the checking are done with cpuset_rwsem held. So there shouldn't be any
> racing. Of course, a hotplug can immediately follow and make the partition
> invalid.

Imagine a system which dynamically on/offlines its cpus based on load or
whatever and also configures partitions for cases where the needed cpus are
online. If the partitions are set up while the cpus are online, it'd work as
expected - partitions are in effect when the system can support them and
ignored otherwise. However, if the partition configuration is attempted
while the cpus happen to be offline, the configuration will fail, and there
is no guaranteed way to make that configuration stick short of disabling
hotplug operations. This is a pretty jarring brekage happening exactly
because the behavior is an inconsistent amalgam.

It's usually not a good sign if interface restrictions can be added or
removed because how one feels without clear functional reasons and often
indicates that there's something broken, which seems to be the case here
too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-27  4:00       ` Tejun Heo
@ 2021-08-27 21:19         ` Waiman Long
  2021-08-27 21:27           ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2021-08-27 21:19 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 8/27/21 12:00 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote:
>> What I am doing here is setting a high bar for transitioning from member to
>> either "root" or "isolated". Once it becomes a partition, there are multiple
>> ways that can make it invalid. I am fine with that. However, I am not sure
>> it is a good idea to allow users to echo "root" to cpuset.cpus.partition
>> anywhere in the cgroup hierarchy and require them to read it back to see if
>> it succeed.
> The problem is that the "high" bar is rather arbitrary. It might feel like a
> good idea to some but not to others. There are no clear technical reasons or
> principles for rules to be set this particular way.
>
>> All the checking are done with cpuset_rwsem held. So there shouldn't be any
>> racing. Of course, a hotplug can immediately follow and make the partition
>> invalid.
> Imagine a system which dynamically on/offlines its cpus based on load or
> whatever and also configures partitions for cases where the needed cpus are
> online. If the partitions are set up while the cpus are online, it'd work as
> expected - partitions are in effect when the system can support them and
> ignored otherwise. However, if the partition configuration is attempted
> while the cpus happen to be offline, the configuration will fail, and there
> is no guaranteed way to make that configuration stick short of disabling
> hotplug operations. This is a pretty jarring brekage happening exactly
> because the behavior is an inconsistent amalgam.
>
> It's usually not a good sign if interface restrictions can be added or
> removed because how one feels without clear functional reasons and often
> indicates that there's something broken, which seems to be the case here
> too.

Well, that is a valid point. The cpus may have been offlined when a 
partition is being created. I can certainly relent on this check in 
forming a partition. IOW, cpus_allowed can contain some or all offline 
cpus and a valid (some are online) or invalid (all are offline) 
partition can be formed. I can also allow an invalid child partition to 
be formed with an invalid parent partition. However, the cpu exclusivity 
rules will still apply.

Other than that, do you envision any other circumstances where we should 
allow an invalid partition to be formed?

Cheers,
Longman



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-27 21:19         ` Waiman Long
@ 2021-08-27 21:27           ` Tejun Heo
  2021-08-27 22:50             ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2021-08-27 21:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

Hello,

On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote:
> Well, that is a valid point. The cpus may have been offlined when a
> partition is being created. I can certainly relent on this check in forming
> a partition. IOW, cpus_allowed can contain some or all offline cpus and a
> valid (some are online) or invalid (all are offline) partition can be
> formed. I can also allow an invalid child partition to be formed with an
> invalid parent partition. However, the cpu exclusivity rules will still
> apply.
> 
> Other than that, do you envision any other circumstances where we should
> allow an invalid partition to be formed?

Now that most restrictions are removed from configuration side, just go all
the way? Given that the user must check the status afterwards anyway, I
don't see technical or even usability reasons for leaving some pieces
behind. Going all the way would be easier to use too - bang in the target
config and read the resulting state to reliably find out why a partition
isn't valid, especially if we list *all* the reasons so that the user can
tell whether the configuration is as intended immediately.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-27 21:27           ` Tejun Heo
@ 2021-08-27 22:50             ` Waiman Long
  2021-08-27 23:35               ` Tejun Heo
  2021-08-30 17:59               ` Michal Koutný
  0 siblings, 2 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-27 22:50 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 8/27/21 5:27 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote:
>> Well, that is a valid point. The cpus may have been offlined when a
>> partition is being created. I can certainly relent on this check in forming
>> a partition. IOW, cpus_allowed can contain some or all offline cpus and a
>> valid (some are online) or invalid (all are offline) partition can be
>> formed. I can also allow an invalid child partition to be formed with an
>> invalid parent partition. However, the cpu exclusivity rules will still
>> apply.
>>
>> Other than that, do you envision any other circumstances where we should
>> allow an invalid partition to be formed?
> Now that most restrictions are removed from configuration side, just go all
> the way? Given that the user must check the status afterwards anyway, I
> don't see technical or even usability reasons for leaving some pieces
> behind. Going all the way would be easier to use too - bang in the target
> config and read the resulting state to reliably find out why a partition
> isn't valid, especially if we list *all* the reasons so that the user
> tell whether the configuration is as intended immediately.

The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. 
This is a pre-existing condition unless you want to change how the 
cpuset.cpu_exclusive works.

So the new rules will be:

1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.
2) The parent cgroup is a partition root (can be an invalid one).
3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.
4) No child cgroup with cpuset enabled.

I think they are reasonable. What do you think?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-27 22:50             ` Waiman Long
@ 2021-08-27 23:35               ` Tejun Heo
  2021-08-28  1:14                 ` Waiman Long
       [not found]                 ` <3533e4f9-169c-d13c-9c4e-d9ec6bdc78f0@redhat.com>
  2021-08-30 17:59               ` Michal Koutný
  1 sibling, 2 replies; 19+ messages in thread
From: Tejun Heo @ 2021-08-27 23:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

Hello,

On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote:
> The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is
> a pre-existing condition unless you want to change how the
> cpuset.cpu_exclusive works.
>
> So the new rules will be:
> 
> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.

Empty cpu list can be considered an exclusive one.

> 2) The parent cgroup is a partition root (can be an invalid one).

Does this mean a partition parent can't stop being a partition if one or
more of its children become partitions? If so, it violates the rule that a
descendant shouldn't be able to restrict what its ancestors can do.

> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.

Why not just go by effective? This would mean that a parent can't withdraw
CPUs from its allowed set once descendants are configured. Restrictions like
this are fine when the entire hierarchy is configured by a single entity but
become awkward when configurations are multi-tiered, automated and dynamic.

> 4) No child cgroup with cpuset enabled.

idk, maybe? I'm having a hard time seeing the point in adding these
restrictions when the state transitions are asynchronous anyway. Would it
help if we try to separate what's absoluately and technically necessary and
what seems reasonable or high bar and try to justify why each of the latter
should be added?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-27 23:35               ` Tejun Heo
@ 2021-08-28  1:14                 ` Waiman Long
       [not found]                 ` <3533e4f9-169c-d13c-9c4e-d9ec6bdc78f0@redhat.com>
  1 sibling, 0 replies; 19+ messages in thread
From: Waiman Long @ 2021-08-28  1:14 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 8/27/21 7:35 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote:
>> The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is
>> a pre-existing condition unless you want to change how the
>> cpuset.cpu_exclusive works.
>>
>> So the new rules will be:
>>
>> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.
> Empty cpu list can be considered an exclusive one.
It doesn't make sense to me to have a partition with no cpu configured 
at all. I very much prefer the users to set cpuset.cpus first before 
turning it into a partition.
>
>> 2) The parent cgroup is a partition root (can be an invalid one).
> Does this mean a partition parent can't stop being a partition if one or
> more of its children become partitions? If so, it violates the rule that a
> descendant shouldn't be able to restrict what its ancestors can do.

No. As I said in the documentation, transitioning from partition root to 
member is allowed. Against, it is illogical to allow a cpuset to become 
a potential partition if it parent is not even a partition root at all. 
In the case that the parent is reverted back to a member, the child 
partitions will stay invalid forever unless the parent become a valid 
partition again.

>
>> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.
> Why not just go by effective? This would mean that a parent can't withdraw
> CPUs from its allowed set once descendants are configured. Restrictions like
> this are fine when the entire hierarchy is configured by a single entity but
> become awkward when configurations are multi-tiered, automated and dynamic.

The original rule is to be based on effective cpus. However, to properly 
handle the case of allowing offlined cpus to be included in the 
partition, I have to change it to cpu_allowed instead. I can certainly 
change it back to effective if you prefer.

>
>> 4) No child cgroup with cpuset enabled.
> idk, maybe? I'm having a hard time seeing the point in adding these
> restrictions when the state transitions are asynchronous anyway. Would it
> help if we try to separate what's absoluately and technically necessary and
> what seems reasonable or high bar and try to justify why each of the latter
> should be added?

This rule is there mainly for ease of implementation. Otherwise, I need 
to add additional code to handle the conversion of child cpusets which 
can be rather complex and require a lot more debugging. This rule will 
no longer apply once the cpuset becomes a partition root.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-08-27 22:50             ` Waiman Long
  2021-08-27 23:35               ` Tejun Heo
@ 2021-08-30 17:59               ` Michal Koutný
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Koutný @ 2021-08-30 17:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

Hello.

On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long <llong@redhat.com> wrote:
> So the new rules will be:

When I followed the thread, it seemed to me you're talking past each
other a bit. I'd suggest the following terminology:

- config space: what's written by the user and saved,

- reality space: what's currently available (primarily subject to
  on-/offlinng but I think it'd be helpful to consider here also what's
  given by the parent),

- effect space: what's actually possible and happening.

Not all elements of config_space x reality_space (Cartesian product) can
be represented in the effect_space (e.g. root partition with no
(effective) cpus).

IIUC, Waiman's "high bar" is supposed to be defined over transitions in
the config_space. However, there can be independent changes in the
reality_space so the rules should be actually formulated in the
effect_space:

The conditions for being a valid partition root rewritten into the effect
space:

> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.
- effective CPUs are non-empty and exclusive wrt siblings
- (E.g. setting empty cpuset.cpus might be possible but it invalidates
  the partition root, same as offlining or removal by an ancestor.)

> 2) The parent cgroup is a partition root (can be an invalid one).
- parent cgroup is a (valid) partition
- (Being valid partition means owning "stolen" cpus from the parent, if
  the parent is not valid partition itself, you can't steal what is not
  owned.)
- (And I think it's OK that: "the child partitions will stay invalid
  forever unless the parent become a valid partition again" [1].)

> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.
- I'm not sure what is the use of this condition (together with the
  rewrite of the 1st condition which covers effective cpus). I think it
  would make sense if being a valid parition root guaranteed that all
  configured cpuset.cpus will be available, however, that's not the case
  IIUC (e.g. due to offlining).

> 4) No child cgroup with cpuset enabled.
- A child cgroup with cpuset enabled is OK in the effect space
  (achievable by switching first and creating children later).
- For technical reasons this may be a condition on the transitions in
  the config_space.

Generally, most config changes should succeed and user should check (or
watch) how they landed in combination with the reality_space.

Regards,
Michal

[1] This follows the general model where ancestors can "preempt"
resources from their subtree.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
       [not found]                 ` <3533e4f9-169c-d13c-9c4e-d9ec6bdc78f0@redhat.com>
@ 2021-10-12 14:39                   ` Michal Koutný
  2021-10-13 21:45                     ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Koutný @ 2021-10-12 14:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

On Wed, Oct 06, 2021 at 02:21:03PM -0400, Waiman Long <llong@redhat.com> wrote:
> Sorry for not following up with this patchset sooner as I was busy on other
> tasks.

Thanks for continuing with this.

> 	1) The "cpuset.cpus" is not empty and the list of CPUs are
> 	   exclusive, i.e. they are not shared by any of its siblings.
> 	2) The parent cgroup is a partition root.
> 	3) The "cpuset.cpus" is a subset of the union of parent's
> 	   "cpuset.cpus.effective" and offlined CPUs in parent's
> 	   "cpuset.cpus".
> 	4) There is no child cgroups with cpuset enabled.  This avoids
> 	   cpu migrations of multiple cgroups simultaneously which can
> 	   be problematic.
> 
>         A partition, when enabled, can be in an invalid state. An example
>         is when its parent is also an invalid partition.

You say:
"it can only be enabled in a cgroup if all the following conditions are met.",
"2) The parent cgroup is a partition root."

and then the example:
"A partition, when enabled, can be in an invalid state. An example is
when its parent is also an invalid partition."

But the first two statements imply you can't have enabled the partition
in such a case.

I think there is still mixup of partition validity conditions and
transition conditions, yours would roughly divide into (not precisely,
just to share my understanding):

Validity conditions
 	1) The "cpuset.cpus" is not empty and the list of CPUs are
 	   exclusive, i.e. they are not shared by any of its siblings.
 	2) The parent cgroup is a partition root.

Transition conditions:
 	3) The "cpuset.cpus" is a subset of the union of parent's
 	   "cpuset.cpus.effective" and offlined CPUs in parent's
 	   "cpuset.cpus".
 	4) There is no child cgroups with cpuset enabled.  This avoids
 	   cpu migrations of multiple cgroups simultaneously which can
 	   be problematic.

(I've put no. 3 into transition conditions because _after_ the
transition parent's cpuset.cpus.effective are subtracted the new root's
cpuset.cpus but I'd like to have something similar as a validity
condition but I haven't come up with that yet.)

I consider the following situation:

r		// all cpus 0-7
`- part1	cpus=0-3	root >partition
   ` subpart1	cpus=0-1	root >partition
   ` subpart2	cpus=2-3	root >partition
`- other	cpus=4-7	// member by default

Both subpart1 and subpart2 are valid partition roots.
Look at actions listed below (as alternatives, not a sequence):

a) hotplug offlines cpu 3
  - would part1 still be considered a valid root? 
    - perhaps not
  - would subpart1 still be considered a valid root? 
    - it could be, but its parent is invalid so no?
  - would subpart2 still be considered a valid root? 
    - perhaps not
    
b) administrative change writes 0-2 into part1 cpus
  - would part1 still be considered a valid root? 
    - yes
  - would subpart1 still be considered a valid root? 
    - yes
  - would subpart2 still be considered a valid root? 
    - perhaps not

c) administrative change writes 3-7 into `other` cpus
  - should this fail or invalidate a root partition part1?
    - perhaps fail since the same "owner" manages all siblings and
      should reduce part1 first

The answers above are just my "natural" responses, the ideal may be
different. The issue I want to illustrate is that if all the conditions
are formed as transition conditions only, they can't be used to reason
about hotplug or config changes (except for cpuset.cpus.partitions
writes).

What would help me with the understanding -- the invalid root partition is defined as
1) such a cgroup where no cpus are granted from the top (and thus has to fall back to ancestors)
or
2) such a cgroup where cpus requested in cpuset.cpus can't be fulfilled (i.e. any missing invalidates)?

Furthermore, another example (motivated by the patch 4/6)

r		// all cpus 0-7
`- part1	cpus=0-4	root >partition
   ` subpart1	cpus=0-1	root >partition
   ` subpart2	cpus=2-3	root >partition
   ` task
`- other	cpus=5-7	// member by default

It's a valid and achievable state (even on v2 since cpuset is a threaded
controller). 

a) cpu 4 is offlined
  - this should invalidate part1 (and propagate invalidation into
    subpart1 and subpart2).
b) administrative write 0-3 into part1 cpus
  - should this invalidate part1 or be rejected?


In conclusion, it'd be good to have validity conditions separate from
transition conditions (since hotplug transition can't be rejected) and
perhaps treat administrative changes from an ancestor equally as a
hotplug.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-10-12 14:39                   ` Michal Koutný
@ 2021-10-13 21:45                     ` Waiman Long
  2021-10-13 22:11                       ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2021-10-13 21:45 UTC (permalink / raw)
  To: Michal Koutný, Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti


On 10/12/21 10:39 AM, Michal Koutný wrote:
> On Wed, Oct 06, 2021 at 02:21:03PM -0400, Waiman Long <llong@redhat.com> wrote:
>> Sorry for not following up with this patchset sooner as I was busy on other
>> tasks.
> Thanks for continuing with this.
>
>> 	1) The "cpuset.cpus" is not empty and the list of CPUs are
>> 	   exclusive, i.e. they are not shared by any of its siblings.
>> 	2) The parent cgroup is a partition root.
>> 	3) The "cpuset.cpus" is a subset of the union of parent's
>> 	   "cpuset.cpus.effective" and offlined CPUs in parent's
>> 	   "cpuset.cpus".
>> 	4) There is no child cgroups with cpuset enabled.  This avoids
>> 	   cpu migrations of multiple cgroups simultaneously which can
>> 	   be problematic.
>>
>>          A partition, when enabled, can be in an invalid state. An example

Thanks for the comments.


>>          is when its parent is also an invalid partition.
> You say:
> "it can only be enabled in a cgroup if all the following conditions are met.",
> "2) The parent cgroup is a partition root."
>
> and then the example:
> "A partition, when enabled, can be in an invalid state. An example is
> when its parent is also an invalid partition."
>
> But the first two statements imply you can't have enabled the partition
> in such a case.

Yes, you are right. We should not allow enabling partition when the 
parent is an invalid right. I will fix that.


> I think there is still mixup of partition validity conditions and
> transition conditions, yours would roughly divide into (not precisely,
> just to share my understanding):
>
> Validity conditions
>   	1) The "cpuset.cpus" is not empty and the list of CPUs are
>   	   exclusive, i.e. they are not shared by any of its siblings.
>   	2) The parent cgroup is a partition root.
>
> Transition conditions:
>   	3) The "cpuset.cpus" is a subset of the union of parent's
>   	   "cpuset.cpus.effective" and offlined CPUs in parent's
>   	   "cpuset.cpus".

I am going to change this condition to just "cpuset.cpus" is a subset of 
parent's "cpuset.cpus". After some deliberation, I  had concluded it 
doesn't make sense from the system partition planning point of view to 
allow a valid partition to contain cpus that are not in the designated 
"cpuset.cpus". That will automatically included offlined cpus in 
parent's "cpuset.cpus".


>   	4) There is no child cgroups with cpuset enabled.  This avoids
>   	   cpu migrations of multiple cgroups simultaneously which can
>   	   be problematic.
>
> (I've put no. 3 into transition conditions because _after_ the
> transition parent's cpuset.cpus.effective are subtracted the new root's
> cpuset.cpus but I'd like to have something similar as a validity
> condition but I haven't come up with that yet.)
>
> I consider the following situation:
>
> r		// all cpus 0-7
> `- part1	cpus=0-3	root >partition
>     ` subpart1	cpus=0-1	root >partition
>     ` subpart2	cpus=2-3	root >partition
> `- other	cpus=4-7	// member by default
>
> Both subpart1 and subpart2 are valid partition roots.
> Look at actions listed below (as alternatives, not a sequence):
>
> a) hotplug offlines cpu 3
>    - would part1 still be considered a valid root?
>      - perhaps not
>    - would subpart1 still be considered a valid root?
>      - it could be, but its parent is invalid so no?
>    - would subpart2 still be considered a valid root?
>      - perhaps not
>      

They will all be valid roots. They will become invalid only when their 
effective cpus are empty and there are tasks in the partition.

> b) administrative change writes 0-2 into part1 cpus

That is actually not allowed because of the following code in 
validate_change():

static int validate_change(struct cpuset *cur, struct cpuset *trial)
{
     :
         /* Each of our child cpusets must be a subset of us */
         ret = -EBUSY;
         cpuset_for_each_child(c, css, cur)
                 if (!is_cpuset_subset(c, trial))
                         goto out;

>    - would part1 still be considered a valid root?
>      - yes
>    - would subpart1 still be considered a valid root?
>      - yes
>    - would subpart2 still be considered a valid root?
>      - perhaps not
>
> c) administrative change writes 3-7 into `other` cpus
>    - should this fail or invalidate a root partition part1?
>      - perhaps fail since the same "owner" manages all siblings and
>        should reduce part1 first
Again, this will not be allowed because of the CPU_EXCLUSIVE flag set in 
part1.
>
> The answers above are just my "natural" responses, the ideal may be
> different. The issue I want to illustrate is that if all the conditions
> are formed as transition conditions only, they can't be used to reason
> about hotplug or config changes (except for cpuset.cpus.partitions
> writes).
>
> What would help me with the understanding -- the invalid root partition is defined as
> 1) such a cgroup where no cpus are granted from the top (and thus has to fall back to ancestors)
> or
> 2) such a cgroup where cpus requested in cpuset.cpus can't be fulfilled (i.e. any missing invalidates)?
For a valid partition, "cpuset.cpus.effective" is always a subset of 
"cpuset.cpus". When "cpuset.cpus.effective" becomes empty and there are 
tasks in the partition, it becomes invalid and inherent the non-empty 
cpuset.cpus.effective of the nearest ancestor. The condition that causes 
"cpuset.cpus.effective" to become empty can be hotplug or changes to 
"cpuset.cpus".
> Furthermore, another example (motivated by the patch 4/6)
>
> r		// all cpus 0-7
> `- part1	cpus=0-4	root >partition
>     ` subpart1	cpus=0-1	root >partition
>     ` subpart2	cpus=2-3	root >partition
>     ` task
> `- other	cpus=5-7	// member by default
>
> It's a valid and achievable state (even on v2 since cpuset is a threaded
> controller).
>
> a) cpu 4 is offlined
>    - this should invalidate part1 (and propagate invalidation into
>      subpart1 and subpart2).

That is subject to design. My current thought is to keep part1 as valid 
but invalidate the child partitions (subpart1 and subpart2).


> b) administrative write 0-3 into part1 cpus
>    - should this invalidate part1 or be rejected?

The result should be the same as (a).

>
> In conclusion, it'd be good to have validity conditions separate from
> transition conditions (since hotplug transition can't be rejected) and
> perhaps treat administrative changes from an ancestor equally as a
> hotplug.

I am trying to make the result of changing "cpuset.cpus" as close to 
hotplug as possible but there are cases where the "cpuset.cpus" change 
is prohibited but hotplug can still happen to remove the cpu.

Hope this will help to clarify the current design.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-10-13 21:45                     ` Waiman Long
@ 2021-10-13 22:11                       ` Waiman Long
  0 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2021-10-13 22:11 UTC (permalink / raw)
  To: Michal Koutný, Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 643 bytes --]

On 10/13/21 5:45 PM, Waiman Long wrote:
>
>
>>
>> In conclusion, it'd be good to have validity conditions separate from
>> transition conditions (since hotplug transition can't be rejected) and
>> perhaps treat administrative changes from an ancestor equally as a
>> hotplug.
>
> I am trying to make the result of changing "cpuset.cpus" as close to 
> hotplug as possible but there are cases where the "cpuset.cpus" change 
> is prohibited but hotplug can still happen to remove the cpu.
>
> Hope this will help to clarify the current design.
>
BTW, the attached file is the current draft of cpuset.cpus.partition 
document.

Cheers,
Longman


[-- Attachment #2: cpuset.cpus.partition.txt --]
[-- Type: text/plain, Size: 4889 bytes --]

  cpuset.cpus.partition
	A read-write single value file which exists on non-root
	cpuset-enabled cgroups.  This flag is owned by the parent cgroup
	and is not delegatable.

	It accepts only the following input values when written to.

	  ========	================================
	  "member"	Non-root member of a partition
	  "root"	Partition root
	  "isolated"	Partition root without load balancing
	  ========	================================

	When set to be a partition root, the current cgroup is the
	root of a new partition or scheduling domain that comprises
	itself and all its descendants except those that are separate
	partition roots themselves and their descendants.  The root
	cgroup is always a partition root.

	When set to "isolated", the CPUs in that partition root will
	be in an isolated state without any load balancing from the
	scheduler.  Tasks in such a partition must be explicitly bound
	to each individual CPU.

	"cpuset.cpus" must always be set up first before enabling
	partition.  Unlike "member" whose "cpuset.cpus.effective" can
	contain CPUs not in "cpuset.cpus", this can never happen with a
	valid partition root.  In other words, "cpuset.cpus.effective"
	is always a subset of "cpuset.cpus" for a valid partition root.

	When a parent partition root cannot exclusively grant any of
	the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective"
	becomes empty. If there are tasks in the partition root, the
	partition root becomes invalid and "cpuset.cpus.effective"
	is reset to that of the nearest non-empty ancestor.

        Note that a task cannot be moved to a cgroup with empty
        "cpuset.cpus.effective".

	There are additional constraints on where a partition root can
	be enabled ("root" or "isolated").  It can only be enabled in
	a cgroup if all the following conditions are met.

	1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
	   not shared by any of its siblings.
	2) The parent cgroup is a valid partition root.
	3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
	4) There is no child cgroups with cpuset enabled.  This avoids
	   cpu migrations of multiple cgroups simultaneously which can
	   be problematic.

	On read, the "cpuset.cpus.partition" file can show the following
	values.

	  ======================	==============================
	  "member"			Non-root member of a partition
	  "root"			Partition root
	  "isolated"			Partition root without load balancing
	  "root invalid (<reason>)"	Invalid partition root
	  ======================	==============================

        In the case of an invalid partition root, a descriptive string on
        why the partition is invalid is included within parentheses.

	Once becoming a partition root, changes to "cpuset.cpus" is
	generally allowed as long as the cpu list is exclusive and is
	a superset of children's cpu lists.

        The constraints of a valid partition root are as follows:

        1) "cpuset.cpus" is non-empty and exclusive.
        2) The parent cgroup is a valid partition root.
        3) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
        4) "cpuset.cpus.effective" is non-empty when there are tasks
           in the partition.

	Changes to "cpuset.cpus" or cpu hotplug may cause the state
	of a valid partition root to become invalid when one or more
	constraints of a valid partition root are violated.  Therefore,
	user space agents that manage partition roots should avoid
	unnecessary changes to "cpuset.cpus" and always check the state
	of "cpuset.cpus.partition" after making changes to make sure
	that the partitions are functioning properly as expected.

        Changing a partition root to "member" is always allowed.
        If there are child partition roots underneath it, however,
        they will be forced to be switched back to "member" too and
        lose their partitions. So care must be taken to double check
        for this condition before disabling a partition root.

	Setting a cgroup to a valid partition root will take the CPUs
	away from the effective CPUs of the parent partition.

	A valid parent partition may distribute out all its CPUs to
	its child partitions as long as it is not the root cgroup as
	we need some house-keeping CPUs in the root cgroup.

	An invalid partition is not a real partition even though some
	internal states may still be kept.

	An invalid partition root can be reverted back to a real
	partition root if none of the constraints of a valid partition
        root are violated.

	Poll and inotify events are triggered whenever the state of
	"cpuset.cpus.partition" changes.  That includes changes caused by
	write to "cpuset.cpus.partition", cpu hotplug and other changes
	that make the partition invalid.  This will allow user space
	agents to monitor unexpected changes to "cpuset.cpus.partition"
	without the need to do continuous polling.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-10-13 22:11 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-25 21:37 [PATCH v7 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
2021-08-25 21:37 ` [PATCH v7 1/6] cgroup/cpuset: Properly transition to invalid partition Waiman Long
2021-08-25 21:37 ` [PATCH v7 2/6] cgroup/cpuset: Show invalid partition reason string Waiman Long
2021-08-25 21:37 ` [PATCH v7 3/6] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
2021-08-25 21:37 ` [PATCH v7 4/6] cgroup/cpuset: Allow non-top parent partition to distribute out all CPUs Waiman Long
2021-08-25 21:37 ` [PATCH v7 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
2021-08-26 17:35   ` Tejun Heo
2021-08-27  3:01     ` Waiman Long
2021-08-27  4:00       ` Tejun Heo
2021-08-27 21:19         ` Waiman Long
2021-08-27 21:27           ` Tejun Heo
2021-08-27 22:50             ` Waiman Long
2021-08-27 23:35               ` Tejun Heo
2021-08-28  1:14                 ` Waiman Long
     [not found]                 ` <3533e4f9-169c-d13c-9c4e-d9ec6bdc78f0@redhat.com>
2021-10-12 14:39                   ` Michal Koutný
2021-10-13 21:45                     ` Waiman Long
2021-10-13 22:11                       ` Waiman Long
2021-08-30 17:59               ` Michal Koutný
2021-08-25 21:37 ` [PATCH v7 6/6] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).