linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
@ 2021-10-18 14:36 Waiman Long
  2021-10-18 14:36 ` [PATCH v8 1/6] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Waiman Long @ 2021-10-18 14:36 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

v8:
 - Reorganize the patch series and rationalize the features and
   constraints of a partition.
 - Update patch descriptions and documentation accordingly.

v7:
 - Simplify the documentation patch (patch 5) as suggested by Tejun.
 - Fix a typo in patch 2 and improper commit log in patch 3.

v6:
 - Remove duplicated tmpmask from update_prstate() which should fix the
   frame size too large problem reported by kernel test robot.

This patchset makes four enhancements to the cpuset v2 code.

 Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.

 Patch 2: Refining the features and constraints of a cpuset partition
 clarifying what changes are allowed.

 Patch 3: Add a new partition state "isolated" to create a partition
 root without load balancing. This is for handling intermitten workloads
 that have a strict low latency requirement.

 Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
 that causes invalid partition like "root invalid (No cpu available
 due to hotplug)".

Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
cpuset test to test the new cpuset partition code.

Waiman Long (6):
  cgroup/cpuset: Allow no-task partition to have empty
    cpuset.cpus.effective
  cgroup/cpuset: Refining features and constraints of a partition
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Update description of cpuset.cpus.partition in
    cgroup-v2.rst
  kselftest/cgroup: Add cpuset v2 partition root state test

 Documentation/admin-guide/cgroup-v2.rst       | 153 ++--
 kernel/cgroup/cpuset.c                        | 393 +++++++----
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 664 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
 5 files changed, 1115 insertions(+), 187 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

-- 
2.27.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v8 1/6] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
@ 2021-10-18 14:36 ` Waiman Long
  2021-10-18 14:36 ` [PATCH v8 2/6] cgroup/cpuset: Refining features and constraints of a partition Waiman Long
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-10-18 14:36 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its CPUs
 to child partitions with no CPUs left. However in most cases, there
shouldn't be any tasks associated with intermediate nodes of the default
 hierarchy. So the current rule is too restrictive and can waste valuable
 CPU resource.

To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Therefore, a parent
partition with no task can now have all its CPUs distributed out to its
child partitions. The top cpuset always have some house-keeping tasks
running and so its list of effective cpu can't never be empty.

Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 113 +++++++++++++++++++++++++++++++----------
 1 file changed, 85 insertions(+), 28 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2a9695ccb65f..e3a6609bc919 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -386,6 +386,41 @@ static inline bool is_in_v2_mode(void)
 	      (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
 }
 
+/**
+ * partition_is_populated - check if partition has tasks
+ * @cs: partition root to be checked
+ * @excluded_child: a child cpuset to be excluded in task checking
+ * Return: true if there are tasks, false otherwise
+ *
+ * It is assumed that @cs is a valid partition root. @excluded_child should
+ * be non-NULL when this cpuset is going to become a partition itself.
+ */
+static inline bool partition_is_populated(struct cpuset *cs,
+					  struct cpuset *excluded_child)
+{
+	struct cgroup_subsys_state *css;
+	struct cpuset *child;
+
+	if (cs->css.cgroup->nr_populated_csets)
+		return true;
+	if (!excluded_child && !cs->nr_subparts_cpus)
+		return cgroup_is_populated(cs->css.cgroup);
+
+	rcu_read_lock();
+	cpuset_for_each_child(child, css, cs) {
+		if (child == excluded_child)
+			continue;
+		if (is_partition_root(child))
+			continue;
+		if (cgroup_is_populated(child->css.cgroup)) {
+			rcu_read_unlock();
+			return true;
+		}
+	}
+	rcu_read_unlock();
+	return false;
+}
+
 /*
  * Return in pmask the portion of a task's cpusets's cpus_allowed that
  * are online and are capable of running the task.  If none are found,
@@ -1198,22 +1233,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
-	/*
-	 * Enabling partition root is not allowed if not all the CPUs
-	 * can be granted from parent's effective_cpus or at least one
-	 * CPU will be left after that.
-	 */
-	if ((cmd == partcmd_enable) &&
-	   (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
-	     cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
-		return -EINVAL;
-
-	/*
-	 * A cpumask update cannot make parent's effective_cpus become empty.
-	 */
 	adding = deleting = false;
 	old_prs = new_prs = cpuset->partition_root_state;
 	if (cmd == partcmd_enable) {
+		/*
+		 * Enabling partition root is not allowed if not all the CPUs
+		 * can be granted from parent's effective_cpus.
+		 */
+		if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus))
+			return -EINVAL;
+
+		/*
+		 * A parent can be left with no CPU as long as there is no
+		 * task directly associated with the parent partition. For
+		 * such a parent, no new task can be moved into it.
+		 */
+		if (partition_is_populated(parent, cpuset) &&
+		    cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))
+			return -EINVAL;
+
 		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
@@ -1235,9 +1273,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty.
+		 * Return error if the new effective_cpus could become empty
+		 * and there are tasks in the parent.
 		 */
-		if (adding &&
+		if (adding && partition_is_populated(parent, cpuset) &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
 			if (!deleting)
 				return -EINVAL;
@@ -1263,8 +1302,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask,
-					   parent->effective_cpus);
+		part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+			     partition_is_populated(parent, cpuset);
 	}
 
 	if (cmd == partcmd_update) {
@@ -1366,9 +1405,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
-		 * parent, which is guaranteed to have some CPUs.
+		 * parent, which is guaranteed to have some CPUs unless
+		 * it is a partition root that has explicitly distributed
+		 * out all its CPUs.
 		 */
 		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+			if (is_partition_root(cp) &&
+			    cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+				goto update_parent_subparts;
+
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 			if (!cp->use_parent_ecpus) {
 				cp->use_parent_ecpus = true;
@@ -1390,6 +1435,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		}
 
+update_parent_subparts:
 		/*
 		 * update_parent_subparts_cpumask() should have been called
 		 * for cs already in update_cpumask(). We should also call
@@ -2189,6 +2235,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
+	/*
+	 * On default hierarchy, task cannot be moved to a cpuset with empty
+	 * effective cpus.
+	 */
+	if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+		goto out_unlock;
+
 	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task, cs->cpus_allowed);
 		if (ret)
@@ -3053,7 +3106,8 @@ hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
-	if (cpumask_empty(new_cpus))
+	/* A partition root is allowed to have empty effective cpus */
+	if (cpumask_empty(new_cpus) && !is_partition_root(cs))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
@@ -3122,11 +3176,12 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus or its parent becomes erroneous, we have to
-	 * transition it to the erroneous state.
+	 * effective_cpus with tasks or its parent becomes erroneous, we
+	 * have to transition it to the erroneous state.
 	 */
-	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
-	   (parent->partition_root_state == PRS_ERROR))) {
+	if (is_partition_root(cs) &&
+	   ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
+	    (parent->partition_root_state == PRS_ERROR))) {
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3136,13 +3191,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		}
 
 		/*
-		 * If the effective_cpus is empty because the child
-		 * partitions take away all the CPUs, we can keep
-		 * the current partition and let the child partitions
-		 * fight for available CPUs.
+		 * Force the partition to become invalid if either one of
+		 * the following conditions hold:
+		 * 1) empty effective cpus but not valid empty partition.
+		 * 2) parent is invalid or doesn't grant any cpus to child
+		 *    partitions.
 		 */
 		if ((parent->partition_root_state == PRS_ERROR) ||
-		     cpumask_empty(&new_cpus)) {
+		    (cpumask_empty(&new_cpus) &&
+		     partition_is_populated(cs, NULL))) {
 			int old_prs;
 
 			update_parent_subparts_cpumask(cs, partcmd_disable,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v8 2/6] cgroup/cpuset: Refining features and constraints of a partition
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
  2021-10-18 14:36 ` [PATCH v8 1/6] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
@ 2021-10-18 14:36 ` Waiman Long
  2021-10-18 14:36 ` [PATCH v8 3/6] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-10-18 14:36 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

In order to plan for proper partitioning of a system, a valid partition
must be using only CPUs designated in "cpuset.cpus" of the partition
root. It can be a subset, but it can't use CPUs outside of its designed
list. If none of the CPUs in the desginated list can be granted to the
partition, its "cpuset.cpus.effective" becomes empty. This is allowed
as long as there is no task in the partition.

To ease implementation, there are additional contraints in enabling a
partition root.

 1) The "cpuset.cpus" is non-empty and exclusive.
 2) The parent cgroup is a valid partition root.
 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
 4) There is no child cgroups with cpuset enabled.

This allows offlined cpus in parent's "cpuset.cpus" to be granted to
a child partition which can have empty "cpuset.cpus.effective" when it
has no task.

The cpuset's subparts_cpus keeps track of CPUs (including offline CPUs)
that are allocated to child partitions. It does not change during
hotplug operations.

Once a partition root has been enabled, changes to "cpuset.cpus" is
generally allowed as long as the cpu list is exclusive, non-empty and
is a superset of children's cpu lists. These existing rules are enforced
by validate_change().

A partition will become invalid when one or more of the following
constraints are violated:

 1) The parent cgroup is a valid partition root.
 2) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
 3) "cpuset.cpus.effective" is non-empty when there are tasks
    in the partition.

Disabling a partition root is always allowed even if there are child
partitions underneath it. In this case, all the child partitions are
also disabled (switch to "member"). So care must be taken to double check
if there are child partitions underneath it before disabling a partition.

This patch makes the necessary change to support the above features
and constraints.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 228 +++++++++++++++++++++--------------------
 1 file changed, 117 insertions(+), 111 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e3a6609bc919..e2c01345353c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1227,29 +1227,27 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		return -EINVAL;
 
 	/*
-	 * Enabling/disabling partition root is not allowed if there are
-	 * online children.
+	 * Enabling partition root is not allowed if there are online children.
 	 */
-	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+	if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
 	adding = deleting = false;
 	old_prs = new_prs = cpuset->partition_root_state;
 	if (cmd == partcmd_enable) {
 		/*
-		 * Enabling partition root is not allowed if not all the CPUs
-		 * can be granted from parent's effective_cpus.
+		 * Enabling partition root is not allowed if cpus_allowed isn't
+		 * a subset of parent's cpus_allowed.
 		 */
-		if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus))
+		if (!cpumask_subset(cpuset->cpus_allowed, parent->cpus_allowed))
 			return -EINVAL;
 
 		/*
 		 * A parent can be left with no CPU as long as there is no
-		 * task directly associated with the parent partition. For
-		 * such a parent, no new task can be moved into it.
+		 * task directly associated with the parent partition.
 		 */
 		if (partition_is_populated(parent, cpuset) &&
-		    cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))
+		    cpumask_subset(parent->effective_cpus, cpuset->cpus_allowed))
 			return -EINVAL;
 
 		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
@@ -1261,54 +1259,52 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * partcmd_update with newmask:
 		 *
+		 * Compute add/delete mask to/from subparts_cpus
+		 *
 		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
-		 * addmask = newmask & parent->effective_cpus
+		 * addmask = newmask & parent->cpus_allowed
 		 *		     & ~parent->subparts_cpus
 		 */
 		cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
 		deleting = cpumask_and(tmp->delmask, tmp->delmask,
 				       parent->subparts_cpus);
 
-		cpumask_and(tmp->addmask, newmask, parent->effective_cpus);
+		cpumask_and(tmp->addmask, newmask, parent->cpus_allowed);
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty
-		 * and there are tasks in the parent.
+		 * Make partition invalid if parent's effective_cpus could
+                 * become empty and there are tasks in the parent.
 		 */
-		if (adding && partition_is_populated(parent, cpuset) &&
-		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
-			if (!deleting)
-				return -EINVAL;
-			/*
-			 * As some of the CPUs in subparts_cpus might have
-			 * been offlined, we need to compute the real delmask
-			 * to confirm that.
-			 */
-			if (!cpumask_and(tmp->addmask, tmp->delmask,
-					 cpu_active_mask))
-				return -EINVAL;
-			cpumask_copy(tmp->addmask, parent->effective_cpus);
-		}
+		part_error = partition_is_populated(parent, cpuset) &&
+			cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+			!cpumask_intersects(tmp->delmask, cpu_active_mask);
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
 		 *
 		 * addmask = cpus_allowed & parent->effective_cpus
 		 *
-		 * Note that parent's subparts_cpus may have been
-		 * pre-shrunk in case there is a change in the cpu list.
-		 * So no deletion is needed.
+		 * This gets invoked either due to a hotplug event or
+                 * from update_cpumasks_hier() where we can't return an
+                 * error. This can cause a partition root to become invalid
+                 * in the case of a hotplug.
+		 *
+		 * A partition error happens when:
+		 * 1) Cpuset is valid partition, but parent does not distribute
+		 *    out any CPUs.
+		 * 2) Parent has tasks and all its effective CPUs will have
+		 *    to be distributed out.
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
-			     partition_is_populated(parent, cpuset);
+		part_error = (is_partition_root(cpuset) &&
+			      !parent->nr_subparts_cpus) ||
+			     (cpumask_equal(parent->effective_cpus, tmp->addmask) &&
+			      partition_is_populated(parent, cpuset));
 	}
 
 	if (cmd == partcmd_update) {
-		int prev_prs = cpuset->partition_root_state;
-
 		/*
 		 * Check for possible transition between PRS_ENABLED
 		 * and PRS_ERROR.
@@ -1323,13 +1319,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				new_prs = PRS_ENABLED;
 			break;
 		}
-		/*
-		 * Set part_error if previously in invalid state.
-		 */
-		part_error = (prev_prs == PRS_ERROR);
 	}
 
-	if (!part_error && (new_prs == PRS_ERROR))
+	if ((old_prs == PRS_ERROR) && (new_prs == PRS_ERROR))
 		return 0;	/* Nothing need to be done */
 
 	if (new_prs == PRS_ERROR) {
@@ -1382,6 +1374,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
  * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
  * @cs:  the cpuset to consider
  * @tmp: temp variables for calculating effective_cpus & partition setup
+ * @force: don't skip any descendant cpusets if set
  *
  * When configured cpumask is changed, the effective cpumasks of this cpuset
  * and all its descendants need to be updated.
@@ -1390,7 +1383,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
  *
  * Called with cpuset_rwsem held
  */
-static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
+static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
+				 bool force)
 {
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
@@ -1400,6 +1394,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
+		bool update_parent = false;
 
 		compute_effective_cpumask(tmp->new_cpus, cp, parent);
 
@@ -1427,9 +1422,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		/*
 		 * Skip the whole subtree if the cpumask remains the same
-		 * and has no partition root state.
+		 * with no partition root state and force flag not set.
 		 */
-		if (!cp->partition_root_state &&
+		if (!cp->partition_root_state && !force &&
 		    cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
 			pos_css = css_rightmost_descendant(pos_css);
 			continue;
@@ -1466,8 +1461,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				break;
 
 			case PRS_ENABLED:
-				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
-					update_tasks_cpumask(parent);
+				update_parent = true;
 				break;
 
 			case PRS_ERROR:
@@ -1483,40 +1477,41 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		rcu_read_unlock();
 
+		if (update_parent) {
+			if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
+				update_tasks_cpumask(parent);
+			/*
+			 * The cpuset partition_root_state may be changed
+			 * to PRS_ERROR. Capture it.
+			 */
+			new_prs = cp->partition_root_state;
+		}
+
 		spin_lock_irq(&callback_lock);
 
-		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+		if (cp->nr_subparts_cpus && (new_prs <= 0)) {
+			/*
+			 * Put all active subparts_cpus back to effective_cpus.
+			 */
+			cpumask_or(tmp->new_cpus, tmp->new_cpus,
+				   cp->subparts_cpus);
+			cpumask_and(tmp->new_cpus, tmp->new_cpus,
+				    cpu_active_mask);
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
-		} else if (cp->nr_subparts_cpus) {
+		}
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
 			 * are mutually exclusive.
-			 *
-			 * In the unlikely event that effective_cpus
-			 * becomes empty. we clear cp->nr_subparts_cpus and
-			 * let its child partition roots to compete for
-			 * CPUs again.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			if (cpumask_empty(cp->effective_cpus)) {
-				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-				cpumask_clear(cp->subparts_cpus);
-				cp->nr_subparts_cpus = 0;
-			} else if (!cpumask_subset(cp->subparts_cpus,
-						   tmp->new_cpus)) {
-				cpumask_andnot(cp->subparts_cpus,
-					cp->subparts_cpus, tmp->new_cpus);
-				cp->nr_subparts_cpus
-					= cpumask_weight(cp->subparts_cpus);
-			}
 		}
 
-		if (new_prs != old_prs)
-			cp->partition_root_state = new_prs;
-
+		cp->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
 		notify_partition_change(cp, old_prs, new_prs);
 
@@ -1570,7 +1565,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
 		if (!sibling->use_parent_ecpus)
 			continue;
 
-		update_cpumasks_hier(sibling, tmp);
+		update_cpumasks_hier(sibling, tmp, false);
 	}
 	rcu_read_unlock();
 }
@@ -1643,13 +1638,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	 * Make sure that subparts_cpus is a subset of cpus_allowed.
 	 */
 	if (cs->nr_subparts_cpus) {
-		cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
-			       cs->cpus_allowed);
+		cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+			    cs->cpus_allowed);
 		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
 	}
 	spin_unlock_irq(&callback_lock);
 
-	update_cpumasks_hier(cs, &tmp);
+	update_cpumasks_hier(cs, &tmp, false);
 
 	if (cs->partition_root_state) {
 		struct cpuset *parent = parent_cs(cs);
@@ -2071,20 +2066,23 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		}
 	} else {
 		/*
-		 * Turning off partition root will clear the
-		 * CS_CPU_EXCLUSIVE bit.
+		 * Switch back to member is always allowed even if it
+                 * disables child partitions.
 		 */
-		if (old_prs == PRS_ERROR) {
-			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
-			err = 0;
-			goto out;
+		err = 0;
+		update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
+					       &tmpmask);
+		/*
+		 * If there are child partitions, we have to disable them.
+		 */
+		if (unlikely(cs->nr_subparts_cpus)) {
+			spin_lock_irq(&callback_lock);
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+			compute_effective_cpumask(cs->effective_cpus, cs, parent);
+			spin_unlock_irq(&callback_lock);
 		}
 
-		err = update_parent_subparts_cpumask(cs, partcmd_disable,
-						     NULL, &tmpmask);
-		if (err)
-			goto out;
-
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 	}
@@ -2105,6 +2103,11 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		spin_lock_irq(&callback_lock);
 		cs->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
+		/*
+		 * Update child cpusets when disabling partition.
+		 */
+		if (new_prs == PRS_DISABLED && !list_empty(&cs->css.children))
+			update_cpumasks_hier(cs, &tmpmask, true);
 		notify_partition_change(cs, old_prs, new_prs);
 	}
 
@@ -3176,12 +3179,32 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus with tasks or its parent becomes erroneous, we
-	 * have to transition it to the erroneous state.
+	 * effective_cpus with tasks, we will have to invalidate child
+	 * partitions, if present, by setting nr_subparts_cpus to 0 to
+	 * reclaim their cpus.
+	 */
+	if (is_partition_root(cs) && cpumask_empty(&new_cpus) &&
+	    cs->nr_subparts_cpus && partition_is_populated(cs, NULL)) {
+		spin_lock_irq(&callback_lock);
+		cs->nr_subparts_cpus = 0;
+		cpumask_clear(cs->subparts_cpus);
+		spin_unlock_irq(&callback_lock);
+		compute_effective_cpumask(&new_cpus, cs, parent);
+	}
+
+	/*
+	 * Force the partition to become invalid if either one of
+	 * the following conditions hold:
+	 * 1) empty effective cpus with tasks in partition
+	 * 2) parent is invalid or doesn't grant any cpus to child partitions.
 	 */
 	if (is_partition_root(cs) &&
 	   ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
-	    (parent->partition_root_state == PRS_ERROR))) {
+	    !parent->nr_subparts_cpus)) {
+		int old_prs;
+
+		update_parent_subparts_cpumask(cs, partcmd_disable,
+					       NULL, tmp);
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3190,40 +3213,23 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 			compute_effective_cpumask(&new_cpus, cs, parent);
 		}
 
-		/*
-		 * Force the partition to become invalid if either one of
-		 * the following conditions hold:
-		 * 1) empty effective cpus but not valid empty partition.
-		 * 2) parent is invalid or doesn't grant any cpus to child
-		 *    partitions.
-		 */
-		if ((parent->partition_root_state == PRS_ERROR) ||
-		    (cpumask_empty(&new_cpus) &&
-		     partition_is_populated(cs, NULL))) {
-			int old_prs;
-
-			update_parent_subparts_cpumask(cs, partcmd_disable,
-						       NULL, tmp);
-			old_prs = cs->partition_root_state;
-			if (old_prs != PRS_ERROR) {
-				spin_lock_irq(&callback_lock);
-				cs->partition_root_state = PRS_ERROR;
-				spin_unlock_irq(&callback_lock);
-				notify_partition_change(cs, old_prs, PRS_ERROR);
-			}
+		old_prs = cs->partition_root_state;
+		if (old_prs != PRS_ERROR) {
+			spin_lock_irq(&callback_lock);
+			cs->partition_root_state = PRS_ERROR;
+			spin_unlock_irq(&callback_lock);
+			notify_partition_change(cs, old_prs, PRS_ERROR);
 		}
 		cpuset_force_rebuild();
 	}
 
 	/*
 	 * On the other hand, an erroneous partition root may be transitioned
-	 * back to a regular one or a partition root with no CPU allocated
-	 * from the parent may change to erroneous.
+	 * back to a regular one.
 	 */
-	if (is_partition_root(parent) &&
-	   ((cs->partition_root_state == PRS_ERROR) ||
-	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
-	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+	else if (is_partition_root(parent) &&
+		(cs->partition_root_state == PRS_ERROR) &&
+		 update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
 		cpuset_force_rebuild();
 
 update_tasks:
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v8 3/6] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
  2021-10-18 14:36 ` [PATCH v8 1/6] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
  2021-10-18 14:36 ` [PATCH v8 2/6] cgroup/cpuset: Refining features and constraints of a partition Waiman Long
@ 2021-10-18 14:36 ` Waiman Long
  2021-10-18 14:36 ` [PATCH v8 4/6] cgroup/cpuset: Show invalid partition reason string Waiman Long
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-10-18 14:36 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

cgroup/cpuset: Add a new isolated cpus.partition type

Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.

For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.

When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.

To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 46 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e2c01345353c..fab31921728c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -172,6 +172,8 @@ struct cpuset {
  *
  *   1 - partition root
  *
+ *   2 - partition root without load balancing (isolated)
+ *
  *  -1 - invalid partition root
  *       None of the cpus in cpus_allowed can be put into the parent's
  *       subparts_cpus. In this case, the cpuset is not a real partition
@@ -181,6 +183,7 @@ struct cpuset {
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
+#define PRS_ISOLATED		2
 #define PRS_ERROR		-1
 
 /*
@@ -1306,17 +1309,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 
 	if (cmd == partcmd_update) {
 		/*
-		 * Check for possible transition between PRS_ENABLED
-		 * and PRS_ERROR.
+		 * Check for possible transition between PRS_ERROR and
+		 * PRS_ENABLED/PRS_ISOLATED.
 		 */
 		switch (cpuset->partition_root_state) {
 		case PRS_ENABLED:
+		case PRS_ISOLATED:
 			if (part_error)
 				new_prs = PRS_ERROR;
 			break;
 		case PRS_ERROR:
-			if (!part_error)
+			if (part_error)
+				break;
+			if (is_sched_load_balance(cpuset))
 				new_prs = PRS_ENABLED;
+			else
+				new_prs = PRS_ISOLATED;
 			break;
 		}
 	}
@@ -1461,6 +1469,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 				break;
 
 			case PRS_ENABLED:
+			case PRS_ISOLATED:
 				update_parent = true;
 				break;
 
@@ -2028,6 +2037,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
 	int err, old_prs = cs->partition_root_state;
+	bool sched_domain_rebuilt = false;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
 
@@ -2064,6 +2074,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			goto out;
 		}
+
+		if (new_prs == PRS_ISOLATED) {
+			/*
+			 * Disable the load balance flag should not return an
+			 * error unless the system is running out of memory.
+			 */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+			sched_domain_rebuilt = true;
+		}
+	} else if (old_prs && new_prs) {
+		/*
+		 * A change in load balance state only, no change in cpumasks.
+		 */
+		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
+		err = 0;
+		goto out;	/* Sched domain is rebuilt in update_flag() */
 	} else {
 		/*
 		 * Switch back to member is always allowed even if it
@@ -2085,6 +2111,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+
+		if (!is_sched_load_balance(cs)) {
+			/* Make sure load balance is on */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+			sched_domain_rebuilt = true;
+		}
 	}
 
 	/*
@@ -2097,7 +2129,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	if (parent->child_ecpus_count)
 		update_sibling_cpumasks(parent, cs, &tmpmask);
 
-	rebuild_sched_domains_locked();
+	if (!sched_domain_rebuilt)
+		rebuild_sched_domains_locked();
 out:
 	if (!err) {
 		spin_lock_irq(&callback_lock);
@@ -2609,6 +2642,9 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 	case PRS_ENABLED:
 		seq_puts(seq, "root\n");
 		break;
+	case PRS_ISOLATED:
+		seq_puts(seq, "isolated\n");
+		break;
 	case PRS_DISABLED:
 		seq_puts(seq, "member\n");
 		break;
@@ -2635,6 +2671,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		val = PRS_ENABLED;
 	else if (!strcmp(buf, "member"))
 		val = PRS_DISABLED;
+	else if (!strcmp(buf, "isolated"))
+		val = PRS_ISOLATED;
 	else
 		return -EINVAL;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v8 4/6] cgroup/cpuset: Show invalid partition reason string
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (2 preceding siblings ...)
  2021-10-18 14:36 ` [PATCH v8 3/6] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
@ 2021-10-18 14:36 ` Waiman Long
  2021-10-18 14:36 ` [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-10-18 14:36 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

There are a number of different reasons which can cause a partition to
become invalid. A user seeing an invalid partition may not know exactly
why. To help user to get a better understanding of the underlying reason,
The cpuset.cpus.partition control file, when read, will now report the
reason why a partition become invalid. When a partition does become
invalid, reading the control file will show "root invalid (<reason>)"
where <reason> is a string that describes why the partition is invalid.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fab31921728c..9c5ecc8d100c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -78,6 +78,24 @@ struct fmeter {
 	spinlock_t lock;	/* guards read or write of above */
 };
 
+/*
+ * Invalid partition error code
+ */
+enum prs_errcode {
+	PERR_NONE = 0,
+	PERR_INVCPUS,
+	PERR_NOCPUS,
+	PERR_PARENT,
+	PERR_HOTPLUG,
+};
+
+static const char * const perr_strings[] = {
+	[PERR_INVCPUS] = "Invalid change to cpuset.cpus",
+	[PERR_PARENT]  = "Parent is no longer a valid partition root",
+	[PERR_NOCPUS]  = "Parent unable to distribute cpu downstream",
+	[PERR_HOTPLUG] = "No cpu available due to hotplug",
+};
+
 struct cpuset {
 	struct cgroup_subsys_state css;
 
@@ -163,6 +181,9 @@ struct cpuset {
 
 	/* Handle for cpuset.cpus.partition */
 	struct cgroup_file partition_file;
+
+	/* Invalid partition error code, not lock protected */
+	enum prs_errcode prs_err;
 };
 
 /*
@@ -275,8 +296,13 @@ static inline int is_partition_root(const struct cpuset *cs)
 static inline void notify_partition_change(struct cpuset *cs,
 					   int old_prs, int new_prs)
 {
-	if (old_prs != new_prs)
-		cgroup_file_notify(&cs->partition_file);
+	if (old_prs == new_prs)
+		return;
+	cgroup_file_notify(&cs->partition_file);
+
+	/* Reset prs_err if not invalid */
+	if (new_prs != PRS_ERROR)
+		WRITE_ONCE(cs->prs_err, PERR_NONE);
 }
 
 static struct cpuset top_cpuset = {
@@ -1282,6 +1308,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		part_error = partition_is_populated(parent, cpuset) &&
 			cpumask_subset(parent->effective_cpus, tmp->addmask) &&
 			!cpumask_intersects(tmp->delmask, cpu_active_mask);
+
+		if ((READ_ONCE(cpuset->prs_err) == PERR_NONE) && part_error)
+			WRITE_ONCE(cpuset->prs_err, PERR_INVCPUS);
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
@@ -1305,6 +1334,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 			      !parent->nr_subparts_cpus) ||
 			     (cpumask_equal(parent->effective_cpus, tmp->addmask) &&
 			      partition_is_populated(parent, cpuset));
+
+		if (is_partition_root(cpuset) && part_error)
+			WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS);
 	}
 
 	if (cmd == partcmd_update) {
@@ -1478,6 +1510,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 				 * When parent is invalid, it has to be too.
 				 */
 				new_prs = PRS_ERROR;
+				WRITE_ONCE(cp->prs_err, PERR_PARENT);
 				break;
 			}
 		}
@@ -2637,6 +2670,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 static int sched_partition_show(struct seq_file *seq, void *v)
 {
 	struct cpuset *cs = css_cs(seq_css(seq));
+	const char *err;
 
 	switch (cs->partition_root_state) {
 	case PRS_ENABLED:
@@ -2649,7 +2683,11 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 		seq_puts(seq, "member\n");
 		break;
 	case PRS_ERROR:
-		seq_puts(seq, "root invalid\n");
+		err = perr_strings[READ_ONCE(cs->prs_err)];
+		if (err)
+			seq_printf(seq, "root invalid (%s)\n", err);
+		else
+			seq_puts(seq, "root invalid\n");
 		break;
 	}
 	return 0;
@@ -3256,6 +3294,10 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 			spin_lock_irq(&callback_lock);
 			cs->partition_root_state = PRS_ERROR;
 			spin_unlock_irq(&callback_lock);
+			if (parent->partition_root_state == PRS_ERROR)
+				WRITE_ONCE(cs->prs_err, PERR_PARENT);
+			else
+				WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
 			notify_partition_change(cs, old_prs, PRS_ERROR);
 		}
 		cpuset_force_rebuild();
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (3 preceding siblings ...)
  2021-10-18 14:36 ` [PATCH v8 4/6] cgroup/cpuset: Show invalid partition reason string Waiman Long
@ 2021-10-18 14:36 ` Waiman Long
  2021-11-15 19:31   ` Michal Koutný
  2021-10-18 14:36 ` [PATCH v8 6/6] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2021-10-18 14:36 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as other changes made in other
cpuset patches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 153 ++++++++++++++----------
 1 file changed, 93 insertions(+), 60 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 4d8c27eca96b..40d39562a8dd 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2091,8 +2091,9 @@ Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
 	When set to be a partition root, the current cgroup is the
@@ -2101,64 +2102,96 @@ Cpuset Interface Files
 	partition roots themselves and their descendants.  The root
 	cgroup is always a partition root.
 
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
-
-	1) The "cpuset.cpus" is not empty and the list of CPUs are
-	   exclusive, i.e. they are not shared by any of its siblings.
-	2) The parent cgroup is a partition root.
-	3) The "cpuset.cpus" is also a proper subset of the parent's
-	   "cpuset.cpus.effective".
-	4) There is no child cgroups with cpuset enabled.  This is for
-	   eliminating corner cases that have to be handled if such a
-	   condition is allowed.
-
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
-	file cannot be reverted back to "member" if there are any child
-	cgroups with cpuset enabled.
-
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
-
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
-
-	Sometimes, external factors like changes to ancestors'
-	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
-
-	  ==============	==============================
-	  "member"		Non-root member of a partition
-	  "root"		Partition root
-	  "root invalid"	Invalid partition root
-	  ==============	==============================
-
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
-	The cpu affinity of all the tasks in the cgroup will then be
-	associated with CPUs in the nearest ancestor partition.
-
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks in such a partition must be explicitly bound
+	to each individual CPU.
+
+	"cpuset.cpus" must always be set up first before enabling
+	partition.  Unlike "member" whose "cpuset.cpus.effective" can
+	contain CPUs not in "cpuset.cpus", this can never happen with a
+	valid partition root.  In other words, "cpuset.cpus.effective"
+	is always a subset of "cpuset.cpus" for a valid partition root.
+
+	When a parent partition root cannot exclusively grant any of
+	the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective"
+	becomes empty. If there are tasks in the partition root, the
+	partition root becomes invalid and "cpuset.cpus.effective"
+	is reset to that of the nearest non-empty ancestor.
+
+        Note that a task cannot be moved to a cgroup with empty
+        "cpuset.cpus.effective".
+
+	There are additional constraints on where a partition root can
+	be enabled ("root" or "isolated").  It can only be enabled in
+	a cgroup if all the following conditions are met.
+
+	1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
+	   not shared by any of its siblings.
+	2) The parent cgroup is a valid partition root.
+	3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
+	4) There is no child cgroups with cpuset enabled.  This avoids
+	   cpu migrations of multiple cgroups simultaneously which can
+	   be problematic.
+
+	On read, the "cpuset.cpus.partition" file can show the following
+	values.
+
+	  ======================	==============================
+	  "member"			Non-root member of a partition
+	  "root"			Partition root
+	  "isolated"			Partition root without load balancing
+	  "root invalid (<reason>)"	Invalid partition root
+	  ======================	==============================
+
+        In the case of an invalid partition root, a descriptive string on
+        why the partition is invalid is included within parentheses.
+
+	Once becoming a partition root, changes to "cpuset.cpus"
+	is generally allowed as long as the cpu list is exclusive,
+	non-empty and is a superset of children's cpu lists.
+
+        The constraints of a valid partition root are as follows:
+
+        1) The parent cgroup is a valid partition root.
+        2) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
+        3) "cpuset.cpus.effective" is non-empty when there are tasks
+           in the partition.
+
+	Changes to "cpuset.cpus" or cpu hotplug may cause the state
+	of a valid partition root to become invalid when one or more
+	constraints of a valid partition root are violated.  Therefore,
+	user space agents that manage partition roots should avoid
+	unnecessary changes to "cpuset.cpus" and always check the state
+	of "cpuset.cpus.partition" after making changes to make sure
+	that the partitions are functioning properly as expected.
+
+        Changing a partition root to "member" is always allowed.
+        If there are child partition roots underneath it, however,
+        they will be forced to be switched back to "member" too and
+        lose their partitions. So care must be taken to double check
+        for this condition before disabling a partition root.
+
+	Setting a cgroup to a valid partition root will take the CPUs
+	away from the effective CPUs of the parent partition.
+
+	A valid parent partition may distribute out all its CPUs to
+	its child partitions as long as it is not the root cgroup as
+	we need some house-keeping CPUs in the root cgroup.
+
+	An invalid partition is not a real partition even though some
+	internal states may still be kept.
+
+	An invalid partition root can be reverted back to a real
+	partition root if none of the constraints of a valid partition
+        root are violated.
+
+	Poll and inotify events are triggered whenever the state of
+	"cpuset.cpus.partition" changes.  That includes changes caused by
+	write to "cpuset.cpus.partition", cpu hotplug and other changes
+	that make the partition invalid.  This will allow user space
+	agents to monitor unexpected changes to "cpuset.cpus.partition"
+	without the need to do continuous polling.
 
 
 Device controller
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v8 6/6] kselftest/cgroup: Add cpuset v2 partition root state test
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (4 preceding siblings ...)
  2021-10-18 14:36 ` [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
@ 2021-10-18 14:36 ` Waiman Long
  2021-10-27 23:05 ` [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
  2021-11-10 11:13 ` Felix Moessbauer
  7 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-10-18 14:36 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Add a test script test_cpuset_prs.sh with a helper program wait_inotify
for exercising the cpuset v2 partition root state code.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 664 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
 3 files changed, 754 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 59e222460581..3f1fd3f93f41 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,10 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -Wall -pthread
 
-all:
+all: ${HELPER_PROGS}
 
 TEST_FILES     := with_stress.sh
-TEST_PROGS     := test_stress.sh
+TEST_PROGS     := test_stress.sh test_cpuset_prs.sh
+TEST_GEN_FILES := wait_inotify
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_kmem
 TEST_GEN_PROGS += test_core
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
new file mode 100755
index 000000000000..0a5d3bbad5cd
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -0,0 +1,664 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test for cpuset v2 partition root state (PRS)
+#
+# The sched verbose flag is set, if available, so that the console log
+# can be examined for the correct setting of scheduling domain.
+#
+
+skip_test() {
+	echo "$1"
+	echo "Test SKIPPED"
+	exit 0
+}
+
+[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
+
+# Set sched verbose flag, if available
+[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
+
+# Get wait_inotify location
+WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
+
+# Find cgroup v2 mount point
+CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
+
+CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
+[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
+
+# Set verbose flag and delay factor
+PROG=$1
+VERBOSE=
+DELAY_FACTOR=1
+while [[ "$1" = -* ]]
+do
+	case "$1" in
+		-v) VERBOSE=1
+		    break
+		    ;;
+		-d) DELAY_FACTOR=$2
+		    shift
+		    break
+		    ;;
+		*)  echo "Usage: $PROG [-v] [-d <delay-factor>"
+		    exit
+		    ;;
+	esac
+	shift
+done
+
+cd $CGROUP2
+echo +cpuset > cgroup.subtree_control
+[[ -d test ]] || mkdir test
+cd test
+
+# Pause in ms
+pause()
+{
+	DELAY=$1
+	LOOP=0
+	while [[ $LOOP -lt $DELAY_FACTOR ]]
+	do
+		sleep $DELAY
+		((LOOP++))
+	done
+	return 0
+}
+
+console_msg()
+{
+	MSG=$1
+	echo "$MSG"
+	echo "" > /dev/console
+	echo "$MSG" > /dev/console
+	pause 0.01
+}
+
+test_partition()
+{
+	EXPECTED_VAL=$1
+	echo $EXPECTED_VAL > cpuset.cpus.partition
+	[[ $? -eq 0 ]] || exit 1
+	ACTUAL_VAL=$(cat cpuset.cpus.partition)
+	[[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
+		echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+test_effective_cpus()
+{
+	EXPECTED_VAL=$1
+	ACTUAL_VAL=$(cat cpuset.cpus.effective)
+	[[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
+		echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+# Adding current process to cgroup.procs as a test
+test_add_proc()
+{
+	OUTSTR="$1"
+	ERRMSG=$((echo $$ > cgroup.procs) |& cat)
+	echo $ERRMSG | grep -q "$OUTSTR"
+	[[ $? -ne 0 ]] && {
+		echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
+		echo "Test FAILED"
+		exit 1
+	}
+	echo $$ > $CGROUP2/cgroup.procs	# Move out the task
+}
+
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+	echo 2-3 > cpuset.cpus
+	TYPE=$(cat cpuset.cpus.partition)
+	[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+	console_msg "Change from member to root"
+	test_partition root
+
+	console_msg "Change from root to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to member"
+	test_partition member
+
+	console_msg "Change from member to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to root"
+	test_partition root
+
+	console_msg "Change from root to member"
+	test_partition member
+
+	#
+	# Testing partition root with no cpu
+	#
+	console_msg "Distribute all cpus to child partition"
+	echo +cpuset > cgroup.subtree_control
+	test_partition root
+
+	mkdir A1
+	cd A1
+	echo 2-3 > cpuset.cpus
+	test_partition root
+	test_effective_cpus 2-3
+	cd ..
+	test_effective_cpus ""
+
+	console_msg "Moving task to partition test"
+	test_add_proc "No space left"
+	cd A1
+	test_add_proc ""
+	cd ..
+
+	console_msg "Shrink and expand child partition"
+	cd A1
+	echo 2 > cpuset.cpus
+	cd ..
+	test_effective_cpus 3
+	cd A1
+	echo 2-3 > cpuset.cpus
+	cd ..
+	test_effective_cpus ""
+
+	# Cleaning up
+	console_msg "Cleaning up"
+	echo $$ > $CGROUP2/cgroup.procs
+	[[ -d A1 ]] && rmdir A1
+}
+
+#
+# Cpuset controller state transition test matrix.
+#
+# Cgroup test hierarchy
+#
+# test -- A1 -- A2 -- A3
+#      \- B1
+#
+#  P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
+#  C<l> = add cpu-list
+#  S<p> = use prefix in subtree_control
+#  T    = put a task into cgroup
+#  O<c>-<v> = Write <v> to CPU online file of <c>
+#
+SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+TEST_MATRIX=(
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .      .     0 A2:0-1"
+	"  S+    C0-1     .      .    C2-3    P1      .      .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+ C0-1:P1   .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+  C1:P1    .      .     0 "
+	"  S+   C0-1:S+   .      .    C2-3     .      .      .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+     C1      .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3   C4-5     .      .      .     0 A1:4-5"
+	"  S+   C0-1:P1   .      .    C2-3  S+:C4-5   .      .      .     0 A1:4-5"
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C2     0 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .    C4-5    0 B1:4-5"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .      .      .      .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     C1-3    .      .      .     0 A1:1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      P0     .      .     0 A1:3,A2:3 A1:P1,A2:P0"
+	"  S+ C2-3:P1:S+  C2:P1  .      .     C2-4    .      .      .     0 A1:3-4,A2:2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .     C0-2   0 A1:,B1:0-2 A1:P1,A2:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# CPU offlining cases:
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .     O2-0   0 A1:0-1,B1:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0    .      .      .     0 A1:0-1,A2:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0   O2-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0    .      .      .     0 A1:0,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0   O1-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0    .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0    .      .      .     0 A1:2,A2: A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .    T:O2-0   .      .      .     0 A1:3,A2:3 A1:P1,A2:P-1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .      .    T:O3-0   .      .     0 A1:2,A2:2 A1:P1,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O1-0    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O2-0    .      .      .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O3-0    .      .      .     0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0   .      .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0   .      .     0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0   .     0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O1-1    .      .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0  O2-1    .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0  O3-1   0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O1-1    .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O2-1    .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	#
+	# Incorrect change to cpuset.cpus invalidates partition root
+	#
+	# Adding CPUs to partition root that are not in parent's
+	# cpuset.cpus is allowed, but those extra CPUs are ignored.
+	"  S+ C2-3:P1:S+ C3:P1   .      .      .     C2-4    .      .     0 A1:,A2:2-3 A1:P1,A2:P1"
+
+	# Taking away all CPUs from parent or itself if there are tasks
+	# will make the partition invalid.
+	"  S+ C2-3:P1:S+  C3:P1  .      .      T     C2-3    .      .     0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:C2-3   .      .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    . T:C2-3:C1-3 .      .      .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# Changing a partition root to member disables child partitions
+	"  S+ C2-3:P1:S+  C3:P1  .      .      P0     .      .      .     0 A1:2-3,A2:3 A1:P0,A2:P0"
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    P0     .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P0"
+
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	# Failure cases:
+
+	# To become a partition root, cpuset.cpus must be a subset of
+	# parent's cpuset.cpus.
+	"  S+    C0-1     .      .    C2-3    S+   C4-5:P1   .      .     1 "
+
+	# A cpuset cannot become a partition root if it has child cpusets
+	# with non-empty cpuset.cpus.
+	"  S+   C0-1:S+   C1     .    C2-3    P1      .      .      .     1 "
+
+	# Any change to cpuset.cpus of a partition root must be exclusive.
+	"  S+   C0-1:P1   .      .    C2-3   C0-2     .      .      .     1 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C1     1 "
+	"  S+ C2-3:P1:S+  C2:P1  .     C1    C1-3     .      .      .     1 "
+
+	# Deletion of CPUs distributed to child cgroup is not allowed.
+	"  S+ C0-1:P1:S+ C1      .    C2-3   C4-5     .      .      .     1 "
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .    C0-2     .      .      .     1 "
+
+	# A task cannot be added to a partition with no cpu
+	"  S+ C2-3:P1:S+  C3:P1  .      .    O2-0:T   .      .      .     1 A1:,A2:3 A1:P1,A2:P1"
+)
+
+#
+# Write to the cpu online file
+#  $1 - <c>-<v> where <c> = cpu number, <v> value to be written
+#
+write_cpu_online()
+{
+	CPU=${1%-*}
+	VAL=${1#*-}
+	CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
+	if [[ $VAL -eq 0 ]]
+	then
+		OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
+	else
+		[[ -n "$OFFLINE_CPUS" ]] && {
+			OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
+					sort | uniq -u)
+		}
+	fi
+	echo $VAL > $CPUFILE
+	pause 0.01
+}
+
+#
+# Set controller state
+#  $1 - cgroup directory
+#  $2 - state
+#  $3 - showerr
+#
+# The presence of ":" in state means transition from one to the next.
+#
+set_ctrl_state()
+{
+	TMPMSG=/tmp/.msg_$$
+	CGRP=$1
+	STATE=$2
+	SHOWERR=${3}${VERBOSE}
+	CTRL=${CTRL:=$CONTROLLER}
+	HASERR=0
+	REDIRECT="2> $TMPMSG"
+	[[ -z "$STATE" || "$STATE" = '.' ]] && return 0
+
+	rm -f $TMPMSG
+	for CMD in $(echo $STATE | sed -e "s/:/ /g")
+	do
+		TFILE=$CGRP/cgroup.procs
+		SFILE=$CGRP/cgroup.subtree_control
+		PFILE=$CGRP/cpuset.cpus.partition
+		CFILE=$CGRP/cpuset.cpus
+		S=$(expr substr $CMD 1 1)
+		if [[ $S = S ]]
+		then
+			PREFIX=${CMD#?}
+			COMM="echo ${PREFIX}${CTRL} > $SFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = C ]]
+		then
+			CPUS=${CMD#?}
+			COMM="echo $CPUS > $CFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = P ]]
+		then
+			VAL=${CMD#?}
+			case $VAL in
+			0)  VAL=member
+			    ;;
+			1)  VAL=root
+			    ;;
+			2)  VAL=isolated
+			    ;;
+			*)
+			    echo "Invalid partition state - $VAL"
+			    exit 1
+			    ;;
+			esac
+			COMM="echo $VAL > $PFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = O ]]
+		then
+			VAL=${CMD#?}
+			write_cpu_online $VAL
+		elif [[ $S = T ]]
+		then
+			COMM="echo 0 > $TFILE"
+			eval $COMM $REDIRECT
+		fi
+		RET=$?
+		[[ $RET -ne 0 ]] && {
+			[[ -n "$SHOWERR" ]] && {
+				echo "$COMM"
+				cat $TMPMSG
+			}
+			HASERR=1
+		}
+		pause 0.01
+		rm -f $TMPMSG
+	done
+	return $HASERR
+}
+
+set_ctrl_state_noerr()
+{
+	CGRP=$1
+	STATE=$2
+	[[ -d $CGRP ]] || mkdir $CGRP
+	set_ctrl_state $CGRP $STATE 1
+	[[ $? -ne 0 ]] && {
+		echo "ERROR: Failed to set $2 to cgroup $1!"
+		exit 1
+	}
+}
+
+online_cpus()
+{
+	[[ -n "OFFLINE_CPUS" ]] && {
+		for C in $OFFLINE_CPUS
+		do
+			write_cpu_online ${C}-1
+		done
+	}
+}
+
+#
+# Return 1 if the list of effective cpus isn't the same as the initial list.
+#
+reset_cgroup_states()
+{
+	echo 0 > $CGROUP2/cgroup.procs
+	online_cpus
+	rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
+	set_ctrl_state . S-
+	pause 0.01
+}
+
+dump_states()
+{
+	for DIR in A1 A1/A2 A1/A2/A3 B1
+	do
+		ECPUS=$DIR/cpuset.cpus.effective
+		PRS=$DIR/cpuset.cpus.partition
+		[[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
+		[[ -e $PRS   ]] && echo "$PRS: $(cat $PRS)"
+	done
+}
+
+#
+# Check effective cpus
+# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
+#
+check_effective_cpus()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		CPUS=$2
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+		FILE=$CGRP/cpuset.cpus.effective
+		[[ -e $FILE ]] || return 1
+		[[ $CPUS = $(cat $FILE) ]] || return 1
+	done
+}
+
+#
+# Check cgroup states
+#  $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
+#
+check_cgroup_states()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		STATE=$2
+		FILE=
+		EVAL=$(expr substr $STATE 2 2)
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+
+		case $STATE in
+			P*) FILE=$CGRP/cpuset.cpus.partition
+			    ;;
+			*)  echo "Unknown state: $STATE!"
+			    exit 1
+			    ;;
+		esac
+		VAL=$(cat $FILE)
+
+		case "$VAL" in
+			member) VAL=0
+				;;
+			root)	VAL=1
+				;;
+			isolated)
+				VAL=2
+				;;
+			"root invalid"*)
+				VAL=-1
+				;;
+		esac
+		[[ $EVAL != $VAL ]] && return 1
+	done
+	return 0
+}
+
+#
+# Run cpuset state transition test
+#  $1 - test matrix name
+#
+# This test is somewhat fragile as delays (sleep x) are added in various
+# places to make sure state changes are fully propagated before the next
+# action. These delays may need to be adjusted if running in a slower machine.
+#
+run_state_test()
+{
+	TEST=$1
+	CONTROLLER=cpuset
+	CPULIST=0-6
+	I=0
+	eval CNT="\${#$TEST[@]}"
+
+	reset_cgroup_states
+	echo $CPULIST > cpuset.cpus
+	echo root > cpuset.cpus.partition
+	console_msg "Running state transition test ..."
+
+	while [[ $I -lt $CNT ]]
+	do
+		echo "Running test $I ..." > /dev/console
+		eval set -- "\${$TEST[$I]}"
+		ROOT=$1
+		OLD_A1=$2
+		OLD_A2=$3
+		OLD_A3=$4
+		OLD_B1=$5
+		NEW_A1=$6
+		NEW_A2=$7
+		NEW_A3=$8
+		NEW_B1=$9
+		RESULT=${10}
+		ECPUS=${11}
+		STATES=${12}
+
+		set_ctrl_state_noerr .        $ROOT
+		set_ctrl_state_noerr A1       $OLD_A1
+		set_ctrl_state_noerr A1/A2    $OLD_A2
+		set_ctrl_state_noerr A1/A2/A3 $OLD_A3
+		set_ctrl_state_noerr B1       $OLD_B1
+		RETVAL=0
+		set_ctrl_state A1       $NEW_A1; ((RETVAL += $?))
+		set_ctrl_state A1/A2    $NEW_A2; ((RETVAL += $?))
+		set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
+		set_ctrl_state B1       $NEW_B1; ((RETVAL += $?))
+
+		[[ $RETVAL -ne $RESULT ]] && {
+			echo "Test $TEST[$I] failed result check!"
+			eval echo \"\${$TEST[$I]}\"
+			dump_states
+			online_cpus
+			exit 1
+		}
+
+		[[ -n "$ECPUS" && "$ECPUS" != . ]] && {
+			check_effective_cpus $ECPUS
+			[[ $? -ne 0 ]] && {
+				echo "Test $TEST[$I] failed effective CPU check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		[[ -n "$STATES" ]] && {
+			check_cgroup_states $STATES
+			[[ $? -ne 0 ]] && {
+				echo "FAILED: Test $TEST[$I] failed states check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		reset_cgroup_states
+		#
+		# Check to see if effective cpu list changes
+		#
+		pause 0.05
+		NEWLIST=$(cat cpuset.cpus.effective)
+		[[ $NEWLIST != $CPULIST ]] && {
+			echo "Effective cpus changed to $NEWLIST after test $I!"
+			exit 1
+		}
+		[[ -n "$VERBOSE" ]] && echo "Test $I done."
+		((I++))
+	done
+	echo "All $I tests of $TEST PASSED."
+
+	echo member > cpuset.cpus.partition
+}
+
+#
+# Wait for inotify event for the given file and read it
+# $1: cgroup file to wait for
+# $2: file to store the read result
+#
+wait_inotify()
+{
+	CGROUP_FILE=$1
+	OUTPUT_FILE=$2
+
+	$WAIT_INOTIFY $CGROUP_FILE
+	cat $CGROUP_FILE > $OUTPUT_FILE
+}
+
+#
+# Test if inotify events are properly generated when going into and out of
+# invalid partition state.
+#
+test_inotify()
+{
+	ERR=0
+	PRS=/tmp/.prs_$$
+	[[ -f $WAIT_INOTIFY ]] || {
+		echo "wait_inotify not found, inotify test SKIPPED."
+		return
+	}
+
+	pause 0.01
+	echo 1 > cpuset.cpus
+	echo 0 > cgroup.procs
+	echo root > cpuset.cpus.partition
+	pause 0.01
+	rm -f $PRS
+	wait_inotify $PWD/cpuset.cpus.partition $PRS &
+	pause 0.01
+	set_ctrl_state . "O1-0"
+	pause 0.01
+	check_cgroup_states ".:P-1"
+	if [[ $? -ne 0 ]]
+	then
+		echo "FAILED: Inotify test - partition not invalid"
+		ERR=1
+	elif [[ ! -f $PRS ]]
+	then
+		echo "FAILED: Inotify test - event not generated"
+		ERR=1
+		kill %1
+	elif [[ $(cat $PRS) != "root invalid"* ]]
+	then
+		echo "FAILED: Inotify test - incorrect state"
+		cat $PRS
+		ERR=1
+	fi
+	online_cpus
+	echo member > cpuset.cpus.partition
+	echo 0 > ../cgroup.procs
+	if [[ $ERR -ne 0 ]]
+	then
+		exit 1
+	else
+		echo "Inotify test PASSED"
+	fi
+}
+
+run_state_test TEST_MATRIX
+test_isolated
+test_inotify
+echo "All tests PASSED."
+cd ..
+rmdir test
diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c
new file mode 100644
index 000000000000..e11b431e1b62
--- /dev/null
+++ b/tools/testing/selftests/cgroup/wait_inotify.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Wait until an inotify event on the given cgroup file.
+ */
+#include <linux/limits.h>
+#include <sys/inotify.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static const char usage[] = "Usage: %s [-v] <cgroup_file>\n";
+static char *file;
+static int verbose;
+
+static inline void fail_message(char *msg)
+{
+	fprintf(stderr, msg, file);
+	exit(1);
+}
+
+int main(int argc, char *argv[])
+{
+	char *cmd = argv[0];
+	int c, fd;
+	struct pollfd fds = { .events = POLLIN, };
+
+	while ((c = getopt(argc, argv, "v")) != -1) {
+		switch (c) {
+		case 'v':
+			verbose++;
+			break;
+		}
+		argv++, argc--;
+	}
+
+	if (argc != 2) {
+		fprintf(stderr, usage, cmd);
+		return -1;
+	}
+	file = argv[1];
+	fd = open(file, O_RDONLY);
+	if (fd < 0)
+		fail_message("Cgroup file %s not found!\n");
+	close(fd);
+
+	fd = inotify_init();
+	if (fd < 0)
+		fail_message("inotify_init() fails on %s!\n");
+	if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
+		fail_message("inotify_add_watch() fails on %s!\n");
+	fds.fd = fd;
+
+	/*
+	 * poll waiting loop
+	 */
+	for (;;) {
+		int ret = poll(&fds, 1, 10000);
+
+		if (ret < 0) {
+			if (errno == EINTR)
+				continue;
+			perror("poll");
+			exit(1);
+		}
+		if ((ret > 0) && (fds.revents & POLLIN))
+			break;
+	}
+	if (verbose) {
+		struct inotify_event events[10];
+		long len;
+
+		usleep(1000);
+		len = read(fd, events, sizeof(events));
+		printf("Number of events read = %ld\n",
+			len/sizeof(struct inotify_event));
+	}
+	close(fd);
+	return 0;
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (5 preceding siblings ...)
  2021-10-18 14:36 ` [PATCH v8 6/6] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
@ 2021-10-27 23:05 ` Waiman Long
  2021-11-10 11:13 ` Felix Moessbauer
  7 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-10-27 23:05 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 10/18/21 10:36 AM, Waiman Long wrote:
> v8:
>   - Reorganize the patch series and rationalize the features and
>     constraints of a partition.
>   - Update patch descriptions and documentation accordingly.
>
> v7:
>   - Simplify the documentation patch (patch 5) as suggested by Tejun.
>   - Fix a typo in patch 2 and improper commit log in patch 3.
>
> v6:
>   - Remove duplicated tmpmask from update_prstate() which should fix the
>     frame size too large problem reported by kernel test robot.
>
> This patchset makes four enhancements to the cpuset v2 code.
>
>   Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
>
>   Patch 2: Refining the features and constraints of a cpuset partition
>   clarifying what changes are allowed.
>
>   Patch 3: Add a new partition state "isolated" to create a partition
>   root without load balancing. This is for handling intermitten workloads
>   that have a strict low latency requirement.
>
>   Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
>   that causes invalid partition like "root invalid (No cpu available
>   due to hotplug)".
>
> Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
> cpuset test to test the new cpuset partition code.
>
> Waiman Long (6):
>    cgroup/cpuset: Allow no-task partition to have empty
>      cpuset.cpus.effective
>    cgroup/cpuset: Refining features and constraints of a partition
>    cgroup/cpuset: Add a new isolated cpus.partition type
>    cgroup/cpuset: Show invalid partition reason string
>    cgroup/cpuset: Update description of cpuset.cpus.partition in
>      cgroup-v2.rst
>    kselftest/cgroup: Add cpuset v2 partition root state test
>
>   Documentation/admin-guide/cgroup-v2.rst       | 153 ++--
>   kernel/cgroup/cpuset.c                        | 393 +++++++----
>   tools/testing/selftests/cgroup/Makefile       |   5 +-
>   .../selftests/cgroup/test_cpuset_prs.sh       | 664 ++++++++++++++++++
>   tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
>   5 files changed, 1115 insertions(+), 187 deletions(-)
>   create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
>   create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

Any feedback on this patch series?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (6 preceding siblings ...)
  2021-10-27 23:05 ` [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
@ 2021-11-10 11:13 ` Felix Moessbauer
  2021-11-10 13:21   ` Marcelo Tosatti
                     ` (2 more replies)
  7 siblings, 3 replies; 39+ messages in thread
From: Felix Moessbauer @ 2021-11-10 11:13 UTC (permalink / raw)
  To: longman
  Cc: akpm, cgroups, corbet, frederic, guro, hannes, juri.lelli,
	linux-doc, linux-kernel, linux-kselftest, lizefan.x, mkoutny,
	mtosatti, pauld, peterz, shuah, tj, jan.kiszka, henning.schild

Hi Weiman,

> v8:
>  - Reorganize the patch series and rationalize the features and
>    constraints of a partition.
>  - Update patch descriptions and documentation accordingly.
> 
> v7:
>  - Simplify the documentation patch (patch 5) as suggested by Tejun.
>  - Fix a typo in patch 2 and improper commit log in patch 3.
> 
> v6:
>  - Remove duplicated tmpmask from update_prstate() which should fix the
>    frame size too large problem reported by kernel test robot.
> 
> This patchset makes four enhancements to the cpuset v2 code.
> 
>  Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
> 
>  Patch 2: Refining the features and constraints of a cpuset partition
>  clarifying what changes are allowed.
>
>  Patch 3: Add a new partition state "isolated" to create a partition
>  root without load balancing. This is for handling intermitten workloads
>  that have a strict low latency requirement.


I just tested this patch-series and can confirm that it works on 5.15.0-rc7-rt15 (PREEMT_RT).

However, I was not able to see any latency improvements when using
cpuset.cpus.partition=isolated.
The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
On the other cpus, stress-ng was executed to generate load.

Just some more general notes:

Even with this new "isolated" type, it is still very tricky to get a similar
behavior as with isolcpus (as long as I don't miss something here):

Consider an RT application that consists of a non-rt thread that should be floating
and a rt-thread that should be placed in the isolated domain.
This requires cgroup.type=threaded on both cgroups and changes to the application
(threads have to be born in non-rt group and moved to rt-group).

Theoretically, this could be done externally, but in case the application sets the
affinity mask manually, you run into a timing issue (setting affinities to CPUs
outside the current cpuset.cpus results in EINVAL).

Best regards,
Felix Moessbauer
Siemens AG

> Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
>  that causes invalid partition like "root invalid (No cpu available
>  due to hotplug)".
> 
> Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
> cpuset test to test the new cpuset partition code.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 11:13 ` Felix Moessbauer
@ 2021-11-10 13:21   ` Marcelo Tosatti
  2021-11-10 13:56   ` Michal Koutný
  2021-11-10 15:20   ` Waiman Long
  2 siblings, 0 replies; 39+ messages in thread
From: Marcelo Tosatti @ 2021-11-10 13:21 UTC (permalink / raw)
  To: Felix Moessbauer
  Cc: longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	mkoutny, pauld, peterz, shuah, tj, jan.kiszka, henning.schild

On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer wrote:
> Hi Weiman,
> 
> > v8:
> >  - Reorganize the patch series and rationalize the features and
> >    constraints of a partition.
> >  - Update patch descriptions and documentation accordingly.
> > 
> > v7:
> >  - Simplify the documentation patch (patch 5) as suggested by Tejun.
> >  - Fix a typo in patch 2 and improper commit log in patch 3.
> > 
> > v6:
> >  - Remove duplicated tmpmask from update_prstate() which should fix the
> >    frame size too large problem reported by kernel test robot.
> > 
> > This patchset makes four enhancements to the cpuset v2 code.
> > 
> >  Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
> > 
> >  Patch 2: Refining the features and constraints of a cpuset partition
> >  clarifying what changes are allowed.
> >
> >  Patch 3: Add a new partition state "isolated" to create a partition
> >  root without load balancing. This is for handling intermitten workloads
> >  that have a strict low latency requirement.
> 
> 
> I just tested this patch-series and can confirm that it works on 5.15.0-rc7-rt15 (PREEMT_RT).
> 
> However, I was not able to see any latency improvements when using
> cpuset.cpus.partition=isolated.
> The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> On the other cpus, stress-ng was executed to generate load.

enum hk_flags {
        HK_FLAG_TIMER           = 1,
        HK_FLAG_RCU             = (1 << 1),
        HK_FLAG_MISC            = (1 << 2),
        HK_FLAG_SCHED           = (1 << 3),
        HK_FLAG_TICK            = (1 << 4),
        HK_FLAG_DOMAIN          = (1 << 5),
        HK_FLAG_WQ              = (1 << 6),
        HK_FLAG_MANAGED_IRQ     = (1 << 7),
        HK_FLAG_KTHREAD         = (1 << 8),
};

static int __init housekeeping_nohz_full_setup(char *str)
{
        unsigned int flags;

        flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
                HK_FLAG_MISC | HK_FLAG_KTHREAD;

        return housekeeping_setup(str, flags);
}
__setup("nohz_full=", housekeeping_nohz_full_setup);

So HK_FLAG_SCHED and HK_FLAG_MANAGED_IRQ are unset in your configuration.
Perhaps they are affecting your latency numbers?

This tool might be handy to see what is the reason for the latency source:

https://github.com/xzpeter/rt-trace-bpf

./rt-trace-bcc.py -c isolated-cpu

> Just some more general notes:
> 
> Even with this new "isolated" type, it is still very tricky to get a similar
> behavior as with isolcpus (as long as I don't miss something here):
> 
> Consider an RT application that consists of a non-rt thread that should be floating
> and a rt-thread that should be placed in the isolated domain.
> This requires cgroup.type=threaded on both cgroups and changes to the application
> (threads have to be born in non-rt group and moved to rt-group).
> 
> Theoretically, this could be done externally, but in case the application sets the
> affinity mask manually, you run into a timing issue (setting affinities to CPUs
> outside the current cpuset.cpus results in EINVAL).
> 
> Best regards,
> Felix Moessbauer
> Siemens AG
> 
> > Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
> >  that causes invalid partition like "root invalid (No cpu available
> >  due to hotplug)".
> > 
> > Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
> > cpuset test to test the new cpuset partition code.
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 11:13 ` Felix Moessbauer
  2021-11-10 13:21   ` Marcelo Tosatti
@ 2021-11-10 13:56   ` Michal Koutný
  2021-11-10 15:21     ` Moessbauer, Felix
  2021-11-10 15:20   ` Waiman Long
  2 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2021-11-10 13:56 UTC (permalink / raw)
  To: Felix Moessbauer
  Cc: longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	mtosatti, pauld, peterz, shuah, tj, jan.kiszka, henning.schild

Hello.

On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer <felix.moessbauer@siemens.com> wrote:
> However, I was not able to see any latency improvements when using
> cpuset.cpus.partition=isolated.

Interesting. What was the baseline against which you compared it
(isolcpus, no cpusets,...)?

> The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> On the other cpus, stress-ng was executed to generate load.
> [...]

> This requires cgroup.type=threaded on both cgroups and changes to the application
> (threads have to be born in non-rt group and moved to rt-group).

But even with isolcpus the application would need to set affinity of
threads to the selected CPUs (cf cgroup migrating). Do I miss anything?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 11:13 ` Felix Moessbauer
  2021-11-10 13:21   ` Marcelo Tosatti
  2021-11-10 13:56   ` Michal Koutný
@ 2021-11-10 15:20   ` Waiman Long
  2 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-11-10 15:20 UTC (permalink / raw)
  To: Felix Moessbauer
  Cc: akpm, cgroups, corbet, frederic, guro, hannes, juri.lelli,
	linux-doc, linux-kernel, linux-kselftest, lizefan.x, mkoutny,
	mtosatti, pauld, peterz, shuah, tj, jan.kiszka, henning.schild


On 11/10/21 06:13, Felix Moessbauer wrote:
> Hi Weiman,
>
>> v8:
>>   - Reorganize the patch series and rationalize the features and
>>     constraints of a partition.
>>   - Update patch descriptions and documentation accordingly.
>>
>> v7:
>>   - Simplify the documentation patch (patch 5) as suggested by Tejun.
>>   - Fix a typo in patch 2 and improper commit log in patch 3.
>>
>> v6:
>>   - Remove duplicated tmpmask from update_prstate() which should fix the
>>     frame size too large problem reported by kernel test robot.
>>
>> This patchset makes four enhancements to the cpuset v2 code.
>>
>>   Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
>>
>>   Patch 2: Refining the features and constraints of a cpuset partition
>>   clarifying what changes are allowed.
>>
>>   Patch 3: Add a new partition state "isolated" to create a partition
>>   root without load balancing. This is for handling intermitten workloads
>>   that have a strict low latency requirement.
>
> I just tested this patch-series and can confirm that it works on 5.15.0-rc7-rt15 (PREEMT_RT).
>
> However, I was not able to see any latency improvements when using
> cpuset.cpus.partition=isolated.
> The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> On the other cpus, stress-ng was executed to generate load.
>
> Just some more general notes:
>
> Even with this new "isolated" type, it is still very tricky to get a similar
> behavior as with isolcpus (as long as I don't miss something here):
>
> Consider an RT application that consists of a non-rt thread that should be floating
> and a rt-thread that should be placed in the isolated domain.
> This requires cgroup.type=threaded on both cgroups and changes to the application
> (threads have to be born in non-rt group and moved to rt-group).
>
> Theoretically, this could be done externally, but in case the application sets the
> affinity mask manually, you run into a timing issue (setting affinities to CPUs
> outside the current cpuset.cpus results in EINVAL).

I believe the "isolated" type will have more benefit on non PREEMPT_RT 
kernel. Anyway, having the "isolated" type is just the first step. It 
should be equivalent to "isolcpus=domain". There are other patches 
floating that attempt to move some of the isolcpus=nohz features into 
cpuset as well. It is not there yet, but we should be able to have 
better dynamic cpu isolation down the road.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 13:56   ` Michal Koutný
@ 2021-11-10 15:21     ` Moessbauer, Felix
  2021-11-10 16:10       ` Marcelo Tosatti
  2021-11-10 18:15       ` Michal Koutný
  0 siblings, 2 replies; 39+ messages in thread
From: Moessbauer, Felix @ 2021-11-10 15:21 UTC (permalink / raw)
  To: Michal Koutný
  Cc: longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	mtosatti, pauld, peterz, shuah, tj, jan.kiszka, henning.schild



> -----Original Message-----
> From: Michal Koutný <mkoutny@suse.com>
> Sent: Wednesday, November 10, 2021 2:57 PM
> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> Cc: longman@redhat.com; akpm@linux-foundation.org;
> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> <henning.schild@siemens.com>
> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> empty effecitve cpus
> 
> Hello.
> 
> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> <felix.moessbauer@siemens.com> wrote:
> > However, I was not able to see any latency improvements when using
> > cpuset.cpus.partition=isolated.
> 
> Interesting. What was the baseline against which you compared it (isolcpus, no
> cpusets,...)?

For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).

> 
> > The test was performed with jitterdebugger on CPUs 1-3 and the following
> cmdline:
> > rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> > On the other cpus, stress-ng was executed to generate load.
> > [...]
> 
> > This requires cgroup.type=threaded on both cgroups and changes to the
> > application (threads have to be born in non-rt group and moved to rt-group).
> 
> But even with isolcpus the application would need to set affinity of threads to
> the selected CPUs (cf cgroup migrating). Do I miss anything?

Yes, that's true. But there are two differences (given that you use isolcpus):
1. the application only has to set the affinity for rt threads.
 Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
 Even common rt test applications like jitterdebugger do not pin their non-rt threads.
2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
This binding can be specified before thread creation via pthread_create.
By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.

With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.

Best regards,
Felix

> 
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 15:21     ` Moessbauer, Felix
@ 2021-11-10 16:10       ` Marcelo Tosatti
  2021-11-10 16:14         ` Marcelo Tosatti
  2021-11-10 16:15         ` Jan Kiszka
  2021-11-10 18:15       ` Michal Koutný
  1 sibling, 2 replies; 39+ messages in thread
From: Marcelo Tosatti @ 2021-11-10 16:10 UTC (permalink / raw)
  To: Moessbauer, Felix
  Cc: Michal Koutný,
	longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	pauld, peterz, shuah, tj, jan.kiszka, henning.schild

On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
> 
> 
> > -----Original Message-----
> > From: Michal Koutný <mkoutny@suse.com>
> > Sent: Wednesday, November 10, 2021 2:57 PM
> > To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> > Cc: longman@redhat.com; akpm@linux-foundation.org;
> > cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> > hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> > kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> > lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> > peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> > IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> > <henning.schild@siemens.com>
> > Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> > empty effecitve cpus
> > 
> > Hello.
> > 
> > On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> > <felix.moessbauer@siemens.com> wrote:
> > > However, I was not able to see any latency improvements when using
> > > cpuset.cpus.partition=isolated.
> > 
> > Interesting. What was the baseline against which you compared it (isolcpus, no
> > cpusets,...)?
> 
> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
> 
> > 
> > > The test was performed with jitterdebugger on CPUs 1-3 and the following
> > cmdline:
> > > rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> > > On the other cpus, stress-ng was executed to generate load.
> > > [...]
> > 
> > > This requires cgroup.type=threaded on both cgroups and changes to the
> > > application (threads have to be born in non-rt group and moved to rt-group).
> > 
> > But even with isolcpus the application would need to set affinity of threads to
> > the selected CPUs (cf cgroup migrating). Do I miss anything?
> 
> Yes, that's true. But there are two differences (given that you use isolcpus):
> 1. the application only has to set the affinity for rt threads.
>  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
>  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> This binding can be specified before thread creation via pthread_create.
> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
> 
> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.

man clone3:

       CLONE_NEWCGROUP (since Linux 4.6)
              Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
              process is created in the same cgroup namespaces as the calling process.

              For further information on cgroup namespaces, see cgroup_namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 16:10       ` Marcelo Tosatti
@ 2021-11-10 16:14         ` Marcelo Tosatti
  2021-11-10 16:15         ` Jan Kiszka
  1 sibling, 0 replies; 39+ messages in thread
From: Marcelo Tosatti @ 2021-11-10 16:14 UTC (permalink / raw)
  To: Moessbauer, Felix
  Cc: Michal Koutný,
	longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	pauld, peterz, shuah, tj, jan.kiszka, henning.schild

On Wed, Nov 10, 2021 at 01:10:20PM -0300, Marcelo Tosatti wrote:
> On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Michal Koutný <mkoutny@suse.com>
> > > Sent: Wednesday, November 10, 2021 2:57 PM
> > > To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> > > Cc: longman@redhat.com; akpm@linux-foundation.org;
> > > cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> > > hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> > > kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> > > lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> > > peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> > > IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> > > <henning.schild@siemens.com>
> > > Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> > > empty effecitve cpus
> > > 
> > > Hello.
> > > 
> > > On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> > > <felix.moessbauer@siemens.com> wrote:
> > > > However, I was not able to see any latency improvements when using
> > > > cpuset.cpus.partition=isolated.
> > > 
> > > Interesting. What was the baseline against which you compared it (isolcpus, no
> > > cpusets,...)?
> > 
> > For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
> > There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
> > 
> > > 
> > > > The test was performed with jitterdebugger on CPUs 1-3 and the following
> > > cmdline:
> > > > rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> > > > On the other cpus, stress-ng was executed to generate load.
> > > > [...]
> > > 
> > > > This requires cgroup.type=threaded on both cgroups and changes to the
> > > > application (threads have to be born in non-rt group and moved to rt-group).
> > > 
> > > But even with isolcpus the application would need to set affinity of threads to
> > > the selected CPUs (cf cgroup migrating). Do I miss anything?
> > 
> > Yes, that's true. But there are two differences (given that you use isolcpus):
> > 1. the application only has to set the affinity for rt threads.
> >  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
> >  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
> > 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> > This binding can be specified before thread creation via pthread_create.
> > By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
> > 
> > With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
> > Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
> > At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
> > Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
> 
> man clone3:
> 
>        CLONE_NEWCGROUP (since Linux 4.6)
>               Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
>               process is created in the same cgroup namespaces as the calling process.
> 
>               For further information on cgroup namespaces, see cgroup_namespaces(7).
> 
>               Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
> 

Err, CLONE_INTO_CGROUP.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 16:10       ` Marcelo Tosatti
  2021-11-10 16:14         ` Marcelo Tosatti
@ 2021-11-10 16:15         ` Jan Kiszka
  2021-11-10 17:29           ` Marcelo Tosatti
  2021-11-10 17:52           ` Michal Koutný
  1 sibling, 2 replies; 39+ messages in thread
From: Jan Kiszka @ 2021-11-10 16:15 UTC (permalink / raw)
  To: Marcelo Tosatti, Moessbauer, Felix
  Cc: Michal Koutný,
	longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	pauld, peterz, shuah, tj, henning.schild

On 10.11.21 17:10, Marcelo Tosatti wrote:
> On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
>>
>>
>>> -----Original Message-----
>>> From: Michal Koutný <mkoutny@suse.com>
>>> Sent: Wednesday, November 10, 2021 2:57 PM
>>> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
>>> Cc: longman@redhat.com; akpm@linux-foundation.org;
>>> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
>>> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
>>> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
>>> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
>>> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
>>> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
>>> <henning.schild@siemens.com>
>>> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
>>> empty effecitve cpus
>>>
>>> Hello.
>>>
>>> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
>>> <felix.moessbauer@siemens.com> wrote:
>>>> However, I was not able to see any latency improvements when using
>>>> cpuset.cpus.partition=isolated.
>>>
>>> Interesting. What was the baseline against which you compared it (isolcpus, no
>>> cpusets,...)?
>>
>> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
>> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
>>
>>>
>>>> The test was performed with jitterdebugger on CPUs 1-3 and the following
>>> cmdline:
>>>> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
>>>> On the other cpus, stress-ng was executed to generate load.
>>>> [...]
>>>
>>>> This requires cgroup.type=threaded on both cgroups and changes to the
>>>> application (threads have to be born in non-rt group and moved to rt-group).
>>>
>>> But even with isolcpus the application would need to set affinity of threads to
>>> the selected CPUs (cf cgroup migrating). Do I miss anything?
>>
>> Yes, that's true. But there are two differences (given that you use isolcpus):
>> 1. the application only has to set the affinity for rt threads.
>>  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
>>  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
>> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
>> This binding can be specified before thread creation via pthread_create.
>> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
>>
>> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
>> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
>> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
>> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
> 
> man clone3:
> 
>        CLONE_NEWCGROUP (since Linux 4.6)
>               Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
>               process is created in the same cgroup namespaces as the calling process.
> 
>               For further information on cgroup namespaces, see cgroup_namespaces(7).
> 
>               Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
> 

Is there pthread_attr_setcgroup_np()?

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 16:15         ` Jan Kiszka
@ 2021-11-10 17:29           ` Marcelo Tosatti
  2021-11-10 18:30             ` Waiman Long
  2021-11-10 17:52           ` Michal Koutný
  1 sibling, 1 reply; 39+ messages in thread
From: Marcelo Tosatti @ 2021-11-10 17:29 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Moessbauer, Felix, Michal Koutný,
	longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	pauld, peterz, shuah, tj, henning.schild

On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka wrote:
> On 10.11.21 17:10, Marcelo Tosatti wrote:
> > On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Michal Koutný <mkoutny@suse.com>
> >>> Sent: Wednesday, November 10, 2021 2:57 PM
> >>> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> >>> Cc: longman@redhat.com; akpm@linux-foundation.org;
> >>> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> >>> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> >>> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> >>> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> >>> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> >>> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> >>> <henning.schild@siemens.com>
> >>> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> >>> empty effecitve cpus
> >>>
> >>> Hello.
> >>>
> >>> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> >>> <felix.moessbauer@siemens.com> wrote:
> >>>> However, I was not able to see any latency improvements when using
> >>>> cpuset.cpus.partition=isolated.
> >>>
> >>> Interesting. What was the baseline against which you compared it (isolcpus, no
> >>> cpusets,...)?
> >>
> >> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
> >> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
> >>
> >>>
> >>>> The test was performed with jitterdebugger on CPUs 1-3 and the following
> >>> cmdline:
> >>>> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> >>>> On the other cpus, stress-ng was executed to generate load.
> >>>> [...]
> >>>
> >>>> This requires cgroup.type=threaded on both cgroups and changes to the
> >>>> application (threads have to be born in non-rt group and moved to rt-group).
> >>>
> >>> But even with isolcpus the application would need to set affinity of threads to
> >>> the selected CPUs (cf cgroup migrating). Do I miss anything?
> >>
> >> Yes, that's true. But there are two differences (given that you use isolcpus):
> >> 1. the application only has to set the affinity for rt threads.
> >>  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
> >>  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
> >> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> >> This binding can be specified before thread creation via pthread_create.
> >> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
> >>
> >> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
> >> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
> >> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
> >> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
> > 
> > man clone3:
> > 
> >        CLONE_NEWCGROUP (since Linux 4.6)
> >               Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
> >               process is created in the same cgroup namespaces as the calling process.
> > 
> >               For further information on cgroup namespaces, see cgroup_namespaces(7).
> > 
> >               Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
> > 
> 
> Is there pthread_attr_setcgroup_np()?
> 
> Jan

Don't know... Waiman? 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 16:15         ` Jan Kiszka
  2021-11-10 17:29           ` Marcelo Tosatti
@ 2021-11-10 17:52           ` Michal Koutný
  2021-11-10 18:04             ` Jan Kiszka
  1 sibling, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2021-11-10 17:52 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Marcelo Tosatti, Moessbauer, Felix, longman, akpm, cgroups,
	corbet, frederic, guro, hannes, juri.lelli, linux-doc,
	linux-kernel, linux-kselftest, lizefan.x, pauld, peterz, shuah,
	tj, henning.schild

On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> Is there pthread_attr_setcgroup_np()?

If I'm not mistaken the 'p' in pthreads stands for POSIX and cgroups are
Linux specific so you won't find that (unless you implement that
yourself). ¯\_(ツ)_/¯

Michal

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 17:52           ` Michal Koutný
@ 2021-11-10 18:04             ` Jan Kiszka
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kiszka @ 2021-11-10 18:04 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Marcelo Tosatti, Moessbauer, Felix, longman, akpm, cgroups,
	corbet, frederic, guro, hannes, juri.lelli, linux-doc,
	linux-kernel, linux-kselftest, lizefan.x, pauld, peterz, shuah,
	tj, henning.schild

On 10.11.21 18:52, Michal Koutný wrote:
> On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> Is there pthread_attr_setcgroup_np()?
> 
> If I'm not mistaken the 'p' in pthreads stands for POSIX and cgroups are
> Linux specific so you won't find that (unless you implement that
> yourself). ¯\_(ツ)_/¯
> 

I know what it stands for :). But I don't want to re-implement pthreads
just to have a single creation-time configurable injected. Neither would
developer of standard application, e.g. libvirt for the rt-kvm special
case while most of their use cases are fine with regular pthread APIs. I
think there is also a demand for a programming model that fits into
existing ones.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 15:21     ` Moessbauer, Felix
  2021-11-10 16:10       ` Marcelo Tosatti
@ 2021-11-10 18:15       ` Michal Koutný
  1 sibling, 0 replies; 39+ messages in thread
From: Michal Koutný @ 2021-11-10 18:15 UTC (permalink / raw)
  To: Moessbauer, Felix
  Cc: longman, akpm, cgroups, corbet, frederic, guro, hannes,
	juri.lelli, linux-doc, linux-kernel, linux-kselftest, lizefan.x,
	mtosatti, pauld, peterz, shuah, tj, jan.kiszka, henning.schild

On Wed, Nov 10, 2021 at 03:21:54PM +0000, "Moessbauer, Felix" <felix.moessbauer@siemens.com> wrote:
> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> This binding can be specified before thread creation via pthread_create.
> By that, you can make sure that at no point in time a thread has a
> "forbidden" CPU in its affinities.

It should boil down to some clone$version(2) and sched_setaffinity(2)
calls, so strictly speaking even with pthread_create(3) the thread is
shortly running with the parent's affinity.

> With cgroup2, you cannot guarantee the second aspect, as thread
> creation and moving to a cgroup is not an atomic operation.

As suggested by others, CLONE_INTO_CGROUP (into cpuset cgroup) can
actually "hide" the migration into the clone3() call.

> At creation time, you cannot set the final affinity mask (as you
> create it in the non-rt group and there the CPU is not in the
> cpuset.cpus).
> Once you move the thread to the rt cgroup, it has a default mask and
> by that can be executed on other rt cores.

Good point. Perhaps you could work this around by having another level
of (non-root partition) cpuset cgroups for individual CPUs? (Maybe
there's more clever approach, this is just first to come into my mind.)

Michal

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-11-10 17:29           ` Marcelo Tosatti
@ 2021-11-10 18:30             ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-11-10 18:30 UTC (permalink / raw)
  To: Marcelo Tosatti, Jan Kiszka
  Cc: Moessbauer, Felix, Michal Koutný,
	akpm, cgroups, corbet, frederic, guro, hannes, juri.lelli,
	linux-doc, linux-kernel, linux-kselftest, lizefan.x, pauld,
	peterz, shuah, tj, henning.schild


On 11/10/21 12:29, Marcelo Tosatti wrote:
> On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka wrote:
>> On 10.11.21 17:10, Marcelo Tosatti wrote:
>>> On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
>>>>
>>>>> -----Original Message-----
>>>>> From: Michal Koutný <mkoutny@suse.com>
>>>>> Sent: Wednesday, November 10, 2021 2:57 PM
>>>>> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
>>>>> Cc: longman@redhat.com; akpm@linux-foundation.org;
>>>>> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
>>>>> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
>>>>> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
>>>>> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
>>>>> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
>>>>> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
>>>>> <henning.schild@siemens.com>
>>>>> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
>>>>> empty effecitve cpus
>>>>>
>>>>> Hello.
>>>>>
>>>>> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
>>>>> <felix.moessbauer@siemens.com> wrote:
>>>>>> However, I was not able to see any latency improvements when using
>>>>>> cpuset.cpus.partition=isolated.
>>>>> Interesting. What was the baseline against which you compared it (isolcpus, no
>>>>> cpusets,...)?
>>>> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
>>>> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
>>>>
>>>>>> The test was performed with jitterdebugger on CPUs 1-3 and the following
>>>>> cmdline:
>>>>>> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
>>>>>> On the other cpus, stress-ng was executed to generate load.
>>>>>> [...]
>>>>>> This requires cgroup.type=threaded on both cgroups and changes to the
>>>>>> application (threads have to be born in non-rt group and moved to rt-group).
>>>>> But even with isolcpus the application would need to set affinity of threads to
>>>>> the selected CPUs (cf cgroup migrating). Do I miss anything?
>>>> Yes, that's true. But there are two differences (given that you use isolcpus):
>>>> 1. the application only has to set the affinity for rt threads.
>>>>   Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
>>>>   Even common rt test applications like jitterdebugger do not pin their non-rt threads.
>>>> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
>>>> This binding can be specified before thread creation via pthread_create.
>>>> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
>>>>
>>>> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
>>>> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
>>>> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
>>>> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
>>> man clone3:
>>>
>>>         CLONE_NEWCGROUP (since Linux 4.6)
>>>                Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
>>>                process is created in the same cgroup namespaces as the calling process.
>>>
>>>                For further information on cgroup namespaces, see cgroup_namespaces(7).
>>>
>>>                Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
>>>
>> Is there pthread_attr_setcgroup_np()?
>>
>> Jan
> Don't know... Waiman?

I don't think there is such libpthread call yet.

-Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-10-18 14:36 ` [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
@ 2021-11-15 19:31   ` Michal Koutný
  2021-11-15 20:11     ` Tejun Heo
  2021-11-15 21:10     ` Waiman Long
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Koutný @ 2021-11-15 19:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti


Hello.

On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote:
> +	When set to "isolated", the CPUs in that partition root will
> +	be in an isolated state without any load balancing from the
> +	scheduler.  Tasks in such a partition must be explicitly bound
> +	to each individual CPU.

This sounds reasonable but it seems to have some usability issues as was
raised in another thread [1]. (I could only think of the workaround of
single-cpu cgroup leaves + CLONE_INTO_CGROUP.)

TL;DR Do whatever you find suitable but (re)consider sticking to the
delegation principle (making hotplug and ancestor changes equal).

Now to the constraints and partition setups. I think it's useful to have
a model with which the implementation can be compared with.
I tried to condense some "simple rules" from the descriptions you posted
in v8 plus your response to my remarks in v7 [2]. These should only be
the "validity conditions", not "transition conditions".

## Validity conditions

For simplification, there's a condition called 'degraded' that tells
whether a cpuset can host tasks (with the given config) that expands to
two predicates:

	degraded := cpus.internal_effective == ø && has_tasks
	valid_root := !degraded && cpus_exclusive && parent.valid_root
	(valid_member := !degraded)

with a helping predicate
	cpus_exclusive := cpus not shared by a sibling

The effective CPUs basically combine configured+available CPUs

	cpus.internal_effective := (cpus ∩ parent.cpus ∩ online_cpus) - passed

where
	passed := union of children cpus whose partition is not member

Finally, to handle the degraded cpusets gracefully, we define

	if (!degraded)
		cpus.effective := cpus.internal_effective 
	else
		cpus.effective := parent.cpus.effective

(In cases when there's no parent, we replace its cpus with online_cpus.)

---

I'll try applying these conditions to your description.

> +
> +	"cpuset.cpus" must always be set up first before enabling
> +	partition.

This is just a transition condition.

>       Unlike "member" whose "cpuset.cpus.effective" can
> +	contain CPUs not in "cpuset.cpus", this can never happen with a
> +	valid partition root. In other words, "cpuset.cpus.effective"
> +	is always a subset of "cpuset.cpus" for a valid partition root.

IIUC this refers to the cgroup that is 'degraded'. (The consequences for
a valid partition root follow from valid_root definition above.)

> +
> +	When a parent partition root cannot exclusively grant any of
> +	the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective"
> +	becomes empty.

This sounds too strict to me, perhaps you meant 'cannot grant _all_ of
the CPUs'?

>       If there are tasks in the partition root, the
> +	partition root becomes invalid and "cpuset.cpus.effective"
> +	is reset to that of the nearest non-empty ancestor.

This is captured in the definition of 'degraded'.

> +
> +        Note that a task cannot be moved to a croup with empty
> +        "cpuset.cpus.effective".

A transition condition. (Makes sense.)

[With the validity conditions above, it's possible to have 'valid_root'
with empty cpus (hence also empty cpus.internal_effective) if there are
no tasks in there. The transition conditions so far prevented this
corner case.]

> +	There are additional constraints on where a partition root can
> +	be enabled ("root" or "isolated").  It can only be enabled in
> +	a cgroup if all the following conditions are met.

I think the enablement (aka rewriting cpuset.cpus.partition) could be
always possible but it'd result in "root invalid (...)" if the resulting
config doesn't meet the validity condition.

> +
> +	1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
> +	   not shared by any of its siblings.

The emptiness here is a judgement call (in my formulation of the
conditions it seemed simpler to allow empty cpus.internal_effective with
no tasks).

> +	2) The parent cgroup is a valid partition root.

Captured in the valid_root definition.

> +	3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".

This is unnecessary strictness. Allow such config,
cpus.internal_effective still can't be more than parent's cpuset.cpus.
(Or do you have a reason to discard such configs?)

> +	4) There is no child cgroups with cpuset enabled.  This avoids
> +	   cpu migrations of multiple cgroups simultaneously which can
> +	   be problematic.

A transition condition (i.e. not relevant to validity conditions).

> +	Once becoming a partition root, changes to "cpuset.cpus"
> +	is generally allowed as long as the cpu list is exclusive,
> +	non-empty and is a superset of children's cpu lists.

Any changes should be allowed otherwise it denies the delegation
principle of v2 (IOW a parent should be able to preempt CPUs given to
chilren previously and not be denied because of them).

(If the change results in failed validity condition the cgroup of course
cannot be be a valid_root anymore.)

> +        The constraints of a valid partition root are as follows:
> +
> +        1) The parent cgroup is a valid partition root.
> +        2) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
> +        3) "cpuset.cpus.effective" is non-empty when there are tasks
> +           in the partition.

(This seem to miss the sibling exclusivity condition.)
Here I'd simply paste the "Validity conditions" specified above instead.

> +        Changing a partition root to "member" is always allowed.
> +        If there are child partition roots underneath it, however,
> +        they will be forced to be switched back to "member" too and
> +        lose their partitions. So care must be taken to double check
> +        for this condition before disabling a partition root.

(Or is this how delegation is intended?) However, AFAICS, parent still
can't remove cpuset.cpus even when the child is a "member". Otherwise,
I agree with the back-switch.


> +	Setting a cgroup to a valid partition root will take the CPUs
> +	away from the effective CPUs of the parent partition.

Captured in the definition of cpus.internal_effective.

> +	A valid parent partition may distribute out all its CPUs to
> +	its child partitions as long as it is not the root cgroup as
> +	we need some house-keeping CPUs in the root cgroup.

This actually applies to any root partition that's supposed to host
tasks. (IOW, 'valid_root' cannot be 'degraded'.)

> +	An invalid partition is not a real partition even though some
> +	internal states may still be kept.

Tautology? (Or new definition of "real".)

> +
> +	An invalid partition root can be reverted back to a real
> +	partition root if none of the constraints of a valid partition
> +        root are violated.

Yes. (Also tautological.)

Anyway, as I said above, I just tried to formulate the model for clearer
understanding and the implementation may introduce transition
constraints but it'd be good to always have the simple rules to tell
what's a valid root in the tree and what's not.

Regards,
Michal

[1] https://lore.kernel.org/r/AM9PR10MB4869C14EAE01B87C0037BF6A89939@AM9PR10MB4869.EURPRD10.PROD.OUTLOOK.COM/
[2] https://lore.kernel.org/lkml/5eacfdcc-148b-b599-3111-4f2971e7ddc0@redhat.com/


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-11-15 19:31   ` Michal Koutný
@ 2021-11-15 20:11     ` Tejun Heo
  2021-11-15 21:27       ` Waiman Long
  2021-11-15 21:10     ` Waiman Long
  1 sibling, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2021-11-15 20:11 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Waiman Long, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

Hello,

On Mon, Nov 15, 2021 at 08:31:22PM +0100, Michal Koutný wrote:
> Now to the constraints and partition setups. I think it's useful to have
> a model with which the implementation can be compared with.
> I tried to condense some "simple rules" from the descriptions you posted
> in v8 plus your response to my remarks in v7 [2]. These should only be
> the "validity conditions", not "transition conditions".

FWIW, my opinion is pretty much in line with Michal's in this regard. Other
than that, everything looks pretty good to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-11-15 19:31   ` Michal Koutný
  2021-11-15 20:11     ` Tejun Heo
@ 2021-11-15 21:10     ` Waiman Long
  2021-11-16 17:54       ` Michal Koutný
  1 sibling, 1 reply; 39+ messages in thread
From: Waiman Long @ 2021-11-15 21:10 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti


On 11/15/21 14:31, Michal Koutný wrote:
> Hello.
>
> On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote:
>> +	When set to "isolated", the CPUs in that partition root will
>> +	be in an isolated state without any load balancing from the
>> +	scheduler.  Tasks in such a partition must be explicitly bound
>> +	to each individual CPU.
> This sounds reasonable but it seems to have some usability issues as was
> raised in another thread [1]. (I could only think of the workaround of
> single-cpu cgroup leaves + CLONE_INTO_CGROUP.)

It can be a problem when one is trying to move from one cgroup to 
another cgroup with non-overlapping cpus laterally. However, if a task 
is initially from a parent cgroup with affinity mask that include cpus 
in the isolated child cgroup, I believe it should be able to move to the 
isolated child cgroup without problem. Otherwise, it is a bug that needs 
to be fixed.


>
> TL;DR Do whatever you find suitable but (re)consider sticking to the
> delegation principle (making hotplug and ancestor changes equal).
>
> Now to the constraints and partition setups. I think it's useful to have
> a model with which the implementation can be compared with.
> I tried to condense some "simple rules" from the descriptions you posted
> in v8 plus your response to my remarks in v7 [2]. These should only be
> the "validity conditions", not "transition conditions".
>
> ## Validity conditions
>
> For simplification, there's a condition called 'degraded' that tells
> whether a cpuset can host tasks (with the given config) that expands to
> two predicates:
>
> 	degraded := cpus.internal_effective == ø && has_tasks
> 	valid_root := !degraded && cpus_exclusive && parent.valid_root
> 	(valid_member := !degraded)
>
> with a helping predicate
> 	cpus_exclusive := cpus not shared by a sibling
>
> The effective CPUs basically combine configured+available CPUs
>
> 	cpus.internal_effective := (cpus ∩ parent.cpus ∩ online_cpus) - passed
>
> where
> 	passed := union of children cpus whose partition is not member
>
> Finally, to handle the degraded cpusets gracefully, we define
>
> 	if (!degraded)
> 		cpus.effective := cpus.internal_effective
> 	else
> 		cpus.effective := parent.cpus.effective
>
> (In cases when there's no parent, we replace its cpus with online_cpus.)
>
> ---
>
> I'll try applying these conditions to your description.
>
>> +
>> +	"cpuset.cpus" must always be set up first before enabling
>> +	partition.
> This is just a transition condition.
>
>>        Unlike "member" whose "cpuset.cpus.effective" can
>> +	contain CPUs not in "cpuset.cpus", this can never happen with a
>> +	valid partition root. In other words, "cpuset.cpus.effective"
>> +	is always a subset of "cpuset.cpus" for a valid partition root.
> IIUC this refers to the cgroup that is 'degraded'. (The consequences for
> a valid partition root follow from valid_root definition above.)
>
>> +
>> +	When a parent partition root cannot exclusively grant any of
>> +	the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective"
>> +	becomes empty.
> This sounds too strict to me, perhaps you meant 'cannot grant _all_ of
> the CPUs'?
I think the wording may be confusing. What I meant is none of the 
requested cpu can be granted. So if there is at least one granted, the 
effective cpus won't be empty.
>>        If there are tasks in the partition root, the
>> +	partition root becomes invalid and "cpuset.cpus.effective"
>> +	is reset to that of the nearest non-empty ancestor.
> This is captured in the definition of 'degraded'.
>
>> +
>> +        Note that a task cannot be moved to a croup with empty
>> +        "cpuset.cpus.effective".
> A transition condition. (Makes sense.)
>
> [With the validity conditions above, it's possible to have 'valid_root'
> with empty cpus (hence also empty cpus.internal_effective) if there are
> no tasks in there. The transition conditions so far prevented this
> corner case.]
>
>> +	There are additional constraints on where a partition root can
>> +	be enabled ("root" or "isolated").  It can only be enabled in
>> +	a cgroup if all the following conditions are met.
> I think the enablement (aka rewriting cpuset.cpus.partition) could be
> always possible but it'd result in "root invalid (...)" if the resulting
> config doesn't meet the validity condition.
>
>> +
>> +	1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
>> +	   not shared by any of its siblings.
> The emptiness here is a judgement call (in my formulation of the
> conditions it seemed simpler to allow empty cpus.internal_effective with
> no tasks).
There are more constraints in enabling a partition. Once it is enabled, 
there will be less constraints to maintain its validity.
>
>> +	2) The parent cgroup is a valid partition root.
> Captured in the valid_root definition.
>
>> +	3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
> This is unnecessary strictness. Allow such config,
> cpus.internal_effective still can't be more than parent's cpuset.cpus.
> (Or do you have a reason to discard such configs?)
>
>> +	4) There is no child cgroups with cpuset enabled.  This avoids
>> +	   cpu migrations of multiple cgroups simultaneously which can
>> +	   be problematic.
> A transition condition (i.e. not relevant to validity conditions).
>
>> +	Once becoming a partition root, changes to "cpuset.cpus"
>> +	is generally allowed as long as the cpu list is exclusive,
>> +	non-empty and is a superset of children's cpu lists.
> Any changes should be allowed otherwise it denies the delegation
> principle of v2 (IOW a parent should be able to preempt CPUs given to
> chilren previously and not be denied because of them).
>
> (If the change results in failed validity condition the cgroup of course
> cannot be be a valid_root anymore.)
>
>> +        The constraints of a valid partition root are as follows:
>> +
>> +        1) The parent cgroup is a valid partition root.
>> +        2) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
>> +        3) "cpuset.cpus.effective" is non-empty when there are tasks
>> +           in the partition.
> (This seem to miss the sibling exclusivity condition.)
> Here I'd simply paste the "Validity conditions" specified above instead.
You currently cannot make change to cpuset.cpus that violates the cpu 
exclusivity rule. The above constraints will not disallow you to make 
the change. They just affect the validity of the partition root.
>
>> +        Changing a partition root to "member" is always allowed.
>> +        If there are child partition roots underneath it, however,
>> +        they will be forced to be switched back to "member" too and
>> +        lose their partitions. So care must be taken to double check
>> +        for this condition before disabling a partition root.
> (Or is this how delegation is intended?) However, AFAICS, parent still
> can't remove cpuset.cpus even when the child is a "member". Otherwise,
> I agree with the back-switch.
There are only 2 possibilities here. Either we force the child 
partitions to be become members or invalid partition root. The purpose 
of invalid partition root is actually a transient state which can be 
recovered in some way to make the partition again. However, changing a 
parent partition root to member breaks the possibility of recovering 
later. That is why I think it is more sensible to force those child 
partitions to become members.
>
>
>> +	Setting a cgroup to a valid partition root will take the CPUs
>> +	away from the effective CPUs of the parent partition.
> Captured in the definition of cpus.internal_effective.
>
>> +	A valid parent partition may distribute out all its CPUs to
>> +	its child partitions as long as it is not the root cgroup as
>> +	we need some house-keeping CPUs in the root cgroup.
> This actually applies to any root partition that's supposed to host
> tasks. (IOW, 'valid_root' cannot be 'degraded'.)
>
>> +	An invalid partition is not a real partition even though some
>> +	internal states may still be kept.
> Tautology? (Or new definition of "real".)
>
>> +
>> +	An invalid partition root can be reverted back to a real
>> +	partition root if none of the constraints of a valid partition
>> +        root are violated.
> Yes. (Also tautological.)
>
> Anyway, as I said above, I just tried to formulate the model for clearer
> understanding and the implementation may introduce transition
> constraints but it'd be good to always have the simple rules to tell
> what's a valid root in the tree and what's not.

Thanks for analyzing each statements for their validity. I will try to 
improve it to make it easier to understand.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-11-15 20:11     ` Tejun Heo
@ 2021-11-15 21:27       ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-11-15 21:27 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti


On 11/15/21 15:11, Tejun Heo wrote:
> Hello,
>
> On Mon, Nov 15, 2021 at 08:31:22PM +0100, Michal Koutný wrote:
>> Now to the constraints and partition setups. I think it's useful to have
>> a model with which the implementation can be compared with.
>> I tried to condense some "simple rules" from the descriptions you posted
>> in v8 plus your response to my remarks in v7 [2]. These should only be
>> the "validity conditions", not "transition conditions".
> FWIW, my opinion is pretty much in line with Michal's in this regard. Other
> than that, everything looks pretty good to me.

Yes, I am going to streamline the documentation as suggested to make it 
easier to understand.

Coding-wise, do you have other changes you want me to make?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-11-15 21:10     ` Waiman Long
@ 2021-11-16 17:54       ` Michal Koutný
  2021-11-30 15:35         ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2021-11-16 17:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

On Mon, Nov 15, 2021 at 04:10:29PM -0500, Waiman Long <longman@redhat.com> wrote:
> > On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote:
> > > +	scheduler.  Tasks in such a partition must be explicitly bound
> > > +	to each individual CPU.
> [...]
> 
> It can be a problem when one is trying to move from one cgroup to another
> cgroup with non-overlapping cpus laterally. However, if a task is initially
> from a parent cgroup with affinity mask that include cpus in the isolated
> child cgroup, I believe it should be able to move to the isolated child
> cgroup without problem. Otherwise, it is a bug that needs to be fixed.

app_root	cpuset.cpus=0-3
`- non_rt	cpuset.cpus=0-1	cpuset.cpus.partition=member
`- rt		cpuset.cpus=2-3	cpuset.cpus.partition=isolated

The app_root would have cpuset.cpus.effective=0-1 so even the task in
app_root can't sched_setaffinity() to cpus 2-3.
But AFAICS, the migration calls set_cpus_allowed_ptr() anyway, so the
task in the isolated partition needn't to bind explicitly with
sched_setaffinity(). (It'd have two cpus available, so one more
sched_setaffinity() or migration into a single-cpu list is desirable.)

All in all, I think the behavior is OK and the explicit binding of tasks
in an isolated cpuset is optional (not a must as worded currently).


> I think the wording may be confusing. What I meant is none of the requested
> cpu can be granted. So if there is at least one granted, the effective cpus
> won't be empty.

Ack.

> You currently cannot make change to cpuset.cpus that violates the cpu
> exclusivity rule. The above constraints will not disallow you to make the
> change. They just affect the validity of the partition root.

Sibling exclusivity should be a validity condition regardless of whether
transition is allowed or not. (At least it looks simpler to me.)


> > > +        Changing a partition root to "member" is always allowed.
> > > +        If there are child partition roots underneath it, however,
> > > +        they will be forced to be switched back to "member" too and
> > > +        lose their partitions. So care must be taken to double check
> > > +        for this condition before disabling a partition root.
> > (Or is this how delegation is intended?) However, AFAICS, parent still
> > can't remove cpuset.cpus even when the child is a "member". Otherwise,
> > I agree with the back-switch.
> There are only 2 possibilities here. Either we force the child partitions to
> be become members or invalid partition root.

My point here was mostly about preempting the cpus (as a v2 specific
feature). (I'm rather indifferent whether children turn into invalid
roots or members.)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-11-16 17:54       ` Michal Koutný
@ 2021-11-30 15:35         ` Waiman Long
  2021-11-30 17:11           ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2021-11-30 15:35 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

On 11/16/21 12:54, Michal Koutný wrote:
> On Mon, Nov 15, 2021 at 04:10:29PM -0500, Waiman Long <longman@redhat.com> wrote:
>>> On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote:
>>>> +	scheduler.  Tasks in such a partition must be explicitly bound
>>>> +	to each individual CPU.
>>>> [...]
>>>>
>>>> It can be a problem when one is trying to move from one cgroup to another
>>>> cgroup with non-overlapping cpus laterally. However, if a task is initially
>>>> from a parent cgroup with affinity mask that include cpus in the isolated
>>>> child cgroup, I believe it should be able to move to the isolated child
>>>> cgroup without problem. Otherwise, it is a bug that needs to be fixed.
> app_root	cpuset.cpus=0-3
> `- non_rt	cpuset.cpus=0-1	cpuset.cpus.partition=member
> `- rt		cpuset.cpus=2-3	cpuset.cpus.partition=isolated
>
> The app_root would have cpuset.cpus.effective=0-1 so even the task in
> app_root can't sched_setaffinity() to cpus 2-3.
> But AFAICS, the migration calls set_cpus_allowed_ptr() anyway, so the
> task in the isolated partition needn't to bind explicitly with
> sched_setaffinity(). (It'd have two cpus available, so one more
> sched_setaffinity() or migration into a single-cpu list is desirable.)
>
> All in all, I think the behavior is OK and the explicit binding of tasks
> in an isolated cpuset is optional (not a must as worded currently).
>
>
>> I think the wording may be confusing. What I meant is none of the requested
>> cpu can be granted. So if there is at least one granted, the effective cpus
>> won't be empty.
> Ack.
>
>> You currently cannot make change to cpuset.cpus that violates the cpu
>> exclusivity rule. The above constraints will not disallow you to make the
>> change. They just affect the validity of the partition root.
> Sibling exclusivity should be a validity condition regardless of whether
> transition is allowed or not. (At least it looks simpler to me.)
>
>
>>>> +        Changing a partition root to "member" is always allowed.
>>>> +        If there are child partition roots underneath it, however,
>>>> +        they will be forced to be switched back to "member" too and
>>>> +        lose their partitions. So care must be taken to double check
>>>> +        for this condition before disabling a partition root.
>>> (Or is this how delegation is intended?) However, AFAICS, parent still
>>> can't remove cpuset.cpus even when the child is a "member". Otherwise,
>>> I agree with the back-switch.
>> There are only 2 possibilities here. Either we force the child partitions to
>> be become members or invalid partition root.
> My point here was mostly about preempting the cpus (as a v2 specific
> feature). (I'm rather indifferent whether children turn into invalid
> roots or members.)

Below is my latest iterations of the cpuset.cpus.partition 
documentation. If there is no objection or other suggestion for 
improvement, I am going to send out another iteration of the patch 
series with the updated documentation.

Cheers,
Longman

--------------------------------------------------------------

   cpuset.cpus.partition
     A read-write single value file which exists on non-root
     cpuset-enabled cgroups.  This flag is owned by the parent cgroup
     and is not delegatable.

     It accepts only the following input values when written to.

       ========    ================================
       "member"    Non-root member of a partition
       "root"    Partition root
       "isolated"    Partition root without load balancing
       ========    ================================

     The root cgroup is always a partition root and its state
     cannot be changed.  All other non-root cgroups start out as
     "member".

     When set to "root", the current cgroup is the root of a new
     partition or scheduling domain that comprises itself and
     all its descendants except those that are separate partition
     roots themselves and their descendants.

     The value shown in "cpuset.cpus.effective" of a partition root is
     the CPUs that the parent partition root can dedicate to the new
     partition root.  They are subtracted from "cpuset.cpus.effective"
     of the parent and may be different from "cpuset.cpus"

     When set to "isolated", the CPUs in that partition root will
     be in an isolated state without any load balancing from the
     scheduler.  Tasks placed in such a partition with multiple
     CPUs should be carefully distributed and bound to each of the
     individual CPUs for optimal performance.

     A partition root ("root" or "isolated") can be in one of the
     two possible states - valid or invalid.  An invalid partition
     root is in a degraded state where some state information are
     retained, but behaves more like a "member".

     On read, the "cpuset.cpus.partition" file can show the following
     values.

       ======================    ==============================
       "member"            Non-root member of a partition
       "root"            Partition root
       "isolated"            Partition root without load balancing
       "root invalid (<reason>)"    Invalid partition root
       ======================    ==============================

     In the case of an invalid partition root, a descriptive string on
     why the partition is invalid is included within parentheses.

     Almost all possible state transitions among "member", valid
     and invalid partition roots are allowed except from "member"
     to invalid partition root.

     Before the "member" to partition root transition can happen,
     the following conditions must be met or the transition will
     not be allowed.

     1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
        not shared by any of its siblings.
     2) The parent cgroup is a valid partition root.
     3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
     4) There is no child cgroups with cpuset enabled.  This avoids
        cpu migrations of multiple cgroups simultaneously which can
        be problematic.

     Once becoming a partition root, the following two rules restrict
     what changes can be made to "cpuset.cpus".

     1) The value must be exclusive.
     2) If child cpusets exist, the value must be a superset of what
        are defined in the child cpusets.

     The second rule applies even for "member". Other changes to
     "cpuset.cpus" that do not violate the above rules are always
     allowed.

     External events like hotplug or inappropriate changes to
     "cpuset.cpus" can cause a valid partition root to become invalid.
     Besides the constraints on changing "cpuset.cpus" listed above,
     the other conditions required to maintain the validity of a
     partition root are as follows:

     1) The parent cgroup is a valid partition root.
     2) If "cpuset.cpus.effective" is empty, the partition must have
        no task associated with it. Otherwise, the partition becomes
        invalid and "cpuset.cpus.effective" will fall back to that
        of the nearest non-empty ancestor.

     A corollary of a valid partition root is that
     "cpuset.cpu.effective" is always a subset of "cpuset.cpus".
     Note that a task cannot be moved to a cgroup with empty
     "cpuset.cpus.effective".

     Changing a partition root (valid or invalid) to "member" is
     always allowed.  If there are child partition roots underneath
     it, however, they will be forced to be switched back to "member"
     too and lose their partitions. So care must be taken to double
     check for this condition before disabling a partition root.

     A valid parent partition may distribute out all its CPUs to
     its child partitions as long as it is not the root cgroup and
     there is no task associated with it.

     An invalid partition root can be reverted back to a valid one
     if none of the validity constraints of a valid partition root
     are violated.

     Poll and inotify events are triggered whenever the state of
     "cpuset.cpus.partition" changes.  That includes changes caused by
     write to "cpuset.cpus.partition", cpu hotplug and other changes
     that make the partition invalid.  This will allow user space
     agents to monitor unexpected changes to "cpuset.cpus.partition"
     without the need to do continuous polling.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-11-30 15:35         ` Waiman Long
@ 2021-11-30 17:11           ` Tejun Heo
  2021-12-01  3:56             ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2021-11-30 17:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti

Hello, Waiman.

On Tue, Nov 30, 2021 at 10:35:19AM -0500, Waiman Long wrote:
>     On read, the "cpuset.cpus.partition" file can show the following
>     values.
> 
>       ======================    ==============================
>       "member"            Non-root member of a partition
>       "root"            Partition root
>       "isolated"            Partition root without load balancing
>       "root invalid (<reason>)"    Invalid partition root
>       ======================    ==============================

What happens if an isolated domain becomes invalid and then valid again due
to cpu hotplug? Does it go "root invalid" and then back to "isolated"?

...
>     Before the "member" to partition root transition can happen,
>     the following conditions must be met or the transition will
>     not be allowed.
> 
>     1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
>        not shared by any of its siblings.
>     2) The parent cgroup is a valid partition root.
>     3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
>     4) There is no child cgroups with cpuset enabled.  This avoids
>        cpu migrations of multiple cgroups simultaneously which can
>        be problematic.

So, I still have a hard time justifying the above restrictions. 1) can be
broken through hotplug anyway. 2) can be broken by the parent switching to
member. 3) would mean that we'd need to restrict parent's config changes
depending on what children are doing. 4) is more understandable but it's an
implementation detail that we can address in the future.

>     Once becoming a partition root, the following two rules restrict
>     what changes can be made to "cpuset.cpus".
> 
>     1) The value must be exclusive.
>     2) If child cpusets exist, the value must be a superset of what
>        are defined in the child cpusets.
>
>     The second rule applies even for "member". Other changes to
>     "cpuset.cpus" that do not violate the above rules are always
>     allowed.

While it isn't necessarily tied to this series, it's a big no-no to restrict
what a parent can do depending on what its descendants are doing. A cgroup
higher up in the hierarchy should be able to change configuration however it
sees fit as deligation breaks down otherwise.

Maybe you can argue that cpuset is special and shouldn't be subject to such
convention but I can't see strong enough justifications especially given
that most of these restrictions can be broken by hotplug operations anyway
and thus need code to handle those situations.

>     Changing a partition root (valid or invalid) to "member" is
>     always allowed.  If there are child partition roots underneath
>     it, however, they will be forced to be switched back to "member"
>     too and lose their partitions. So care must be taken to double
>     check for this condition before disabling a partition root.

Wouldn't it make more sense for them to retain their configuration and turn
invalid? Why is this special?

>     A valid parent partition may distribute out all its CPUs to
>     its child partitions as long as it is not the root cgroup and
>     there is no task associated with it.

A valid parent partition which isn't root never has tasks in them to begin
with.

>     An invalid partition root can be reverted back to a valid one
>     if none of the validity constraints of a valid partition root
>     are violated.
> 
>     Poll and inotify events are triggered whenever the state of
>     "cpuset.cpus.partition" changes.  That includes changes caused by
>     write to "cpuset.cpus.partition", cpu hotplug and other changes
>     that make the partition invalid.  This will allow user space
>     agents to monitor unexpected changes to "cpuset.cpus.partition"
>     without the need to do continuous polling.

Unfortunately, my sense is still that both the restrictions and behaviors
are pretty arbitrary. I can somewhat see how the restrictions may make sense
in a specific frame of mind but am having a hard time finding strong enough
justifications for them. There are many really specific rules and it isn't
clear why they are the way they are.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-11-30 17:11           ` Tejun Heo
@ 2021-12-01  3:56             ` Waiman Long
  2021-12-01 14:13               ` Michal Koutný
                                 ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Waiman Long @ 2021-12-01  3:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti

On 11/30/21 12:11, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, Nov 30, 2021 at 10:35:19AM -0500, Waiman Long wrote:
>>      On read, the "cpuset.cpus.partition" file can show the following
>>      values.
>>
>>        ======================    ==============================
>>        "member"            Non-root member of a partition
>>        "root"            Partition root
>>        "isolated"            Partition root without load balancing
>>        "root invalid (<reason>)"    Invalid partition root
>>        ======================    ==============================
> What happens if an isolated domain becomes invalid and then valid again due
> to cpu hotplug? Does it go "root invalid" and then back to "isolated"?
Yes, the current code allow recovering from an invalid state. In this 
particular case, the transition will be "isolated" --> "root invalid" 
--> "isolated".
> ...
>>      Before the "member" to partition root transition can happen,
>>      the following conditions must be met or the transition will
>>      not be allowed.
>>
>>      1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
>>         not shared by any of its siblings.
>>      2) The parent cgroup is a valid partition root.
>>      3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
>>      4) There is no child cgroups with cpuset enabled.  This avoids
>>         cpu migrations of multiple cgroups simultaneously which can
>>         be problematic.
> So, I still have a hard time justifying the above restrictions. 1) can be
> broken through hotplug anyway. 2) can be broken by the parent switching to
> member. 3) would mean that we'd need to restrict parent's config changes
> depending on what children are doing. 4) is more understandable but it's an
> implementation detail that we can address in the future.
>
The initial transition to a partition root has a higher barrier. Once it 
becomes a partition root. Some restrictions are relaxed.

>>      Once becoming a partition root, the following two rules restrict
>>      what changes can be made to "cpuset.cpus".
>>
>>      1) The value must be exclusive.
>>      2) If child cpusets exist, the value must be a superset of what
>>         are defined in the child cpusets.
>>
>>      The second rule applies even for "member". Other changes to
>>      "cpuset.cpus" that do not violate the above rules are always
>>      allowed.
> While it isn't necessarily tied to this series, it's a big no-no to restrict
> what a parent can do depending on what its descendants are doing. A cgroup
> higher up in the hierarchy should be able to change configuration however it
> sees fit as deligation breaks down otherwise.
>
> Maybe you can argue that cpuset is special and shouldn't be subject to such
> convention but I can't see strong enough justifications especially given
> that most of these restrictions can be broken by hotplug operations anyway
> and thus need code to handle those situations.

These are all pre-existing restrictions before the introduction of 
partition. These are checks done in validate_change(). I am just saying 
out loud the existing behavior. If you think that needs to be changed, I 
am fine with that. However, it will be a separate patch as it is not a 
behavior that is introduced by this series.


>>      Changing a partition root (valid or invalid) to "member" is
>>      always allowed.  If there are child partition roots underneath
>>      it, however, they will be forced to be switched back to "member"
>>      too and lose their partitions. So care must be taken to double
>>      check for this condition before disabling a partition root.
> Wouldn't it make more sense for them to retain their configuration and turn
> invalid? Why is this special?

Once an invalid partition is changed to "member", there is no way for a 
child invalid partition root to recover and become valid again. There is 
why I force them to become "member" also. I am OK if you believe it is 
better to keep them in the invalid state forever until we explicitly 
changed them to "member" eventually.


>
>>      A valid parent partition may distribute out all its CPUs to
>>      its child partitions as long as it is not the root cgroup and
>>      there is no task associated with it.
> A valid parent partition which isn't root never has tasks in them to begin
> with.
I believe there is some corner cases where it is possible to put task in 
an intermediate partition. That is why I put down this statement.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01  3:56             ` Waiman Long
@ 2021-12-01 14:13               ` Michal Koutný
  2021-12-01 14:56                 ` Waiman Long
  2021-12-01 14:26               ` Waiman Long
  2021-12-01 16:46               ` Tejun Heo
  2 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2021-12-01 14:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long <longman@redhat.com> wrote:
> > >      A valid parent partition may distribute out all its CPUs to
> > >      its child partitions as long as it is not the root cgroup and
> > >      there is no task associated with it.
> > A valid parent partition which isn't root never has tasks in them to begin
> > with.
> I believe there is some corner cases where it is possible to put task in an
> intermediate partition. That is why I put down this statement.

Just mind the threads -- cpuset controller is threaded and having tasks
in inner cgroup nodes is a real scenario. I wouldn't consider it a
corner case.

[ Actually, the paragraph could IMO be simplified:

> A valid parent partition may distribute out all its CPUs to
> its child partitions as long as there is no task associated with it.

Assuming there's always at least one kernel thread in the root cgroup
that can't be migrated anyway.]


Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01  3:56             ` Waiman Long
  2021-12-01 14:13               ` Michal Koutný
@ 2021-12-01 14:26               ` Waiman Long
  2021-12-01 16:46               ` Tejun Heo
  2 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-12-01 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti

On 11/30/21 22:56, Waiman Long wrote:
> On 11/30/21 12:11, Tejun Heo wrote:
>
>
>>>      Once becoming a partition root, the following two rules restrict
>>>      what changes can be made to "cpuset.cpus".
>>>
>>>      1) The value must be exclusive.
>>>      2) If child cpusets exist, the value must be a superset of what
>>>         are defined in the child cpusets.
>>>
>>>      The second rule applies even for "member". Other changes to
>>>      "cpuset.cpus" that do not violate the above rules are always
>>>      allowed.
>> While it isn't necessarily tied to this series, it's a big no-no to 
>> restrict
>> what a parent can do depending on what its descendants are doing. A 
>> cgroup
>> higher up in the hierarchy should be able to change configuration 
>> however it
>> sees fit as deligation breaks down otherwise.
>>
>> Maybe you can argue that cpuset is special and shouldn't be subject 
>> to such
>> convention but I can't see strong enough justifications especially given
>> that most of these restrictions can be broken by hotplug operations 
>> anyway
>> and thus need code to handle those situations.
>
> These are all pre-existing restrictions before the introduction of 
> partition. These are checks done in validate_change(). I am just 
> saying out loud the existing behavior. If you think that needs to be 
> changed, I am fine with that. However, it will be a separate patch as 
> it is not a behavior that is introduced by this series.

Of the 2 restrictions listed above, the exclusivity rule is due to the 
use of CS_CPU_EXCLUSIVE flag. I think it is reasonable as it affects 
only siblings, not the parent.

The second restriction was found during my testing. It is caused by the 
following code in validate_change():

         /* Each of our child cpusets must be a subset of us */
         ret = -EBUSY;
         cpuset_for_each_child(c, css, cur)
                 if (!is_cpuset_subset(c, trial))
                         goto out;

It seems that this code was there since v2.6.12 (the beginning of the 
git era). Later in function, we have

         /* On legacy hierarchy, we must be a subset of our parent 
cpuset. */
         ret = -EACCES;
         if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
                 goto out;

This is actually a duplicate in the case of legacy hierarchy.

I can add a patch to take out the first code block above which I think 
is where most of your objections are. Then I can remove the 2nd 
restriction in my documentation. I would like to emphasize that this is 
a pre-existing behavior which I just happen to document.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01 14:13               ` Michal Koutný
@ 2021-12-01 14:56                 ` Waiman Long
  2021-12-01 16:39                   ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2021-12-01 14:56 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

On 12/1/21 09:13, Michal Koutný wrote:
> On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long <longman@redhat.com> wrote:
>>>>       A valid parent partition may distribute out all its CPUs to
>>>>       its child partitions as long as it is not the root cgroup and
>>>>       there is no task associated with it.
>>> A valid parent partition which isn't root never has tasks in them to begin
>>> with.
>> I believe there is some corner cases where it is possible to put task in an
>> intermediate partition. That is why I put down this statement.
> Just mind the threads -- cpuset controller is threaded and having tasks
> in inner cgroup nodes is a real scenario. I wouldn't consider it a
> corner case.
>
> [ Actually, the paragraph could IMO be simplified:
Right, I shouldn't say corner cases. Having task in an intermediate 
partition is possible depending on event sequence. I am aware that there 
are code in the cpuset code to prevent that, but it didn't block all cases.
>> A valid parent partition may distribute out all its CPUs to
>>   its child partitions as long as there is no task associated with it.
> Assuming there's always at least one kernel thread in the root cgroup
> that can't be migrated anyway.]

I am aware of that. That is why I said root cgroup must have at least 
one cpu in its "cpuset.cpus.effective".

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01 14:56                 ` Waiman Long
@ 2021-12-01 16:39                   ` Tejun Heo
  2021-12-01 17:49                     ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2021-12-01 16:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti

On Wed, Dec 01, 2021 at 09:56:21AM -0500, Waiman Long wrote:
> Right, I shouldn't say corner cases. Having task in an intermediate
> partition is possible depending on event sequence. I am aware that there are
> code in the cpuset code to prevent that, but it didn't block all cases.
> > > A valid parent partition may distribute out all its CPUs to
> > >   its child partitions as long as there is no task associated with it.
> > Assuming there's always at least one kernel thread in the root cgroup
> > that can't be migrated anyway.]
> 
> I am aware of that. That is why I said root cgroup must have at least one
> cpu in its "cpuset.cpus.effective".

In that case, let's explicitly describe that condition.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01  3:56             ` Waiman Long
  2021-12-01 14:13               ` Michal Koutný
  2021-12-01 14:26               ` Waiman Long
@ 2021-12-01 16:46               ` Tejun Heo
  2021-12-01 18:05                 ` Waiman Long
  2 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2021-12-01 16:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti

Hello, Waiman.

On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long wrote:
> > What happens if an isolated domain becomes invalid and then valid again due
> > to cpu hotplug? Does it go "root invalid" and then back to "isolated"?
>
> Yes, the current code allow recovering from an invalid state. In this
> particular case, the transition will be "isolated" --> "root invalid" -->
> "isolated".

Wouldn't it be clearer if it became "isolated invalid"?

> > While it isn't necessarily tied to this series, it's a big no-no to restrict
> > what a parent can do depending on what its descendants are doing. A cgroup
> > higher up in the hierarchy should be able to change configuration however it
> > sees fit as deligation breaks down otherwise.
> > 
> > Maybe you can argue that cpuset is special and shouldn't be subject to such
> > convention but I can't see strong enough justifications especially given
> > that most of these restrictions can be broken by hotplug operations anyway
> > and thus need code to handle those situations.
> 
> These are all pre-existing restrictions before the introduction of
> partition. These are checks done in validate_change(). I am just saying out
> loud the existing behavior. If you think that needs to be changed, I am fine
> with that. However, it will be a separate patch as it is not a behavior that
> is introduced by this series.

I see. It looks more problematic now with the addtion of the state
transition error reporting, more possible state transitions and, well,
actual documentation.

> Once an invalid partition is changed to "member", there is no way for a
> child invalid partition root to recover and become valid again. There is why
> I force them to become "member" also. I am OK if you believe it is better to
> keep them in the invalid state forever until we explicitly changed them to
> "member" eventually.

That's because we don't allow turning a cgroup with descendants into a
partition, right?

So, when we were first adding the partition support, the thinking was that
as it's pretty niche anyway, we can take some aberrations and restrictions,
but I don't think it's a good direction to be building up on top of those
like this and would much prefer to clean up the rules and restrictions. I
know that this has been going on for quite a while and am sorry that am
coming back to the same issue repeatedly which isn't necessarily caused by
the proposed change. What do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01 16:39                   ` Tejun Heo
@ 2021-12-01 17:49                     ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-12-01 17:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti


On 12/1/21 11:39, Tejun Heo wrote:
> On Wed, Dec 01, 2021 at 09:56:21AM -0500, Waiman Long wrote:
>> Right, I shouldn't say corner cases. Having task in an intermediate
>> partition is possible depending on event sequence. I am aware that there are
>> code in the cpuset code to prevent that, but it didn't block all cases.
>>>> A valid parent partition may distribute out all its CPUs to
>>>>    its child partitions as long as there is no task associated with it.
>>> Assuming there's always at least one kernel thread in the root cgroup
>>> that can't be migrated anyway.]
>> I am aware of that. That is why I said root cgroup must have at least one
>> cpu in its "cpuset.cpus.effective".
> In that case, let's explicitly describe that condition.

Yes, I will. Only non-root cgroup can distribute out all its CPUs. I 
thought I said that in the documentation, maybe it is very clear.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01 16:46               ` Tejun Heo
@ 2021-12-01 18:05                 ` Waiman Long
  2021-12-02  1:28                   ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2021-12-01 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti


On 12/1/21 11:46, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long wrote:
>>> What happens if an isolated domain becomes invalid and then valid again due
>>> to cpu hotplug? Does it go "root invalid" and then back to "isolated"?
>> Yes, the current code allow recovering from an invalid state. In this
>> particular case, the transition will be "isolated" --> "root invalid" -->
>> "isolated".
> Wouldn't it be clearer if it became "isolated invalid"?

You are right. I have overlooked that. Will make the change.


>
>>> While it isn't necessarily tied to this series, it's a big no-no to restrict
>>> what a parent can do depending on what its descendants are doing. A cgroup
>>> higher up in the hierarchy should be able to change configuration however it
>>> sees fit as deligation breaks down otherwise.
>>>
>>> Maybe you can argue that cpuset is special and shouldn't be subject to such
>>> convention but I can't see strong enough justifications especially given
>>> that most of these restrictions can be broken by hotplug operations anyway
>>> and thus need code to handle those situations.
>> These are all pre-existing restrictions before the introduction of
>> partition. These are checks done in validate_change(). I am just saying out
>> loud the existing behavior. If you think that needs to be changed, I am fine
>> with that. However, it will be a separate patch as it is not a behavior that
>> is introduced by this series.
> I see. It looks more problematic now with the addtion of the state
> transition error reporting, more possible state transitions and, well,
> actual documentation.

I am going to add a patch to take out the child superset limitation for 
the default hierarchy as I believe it is probably an oversight that we 
were not aware of before. I would like to keep the exclusivity rule 
though as I think it makes sense.

>
>> Once an invalid partition is changed to "member", there is no way for a
>> child invalid partition root to recover and become valid again. There is why
>> I force them to become "member" also. I am OK if you believe it is better to
>> keep them in the invalid state forever until we explicitly changed them to
>> "member" eventually.
> That's because we don't allow turning a cgroup with descendants into a
> partition, right?
Yes, that is a major part of it.
>
> So, when we were first adding the partition support, the thinking was that
> as it's pretty niche anyway, we can take some aberrations and restrictions,
> but I don't think it's a good direction to be building up on top of those
> like this and would much prefer to clean up the rules and restrictions. I
> know that this has been going on for quite a while and am sorry that am
> coming back to the same issue repeatedly which isn't necessarily caused by
> the proposed change. What do you think?

I think I can relax some of the restrictions, but probably not all of 
them at this time. We can certainly working on removing as much 
restriction and limitations as possible in future update to the 
partition code.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-01 18:05                 ` Waiman Long
@ 2021-12-02  1:28                   ` Waiman Long
  2021-12-03 18:25                     ` Michal Koutný
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2021-12-02  1:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti

On 12/1/21 13:05, Waiman Long wrote:
>
> On 12/1/21 11:46, Tejun Heo wrote:
>>
>> So, when we were first adding the partition support, the thinking was 
>> that
>> as it's pretty niche anyway, we can take some aberrations and 
>> restrictions,
>> but I don't think it's a good direction to be building up on top of 
>> those
>> like this and would much prefer to clean up the rules and 
>> restrictions. I
>> know that this has been going on for quite a while and am sorry that am
>> coming back to the same issue repeatedly which isn't necessarily 
>> caused by
>> the proposed change. What do you think?
>
> I think I can relax some of the restrictions, but probably not all of 
> them at this time. We can certainly working on removing as much 
> restriction and limitations as possible in future update to the 
> partition code.

I would say that partition is a cpuset feature that only a minority of 
users may ever need to use. What I don't want to do is to make the 
partition feature as general and accommodating as possible and then some 
of them become dead code that people never use. It won't break binary 
compatibility to relax or remove limitations in the future. However, 
imposing new limitation or restriction in the future may not be 
possible. So I would like to see new use cases evolve that require us to 
remove the limitations. If that happens, I am happy to update the code 
to accommodate the new use cases.

For the current use cases of partition that I am aware of, the current 
limitations as documented will not be a problem for those use cases.

The document below is my latest draft of the document. There are several 
major changes from the earlier draft:

1) The limitation that "cpuset.cpus" has to be a superset of child's 
"cpuset.cpus" has been removed as a new patch to remove that limitation 
will be added.

2) The initial transition from "member" to partition root now requires 
that "cpuset.cpus" overlap with that of the parent's "cpuset.cpus" 
instead of being a superset.

3) Now read back of "cpuset.cpus.partition" may return "isolated invalid".

For the transition back to "member", I haven't changed the current 
wording of forcing child partition roots to become "member" yet. If you 
think keeping them as invalid partition root is better, I can made that 
change too.

Please let me know what other changes you would like to see.

Cheers,
Longman

----------------------------------------------------------------------------------------

   cpuset.cpus.partition
     A read-write single value file which exists on non-root
     cpuset-enabled cgroups.  This flag is owned by the parent cgroup
     and is not delegatable.

     It accepts only the following input values when written to.

       ========    ================================
       "member"    Non-root member of a partition
       "root"    Partition root
       "isolated"    Partition root without load balancing
       ========    ================================

     The root cgroup is always a partition root and its state
     cannot be changed.  All other non-root cgroups start out as
     "member".

     When set to "root", the current cgroup is the root of a new
     partition or scheduling domain that comprises itself and
     all its descendants except those that are separate partition
     roots themselves and their descendants.

     The value shown in "cpuset.cpus.effective" of a partition root is
     the CPUs that the parent partition root can dedicate to the new
     partition root.  They are subtracted from "cpuset.cpus.effective"
     of the parent and may be different from "cpuset.cpus"

     When set to "isolated", the CPUs in that partition root will
     be in an isolated state without any load balancing from the
     scheduler.  Tasks placed in such a partition with multiple
     CPUs should be carefully distributed and bound to each of the
     individual CPUs for optimal performance.

     A partition root ("root" or "isolated") can be in one of the
     two possible states - valid or invalid.  An invalid partition
     root is in a degraded state where some state information are
     retained, but behaves more like a "member".

     On read, the "cpuset.cpus.partition" file can show the following
     values.

       ======================    ==============================
       "member"            Non-root member of a partition
       "root"            Partition root
       "isolated"            Partition root without load balancing
       "root invalid (<reason>)"    Invalid partition root
       "isolated invalid (<reason>)"    Invalid isolated partition root
       ======================    ==============================

     In the case of an invalid partition root, a descriptive string on
     why the partition is invalid is included within parentheses.

     Almost all possible state transitions among "member", valid
     and invalid partition roots are allowed except from "member"
     to invalid partition root.

     Before the "member" to partition root transition can happen,
     the following conditions must be met or the transition will
     not be allowed.

     1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
        not shared by any of its siblings.
     2) The parent cgroup is a valid partition root.
     3) The "cpuset.cpus" must contain at least one of the CPUs from
        parent's "cpuset.cpus", i.e. they overlap.
     4) There is no child cgroups with cpuset enabled.  This avoids
        cpu migrations of multiple cgroups simultaneously which can
        be problematic.

     Once becoming a partition root, the only rule restricting
     changes made to "cpuset.cpus" is the exclusivity rule where
     none of the siblings of a partition root can share CPUs with
     it.

     External events like hotplug or inappropriate changes to
     "cpuset.cpus" can cause a valid partition root to become invalid.
     Besides the exclusivity rule listed above, the other conditions
     required to maintain the validity of a partition root are
     as follows:

     1) The parent cgroup is a valid partition root.
     2) If "cpuset.cpus.effective" is empty, the partition must have
        no task associated with it. Otherwise, the partition becomes
        invalid and "cpuset.cpus.effective" will fall back to that
        of the nearest non-empty ancestor.

     A corollary of a valid partition root is that
     "cpuset.cpus.effective" is always a subset of "cpuset.cpus".
     Note that a task cannot be moved to a cgroup with empty
     "cpuset.cpus.effective".

     Changing a partition root (valid or invalid) to "member" is
     always allowed.  If there are child partition roots underneath
     it, however, they will be forced to be switched back to "member"
     too and lose their partitions. So care must be taken to double
     check for this condition before disabling a partition root.

     A valid non-root parent partition may distribute out all its CPUs
     to its child partitions when there is no task associated with it.

     An invalid partition root can be reverted back to a valid one
     if none of the validity constraints of a valid partition root
     are violated.

     Poll and inotify events are triggered whenever the state of
     "cpuset.cpus.partition" changes.  That includes changes caused by
     write to "cpuset.cpus.partition", cpu hotplug and other changes
     that make the partition invalid.  This will allow user space
     agents to monitor unexpected changes to "cpuset.cpus.partition"
     without the need to do continuous polling.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-02  1:28                   ` Waiman Long
@ 2021-12-03 18:25                     ` Michal Koutný
  2021-12-03 19:27                       ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2021-12-03 18:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 1919 bytes --]

Hello Longman.

On Wed, Dec 01, 2021 at 08:28:09PM -0500, Waiman Long <longman@redhat.com> wrote:
> 1) The limitation that "cpuset.cpus" has to be a superset of child's
> "cpuset.cpus" has been removed as a new patch to remove that limitation will
> be added.

Superb!

> 2) The initial transition from "member" to partition root now requires that
> "cpuset.cpus" overlap with that of the parent's "cpuset.cpus" instead of
> being a superset.

That's sensible.

> For the transition back to "member", I haven't changed the current wording
> of forcing child partition roots to become "member" yet. If you think
> keeping them as invalid partition root is better, I can made that change
> too.

I wrote I was indifferent about this in a previous mail but when I think
about it now, switching to invalid root is perhaps better than switching
to member since it'd effectively mean that modifications of the parent
config propagate (permanently) also to a descendant config, which is
an undesired v1-ism.


> Please let me know what other changes you would like to see.

I hope my remarks below are just clarifications and not substantial
changes. Besides that I find your new draft good. Thanks!

> [...]

>     An invalid partition root can be reverted back to a valid one
>     if none of the validity constraints of a valid partition root
>     are violated.

s/can be/will be/ 

(I understand the intention is to make it asynchronously and
automatically, i.e. without writing into the affected descendant(s)
cpuset.partition again.)

>     Poll and inotify events are triggered whenever the state of
>     "cpuset.cpus.partition" changes.  That includes changes caused by
>     write to "cpuset.cpus.partition", cpu hotplug and other changes
>     that make the partition invalid.

-> that change validity status

(In accordance with the comment above.)


Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-12-03 18:25                     ` Michal Koutný
@ 2021-12-03 19:27                       ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2021-12-03 19:27 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

On 12/3/21 13:25, Michal Koutný wrote:
> Hello Longman.
>
> On Wed, Dec 01, 2021 at 08:28:09PM -0500, Waiman Long <longman@redhat.com> wrote:
>> 1) The limitation that "cpuset.cpus" has to be a superset of child's
>> "cpuset.cpus" has been removed as a new patch to remove that limitation will
>> be added.
> Superb!
>
>> 2) The initial transition from "member" to partition root now requires that
>> "cpuset.cpus" overlap with that of the parent's "cpuset.cpus" instead of
>> being a superset.
> That's sensible.
>
>> For the transition back to "member", I haven't changed the current wording
>> of forcing child partition roots to become "member" yet. If you think
>> keeping them as invalid partition root is better, I can made that change
>> too.
> I wrote I was indifferent about this in a previous mail but when I think
> about it now, switching to invalid root is perhaps better than switching
> to member since it'd effectively mean that modifications of the parent
> config propagate (permanently) also to a descendant config, which is
> an undesired v1-ism.

That makes sense. I will keep those child partitions in an invalid state 
then.

>
>> Please let me know what other changes you would like to see.
> I hope my remarks below are just clarifications and not substantial
> changes. Besides that I find your new draft good. Thanks!
>
>> [...]
>>      An invalid partition root can be reverted back to a valid one
>>      if none of the validity constraints of a valid partition root
>>      are violated.
> s/can be/will be/
>
> (I understand the intention is to make it asynchronously and
> automatically, i.e. without writing into the affected descendant(s)
> cpuset.partition again.)
Yes, that will be automatic and the partition will become valid again if 
other events cause changes that unbreak the validity constraints.
>
>>      Poll and inotify events are triggered whenever the state of
>>      "cpuset.cpus.partition" changes.  That includes changes caused by
>>      write to "cpuset.cpus.partition", cpu hotplug and other changes
>>      that make the partition invalid.
> -> that change validity status
>
> (In accordance with the comment above.)
Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-12-03 19:27 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-18 14:36 [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
2021-10-18 14:36 ` [PATCH v8 1/6] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Waiman Long
2021-10-18 14:36 ` [PATCH v8 2/6] cgroup/cpuset: Refining features and constraints of a partition Waiman Long
2021-10-18 14:36 ` [PATCH v8 3/6] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
2021-10-18 14:36 ` [PATCH v8 4/6] cgroup/cpuset: Show invalid partition reason string Waiman Long
2021-10-18 14:36 ` [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
2021-11-15 19:31   ` Michal Koutný
2021-11-15 20:11     ` Tejun Heo
2021-11-15 21:27       ` Waiman Long
2021-11-15 21:10     ` Waiman Long
2021-11-16 17:54       ` Michal Koutný
2021-11-30 15:35         ` Waiman Long
2021-11-30 17:11           ` Tejun Heo
2021-12-01  3:56             ` Waiman Long
2021-12-01 14:13               ` Michal Koutný
2021-12-01 14:56                 ` Waiman Long
2021-12-01 16:39                   ` Tejun Heo
2021-12-01 17:49                     ` Waiman Long
2021-12-01 14:26               ` Waiman Long
2021-12-01 16:46               ` Tejun Heo
2021-12-01 18:05                 ` Waiman Long
2021-12-02  1:28                   ` Waiman Long
2021-12-03 18:25                     ` Michal Koutný
2021-12-03 19:27                       ` Waiman Long
2021-10-18 14:36 ` [PATCH v8 6/6] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
2021-10-27 23:05 ` [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
2021-11-10 11:13 ` Felix Moessbauer
2021-11-10 13:21   ` Marcelo Tosatti
2021-11-10 13:56   ` Michal Koutný
2021-11-10 15:21     ` Moessbauer, Felix
2021-11-10 16:10       ` Marcelo Tosatti
2021-11-10 16:14         ` Marcelo Tosatti
2021-11-10 16:15         ` Jan Kiszka
2021-11-10 17:29           ` Marcelo Tosatti
2021-11-10 18:30             ` Waiman Long
2021-11-10 17:52           ` Michal Koutný
2021-11-10 18:04             ` Jan Kiszka
2021-11-10 18:15       ` Michal Koutný
2021-11-10 15:20   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).