linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
@ 2021-07-20 14:18 Waiman Long
  2021-07-20 14:18 ` [PATCH v3 1/9] cgroup/cpuset: Miscellaneous code cleanup Waiman Long
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

v3:
 - Add two new patches (patches 2 & 3) to fix bugs found during the
   testing process.
 - Add a new patch to enable inotify event notification when partition
   become invalid.
 - Add a test to test event notification when partition become invalid.

v2:
 - Drop v1 patch 1.
 - Break out some cosmetic changes into a separate patch (patch #1).
 - Add a new patch to clarify the transition to invalid partition root
   is mainly caused by hotplug events.
 - Enhance the partition root state test including CPU online/offline
   behavior and fix issues found by the test.

This patchset fixes two bugs and makes four enhancements to the cpuset
v2 code.

Bug fixes:

 Patch 2: Fix a hotplug handling bug when just all cpus in subparts_cpus
 are offlined.

 Patch 3: Fix violation of cpuset locking rule.

Enhancements: 

 Patch 4: Enable event notification on "cpuset.cpus.partition" when
 a partition become invalid.

 Patch 5: Clarify the use of invalid partition root and add new checks
 to make sure that normal cpuset control file operations will not be
 allowed to create invalid partition root. It also fixes some of the
 issues in existing code.

 Patch 6: Add a new partition state "isolated" to create a partition
 root without load balancing. This is for handling intermitten workloads
 that have a strict low latency requirement.

 Patch 7: Allow partition roots that are not the top cpuset to distribute
 all its cpus to child partitions as long as there is no task associated
 with that partition root. This allows more flexibility for middleware
 to manage multiple partitions.

Patch 8 updates the cgroup-v2.rst file accordingly. Patch 9 adds a new
cpuset test to test the new cpuset partition code.

Waiman Long (9):
  cgroup/cpuset: Miscellaneous code cleanup
  cgroup/cpuset: Fix a partition bug with hotplug
  cgroup/cpuset: Fix violation of cpuset locking rule
  cgroup/cpuset: Enable event notification when partition become invalid
  cgroup/cpuset: Clarify the use of invalid partition root
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Allow non-top parent partition root to distribute out
    all CPUs
  cgroup/cpuset: Update description of cpuset.cpus.partition in
    cgroup-v2.rst
  kselftest/cgroup: Add cpuset v2 partition root state test

 Documentation/admin-guide/cgroup-v2.rst       |  94 ++-
 kernel/cgroup/cpuset.c                        | 360 +++++++---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 626 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  67 ++
 5 files changed, 1007 insertions(+), 145 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 1/9] cgroup/cpuset: Miscellaneous code cleanup
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-26 22:56   ` Tejun Heo
  2021-07-20 14:18 ` [PATCH v3 2/9] cgroup/cpuset: Fix a partition bug with hotplug Waiman Long
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Use more descriptive variable names for update_prstate(), remove
unnecessary code and fix some typos. There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 40 +++++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index adb5190c4429..f5fef5516d99 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1114,7 +1114,7 @@ enum subparts_cmd {
  * cpus_allowed can be granted or an error code will be returned.
  *
  * For partcmd_disable, the cpuset is being transofrmed from a partition
- * root back to a non-partition root. any CPUs in cpus_allowed that are in
+ * root back to a non-partition root. Any CPUs in cpus_allowed that are in
  * parent's subparts_cpus will be taken away from that cpumask and put back
  * into parent's effective_cpus. 0 should always be returned.
  *
@@ -1225,7 +1225,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * partcmd_update w/o newmask:
 		 *
-		 * addmask = cpus_allowed & parent->effectiveb_cpus
+		 * addmask = cpus_allowed & parent->effective_cpus
 		 *
 		 * Note that parent's subparts_cpus may have been
 		 * pre-shrunk in case there is a change in the cpu list.
@@ -1365,12 +1365,12 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			case PRS_DISABLED:
 				/*
 				 * If parent is not a partition root or an
-				 * invalid partition root, clear the state
-				 * state and the CS_CPU_EXCLUSIVE flag.
+				 * invalid partition root, clear its state
+				 * and its CS_CPU_EXCLUSIVE flag.
 				 */
 				WARN_ON_ONCE(cp->partition_root_state
 					     != PRS_ERROR);
-				cp->partition_root_state = 0;
+				cp->partition_root_state = PRS_DISABLED;
 
 				/*
 				 * clear_bit() is an atomic operation and
@@ -1937,30 +1937,28 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 
 /*
  * update_prstate - update partititon_root_state
- * cs:	the cpuset to update
- * val: 0 - disabled, 1 - enabled
+ * cs: the cpuset to update
+ * new_prs: new partition root state
  *
  * Call with cpuset_mutex held.
  */
-static int update_prstate(struct cpuset *cs, int val)
+static int update_prstate(struct cpuset *cs, int new_prs)
 {
 	int err;
 	struct cpuset *parent = parent_cs(cs);
-	struct tmpmasks tmp;
+	struct tmpmasks tmpmask;
 
-	if ((val != 0) && (val != 1))
-		return -EINVAL;
-	if (val == cs->partition_root_state)
+	if (new_prs == cs->partition_root_state)
 		return 0;
 
 	/*
 	 * Cannot force a partial or invalid partition root to a full
 	 * partition root.
 	 */
-	if (val && cs->partition_root_state)
+	if (new_prs && (cs->partition_root_state < 0))
 		return -EINVAL;
 
-	if (alloc_cpumasks(NULL, &tmp))
+	if (alloc_cpumasks(NULL, &tmpmask))
 		return -ENOMEM;
 
 	err = -EINVAL;
@@ -1978,7 +1976,7 @@ static int update_prstate(struct cpuset *cs, int val)
 			goto out;
 
 		err = update_parent_subparts_cpumask(cs, partcmd_enable,
-						     NULL, &tmp);
+						     NULL, &tmpmask);
 		if (err) {
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			goto out;
@@ -1990,18 +1988,18 @@ static int update_prstate(struct cpuset *cs, int val)
 		 * CS_CPU_EXCLUSIVE bit.
 		 */
 		if (cs->partition_root_state == PRS_ERROR) {
-			cs->partition_root_state = 0;
+			cs->partition_root_state = PRS_DISABLED;
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			err = 0;
 			goto out;
 		}
 
 		err = update_parent_subparts_cpumask(cs, partcmd_disable,
-						     NULL, &tmp);
+						     NULL, &tmpmask);
 		if (err)
 			goto out;
 
-		cs->partition_root_state = 0;
+		cs->partition_root_state = PRS_DISABLED;
 
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
@@ -2015,11 +2013,11 @@ static int update_prstate(struct cpuset *cs, int val)
 		update_tasks_cpumask(parent);
 
 	if (parent->child_ecpus_count)
-		update_sibling_cpumasks(parent, cs, &tmp);
+		update_sibling_cpumasks(parent, cs, &tmpmask);
 
 	rebuild_sched_domains_locked();
 out:
-	free_cpumasks(NULL, &tmp);
+	free_cpumasks(NULL, &tmpmask);
 	return err;
 }
 
@@ -3060,7 +3058,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		goto retry;
 	}
 
-	parent =  parent_cs(cs);
+	parent = parent_cs(cs);
 	compute_effective_cpumask(&new_cpus, cs, parent);
 	nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
 
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 2/9] cgroup/cpuset: Fix a partition bug with hotplug
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
  2021-07-20 14:18 ` [PATCH v3 1/9] cgroup/cpuset: Miscellaneous code cleanup Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-26 22:59   ` Tejun Heo
  2021-07-20 14:18 ` [PATCH v3 3/9] cgroup/cpuset: Fix violation of cpuset locking rule Waiman Long
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

In cpuset_hotplug_workfn(), the detection of whether the cpu list
has been changed is done by comparing the effective cpus of the top
cpuset with the cpu_active_mask. However, in the rare case that just
all the CPUs in the subparts_cpus are offlined, the detection fails
and the partition states are not updated correctly. Fix it by forcing
the cpus_updated flag to true in this particular case.

Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f5fef5516d99..b00982e6f6d8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3166,6 +3166,13 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
+	/*
+	 * In the rare case that hotplug removes just all the cpus in
+	 * subparts_cpus, we assumed that cpus are updated.
+	 */
+	if (!cpus_updated && top_cpuset.nr_subparts_cpus)
+		cpus_updated = true;
+
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
 		spin_lock_irq(&callback_lock);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 3/9] cgroup/cpuset: Fix violation of cpuset locking rule
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
  2021-07-20 14:18 ` [PATCH v3 1/9] cgroup/cpuset: Miscellaneous code cleanup Waiman Long
  2021-07-20 14:18 ` [PATCH v3 2/9] cgroup/cpuset: Fix a partition bug with hotplug Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-26 23:10   ` Tejun Heo
  2021-07-20 14:18 ` [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid Waiman Long
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

The cpuset fields that manage partition root state do not strictly
follow the cpuset locking rule that update to cpuset has to be done
with both the callback_lock and cpuset_mutex held. This is now fixed
by making sure that the locking rule is upheld.

Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition")
Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition)
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 58 +++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b00982e6f6d8..04a6951abe2a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1148,6 +1148,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	struct cpuset *parent = parent_cs(cpuset);
 	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
+	int new_prs;
 	bool part_error = false;	/* Partition error? */
 
 	percpu_rwsem_assert_held(&cpuset_rwsem);
@@ -1183,6 +1184,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	 * A cpumask update cannot make parent's effective_cpus become empty.
 	 */
 	adding = deleting = false;
+	new_prs = cpuset->partition_root_state;
 	if (cmd == partcmd_enable) {
 		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
 		adding = true;
@@ -1247,11 +1249,11 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		switch (cpuset->partition_root_state) {
 		case PRS_ENABLED:
 			if (part_error)
-				cpuset->partition_root_state = PRS_ERROR;
+				new_prs = PRS_ERROR;
 			break;
 		case PRS_ERROR:
 			if (!part_error)
-				cpuset->partition_root_state = PRS_ENABLED;
+				new_prs = PRS_ENABLED;
 			break;
 		}
 		/*
@@ -1260,10 +1262,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		part_error = (prev_prs == PRS_ERROR);
 	}
 
-	if (!part_error && (cpuset->partition_root_state == PRS_ERROR))
+	if (!part_error && (new_prs == PRS_ERROR))
 		return 0;	/* Nothing need to be done */
 
-	if (cpuset->partition_root_state == PRS_ERROR) {
+	if (new_prs == PRS_ERROR) {
 		/*
 		 * Remove all its cpus from parent's subparts_cpus.
 		 */
@@ -1272,7 +1274,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				       parent->subparts_cpus);
 	}
 
-	if (!adding && !deleting)
+	if (!adding && !deleting && (new_prs == cpuset->partition_root_state))
 		return 0;
 
 	/*
@@ -1299,6 +1301,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	}
 
 	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
+
+	if (cpuset->partition_root_state != new_prs)
+		cpuset->partition_root_state = new_prs;
 	spin_unlock_irq(&callback_lock);
 
 	return cmd == partcmd_update;
@@ -1321,6 +1326,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 	bool need_rebuild_sched_domains = false;
+	int new_prs;
 
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
@@ -1360,7 +1366,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		 * update_tasks_cpumask() again for tasks in the parent
 		 * cpuset if the parent's subparts_cpus changes.
 		 */
-		if ((cp != cs) && cp->partition_root_state) {
+		new_prs = cp->partition_root_state;
+		if ((cp != cs) && new_prs) {
 			switch (parent->partition_root_state) {
 			case PRS_DISABLED:
 				/*
@@ -1370,7 +1377,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				 */
 				WARN_ON_ONCE(cp->partition_root_state
 					     != PRS_ERROR);
-				cp->partition_root_state = PRS_DISABLED;
+				new_prs = PRS_DISABLED;
 
 				/*
 				 * clear_bit() is an atomic operation and
@@ -1391,11 +1398,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				/*
 				 * When parent is invalid, it has to be too.
 				 */
-				cp->partition_root_state = PRS_ERROR;
-				if (cp->nr_subparts_cpus) {
-					cp->nr_subparts_cpus = 0;
-					cpumask_clear(cp->subparts_cpus);
-				}
+				new_prs = PRS_ERROR;
 				break;
 			}
 		}
@@ -1407,8 +1410,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		spin_lock_irq(&callback_lock);
 
 		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-		if (cp->nr_subparts_cpus &&
-		   (cp->partition_root_state != PRS_ENABLED)) {
+		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
 		} else if (cp->nr_subparts_cpus) {
@@ -1435,6 +1437,10 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 					= cpumask_weight(cp->subparts_cpus);
 			}
 		}
+
+		if (new_prs != cp->partition_root_state)
+			cp->partition_root_state = new_prs;
+
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -1944,25 +1950,25 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
  */
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
-	int err;
+	int err, old_prs = cs->partition_root_state;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
 
-	if (new_prs == cs->partition_root_state)
+	if (old_prs == new_prs)
 		return 0;
 
 	/*
 	 * Cannot force a partial or invalid partition root to a full
 	 * partition root.
 	 */
-	if (new_prs && (cs->partition_root_state < 0))
+	if (new_prs && (old_prs == PRS_ERROR))
 		return -EINVAL;
 
 	if (alloc_cpumasks(NULL, &tmpmask))
 		return -ENOMEM;
 
 	err = -EINVAL;
-	if (!cs->partition_root_state) {
+	if (!old_prs) {
 		/*
 		 * Turning on partition root requires setting the
 		 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
@@ -1981,14 +1987,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			goto out;
 		}
-		cs->partition_root_state = PRS_ENABLED;
 	} else {
 		/*
 		 * Turning off partition root will clear the
 		 * CS_CPU_EXCLUSIVE bit.
 		 */
-		if (cs->partition_root_state == PRS_ERROR) {
-			cs->partition_root_state = PRS_DISABLED;
+		if (old_prs == PRS_ERROR) {
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			err = 0;
 			goto out;
@@ -1999,8 +2003,6 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		if (err)
 			goto out;
 
-		cs->partition_root_state = PRS_DISABLED;
-
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 	}
@@ -2017,6 +2019,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 
 	rebuild_sched_domains_locked();
 out:
+	if (!err) {
+		spin_lock_irq(&callback_lock);
+		cs->partition_root_state = new_prs;
+		spin_unlock_irq(&callback_lock);
+	}
+
 	free_cpumasks(NULL, &tmpmask);
 	return err;
 }
@@ -3080,8 +3088,10 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
 	   (parent->partition_root_state == PRS_ERROR))) {
 		if (cs->nr_subparts_cpus) {
+			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
 			cpumask_clear(cs->subparts_cpus);
+			spin_unlock_irq(&callback_lock);
 			compute_effective_cpumask(&new_cpus, cs, parent);
 		}
 
@@ -3095,7 +3105,9 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		     cpumask_empty(&new_cpus)) {
 			update_parent_subparts_cpumask(cs, partcmd_disable,
 						       NULL, tmp);
+			spin_lock_irq(&callback_lock);
 			cs->partition_root_state = PRS_ERROR;
+			spin_unlock_irq(&callback_lock);
 		}
 		cpuset_force_rebuild();
 	}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (2 preceding siblings ...)
  2021-07-20 14:18 ` [PATCH v3 3/9] cgroup/cpuset: Fix violation of cpuset locking rule Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-26 23:14   ` Tejun Heo
  2021-07-20 14:18 ` [PATCH v3 5/9] cgroup/cpuset: Clarify the use of invalid partition root Waiman Long
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

A valid cpuset partition can become invalid if all its CPUs are offlined
or somehow removed. This can happen through external events without
"cpuset.cpus.partition" being touched at all.

Users that rely on the property of a partition being present do not
currently have a simple way to get such an event notified other than
constant periodic polling which is both inefficient and cumbersome.

To make life easier for those users, event notification is now enabled
for "cpuset.cpus.partition" when it goes into or out of an invalid
partition state.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 49 ++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 04a6951abe2a..2e34fc5b76f0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -160,6 +160,9 @@ struct cpuset {
 	 */
 	int use_parent_ecpus;
 	int child_ecpus_count;
+
+	/* Handle for cpuset.cpus.partition */
+	struct cgroup_file partition_file;
 };
 
 /*
@@ -263,6 +266,19 @@ static inline int is_partition_root(const struct cpuset *cs)
 	return cs->partition_root_state > 0;
 }
 
+/*
+ * Send notification event of partition_root_state change when going into
+ * or out of PRS_ERROR which may be due to an external event like hotplug.
+ */
+static inline void notify_partition_change(struct cpuset *cs,
+					   int old_prs, int new_prs)
+{
+	if ((old_prs == new_prs) ||
+	   ((old_prs != PRS_ERROR) && (new_prs != PRS_ERROR)))
+		return;
+	cgroup_file_notify(&cs->partition_file);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
@@ -1148,7 +1164,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	struct cpuset *parent = parent_cs(cpuset);
 	int adding;	/* Moving cpus from effective_cpus to subparts_cpus */
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
-	int new_prs;
+	int old_prs, new_prs;
 	bool part_error = false;	/* Partition error? */
 
 	percpu_rwsem_assert_held(&cpuset_rwsem);
@@ -1184,7 +1200,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	 * A cpumask update cannot make parent's effective_cpus become empty.
 	 */
 	adding = deleting = false;
-	new_prs = cpuset->partition_root_state;
+	old_prs = new_prs = cpuset->partition_root_state;
 	if (cmd == partcmd_enable) {
 		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
 		adding = true;
@@ -1274,7 +1290,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				       parent->subparts_cpus);
 	}
 
-	if (!adding && !deleting && (new_prs == cpuset->partition_root_state))
+	if (!adding && !deleting && (new_prs == old_prs))
 		return 0;
 
 	/*
@@ -1302,9 +1318,11 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 
 	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
 
-	if (cpuset->partition_root_state != new_prs)
+	if (old_prs != new_prs)
 		cpuset->partition_root_state = new_prs;
+
 	spin_unlock_irq(&callback_lock);
+	notify_partition_change(cpuset, old_prs, new_prs);
 
 	return cmd == partcmd_update;
 }
@@ -1326,7 +1344,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 	bool need_rebuild_sched_domains = false;
-	int new_prs;
+	int old_prs, new_prs;
 
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
@@ -1366,8 +1384,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		 * update_tasks_cpumask() again for tasks in the parent
 		 * cpuset if the parent's subparts_cpus changes.
 		 */
-		new_prs = cp->partition_root_state;
-		if ((cp != cs) && new_prs) {
+		old_prs = new_prs = cp->partition_root_state;
+		if ((cp != cs) && old_prs) {
 			switch (parent->partition_root_state) {
 			case PRS_DISABLED:
 				/*
@@ -1438,10 +1456,11 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			}
 		}
 
-		if (new_prs != cp->partition_root_state)
+		if (new_prs != old_prs)
 			cp->partition_root_state = new_prs;
 
 		spin_unlock_irq(&callback_lock);
+		notify_partition_change(cp, old_prs, new_prs);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -2023,6 +2042,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		spin_lock_irq(&callback_lock);
 		cs->partition_root_state = new_prs;
 		spin_unlock_irq(&callback_lock);
+		notify_partition_change(cs, old_prs, new_prs);
 	}
 
 	free_cpumasks(NULL, &tmpmask);
@@ -2708,6 +2728,7 @@ static struct cftype dfl_files[] = {
 		.write = sched_partition_write,
 		.private = FILE_PARTITION_ROOT,
 		.flags = CFTYPE_NOT_ON_ROOT,
+		.file_offset = offsetof(struct cpuset, partition_file),
 	},
 
 	{
@@ -3103,11 +3124,17 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		 */
 		if ((parent->partition_root_state == PRS_ERROR) ||
 		     cpumask_empty(&new_cpus)) {
+			int old_prs;
+
 			update_parent_subparts_cpumask(cs, partcmd_disable,
 						       NULL, tmp);
-			spin_lock_irq(&callback_lock);
-			cs->partition_root_state = PRS_ERROR;
-			spin_unlock_irq(&callback_lock);
+			old_prs = cs->partition_root_state;
+			if (old_prs != PRS_ERROR) {
+				spin_lock_irq(&callback_lock);
+				cs->partition_root_state = PRS_ERROR;
+				spin_unlock_irq(&callback_lock);
+				notify_partition_change(cs, old_prs, PRS_ERROR);
+			}
 		}
 		cpuset_force_rebuild();
 	}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 5/9] cgroup/cpuset: Clarify the use of invalid partition root
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (3 preceding siblings ...)
  2021-07-20 14:18 ` [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-20 14:18 ` [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

For cpuset partition, the special state of PRS_ERROR (invalid partition
root) was originally designed to handle hotplug events.  In this state,
CPUs allocated to the partition root is released back to the parent
but the cpuset exclusive flags remain unchanged.

Since partition root sets the CPU_EXCLUSIVE flag, cpuset.cpus changes
that break the cpu exclusivity rule will not be allowed. However,
other changes to cpuset.cpus on a partition root may still cause it to
become invalid. This is undesriable as we don't want accidental change
to cpuset.cpus to invalidate a partition root.

Additional checks are now added to make sure that regular cpuset control
file manipulations are not allowed to make a partition root invalid. These
additional checks are:

 1) A partition root can't be changed to member if it has child partition
    roots.
 2) Removing CPUs from cpuset.cpus that causes it to become invalid is
    not allowed.

Comments are also added to clarify that a partition root becomes
invalid only when an external event like hotplug that causes all the
CPUs allocated to a partition root to become unavailable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 144 ++++++++++++++++++++++++-----------------
 1 file changed, 86 insertions(+), 58 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2e34fc5b76f0..c16060b703cc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -177,7 +177,9 @@ struct cpuset {
  *       subparts_cpus. In this case, the cpuset is not a real partition
  *       root anymore.  However, the CPU_EXCLUSIVE bit will still be set
  *       and the cpuset can be restored back to a partition root if the
- *       parent cpuset can give more CPUs back to this child cpuset.
+ *       parent cpuset can give more CPUs back to this child cpuset. A
+ *       partition root becomes invalid when all its cpus become unavailable
+ *       like being offlined.
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
@@ -1211,6 +1213,15 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * partcmd_update with newmask:
 		 *
+		 * Return error if newmask isn't a subset of
+		 * (cpus_allowed | parent->effective_cpus).
+		 */
+		cpumask_or(tmp->addmask, cpuset->cpus_allowed,
+					 parent->effective_cpus);
+		if (!cpumask_subset(newmask, tmp->addmask))
+			return -EINVAL;
+
+		/*
 		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
 		 * addmask = newmask & parent->effective_cpus
 		 *		     & ~parent->subparts_cpus
@@ -1223,7 +1234,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty.
+		 * Return error if parent's effective_cpus could become empty.
 		 */
 		if (adding &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
@@ -1239,20 +1250,35 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				return -EINVAL;
 			cpumask_copy(tmp->addmask, parent->effective_cpus);
 		}
+
+		/*
+		 * Return error if effective_cpus becomes empty or any CPU
+		 * distributed to child partitions is deleted.
+		 */
+		if (deleting &&
+		   (cpumask_intersects(tmp->delmask, cpuset->subparts_cpus) ||
+		    cpumask_equal(tmp->delmask, cpuset->effective_cpus)))
+			return -EBUSY;
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
 		 *
 		 * addmask = cpus_allowed & parent->effective_cpus
 		 *
+		 * This gets invoked either due to a hotplug event or
+		 * from update_cpumasks_hier() where we can't return an
+		 * error. This can cause a partition root to become invalid
+		 * in the case of a hotplug.
+		 *
 		 * Note that parent's subparts_cpus may have been
 		 * pre-shrunk in case there is a change in the cpu list.
 		 * So no deletion is needed.
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask,
-					   parent->effective_cpus);
+		part_error = (is_partition_root(cpuset) &&
+			      !parent->nr_subparts_cpus) ||
+			     cpumask_equal(tmp->addmask, parent->effective_cpus);
 	}
 
 	if (cmd == partcmd_update) {
@@ -1427,33 +1453,27 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		spin_lock_irq(&callback_lock);
 
-		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+			/*
+			 * Put all active subparts_cpus back to effective_cpus.
+			 */
+			cpumask_or(tmp->new_cpus, tmp->new_cpus,
+				   cp->subparts_cpus);
+			cpumask_and(tmp->new_cpus, tmp->new_cpus,
+				    cpu_active_mask);
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
-		} else if (cp->nr_subparts_cpus) {
+		}
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
-			 * are mutually exclusive.
-			 *
-			 * In the unlikely event that effective_cpus
-			 * becomes empty. we clear cp->nr_subparts_cpus and
-			 * let its child partition roots to compete for
-			 * CPUs again.
+			 * of a partition root are mutually exclusive.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			if (cpumask_empty(cp->effective_cpus)) {
-				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-				cpumask_clear(cp->subparts_cpus);
-				cp->nr_subparts_cpus = 0;
-			} else if (!cpumask_subset(cp->subparts_cpus,
-						   tmp->new_cpus)) {
-				cpumask_andnot(cp->subparts_cpus,
-					cp->subparts_cpus, tmp->new_cpus);
-				cp->nr_subparts_cpus
-					= cpumask_weight(cp->subparts_cpus);
-			}
+			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
 		}
 
 		if (new_prs != old_prs)
@@ -1462,7 +1482,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		spin_unlock_irq(&callback_lock);
 		notify_partition_change(cp, old_prs, new_prs);
 
-		WARN_ON(!is_in_v2_mode() &&
+		WARN_ON_ONCE(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
 
 		update_tasks_cpumask(cp);
@@ -1585,8 +1605,8 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	 * Make sure that subparts_cpus is a subset of cpus_allowed.
 	 */
 	if (cs->nr_subparts_cpus) {
-		cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
-			       cs->cpus_allowed);
+		cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+			    cs->cpus_allowed);
 		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
 	}
 	spin_unlock_irq(&callback_lock);
@@ -2008,20 +2028,26 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		}
 	} else {
 		/*
-		 * Turning off partition root will clear the
-		 * CS_CPU_EXCLUSIVE bit.
+		 * Switch back to member is always allowed if PRS_ERROR.
 		 */
 		if (old_prs == PRS_ERROR) {
-			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			err = 0;
-			goto out;
+			goto reset_flag;
 		}
 
+		/*
+		 * A partition root cannot be reverted to member if some
+		 * CPUs have been distributed to child partition roots.
+		 */
+		if (!cpumask_empty(cs->subparts_cpus))
+			return -EBUSY;
+
 		err = update_parent_subparts_cpumask(cs, partcmd_disable,
 						     NULL, &tmpmask);
 		if (err)
 			goto out;
 
+reset_flag:
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 	}
@@ -3103,11 +3129,28 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus or its parent becomes erroneous, we have to
-	 * transition it to the erroneous state.
+	 * effective_cpus, we will have to force any child partitions,
+	 * if present, to become invalid by setting nr_subparts_cpus to 0
+	 * without causing itself to become invalid.
+	 */
+	if (is_partition_root(cs) && cs->nr_subparts_cpus &&
+	    cpumask_empty(&new_cpus)) {
+		cs->nr_subparts_cpus = 0;
+		cpumask_clear(cs->subparts_cpus);
+		compute_effective_cpumask(&new_cpus, cs, parent);
+	}
+
+	/*
+	 * If empty effective_cpus or zero nr_subparts_cpus or its parent
+	 * becomes erroneous, we have to transition it to the erroneous state.
 	 */
 	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
-	   (parent->partition_root_state == PRS_ERROR))) {
+	    (parent->partition_root_state == PRS_ERROR) ||
+	    !parent->nr_subparts_cpus)) {
+		int old_prs;
+
+		update_parent_subparts_cpumask(cs, partcmd_disable,
+					       NULL, tmp);
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3116,38 +3159,23 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 			compute_effective_cpumask(&new_cpus, cs, parent);
 		}
 
-		/*
-		 * If the effective_cpus is empty because the child
-		 * partitions take away all the CPUs, we can keep
-		 * the current partition and let the child partitions
-		 * fight for available CPUs.
-		 */
-		if ((parent->partition_root_state == PRS_ERROR) ||
-		     cpumask_empty(&new_cpus)) {
-			int old_prs;
-
-			update_parent_subparts_cpumask(cs, partcmd_disable,
-						       NULL, tmp);
-			old_prs = cs->partition_root_state;
-			if (old_prs != PRS_ERROR) {
-				spin_lock_irq(&callback_lock);
-				cs->partition_root_state = PRS_ERROR;
-				spin_unlock_irq(&callback_lock);
-				notify_partition_change(cs, old_prs, PRS_ERROR);
-			}
+		old_prs = cs->partition_root_state;
+		if (old_prs != PRS_ERROR) {
+			spin_lock_irq(&callback_lock);
+			cs->partition_root_state = PRS_ERROR;
+			spin_unlock_irq(&callback_lock);
+			notify_partition_change(cs, old_prs, PRS_ERROR);
 		}
 		cpuset_force_rebuild();
 	}
 
 	/*
 	 * On the other hand, an erroneous partition root may be transitioned
-	 * back to a regular one or a partition root with no CPU allocated
-	 * from the parent may change to erroneous.
+	 * back to a regular one.
 	 */
-	if (is_partition_root(parent) &&
-	   ((cs->partition_root_state == PRS_ERROR) ||
-	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
-	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+	else if (is_partition_root(parent) &&
+		(cs->partition_root_state == PRS_ERROR) &&
+		 update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
 		cpuset_force_rebuild();
 
 update_tasks:
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (4 preceding siblings ...)
  2021-07-20 14:18 ` [PATCH v3 5/9] cgroup/cpuset: Clarify the use of invalid partition root Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-27 11:42   ` Frederic Weisbecker
  2021-07-28 16:09   ` Michal Koutný
  2021-07-20 14:18 ` [PATCH v3 7/9] cgroup/cpuset: Allow non-top parent partition root to distribute out all CPUs Waiman Long
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD

commit 994fb794cb252edd124a46ca0994e37a4726a100
Author: Waiman Long <longman@redhat.com>
Date:   Sat, 19 Jun 2021 13:28:19 -0400

    cgroup/cpuset: Add a new isolated cpus.partition type

    Cpuset v1 uses the sched_load_balance control file to determine if load
    balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
    as its use may require disabling load balancing at cgroup root.

    For workloads that require very low latency like DPDK, the latency
    jitters caused by periodic load balancing may exceed the desired
    latency limit.

    When cpuset v2 is in use, the only way to avoid this latency cost is to
    use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
    the kernel boot, however, there is no way to add or remove CPUs from
    this isolated set. For workloads that are more dynamic in nature, that
    means users have to provision enough CPUs for the worst case situation
    resulting in excess idle CPUs.

    To address this issue for cpuset v2, a new cpuset.cpus.partition type
    "isolated" is added which allows the creation of a cpuset partition
    without load balancing. This will allow system administrators to
    dynamically adjust the size of isolated partition to the current need
    of the workload without rebooting the system.

    Signed-off-by: Waiman Long <longman@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c16060b703cc..60562346ecc1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -172,6 +172,8 @@ struct cpuset {
  *
  *   1 - partition root
  *
+ *   2 - partition root without load balancing (isolated)
+ *
  *  -1 - invalid partition root
  *       None of the cpus in cpus_allowed can be put into the parent's
  *       subparts_cpus. In this case, the cpuset is not a real partition
@@ -183,6 +185,7 @@ struct cpuset {
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
+#define PRS_ISOLATED		2
 #define PRS_ERROR		-1
 
 /*
@@ -1285,17 +1288,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		int prev_prs = cpuset->partition_root_state;
 
 		/*
-		 * Check for possible transition between PRS_ENABLED
-		 * and PRS_ERROR.
+		 * Check for possible transition between PRS_ERROR and
+		 * PRS_ENABLED/PRS_ISOLATED.
 		 */
 		switch (cpuset->partition_root_state) {
 		case PRS_ENABLED:
+		case PRS_ISOLATED:
 			if (part_error)
 				new_prs = PRS_ERROR;
 			break;
 		case PRS_ERROR:
-			if (!part_error)
+			if (part_error)
+				break;
+			if (is_sched_load_balance(cpuset))
 				new_prs = PRS_ENABLED;
+			else
+				new_prs = PRS_ISOLATED;
 			break;
 		}
 		/*
@@ -1434,6 +1442,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				break;
 
 			case PRS_ENABLED:
+			case PRS_ISOLATED:
 				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
 					update_tasks_cpumask(parent);
 				break;
@@ -1453,7 +1462,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		spin_lock_irq(&callback_lock);
 
-		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+		if (cp->nr_subparts_cpus && (new_prs <= 0)) {
 			/*
 			 * Put all active subparts_cpus back to effective_cpus.
 			 */
@@ -1992,6 +2001,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	int err, old_prs = cs->partition_root_state;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
+	bool sched_domain_rebuilt = false;
 
 	if (old_prs == new_prs)
 		return 0;
@@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			goto out;
 		}
+
+		if (new_prs == PRS_ISOLATED) {
+			/*
+			 * Disable the load balance flag should not return an
+			 * error unless the system is running out of memory.
+			 */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+			sched_domain_rebuilt = true;
+		}
+	} else if (old_prs && new_prs) {
+		/*
+		 * A change in load balance state only, no change in cpumasks.
+		 */
+		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
+		err = 0;
+		goto out;	/* Sched domain is rebuilt in update_flag() */
 	} else {
 		/*
 		 * Switch back to member is always allowed if PRS_ERROR.
@@ -2050,6 +2076,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 reset_flag:
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+
+		if (!is_sched_load_balance(cs)) {
+			/* Make sure load balance is on */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+			sched_domain_rebuilt = true;
+		}
 	}
 
 	/*
@@ -2062,7 +2094,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	if (parent->child_ecpus_count)
 		update_sibling_cpumasks(parent, cs, &tmpmask);
 
-	rebuild_sched_domains_locked();
+	if (!sched_domain_rebuilt)
+		rebuild_sched_domains_locked();
 out:
 	if (!err) {
 		spin_lock_irq(&callback_lock);
@@ -2564,6 +2597,9 @@ static int sched_partition_show(struct seq_file *seq, void *v)
 	case PRS_ENABLED:
 		seq_puts(seq, "root\n");
 		break;
+	case PRS_ISOLATED:
+		seq_puts(seq, "isolated\n");
+		break;
 	case PRS_DISABLED:
 		seq_puts(seq, "member\n");
 		break;
@@ -2590,6 +2626,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		val = PRS_ENABLED;
 	else if (!strcmp(buf, "member"))
 		val = PRS_DISABLED;
+	else if (!strcmp(buf, "isolated"))
+		val = PRS_ISOLATED;
 	else
 		return -EINVAL;
 
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 7/9] cgroup/cpuset: Allow non-top parent partition root to distribute out all CPUs
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (5 preceding siblings ...)
  2021-07-20 14:18 ` [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-20 14:18 ` [PATCH v3 8/9] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Currently, a parent partition root cannot distribute all its CPUs to
child partition roots with no CPUs left. However in some use cases,
a management application may want to create a parent partition root as
a management unit with no task associated with it and has all its CPUs
distributed to various child partition roots dynamically according to
their needs. Leaving a cpu in the parent partition root in such a case is
now a waste.

To accommodate such use cases, a parent partition root can now have
all its CPUs distributed to its child partition roots as long as:
 1) it is not the top cpuset; and
 2) there is no task directly associated with the parent.

Once an empty parent partition root is formed, no new task can be moved
into it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 90 +++++++++++++++++++++++++++++-------------
 1 file changed, 63 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 60562346ecc1..d4d0c091a0d3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -284,6 +284,11 @@ static inline void notify_partition_change(struct cpuset *cs,
 	cgroup_file_notify(&cs->partition_file);
 }
 
+static inline int cpuset_has_tasks(const struct cpuset *cs)
+{
+	return cs->css.cgroup->nr_populated_csets;
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
 		  (1 << CS_MEM_EXCLUSIVE)),
@@ -1191,22 +1196,32 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
-	/*
-	 * Enabling partition root is not allowed if not all the CPUs
-	 * can be granted from parent's effective_cpus or at least one
-	 * CPU will be left after that.
-	 */
-	if ((cmd == partcmd_enable) &&
-	   (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
-	     cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
-		return -EINVAL;
-
 	/*
 	 * A cpumask update cannot make parent's effective_cpus become empty.
 	 */
 	adding = deleting = false;
 	old_prs = new_prs = cpuset->partition_root_state;
 	if (cmd == partcmd_enable) {
+		bool parent_is_top_cpuset = !parent_cs(parent);
+		bool no_cpu_in_parent = cpumask_equal(cpuset->cpus_allowed,
+						      parent->effective_cpus);
+		/*
+		 * Enabling partition root is not allowed if not all the CPUs
+		 * can be granted from parent's effective_cpus. If the parent
+		 * is the top cpuset, at least one CPU must be left after that.
+		 */
+		if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
+		    (parent_is_top_cpuset && no_cpu_in_parent))
+			return -EINVAL;
+
+		/*
+		 * A non-top parent can be left with no CPU as long as there
+		 * is no task directly associated with the parent. For such
+		 * a parent, no new task can be moved into it.
+		 */
+		if (no_cpu_in_parent && cpuset_has_tasks(parent))
+			return -EINVAL;
+
 		cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
 		adding = true;
 	} else if (cmd == partcmd_disable) {
@@ -1237,9 +1252,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if parent's effective_cpus could become empty.
+		 * Return error if parent's effective_cpus could become empty
+		 * and there are tasks in the parent.
 		 */
-		if (adding &&
+		if (adding && cpuset_has_tasks(parent) &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
 			if (!deleting)
 				return -EINVAL;
@@ -1255,12 +1271,13 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		}
 
 		/*
-		 * Return error if effective_cpus becomes empty or any CPU
-		 * distributed to child partitions is deleted.
+		 * Return error if effective_cpus becomes empty with tasks
+		 * or any CPU distributed to child partitions is deleted.
 		 */
 		if (deleting &&
 		   (cpumask_intersects(tmp->delmask, cpuset->subparts_cpus) ||
-		    cpumask_equal(tmp->delmask, cpuset->effective_cpus)))
+		   (cpumask_equal(tmp->delmask, cpuset->effective_cpus) &&
+		    cpuset_has_tasks(cpuset))))
 			return -EBUSY;
 	} else {
 		/*
@@ -1281,7 +1298,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				     parent->effective_cpus);
 		part_error = (is_partition_root(cpuset) &&
 			      !parent->nr_subparts_cpus) ||
-			     cpumask_equal(tmp->addmask, parent->effective_cpus);
+			     (cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+			      cpuset_has_tasks(parent));
 	}
 
 	if (cmd == partcmd_update) {
@@ -1388,9 +1406,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
-		 * parent, which is guaranteed to have some CPUs.
+		 * parent, which is guaranteed to have some CPUs unless
+		 * it is a partition root that has explicitly distributed
+		 * out all its CPUs.
 		 */
 		if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+			if (is_partition_root(cp) &&
+			    cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+				goto update_parent_subparts;
+
 			cpumask_copy(tmp->new_cpus, parent->effective_cpus);
 			if (!cp->use_parent_ecpus) {
 				cp->use_parent_ecpus = true;
@@ -1412,6 +1436,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		}
 
+update_parent_subparts:
 		/*
 		 * update_parent_subparts_cpumask() should have been called
 		 * for cs already in update_cpumask(). We should also call
@@ -1482,7 +1507,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
+			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus) &&
+				     cpuset_has_tasks(cp));
 		}
 
 		if (new_prs != old_prs)
@@ -1816,7 +1842,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 		cp->effective_mems = *new_mems;
 		spin_unlock_irq(&callback_lock);
 
-		WARN_ON(!is_in_v2_mode() &&
+		WARN_ON_ONCE(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
 
 		update_tasks_nodemask(cp);
@@ -2231,6 +2257,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
+	/*
+	 * On default hierarchy, task cannot be moved to a cpuset with empty
+	 * effective cpus.
+	 */
+	if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+		goto out_unlock;
+
 	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task, cs->cpus_allowed);
 		if (ret)
@@ -3098,7 +3131,8 @@ hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
-	if (cpumask_empty(new_cpus))
+	/* A partition root is allowed to have empty effective cpus */
+	if (cpumask_empty(new_cpus) && !is_partition_root(cs))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
@@ -3167,22 +3201,24 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus, we will have to force any child partitions,
-	 * if present, to become invalid by setting nr_subparts_cpus to 0
-	 * without causing itself to become invalid.
+	 * effective_cpus with tasks, we will have to force any child
+	 * partitions, if present, to become invalid by setting
+	 * nr_subparts_cpus to 0 without causing itself to become invalid.
 	 */
 	if (is_partition_root(cs) && cs->nr_subparts_cpus &&
-	    cpumask_empty(&new_cpus)) {
+	    cpumask_empty(&new_cpus) && cpuset_has_tasks(cs)) {
 		cs->nr_subparts_cpus = 0;
 		cpumask_clear(cs->subparts_cpus);
 		compute_effective_cpumask(&new_cpus, cs, parent);
 	}
 
 	/*
-	 * If empty effective_cpus or zero nr_subparts_cpus or its parent
-	 * becomes erroneous, we have to transition it to the erroneous state.
+	 * If empty effective_cpus with tasks or zero nr_subparts_cpus or
+	 * its parent becomes erroneous, we have to transition it to the
+	 * erroneous state.
 	 */
-	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
+	if (is_partition_root(cs) &&
+	   ((cpumask_empty(&new_cpus) && cpuset_has_tasks(cs)) ||
 	    (parent->partition_root_state == PRS_ERROR) ||
 	    !parent->nr_subparts_cpus)) {
 		int old_prs;
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 8/9] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (6 preceding siblings ...)
  2021-07-20 14:18 ` [PATCH v3 7/9] cgroup/cpuset: Allow non-top parent partition root to distribute out all CPUs Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-20 14:18 ` [PATCH v3 9/9] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
  2021-07-26 23:17 ` [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Tejun Heo
  9 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as the ability to create
non-top cpuset partition with no cpu allocated to it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 94 +++++++++++++++----------
 1 file changed, 58 insertions(+), 36 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5c7377b5bd3e..2e101a353ab1 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2080,8 +2080,9 @@ Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
 	When set to be a partition root, the current cgroup is the
@@ -2090,9 +2091,14 @@ Cpuset Interface Files
 	partition roots themselves and their descendants.  The root
 	cgroup is always a partition root.
 
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks in such a partition must be explicitly bound
+	to each individual CPU.
+
+	There are constraints on where a partition root can be set
+	("root" or "isolated").  It can only be set in a cgroup if all
+	the following conditions are true.
 
 	1) The "cpuset.cpus" is not empty and the list of CPUs are
 	   exclusive, i.e. they are not shared by any of its siblings.
@@ -2103,51 +2109,67 @@ Cpuset Interface Files
 	   eliminating corner cases that have to be handled if such a
 	   condition is allowed.
 
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
+	Setting it to a partition root will take the CPUs away from
+	the effective CPUs of the parent cgroup.  Once it is set, this
 	file cannot be reverted back to "member" if there are any child
 	cgroups with cpuset enabled.
 
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
+	A parent partition may distribute all its CPUs to its child
+	partitions as long as it is not the root cgroup and there is no
+	task directly associated with that parent partition.  Otherwise,
+	there must be at least one cpu left in the parent partition.
+	A new task cannot be moved to a partition root with no effective
+	cpu.
+
+	Once becoming a partition root, changes to "cpuset.cpus"
+	is generally allowed as long as the first condition above
+	(cpu exclusivity rule) is true.  Other constraints for this
+	operation are as follows.
 
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
+	1) Any newly added CPUs must be a subset of the parent's
+	   "cpuset.cpus.effective".
+	2) Taking away all the CPUs from the parent's "cpuset.cpus.effective"
+	   is only allowed if there is no task associated with the
+	   parent partition.
+	3) Deletion of CPUs that have been distributed to child partition
+	   roots are not allowed.
 
 	Sometimes, external factors like changes to ancestors'
 	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
+	root to change.  On read, the "cpuset.cpus.partition" file can
+	show the following values.
 
 	  ==============	==============================
 	  "member"		Non-root member of a partition
 	  "root"		Partition root
+	  "isolated"		Partition root without load balancing
 	  "root invalid"	Invalid partition root
 	  ==============	==============================
 
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
-	The cpu affinity of all the tasks in the cgroup will then be
-	associated with CPUs in the nearest ancestor partition.
-
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	A partition root becomes invalid if all the CPUs requested in
+	"cpuset.cpus" become unavailable.  This can happen if all the
+	CPUs have been offlined, or the state of an ancestor partition
+	root become invalid.  In this case, it is not a real partition
+	even though the restriction of the cpu exclusivity rule will
+	still apply.  The cpu affinity of all the tasks in the cgroup
+	will then be associated with CPUs in the nearest ancestor
+	partition.
+
+	In the special case of a parent partition competing with a child
+	partition for the only CPU left, the parent partition wins and
+	the child partition becomes invalid.
+
+	An invalid partition root can be transitioned back to a real
+	partition root if at least one of the requested CPUs become
+	available again. In this case, the cpu affinity of all the tasks
+	in the formerly invalid partition will be associated to the CPUs
+	of the newly formed partition.	Changing the partition state of
+	an invalid partition root to "member" is always allowed even if
+	child cpusets are present. However changing a partition root back
+	to member will not be allowed if child partitions are present.
+
+	Poll and inotify events are triggered when transition to or
+	from invalid partition root happens.
 
 
 Device controller
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 9/9] kselftest/cgroup: Add cpuset v2 partition root state test
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (7 preceding siblings ...)
  2021-07-20 14:18 ` [PATCH v3 8/9] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
@ 2021-07-20 14:18 ` Waiman Long
  2021-07-26 23:17 ` [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Tejun Heo
  9 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný,
	Waiman Long

Add a test script test_cpuset_prs.sh with a helper program wait_inotify
for exercising the cpuset v2 partition root state code.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 626 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  67 ++
 3 files changed, 696 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 59e222460581..3f1fd3f93f41 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,10 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -Wall -pthread
 
-all:
+all: ${HELPER_PROGS}
 
 TEST_FILES     := with_stress.sh
-TEST_PROGS     := test_stress.sh
+TEST_PROGS     := test_stress.sh test_cpuset_prs.sh
+TEST_GEN_FILES := wait_inotify
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_kmem
 TEST_GEN_PROGS += test_core
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
new file mode 100755
index 000000000000..5a99f7048910
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -0,0 +1,626 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test for cpuset v2 partition root state (PRS)
+#
+# The sched verbose flag is set, if available, so that the console log
+# can be examined for the correct setting of scheduling domain.
+#
+
+skip_test() {
+	echo "$1"
+	echo "Test SKIPPED"
+	exit 0
+}
+
+[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
+
+# Set sched verbose flag, if available
+[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
+
+# Get wait_inotify location
+WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
+
+# Find cgroup v2 mount point
+CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
+
+CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
+[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
+
+# Set verbose flag
+VERBOSE=
+[[ "$1" = -v ]] && VERBOSE=1
+
+cd $CGROUP2
+echo +cpuset > cgroup.subtree_control
+[[ -d test ]] || mkdir test
+cd test
+
+console_msg()
+{
+	MSG=$1
+	echo "$MSG"
+	echo "" > /dev/console
+	echo "$MSG" > /dev/console
+	sleep 0.01
+}
+
+test_partition()
+{
+	EXPECTED_VAL=$1
+	echo $EXPECTED_VAL > cpuset.cpus.partition
+	[[ $? -eq 0 ]] || exit 1
+	ACTUAL_VAL=$(cat cpuset.cpus.partition)
+	[[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
+		echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+test_effective_cpus()
+{
+	EXPECTED_VAL=$1
+	ACTUAL_VAL=$(cat cpuset.cpus.effective)
+	[[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
+		echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
+		echo "Test FAILED"
+		exit 1
+	}
+}
+
+# Adding current process to cgroup.procs as a test
+test_add_proc()
+{
+	OUTSTR="$1"
+	ERRMSG=$((echo $$ > cgroup.procs) |& cat)
+	echo $ERRMSG | grep -q "$OUTSTR"
+	[[ $? -ne 0 ]] && {
+		echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
+		echo "Test FAILED"
+		exit 1
+	}
+	echo $$ > $CGROUP2/cgroup.procs	# Move out the task
+}
+
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+	echo 2-3 > cpuset.cpus
+	TYPE=$(cat cpuset.cpus.partition)
+	[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+	console_msg "Change from member to root"
+	test_partition root
+
+	console_msg "Change from root to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to member"
+	test_partition member
+
+	console_msg "Change from member to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to root"
+	test_partition root
+
+	console_msg "Change from root to member"
+	test_partition member
+
+	#
+	# Testing partition root with no cpu
+	#
+	console_msg "Distribute all cpus to child partition"
+	echo +cpuset > cgroup.subtree_control
+	test_partition root
+
+	mkdir A1
+	cd A1
+	echo 2-3 > cpuset.cpus
+	test_partition root
+	test_effective_cpus 2-3
+	cd ..
+	test_effective_cpus ""
+
+	console_msg "Moving task to partition test"
+	test_add_proc "No space left"
+	cd A1
+	test_add_proc ""
+	cd ..
+
+	console_msg "Shrink and expand child partition"
+	cd A1
+	echo 2 > cpuset.cpus
+	cd ..
+	test_effective_cpus 3
+	cd A1
+	echo 2-3 > cpuset.cpus
+	cd ..
+	test_effective_cpus ""
+
+	# Cleaning up
+	console_msg "Cleaning up"
+	echo $$ > $CGROUP2/cgroup.procs
+	[[ -d A1 ]] && rmdir A1
+}
+
+#
+# Cpuset controller state transition test matrix.
+#
+# Cgroup test hierarchy
+#
+# test -- A1 -- A2 -- A3
+#      \- B1
+#
+#  P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
+#  C<l> = add cpu-list
+#  S<p> = use prefix in subtree_control
+#  T    = put a task into cgroup
+#  O<c>-<v> = Write <v> to CPU online file of <c>
+#
+SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+TEST_MATRIX=(
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .      .     0 A2:0-1"
+	"  S+    C0-1     .      .    C2-3    P1      .      .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+ C0-1:P1   .      .     0 "
+	"  S+    C0-1     .      .    C2-3   P1:S+  C1:P1    .      .     0 "
+	"  S+   C0-1:S+   .      .    C2-3     .      .      .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+     C1      .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .      .     0 "
+	"  S+   C0-1:P1   .      .    C2-3    S+    C1:P1    .     P1     0 "
+	"  S+   C0-1:P1   .      .    C2-3   C4-5     .      .      .     0 A1:4-5"
+	"  S+   C0-1:P1   .      .    C2-3  S+:C4-5   .      .      .     0 A1:4-5"
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C2     0 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .    C4-5    0 B1:4-5"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .      .      .      .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     C1-3    .      .      .     0 A1:1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      P0     .      .     0 A1:3,A2:3 A1:P1,A2:P0"
+	"  S+ C2-3:P1:S+  C2:P1  .      .     C2-4    .      .      .     0 A1:3-4,A2:2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     C3      .      .     C0-2   0 A1:,B1:0-2 A1:P1,A2:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# CPU offlining cases:
+	"  S+    C0-1     .      .    C2-3    S+    C4-5     .     O2-0   0 A1:0-1,B1:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0    .      .      .     0 A1:0-1,A2:3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O2-0   O2-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0    .      .      .     0 A1:0,A2:2-3"
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .     O1-0   O1-1    .      .     0 A1:0-1,A2:2-3"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O3-0   O3-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P2  .      .     O2-0   O2-1    .      .     0 A1:2,A2:3 A1:P1,A2:P2"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O2-0    .      .      .     0 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0    .      .      .     0 A1:2,A2: A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .    T:O2-0   .      .      .     0 A1:3,A2:3 A1:P1,A2:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O1-0    .      .      .     0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O2-0    .      .      .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .     O3-0    .      .      .     0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0   .      .      .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0   .      .     0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0   .     0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O1-1    .      .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .    T:O2-0  O2-1    .     0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .      .      .    T:O3-0  O3-1   0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O1-1    .     0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+	"  S+ $SETUP_A123_PARTITIONS    .    T:O1-0  O2-0   O2-1    .     0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+
+	# test  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+	# ----  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+	# Failure cases:
+
+	# To become a partition root, cpuset.cpus must be a subset of
+	# parent's cpuset.cpus.effective.
+	"  S+    C0-1     .      .    C2-3    S+   C4-5:P1   .      .     1 "
+
+	# A cpuset cannot become a partition root if it has child cpusets
+	# with non-empty cpuset.cpus.
+	"  S+   C0-1:S+   C1     .    C2-3    P1      .      .      .     1 "
+
+	# Any change to cpuset.cpus of a partition root must be exclusive.
+	"  S+   C0-1:P1   .      .    C2-3   C0-2     .      .      .     1 "
+	"  S+    C0-1     .      .   C2-3:P1   .      .      .     C1     1 "
+	"  S+ C2-3:P1:S+  C2:P1  .     C1    C1-3     .      .      .     1 "
+
+	# Deletion of CPUs distributed to child partition root is not allowed.
+	"  S+ C0-1:P1:S+ C1      .    C2-3   C4-5     .      .      .     1 "
+	"  S+ C0-3:P1:S+ C2-3:P1 .      .    C0-2     .      .      .     1 "
+
+	# Adding CPUs to partition root that are not in parent's
+	# cpuset.cpus.effective is not allowed.
+	"  S+ C2-3:P1:S+  C3:P1  .      .      .     C2-4    .      .     1 "
+
+	# Taking away all CPUs from parent or itself is not allowed if there are tasks.
+	"  S+ C2-3:P1:S+  C3:P1  .      .      T     C2-3    .      .     1 A1:2,A2:3"
+	"  S+ C1-3:P1:S+ C2-3:P1:S+
+	                       C3:P1    .    T:C2-3   .      .      .     1 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# A partition root cannot change to member if it has child partition.
+	"  S+ C2-3:P1:S+  C3:P1  .      .      P0     .      .      .     1 "
+	"  S+ $SETUP_A123_PARTITIONS    .     C2-3    P0     .      .     1 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+	# A task cannot be added to a partition with no cpu
+	"  S+ C2-3:P1:S+  C3:P1  .      .    O2-0:T   .      .      .     1 A1:,A2:3 A1:P1,A2:P1"
+	"  S+ C2-3:P1:S+  C3:P1  .      .     O3-0    T      .      .     1 A1:2,A2: A1:P1,A2:P1"
+)
+
+#
+# Write to the cpu online file
+#  $1 - <c>-<v> where <c> = cpu number, <v> value to be written
+#
+write_cpu_online()
+{
+	CPU=${1%-*}
+	VAL=${1#*-}
+	CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
+	if [[ $VAL -eq 0 ]]
+	then
+		OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
+	else
+		[[ -n "$OFFLINE_CPUS" ]] && {
+			OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
+					sort | uniq -u)
+		}
+	fi
+	echo $VAL > $CPUFILE
+	sleep 0.01
+}
+
+#
+# Set controller state
+#  $1 - cgroup directory
+#  $2 - state
+#  $3 - showerr
+#
+# The presence of ":" in state means transition from one to the next.
+#
+set_ctrl_state()
+{
+	TMPMSG=/tmp/.msg_$$
+	CGRP=$1
+	STATE=$2
+	SHOWERR=${3}${VERBOSE}
+	CTRL=${CTRL:=$CONTROLLER}
+	HASERR=0
+	REDIRECT="2> $TMPMSG"
+	[[ -z "$STATE" || "$STATE" = '.' ]] && return 0
+
+	rm -f $TMPMSG
+	for CMD in $(echo $STATE | sed -e "s/:/ /g")
+	do
+		TFILE=$CGRP/cgroup.procs
+		SFILE=$CGRP/cgroup.subtree_control
+		PFILE=$CGRP/cpuset.cpus.partition
+		CFILE=$CGRP/cpuset.cpus
+		S=$(expr substr $CMD 1 1)
+		if [[ $S = S ]]
+		then
+			PREFIX=${CMD#?}
+			COMM="echo ${PREFIX}${CTRL} > $SFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = C ]]
+		then
+			CPUS=${CMD#?}
+			COMM="echo $CPUS > $CFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = P ]]
+		then
+			VAL=${CMD#?}
+			case $VAL in
+			0)  VAL=member
+			    ;;
+			1)  VAL=root
+			    ;;
+			2)  VAL=isolated
+			    ;;
+			*)
+			    echo "Invalid partiton state - $VAL"
+			    exit 1
+			    ;;
+			esac
+			COMM="echo $VAL > $PFILE"
+			eval $COMM $REDIRECT
+		elif [[ $S = O ]]
+		then
+			VAL=${CMD#?}
+			write_cpu_online $VAL
+		elif [[ $S = T ]]
+		then
+			COMM="echo 0 > $TFILE"
+			eval $COMM $REDIRECT
+		fi
+		RET=$?
+		[[ $RET -ne 0 ]] && {
+			[[ -n "$SHOWERR" ]] && {
+				echo "$COMM"
+				cat $TMPMSG
+			}
+			HASERR=1
+		}
+		sleep 0.01
+		rm -f $TMPMSG
+	done
+	return $HASERR
+}
+
+set_ctrl_state_noerr()
+{
+	CGRP=$1
+	STATE=$2
+	[[ -d $CGRP ]] || mkdir $CGRP
+	set_ctrl_state $CGRP $STATE 1
+	[[ $? -ne 0 ]] && {
+		echo "ERROR: Failed to set $2 to cgroup $1!"
+		exit 1
+	}
+}
+
+online_cpus()
+{
+	[[ -n "OFFLINE_CPUS" ]] && {
+		for C in $OFFLINE_CPUS
+		do
+			write_cpu_online ${C}-1
+		done
+	}
+}
+
+#
+# Return 1 if the list of effective cpus isn't the same as the initial list.
+#
+reset_cgroup_states()
+{
+	echo 0 > $CGROUP2/cgroup.procs
+	online_cpus
+	rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
+	set_ctrl_state . S-
+	sleep 0.005 # 5ms artificial delay to complete the deletion
+}
+
+dump_states()
+{
+	for DIR in A1 A1/A2 A1/A2/A3 B1
+	do
+		ECPUS=$DIR/cpuset.cpus.effective
+		PRS=$DIR/cpuset.cpus.partition
+		[[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
+		[[ -e $PRS   ]] && echo "$PRS: $(cat $PRS)"
+	done
+}
+
+#
+# Check effective cpus
+# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
+#
+check_effective_cpus()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		CPUS=$2
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+		FILE=$CGRP/cpuset.cpus.effective
+		[[ -e $FILE ]] || return 1
+		[[ $CPUS = $(cat $FILE) ]] || return 1
+	done
+}
+
+#
+# Check cgroup states
+#  $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
+#
+check_cgroup_states()
+{
+	CHK_STR=$1
+	for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+	do
+		set -- $(echo $CHK | sed -e "s/:/ /g")
+		CGRP=$1
+		STATE=$2
+		FILE=
+		EVAL=$(expr substr $STATE 2 2)
+		[[ $CGRP = A2 ]] && CGRP=A1/A2
+		[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+
+		case $STATE in
+			P*) FILE=$CGRP/cpuset.cpus.partition
+			    ;;
+			*)  echo "Unknown state: $STATE!"
+			    exit 1
+			    ;;
+		esac
+		VAL=$(cat $FILE)
+
+		case "$VAL" in
+			member) VAL=0
+				;;
+			root)	VAL=1
+				;;
+			isolated)
+				VAL=2
+				;;
+			"root invalid")
+				VAL=-1
+				;;
+		esac
+		[[ $EVAL != $VAL ]] && return 1
+	done
+	return 0
+}
+
+#
+# Run cpuset state transition test
+#  $1 - test matrix name
+#
+# This test is somewhat fragile as delays (sleep x) are added in various
+# places to make sure state changes are fully propagated before the next
+# action. These delays may need to be adjusted if running in a slower machine.
+#
+run_state_test()
+{
+	TEST=$1
+	CONTROLLER=cpuset
+	I=0
+	CPULIST=0-6
+	eval CNT="\${#$TEST[@]}"
+
+	reset_cgroup_states
+	echo $CPULIST > cpuset.cpus
+	echo root > cpuset.cpus.partition
+	console_msg "Running state transition test ..."
+
+	while [[ $I -lt $CNT ]]
+	do
+		echo "Running test $I ..." > /dev/console
+		eval set -- "\${$TEST[$I]}"
+		ROOT=$1
+		OLD_A1=$2
+		OLD_A2=$3
+		OLD_A3=$4
+		OLD_B1=$5
+		NEW_A1=$6
+		NEW_A2=$7
+		NEW_A3=$8
+		NEW_B1=$9
+		RESULT=${10}
+		ECPUS=${11}
+		STATES=${12}
+
+		set_ctrl_state_noerr .        $ROOT
+		set_ctrl_state_noerr A1       $OLD_A1
+		set_ctrl_state_noerr A1/A2    $OLD_A2
+		set_ctrl_state_noerr A1/A2/A3 $OLD_A3
+		set_ctrl_state_noerr B1       $OLD_B1
+		RETVAL=0
+		set_ctrl_state A1       $NEW_A1; ((RETVAL += $?))
+		set_ctrl_state A1/A2    $NEW_A2; ((RETVAL += $?))
+		set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
+		set_ctrl_state B1       $NEW_B1; ((RETVAL += $?))
+
+		[[ $RETVAL -ne $RESULT ]] && {
+			echo "Test $TEST[$I] failed result check!"
+			eval echo \"\${$TEST[$I]}\"
+			online_cpus
+			exit 1
+		}
+
+		[[ -n "$ECPUS" && "$ECPUS" != . ]] && {
+			check_effective_cpus $ECPUS
+			[[ $? -ne 0 ]] && {
+				echo "Test $TEST[$I] failed effective CPU check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		[[ -n "$STATES" ]] && {
+			check_cgroup_states $STATES
+			[[ $? -ne 0 ]] && {
+				echo "FAILED: Test $TEST[$I] failed states check!"
+				eval echo \"\${$TEST[$I]}\"
+				echo
+				dump_states
+				online_cpus
+				exit 1
+			}
+		}
+
+		reset_cgroup_states
+		[[ -n "$VERBOSE" ]] && echo "Test $I done."
+		((I++))
+	done
+	echo "All $I tests of $TEST PASSED."
+
+	#
+	# Check to see if the effective cpu list changes
+	#
+	sleep 0.05
+	NEWLIST=$(cat cpuset.cpus.effective)
+	[[ $NEWLIST != $CPULIST ]] && {
+		echo "Effective cpus changed to $NEWLIST!"
+	}
+	echo member > cpuset.cpus.partition
+}
+
+#
+# Wait for inotify event for the given file and read it
+# $1: cgroup file to wait for
+# $2: file to store the read result
+#
+wait_inotify()
+{
+	CGROUP_FILE=$1
+	OUTPUT_FILE=$2
+
+	$WAIT_INOTIFY $CGROUP_FILE
+	cat $CGROUP_FILE > $OUTPUT_FILE
+}
+
+#
+# Test if inotify events are properly generated when going into and out of
+# invalid partition state.
+#
+test_inotify()
+{
+	ERR=0
+	PRS=/tmp/.prs_$$
+	[[ -f $WAIT_INOTIFY ]] || {
+		echo "wait_inotify not found, inotify test SKIPPED."
+		return
+	}
+
+	sleep 0.01
+	echo 1 > cpuset.cpus
+	echo 0 > cgroup.procs
+	echo root > cpuset.cpus.partition
+	sleep 0.01
+	rm -f $PRS
+	wait_inotify $PWD/cpuset.cpus.partition $PRS &
+	sleep 0.01
+	set_ctrl_state . "O1-0"
+	sleep 0.01
+	check_cgroup_states ".:P-1"
+	if [[ $? -ne 0 ]]
+	then
+		echo "FAILED: Inotify test - partition not invalid"
+		ERR=1
+	elif [[ ! -f $PRS ]]
+	then
+		echo "FAILED: Inotify test - event not generated"
+		ERR=1
+		kill %1
+	elif [[ $(cat $PRS) != "root invalid" ]]
+	then
+		echo "FAILED: Inotify test - incorrect state"
+		cat $PRS
+		ERR=1
+	fi
+	online_cpus
+	echo member > cpuset.cpus.partition
+	echo 0 > ../cgroup.procs
+	if [[ $ERR -ne 0 ]]
+	then
+		exit 1
+	else
+		echo "Inotify test PASSED"
+	fi
+}
+
+run_state_test TEST_MATRIX
+test_isolated
+test_inotify
+echo "All tests PASSED."
+cd ..
+rmdir test
diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c
new file mode 100644
index 000000000000..a6f80d008b4b
--- /dev/null
+++ b/tools/testing/selftests/cgroup/wait_inotify.c
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Wait until an inotify event on the given cgroup file.
+ */
+#include <linux/limits.h>
+#include <sys/inotify.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+const char usage[] = "Usage: %s <cgroup_file>\n";
+static char *file;
+
+static inline void fail_message(char *msg)
+{
+	fprintf(stderr, msg, file);
+	exit(1);
+}
+
+int main(int argc, char *argv[])
+{
+	char *cmd = argv[0];
+	int fd;
+	struct pollfd fds = { .events = POLLIN, };
+
+	if (argc != 2) {
+		fprintf(stderr, usage, cmd);
+		return -1;
+	}
+	file = argv[1];
+	fd = open(file, O_RDONLY);
+	if (fd < 0)
+		fail_message("Cgroup file %s not found!\n");
+	close(fd);
+
+	fd = inotify_init();
+	if (fd < 0)
+		fail_message("inotify_init() fails on %s!\n");
+	if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
+		fail_message("inotify_init() fails on %s!\n");
+	fds.fd = fd;
+
+	/*
+	 * poll waiting loop
+	 */
+	for (;;) {
+		int ret = poll(&fds, 1, 10000);
+		if (ret < 0) {
+			if (errno == EINTR)
+				continue;
+			perror("poll");
+			exit(1);
+		}
+		if ((ret > 0) && (fds.revents & POLLIN))
+			break;
+	}
+	close(fd);
+	return 0;
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/9] cgroup/cpuset: Miscellaneous code cleanup
  2021-07-20 14:18 ` [PATCH v3 1/9] cgroup/cpuset: Miscellaneous code cleanup Waiman Long
@ 2021-07-26 22:56   ` Tejun Heo
  0 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2021-07-26 22:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On Tue, Jul 20, 2021 at 10:18:26AM -0400, Waiman Long wrote:
> Use more descriptive variable names for update_prstate(), remove
> unnecessary code and fix some typos. There is no functional change.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Applied to cgroup/for-5.15.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 2/9] cgroup/cpuset: Fix a partition bug with hotplug
  2021-07-20 14:18 ` [PATCH v3 2/9] cgroup/cpuset: Fix a partition bug with hotplug Waiman Long
@ 2021-07-26 22:59   ` Tejun Heo
  2021-07-27 20:16     ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2021-07-26 22:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On Tue, Jul 20, 2021 at 10:18:27AM -0400, Waiman Long wrote:
> In cpuset_hotplug_workfn(), the detection of whether the cpu list
> has been changed is done by comparing the effective cpus of the top
> cpuset with the cpu_active_mask. However, in the rare case that just
> all the CPUs in the subparts_cpus are offlined, the detection fails
> and the partition states are not updated correctly. Fix it by forcing
> the cpus_updated flag to true in this particular case.
> 
> Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
> Signed-off-by: Waiman Long <longman@redhat.com>

Applied to cgroup/for-5.15 w/ a minor update to the comment (I dropped
"just" before "all". It read weird to me.)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/9] cgroup/cpuset: Fix violation of cpuset locking rule
  2021-07-20 14:18 ` [PATCH v3 3/9] cgroup/cpuset: Fix violation of cpuset locking rule Waiman Long
@ 2021-07-26 23:10   ` Tejun Heo
  0 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2021-07-26 23:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On Tue, Jul 20, 2021 at 10:18:28AM -0400, Waiman Long wrote:
> The cpuset fields that manage partition root state do not strictly
> follow the cpuset locking rule that update to cpuset has to be done
> with both the callback_lock and cpuset_mutex held. This is now fixed
> by making sure that the locking rule is upheld.
> 
> Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition")
> Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition)
> Signed-off-by: Waiman Long <longman@redhat.com>

Applied to cgroup/for-5.15.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid
  2021-07-20 14:18 ` [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid Waiman Long
@ 2021-07-26 23:14   ` Tejun Heo
  2021-07-27 20:26     ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2021-07-26 23:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On Tue, Jul 20, 2021 at 10:18:29AM -0400, Waiman Long wrote:
> +static inline void notify_partition_change(struct cpuset *cs,
> +					   int old_prs, int new_prs)
> +{
> +	if ((old_prs == new_prs) ||
> +	   ((old_prs != PRS_ERROR) && (new_prs != PRS_ERROR)))
> +		return;
> +	cgroup_file_notify(&cs->partition_file);

I'd generate an event on any state changes. The user have to read the file
to find out what happened anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
                   ` (8 preceding siblings ...)
  2021-07-20 14:18 ` [PATCH v3 9/9] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
@ 2021-07-26 23:17 ` Tejun Heo
  2021-07-27 21:14   ` Waiman Long
  9 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2021-07-26 23:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

Hello,

On Tue, Jul 20, 2021 at 10:18:25AM -0400, Waiman Long wrote:
> v3:
>  - Add two new patches (patches 2 & 3) to fix bugs found during the
>    testing process.
>  - Add a new patch to enable inotify event notification when partition
>    become invalid.
>  - Add a test to test event notification when partition become invalid.

I applied parts of the series. I think there was a bit of miscommunication.
I meant that we should use the invalid state as the only way to indicate
errors as long as the error state is something which can be reached through
hot unplug or other uncontrollable changes, and require users to monitor the
state transitions for confirmation and error handling.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-07-20 14:18 ` [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
@ 2021-07-27 11:42   ` Frederic Weisbecker
  2021-07-27 15:56     ` Waiman Long
  2021-07-28 16:09   ` Michal Koutný
  1 sibling, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2021-07-27 11:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Marcelo Tosatti, Michal Koutný

On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD
> 
> commit 994fb794cb252edd124a46ca0994e37a4726a100
> Author: Waiman Long <longman@redhat.com>
> Date:   Sat, 19 Jun 2021 13:28:19 -0400
> 
>     cgroup/cpuset: Add a new isolated cpus.partition type
> 
>     Cpuset v1 uses the sched_load_balance control file to determine if load
>     balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
>     as its use may require disabling load balancing at cgroup root.
> 
>     For workloads that require very low latency like DPDK, the latency
>     jitters caused by periodic load balancing may exceed the desired
>     latency limit.
> 
>     When cpuset v2 is in use, the only way to avoid this latency cost is to
>     use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
>     the kernel boot, however, there is no way to add or remove CPUs from
>     this isolated set. For workloads that are more dynamic in nature, that
>     means users have to provision enough CPUs for the worst case situation
>     resulting in excess idle CPUs.
> 
>     To address this issue for cpuset v2, a new cpuset.cpus.partition type
>     "isolated" is added which allows the creation of a cpuset partition
>     without load balancing. This will allow system administrators to
>     dynamically adjust the size of isolated partition to the current need
>     of the workload without rebooting the system.
> 
>     Signed-off-by: Waiman Long <longman@redhat.com>
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Nice! And while we are adding a new ABI, can we take advantage of that and
add a specific semantic that if a new isolated partition matches a subset of
"isolcpus=", it automatically maps to it. This means that any further
modification to that isolated partition will also modify the associated
isolcpus= subset.

Or to summarize, when we create a new isolated partition, remove the associated
CPUs from isolcpus= ?

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-07-27 11:42   ` Frederic Weisbecker
@ 2021-07-27 15:56     ` Waiman Long
  2021-07-29 11:03       ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-27 15:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Marcelo Tosatti, Michal Koutný

On 7/27/21 7:42 AM, Frederic Weisbecker wrote:
> On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:
>> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD
>>
>> commit 994fb794cb252edd124a46ca0994e37a4726a100
>> Author: Waiman Long <longman@redhat.com>
>> Date:   Sat, 19 Jun 2021 13:28:19 -0400
>>
>>      cgroup/cpuset: Add a new isolated cpus.partition type
>>
>>      Cpuset v1 uses the sched_load_balance control file to determine if load
>>      balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
>>      as its use may require disabling load balancing at cgroup root.
>>
>>      For workloads that require very low latency like DPDK, the latency
>>      jitters caused by periodic load balancing may exceed the desired
>>      latency limit.
>>
>>      When cpuset v2 is in use, the only way to avoid this latency cost is to
>>      use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
>>      the kernel boot, however, there is no way to add or remove CPUs from
>>      this isolated set. For workloads that are more dynamic in nature, that
>>      means users have to provision enough CPUs for the worst case situation
>>      resulting in excess idle CPUs.
>>
>>      To address this issue for cpuset v2, a new cpuset.cpus.partition type
>>      "isolated" is added which allows the creation of a cpuset partition
>>      without load balancing. This will allow system administrators to
>>      dynamically adjust the size of isolated partition to the current need
>>      of the workload without rebooting the system.
>>
>>      Signed-off-by: Waiman Long <longman@redhat.com>
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Nice! And while we are adding a new ABI, can we take advantage of that and
> add a specific semantic that if a new isolated partition matches a subset of
> "isolcpus=", it automatically maps to it. This means that any further
> modification to that isolated partition will also modify the associated
> isolcpus= subset.
>
> Or to summarize, when we create a new isolated partition, remove the associated
> CPUs from isolcpus= ?

We can certainly do that as a follow-on. Another idea that I have been 
thinking about is to automatically generating a isolated partition under 
root to match the given isolcpus parameter when the v2 filesystem is 
mounted. That needs more experimentation and testing to verify that it 
can work.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 2/9] cgroup/cpuset: Fix a partition bug with hotplug
  2021-07-26 22:59   ` Tejun Heo
@ 2021-07-27 20:16     ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-27 20:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 7/26/21 6:59 PM, Tejun Heo wrote:
> On Tue, Jul 20, 2021 at 10:18:27AM -0400, Waiman Long wrote:
>> In cpuset_hotplug_workfn(), the detection of whether the cpu list
>> has been changed is done by comparing the effective cpus of the top
>> cpuset with the cpu_active_mask. However, in the rare case that just
>> all the CPUs in the subparts_cpus are offlined, the detection fails
>> and the partition states are not updated correctly. Fix it by forcing
>> the cpus_updated flag to true in this particular case.
>>
>> Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Applied to cgroup/for-5.15 w/ a minor update to the comment (I dropped
> "just" before "all". It read weird to me.)
>
> Thanks.
>
Thanks for fixing the wording.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid
  2021-07-26 23:14   ` Tejun Heo
@ 2021-07-27 20:26     ` Waiman Long
  2021-07-27 20:46       ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-27 20:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 7/26/21 7:14 PM, Tejun Heo wrote:
> On Tue, Jul 20, 2021 at 10:18:29AM -0400, Waiman Long wrote:
>> +static inline void notify_partition_change(struct cpuset *cs,
>> +					   int old_prs, int new_prs)
>> +{
>> +	if ((old_prs == new_prs) ||
>> +	   ((old_prs != PRS_ERROR) && (new_prs != PRS_ERROR)))
>> +		return;
>> +	cgroup_file_notify(&cs->partition_file);
> I'd generate an event on any state changes. The user have to read the file
> to find out what happened anyway.
>
> Thanks.

 From my own testing with "inotify_add_watch(fd, file, IN_MODIFY)", 
poll() will return with a event whenever a user write to 
cpuset.cpus.partition control file. I haven't really look into the sysfs 
code yet, but I believe event generation will be automatic in this case. 
So I don't think I need to explicitly add a cgroup_file_notify() when 
users modify the control file directly. Other indirect modification may 
cause the partition value to change to/from PRS_ERROR and I should have 
captured all those changes in this patchset. I will update the patch to 
note this point to make it more clear.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid
  2021-07-27 20:26     ` Waiman Long
@ 2021-07-27 20:46       ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2021-07-27 20:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 7/27/21 4:26 PM, Waiman Long wrote:
> On 7/26/21 7:14 PM, Tejun Heo wrote:
>> On Tue, Jul 20, 2021 at 10:18:29AM -0400, Waiman Long wrote:
>>> +static inline void notify_partition_change(struct cpuset *cs,
>>> +                       int old_prs, int new_prs)
>>> +{
>>> +    if ((old_prs == new_prs) ||
>>> +       ((old_prs != PRS_ERROR) && (new_prs != PRS_ERROR)))
>>> +        return;
>>> +    cgroup_file_notify(&cs->partition_file);
>> I'd generate an event on any state changes. The user have to read the 
>> file
>> to find out what happened anyway.
>>
>> Thanks.
>
> From my own testing with "inotify_add_watch(fd, file, IN_MODIFY)", 
> poll() will return with a event whenever a user write to 
> cpuset.cpus.partition control file. I haven't really look into the 
> sysfs code yet, but I believe event generation will be automatic in 
> this case. So I don't think I need to explicitly add a 
> cgroup_file_notify() when users modify the control file directly. 
> Other indirect modification may cause the partition value to change 
> to/from PRS_ERROR and I should have captured all those changes in this 
> patchset. I will update the patch to note this point to make it more 
> clear. 

After thinking about it a bit more it, it is probably not a problem to 
call cgroup_file_notify() for every change as this is not in a 
performance critical path anyway. I will do some more testing to find 
out if doing cgroup_file_notify() for regular file write will cause an 
extra duplicated event to be sent out, I will probably stay with the 
current patch. Otherwise, I can change it to always call 
cgroup_file_notify().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-07-26 23:17 ` [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Tejun Heo
@ 2021-07-27 21:14   ` Waiman Long
  2021-08-09 22:46     ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-27 21:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 7/26/21 7:17 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 20, 2021 at 10:18:25AM -0400, Waiman Long wrote:
>> v3:
>>   - Add two new patches (patches 2 & 3) to fix bugs found during the
>>     testing process.
>>   - Add a new patch to enable inotify event notification when partition
>>     become invalid.
>>   - Add a test to test event notification when partition become invalid.
> I applied parts of the series. I think there was a bit of miscommunication.
> I meant that we should use the invalid state as the only way to indicate
> errors as long as the error state is something which can be reached through
> hot unplug or other uncontrollable changes, and require users to monitor the
> state transitions for confirmation and error handling.

Yes, that is the point of adding the event notification patch.

In the current code, direct write to cpuset.cpus.partition are strictly 
controlled and invalid transitions are rejected. However, changes to 
cpuset.cpus that do not break the cpu exclusivity rule or cpu hot plug 
may cause a partition to changed to invalid. What is currently done in 
this patchset is to add extra guards to reject those cpuset.cpus change 
that cause the partition to become invalid since changes that break cpu 
exclusivity rule will be rejected anyway. I can leave out those extra 
guards and allow those invalid cpuset.cpus change to go forward and 
change the partition to invalid instead if this is what you want.

However, if we have a complicated partition setup with multiple child 
partitions. Invalid cpuset.cpus change in a parent partition will cause 
all the child partitions to become invalid too. That is the scenario 
that I don't want to happen inadvertently. Alternatively, we can 
restrict those invalid changes if a child partition exist and let it 
pass through and make it invalid if it is a standalone partition.

Please let me know which approach do you want me to take.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-07-20 14:18 ` [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
  2021-07-27 11:42   ` Frederic Weisbecker
@ 2021-07-28 16:09   ` Michal Koutný
  2021-07-28 16:27     ` Waiman Long
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Koutný @ 2021-07-28 16:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 887 bytes --]

Hello Waiman.

On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote:
> @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> [...]
> +	} else if (old_prs && new_prs) {

If an isolated root partition becomes invalid (new_prs == PRS_ERROR)...

> +		/*
> +		 * A change in load balance state only, no change in cpumasks.
> +		 */
> +		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));

...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness.

IOW, if there's an isolated partition that becomes invalid and later
valid again (a cpu is (re)added), it will be a normal root partition
without the requested isolation, which is IMO undesired.

I may have overlooked something in broader context but it seems to me
the invalidity should be saved independently of the root/isolated type.

Regards,
Michal


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-07-28 16:09   ` Michal Koutný
@ 2021-07-28 16:27     ` Waiman Long
  2021-07-28 17:25       ` Michal Koutný
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2021-07-28 16:27 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

On 7/28/21 12:09 PM, Michal Koutný wrote:
> Hello Waiman.
>
> On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote:
>> @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
>> [...]
>> +	} else if (old_prs && new_prs) {
> If an isolated root partition becomes invalid (new_prs == PRS_ERROR)...
>
>> +		/*
>> +		 * A change in load balance state only, no change in cpumasks.
>> +		 */
>> +		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
> ...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness.
>
> IOW, if there's an isolated partition that becomes invalid and later
> valid again (a cpu is (re)added), it will be a normal root partition
> without the requested isolation, which is IMO undesired.
>
> I may have overlooked something in broader context but it seems to me
> the invalidity should be saved independently of the root/isolated type.

PRS_ERROR cannot be passed to update_prstate(). For this patchset, 
PRS_ERROR can only be set by changes in hotplug. The current design will 
maintain the set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to 
switch back to PRS_ENABLED or PRS_ISOLATED when the cpus are available 
again.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-07-28 16:27     ` Waiman Long
@ 2021-07-28 17:25       ` Michal Koutný
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Koutný @ 2021-07-28 17:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 730 bytes --]

On Wed, Jul 28, 2021 at 12:27:58PM -0400, Waiman Long <llong@redhat.com> wrote:
> PRS_ERROR cannot be passed to update_prstate(). For this patchset, PRS_ERROR
> can only be set by changes in hotplug. The current design will maintain the
> set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to switch back to
> PRS_ENABLED or PRS_ISOLATED when the cpus are available again.

I see it now, thanks. (I still find a bit weird that the "isolated"
partition will be shown as "root invalid" when it's lacking cpus
(instead of "isolated invalid" and returning to "isolated") but I can
understand the approach of having just one "root invalid" for all.)

This patch can have
Reviewed-by: Michal Koutný <mkoutny@suse.com>


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type
  2021-07-27 15:56     ` Waiman Long
@ 2021-07-29 11:03       ` Frederic Weisbecker
  0 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2021-07-29 11:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest,
	Andrew Morton, Roman Gushchin, Phil Auld, Peter Zijlstra,
	Juri Lelli, Marcelo Tosatti, Michal Koutný

On Tue, Jul 27, 2021 at 11:56:25AM -0400, Waiman Long wrote:
> On 7/27/21 7:42 AM, Frederic Weisbecker wrote:
> > On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:
> > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD
> > > 
> > > commit 994fb794cb252edd124a46ca0994e37a4726a100
> > > Author: Waiman Long <longman@redhat.com>
> > > Date:   Sat, 19 Jun 2021 13:28:19 -0400
> > > 
> > >      cgroup/cpuset: Add a new isolated cpus.partition type
> > > 
> > >      Cpuset v1 uses the sched_load_balance control file to determine if load
> > >      balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
> > >      as its use may require disabling load balancing at cgroup root.
> > > 
> > >      For workloads that require very low latency like DPDK, the latency
> > >      jitters caused by periodic load balancing may exceed the desired
> > >      latency limit.
> > > 
> > >      When cpuset v2 is in use, the only way to avoid this latency cost is to
> > >      use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
> > >      the kernel boot, however, there is no way to add or remove CPUs from
> > >      this isolated set. For workloads that are more dynamic in nature, that
> > >      means users have to provision enough CPUs for the worst case situation
> > >      resulting in excess idle CPUs.
> > > 
> > >      To address this issue for cpuset v2, a new cpuset.cpus.partition type
> > >      "isolated" is added which allows the creation of a cpuset partition
> > >      without load balancing. This will allow system administrators to
> > >      dynamically adjust the size of isolated partition to the current need
> > >      of the workload without rebooting the system.
> > > 
> > >      Signed-off-by: Waiman Long <longman@redhat.com>
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > Nice! And while we are adding a new ABI, can we take advantage of that and
> > add a specific semantic that if a new isolated partition matches a subset of
> > "isolcpus=", it automatically maps to it. This means that any further
> > modification to that isolated partition will also modify the associated
> > isolcpus= subset.
> > 
> > Or to summarize, when we create a new isolated partition, remove the associated
> > CPUs from isolcpus= ?
> 
> We can certainly do that as a follow-on.

I'm just concerned that this feature gets merged before we add that new
isolcpus= implicit mapping, which technically is a new ABI. Well I guess I
should hurry up and try to propose a patchset quickly once I'm back from
vacation :-)



> Another idea that I have been
> thinking about is to automatically generating a isolated partition under
> root to match the given isolcpus parameter when the v2 filesystem is
> mounted. That needs more experimentation and testing to verify that it can
> work.

I thought about that too, mounting an "isolcpus" subdirectory withing the top
cpuset but I was worried it could break userspace that wouldn't expect that new
thing to show up.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-07-27 21:14   ` Waiman Long
@ 2021-08-09 22:46     ` Tejun Heo
  2021-08-10  1:12       ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2021-08-09 22:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

Hello, Waiman. Sorry about the delay. Was off for a while.

On Tue, Jul 27, 2021 at 05:14:27PM -0400, Waiman Long wrote:
> However, if we have a complicated partition setup with multiple child
> partitions. Invalid cpuset.cpus change in a parent partition will cause all
> the child partitions to become invalid too. That is the scenario that I
> don't want to happen inadvertently. Alternatively, we can restrict those

I don't think there's anything fundamentally wrong with it given the
requirement that userland has to monitor invalid state transitions.
The same mass transition can happen through cpu hotplug operations,
right?

> invalid changes if a child partition exist and let it pass through and make
> it invalid if it is a standalone partition.
> 
> Please let me know which approach do you want me to take.

I think it'd be best if we can stick to some principles rather than
trying to adjust it for specific scenarios. e.g.:

* If a given state can be reached through cpu hot [un]plug, any
  configuration attempt which reaches the same state should be allowed
  with the same end result as cpu hot [un]plug.

* If a given state can't ever be reached in whichever way, the
  configuration attempting to reach such state should be rejected.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus
  2021-08-09 22:46     ` Tejun Heo
@ 2021-08-10  1:12       ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2021-08-10  1:12 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
	linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
	Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Marcelo Tosatti, Michal Koutný

On 8/9/21 6:46 PM, Tejun Heo wrote:
> Hello, Waiman. Sorry about the delay. Was off for a while.
>
> On Tue, Jul 27, 2021 at 05:14:27PM -0400, Waiman Long wrote:
>> However, if we have a complicated partition setup with multiple child
>> partitions. Invalid cpuset.cpus change in a parent partition will cause all
>> the child partitions to become invalid too. That is the scenario that I
>> don't want to happen inadvertently. Alternatively, we can restrict those
> I don't think there's anything fundamentally wrong with it given the
> requirement that userland has to monitor invalid state transitions.
> The same mass transition can happen through cpu hotplug operations,
> right?
>
>> invalid changes if a child partition exist and let it pass through and make
>> it invalid if it is a standalone partition.
>>
>> Please let me know which approach do you want me to take.
> I think it'd be best if we can stick to some principles rather than
> trying to adjust it for specific scenarios. e.g.:
>
> * If a given state can be reached through cpu hot [un]plug, any
>    configuration attempt which reaches the same state should be allowed
>    with the same end result as cpu hot [un]plug.
>
> * If a given state can't ever be reached in whichever way, the
>    configuration attempting to reach such state should be rejected.

OK, I got it. I will make the necessary changes and submit a new patch 
series.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-08-10  1:12 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20 14:18 [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Waiman Long
2021-07-20 14:18 ` [PATCH v3 1/9] cgroup/cpuset: Miscellaneous code cleanup Waiman Long
2021-07-26 22:56   ` Tejun Heo
2021-07-20 14:18 ` [PATCH v3 2/9] cgroup/cpuset: Fix a partition bug with hotplug Waiman Long
2021-07-26 22:59   ` Tejun Heo
2021-07-27 20:16     ` Waiman Long
2021-07-20 14:18 ` [PATCH v3 3/9] cgroup/cpuset: Fix violation of cpuset locking rule Waiman Long
2021-07-26 23:10   ` Tejun Heo
2021-07-20 14:18 ` [PATCH v3 4/9] cgroup/cpuset: Enable event notification when partition become invalid Waiman Long
2021-07-26 23:14   ` Tejun Heo
2021-07-27 20:26     ` Waiman Long
2021-07-27 20:46       ` Waiman Long
2021-07-20 14:18 ` [PATCH v3 5/9] cgroup/cpuset: Clarify the use of invalid partition root Waiman Long
2021-07-20 14:18 ` [PATCH v3 6/9] cgroup/cpuset: Add a new isolated cpus.partition type Waiman Long
2021-07-27 11:42   ` Frederic Weisbecker
2021-07-27 15:56     ` Waiman Long
2021-07-29 11:03       ` Frederic Weisbecker
2021-07-28 16:09   ` Michal Koutný
2021-07-28 16:27     ` Waiman Long
2021-07-28 17:25       ` Michal Koutný
2021-07-20 14:18 ` [PATCH v3 7/9] cgroup/cpuset: Allow non-top parent partition root to distribute out all CPUs Waiman Long
2021-07-20 14:18 ` [PATCH v3 8/9] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Waiman Long
2021-07-20 14:18 ` [PATCH v3 9/9] kselftest/cgroup: Add cpuset v2 partition root state test Waiman Long
2021-07-26 23:17 ` [PATCH v3 0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus Tejun Heo
2021-07-27 21:14   ` Waiman Long
2021-08-09 22:46     ` Tejun Heo
2021-08-10  1:12       ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).