All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH-cgroup 0/4] cgroup/cpuset: Improve CPU isolation in isolated partitions
@ 2023-10-13 18:11 Waiman Long
  2023-10-13 18:11 ` [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Waiman Long @ 2023-10-13 18:11 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Lai Jiangshan, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-kselftest, Waiman Long

Isolated cpuset partition can currently be created to contain an
exclusive set of CPUs not used in other cgroups and with load balancing
disabled to reduce interference from the scheduler.

The main purpose of this isolated partition type is to dynamically
emulate what can be done via the "isolcpus" boot command line option,
specifically the default domain flag. One effect of the "isolcpus" option
is to remove the isolated CPUs from the cpumasks of unbound workqueues
since running work functions in an isolated CPU can be a major source
of interference. Changing the unbound workqueue cpumasks can be done at
run time by writing an appropriate cpumask without the isolated CPUs to
/sys/devices/virtual/workqueue/cpumask. So one can set up an isolated
cpuset partition and then write to the cpumask sysfs file to achieve
similar level of CPU isolation. However, this manual process can be
error prone.

This patch series implements automatic exclusion of isolated CPUs from
unbound workqueue cpumasks when an isolated cpuset partition is created
and then adds those CPUs back when the isolated partition is destroyed.

There are also other places in the kernel that look at the HK_FLAG_DOMAIN
cpumask or other HK_FLAG_* cpumasks and exclude the isolated CPUs from
certain actions to further reduce interference. CPUs in an isolated
cpuset partition will not be able to avoid those interferences yet. That
may change in the future as the need arises.

Waiman Long (4):
  workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs
    from wq_unbound_cpumask
  selftests/cgroup: Minor code cleanup and reorganization of
    test_cpuset_prs.sh
  cgroup/cpuset: Keep track of CPUs in isolated partitions
  cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask

 Documentation/admin-guide/cgroup-v2.rst       |  10 +-
 include/linux/workqueue.h                     |   2 +-
 kernel/cgroup/cpuset.c                        | 237 +++++++++++++-----
 kernel/workqueue.c                            |  42 +++-
 .../selftests/cgroup/test_cpuset_prs.sh       | 209 +++++++++------
 5 files changed, 350 insertions(+), 150 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  2023-10-13 18:11 [PATCH-cgroup 0/4] cgroup/cpuset: Improve CPU isolation in isolated partitions Waiman Long
@ 2023-10-13 18:11 ` Waiman Long
  2023-10-18  9:24   ` Tejun Heo
  2023-10-13 18:11 ` [PATCH-cgroup 2/4] selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2023-10-13 18:11 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Lai Jiangshan, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-kselftest, Waiman Long

When the "isolcpus" boot command line option is used to add a set
of isolated CPUs, those CPUs will be excluded automatically from
wq_unbound_cpumask to avoid running work functions from unbound
workqueues.

Recently cpuset has been extended to allow the creation of partitions
of isolated CPUs dynamically. To make it closer to the "isolcpus"
in functionality, the CPUs in those isolated cpuset partitions should be
excluded from wq_unbound_cpumask as well. This can be done currently by
explicitly writing to the workqueue's cpumask sysfs file after creating
the isolated partitions. However, this process can be error prone.
Ideally, the cpuset code should be allowed to request the workqueue code
to exclude those isolated CPUs from wq_unbound_cpumask so that this
operation can be done automatically and the isolated CPUs will be returned
back to wq_unbound_cpumask after the destructions of the isolated
cpuset partitions.

This patch adds a new workqueue_unbound_exclude_cpumask() to enable
that. This new function will exclude the specified isolated CPUs
from wq_unbound_cpumask. To be able to restore those isolated CPUs
back after the destruction of isolated cpuset partitions, a new
wq_user_unbound_cpumask is added to store the user provided unbound
cpumask either from the boot command line options or from writing to
the cpumask sysfs file. This new cpumask provides the basis for CPU
exclusion.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/workqueue.h |  2 +-
 kernel/workqueue.c        | 42 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 1c1d06804d45..a936460ccc7e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -483,7 +483,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void);
 void free_workqueue_attrs(struct workqueue_attrs *attrs);
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs);
-int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);
+extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 010b674b02a7..19d403aa41b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -381,6 +381,9 @@ static bool workqueue_freezing;		/* PL: have wqs started freezing? */
 /* PL&A: allowable cpus for unbound wqs and work items */
 static cpumask_var_t wq_unbound_cpumask;
 
+/* PL: user-provided unbound cpumask via sysfs */
+static cpumask_var_t wq_user_unbound_cpumask;
+
 /* for further constrain wq_unbound_cpumask by cmdline parameter*/
 static struct cpumask wq_cmdline_cpumask __initdata;
 
@@ -5825,7 +5828,7 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
  *  		-EINVAL	- Invalid @cpumask
  *  		-ENOMEM	- Failed to allocate memory for attrs or pwqs.
  */
-int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
+static int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 {
 	int ret = -EINVAL;
 
@@ -5836,6 +5839,7 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 	cpumask_and(cpumask, cpumask, cpu_possible_mask);
 	if (!cpumask_empty(cpumask)) {
 		apply_wqattrs_lock();
+		cpumask_copy(wq_user_unbound_cpumask, cpumask);
 		if (cpumask_equal(cpumask, wq_unbound_cpumask)) {
 			ret = 0;
 			goto out_unlock;
@@ -5850,6 +5854,40 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 	return ret;
 }
 
+/**
+ * workqueue_unbound_exclude_cpumask - Exclude given CPUs from unbound cpumask
+ * @exclude_cpumask: the cpumask to be excluded from wq_unbound_cpumask
+ *
+ * This function can be called from cpuset code to provide a set of isolated
+ * CPUs that should be excluded from wq_unbound_cpumask.
+ */
+int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
+{
+	cpumask_var_t cpumask;
+	int ret = 0;
+
+	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	/*
+	 * The caller of this function may have called cpus_read_lock(),
+	 * use cpus_read_trylock() to avoid potential deadlock.
+	 */
+	if (!cpus_read_trylock())
+		return -EBUSY;
+	mutex_lock(&wq_pool_mutex);
+
+	if (!cpumask_andnot(cpumask, wq_user_unbound_cpumask, exclude_cpumask))
+		ret = -EINVAL;	/* The new cpumask can't be empty */
+	else if (!cpumask_equal(cpumask, wq_unbound_cpumask))
+		ret = workqueue_apply_unbound_cpumask(cpumask);
+
+	mutex_unlock(&wq_pool_mutex);
+	cpus_read_unlock();
+	free_cpumask_var(cpumask);
+	return ret;
+}
+
 static int parse_affn_scope(const char *val)
 {
 	int i;
@@ -6520,11 +6558,13 @@ void __init workqueue_init_early(void)
 	BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
 
 	BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
+	BUG_ON(!alloc_cpumask_var(&wq_user_unbound_cpumask, GFP_KERNEL));
 	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_WQ));
 	cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 	if (!cpumask_empty(&wq_cmdline_cpumask))
 		cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, &wq_cmdline_cpumask);
+	cpumask_copy(wq_user_unbound_cpumask, wq_unbound_cpumask);
 
 	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH-cgroup 2/4] selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh
  2023-10-13 18:11 [PATCH-cgroup 0/4] cgroup/cpuset: Improve CPU isolation in isolated partitions Waiman Long
  2023-10-13 18:11 ` [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask Waiman Long
@ 2023-10-13 18:11 ` Waiman Long
  2023-10-13 18:11 ` [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions Waiman Long
  2023-10-13 18:11 ` [PATCH-cgroup 4/4] cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask Waiman Long
  3 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2023-10-13 18:11 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Lai Jiangshan, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-kselftest, Waiman Long

Minor cleanup of test matrix and relocation of test_isolated() function
to prepare for the next patch. There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../selftests/cgroup/test_cpuset_prs.sh       | 142 +++++++++---------
 1 file changed, 71 insertions(+), 71 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index a6e9848189d6..2b825019f806 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -146,71 +146,6 @@ test_add_proc()
 	echo $$ > $CGROUP2/cgroup.procs	# Move out the task
 }
 
-#
-# Testing the new "isolated" partition root type
-#
-test_isolated()
-{
-	cd $CGROUP2/test
-	echo 2-3 > cpuset.cpus
-	TYPE=$(cat cpuset.cpus.partition)
-	[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
-
-	console_msg "Change from member to root"
-	test_partition root
-
-	console_msg "Change from root to isolated"
-	test_partition isolated
-
-	console_msg "Change from isolated to member"
-	test_partition member
-
-	console_msg "Change from member to isolated"
-	test_partition isolated
-
-	console_msg "Change from isolated to root"
-	test_partition root
-
-	console_msg "Change from root to member"
-	test_partition member
-
-	#
-	# Testing partition root with no cpu
-	#
-	console_msg "Distribute all cpus to child partition"
-	echo +cpuset > cgroup.subtree_control
-	test_partition root
-
-	mkdir A1
-	cd A1
-	echo 2-3 > cpuset.cpus
-	test_partition root
-	test_effective_cpus 2-3
-	cd ..
-	test_effective_cpus ""
-
-	console_msg "Moving task to partition test"
-	test_add_proc "No space left"
-	cd A1
-	test_add_proc ""
-	cd ..
-
-	console_msg "Shrink and expand child partition"
-	cd A1
-	echo 2 > cpuset.cpus
-	cd ..
-	test_effective_cpus 3
-	cd A1
-	echo 2-3 > cpuset.cpus
-	cd ..
-	test_effective_cpus ""
-
-	# Cleaning up
-	console_msg "Cleaning up"
-	echo $$ > $CGROUP2/cgroup.procs
-	[[ -d A1 ]] && rmdir A1
-}
-
 #
 # Cpuset controller state transition test matrix.
 #
@@ -304,7 +239,7 @@ TEST_MATRIX=(
 								       A1:P0,A2:P2,A3:P1 2-4"
 	" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
 				   .      .      X5      .      .    0 A1:0-4,A2:1-4,A3:2-4 \
-								       A1:P0,A2:P-2,A3:P-1 ."
+								       A1:P0,A2:P-2,A3:P-1"
 	" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
 				   .      .      .      X1      .    0 A1:0-1,A2:2-4,A3:2-4 \
 								       A1:P0,A2:P2,A3:P-1 2-4"
@@ -347,10 +282,10 @@ TEST_MATRIX=(
 	# cpus_allowed/exclusive_cpus update tests
 	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
 				   .     C4      .      P2     .     0 A1:4,A2:4,XA2:,XA3:,A3:4 \
-								       A1:P0,A3:P-2 ."
+								       A1:P0,A3:P-2"
 	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
 				   .     X1      .      P2     .     0 A1:0-3,A2:1-3,XA1:1,XA2:,XA3:,A3:2-3 \
-								       A1:P0,A3:P-2 ."
+								       A1:P0,A3:P-2"
 	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
 				   .      .     C3      P2     .     0 A1:0-2,A2:0-2,XA2:3,XA3:3,A3:3 \
 								       A1:P0,A3:P2 3"
@@ -359,13 +294,13 @@ TEST_MATRIX=(
 								       A1:P0,A3:P2 3"
 	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
 				   .      .     X3      .      .     0 A1:0-3,A2:1-3,XA2:3,XA3:3,A3:2-3 \
-								       A1:P0,A3:P-2 ."
+								       A1:P0,A3:P-2"
 	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
 				   .      .     C3      .      .     0 A1:0-3,A2:3,XA2:3,XA3:3,A3:3 \
-								       A1:P0,A3:P-2 ."
+								       A1:P0,A3:P-2"
 	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
 				   .     C4      .      .      .     0 A1:4,A2:4,A3:4,XA1:,XA2:,XA3 \
-								       A1:P0,A3:P-2 ."
+								       A1:P0,A3:P-2"
 
 	#  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
 	#  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
@@ -804,6 +739,71 @@ run_state_test()
 	echo "All $I tests of $TEST PASSED."
 }
 
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+	cd $CGROUP2/test
+	echo 2-3 > cpuset.cpus
+	TYPE=$(cat cpuset.cpus.partition)
+	[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+	console_msg "Change from member to root"
+	test_partition root
+
+	console_msg "Change from root to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to member"
+	test_partition member
+
+	console_msg "Change from member to isolated"
+	test_partition isolated
+
+	console_msg "Change from isolated to root"
+	test_partition root
+
+	console_msg "Change from root to member"
+	test_partition member
+
+	#
+	# Testing partition root with no cpu
+	#
+	console_msg "Distribute all cpus to child partition"
+	echo +cpuset > cgroup.subtree_control
+	test_partition root
+
+	mkdir A1
+	cd A1
+	echo 2-3 > cpuset.cpus
+	test_partition root
+	test_effective_cpus 2-3
+	cd ..
+	test_effective_cpus ""
+
+	console_msg "Moving task to partition test"
+	test_add_proc "No space left"
+	cd A1
+	test_add_proc ""
+	cd ..
+
+	console_msg "Shrink and expand child partition"
+	cd A1
+	echo 2 > cpuset.cpus
+	cd ..
+	test_effective_cpus 3
+	cd A1
+	echo 2-3 > cpuset.cpus
+	cd ..
+	test_effective_cpus ""
+
+	# Cleaning up
+	console_msg "Cleaning up"
+	echo $$ > $CGROUP2/cgroup.procs
+	[[ -d A1 ]] && rmdir A1
+}
+
 #
 # Wait for inotify event for the given file and read it
 # $1: cgroup file to wait for
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions
  2023-10-13 18:11 [PATCH-cgroup 0/4] cgroup/cpuset: Improve CPU isolation in isolated partitions Waiman Long
  2023-10-13 18:11 ` [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask Waiman Long
  2023-10-13 18:11 ` [PATCH-cgroup 2/4] selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh Waiman Long
@ 2023-10-13 18:11 ` Waiman Long
  2023-10-18  9:26   ` Tejun Heo
  2023-10-13 18:11 ` [PATCH-cgroup 4/4] cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask Waiman Long
  3 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2023-10-13 18:11 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Lai Jiangshan, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-kselftest, Waiman Long

Add a new internal isolated_cpus mask to keep track of the CPUs that
are in isolated partitions. Expose that new cpumask as a new root-only
control file ".__DEBUG__.cpuset.cpus.isolated" when cgroup_debug command
line option is specified.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 190 +++++++++++++++++++++++++++--------------
 1 file changed, 127 insertions(+), 63 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 615daaf87f1f..19c8779798fd 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -204,6 +204,11 @@ struct cpuset {
  */
 static cpumask_var_t	subpartitions_cpus;
 
+/*
+ * Exclusive CPUs in isolated partitions
+ */
+static cpumask_var_t	isolated_cpus;
+
 /* List of remote partition root children */
 static struct list_head remote_children;
 
@@ -1317,6 +1322,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus,
  */
 enum partition_cmd {
 	partcmd_enable,		/* Enable partition root	  */
+	partcmd_enablei,	/* Enable isolated partition root */
 	partcmd_disable,	/* Disable partition root	  */
 	partcmd_update,		/* Update parent's effective_cpus */
 	partcmd_invalidate,	/* Make partition invalid	  */
@@ -1418,6 +1424,74 @@ static void reset_partition_data(struct cpuset *cs)
 	}
 }
 
+/*
+ * partition_xcpus_newstate - Exclusive CPUs state change
+ * @old_prs: old partition_root_state
+ * @new_prs: new partition_root_state
+ * @xcpus: exclusive CPUs with state change
+ */
+static void partition_xcpus_newstate(int old_prs, int new_prs, struct cpumask *xcpus)
+{
+	WARN_ON_ONCE(old_prs == new_prs);
+	if (new_prs == PRS_ISOLATED)
+		cpumask_or(isolated_cpus, isolated_cpus, xcpus);
+	else
+		cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
+}
+
+/*
+ * partition_xcpus_add - Add new exclusive CPUs to partition
+ * @new_prs: new partition_root_state
+ * @parent: parent cpuset
+ * @xcpus: exclusive CPUs to be added
+ *
+ * Remote partition if parent == NULL
+ */
+static void partition_xcpus_add(int new_prs, struct cpuset *parent,
+				struct cpumask *xcpus)
+{
+	WARN_ON_ONCE(new_prs < 0);
+	lockdep_assert_held(&callback_lock);
+	if (!parent)
+		parent = &top_cpuset;
+
+	if (parent == &top_cpuset)
+		cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus);
+
+	if (new_prs != parent->partition_root_state)
+		partition_xcpus_newstate(parent->partition_root_state, new_prs,
+					 xcpus);
+
+	cpumask_andnot(parent->effective_cpus, parent->effective_cpus, xcpus);
+}
+
+/*
+ * partition_xcpus_del - Remove exclusive CPUs from partition
+ * @old_prs: old partition_root_state
+ * @parent: parent cpuset
+ * @xcpus: exclusive CPUs to be removed
+ *
+ * Remote partition if parent == NULL
+ */
+static void partition_xcpus_del(int old_prs, struct cpuset *parent,
+				struct cpumask *xcpus)
+{
+	WARN_ON_ONCE(old_prs < 0);
+	lockdep_assert_held(&callback_lock);
+	if (!parent)
+		parent = &top_cpuset;
+
+	if (parent == &top_cpuset)
+		cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus);
+
+	if (old_prs != parent->partition_root_state)
+		partition_xcpus_newstate(old_prs, parent->partition_root_state,
+					 xcpus);
+
+	cpumask_and(xcpus, xcpus, cpu_active_mask);
+	cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus);
+}
+
 /*
  * compute_effective_exclusive_cpumask - compute effective exclusive CPUs
  * @cs: cpuset
@@ -1456,13 +1530,15 @@ static inline bool is_local_partition(struct cpuset *cs)
 /*
  * remote_partition_enable - Enable current cpuset as a remote partition root
  * @cs: the cpuset to update
+ * @new_prs: new partition_root_state
  * @tmp: temparary masks
  * Return: 1 if successful, 0 if error
  *
  * Enable the current cpuset to become a remote partition root taking CPUs
  * directly from the top cpuset. cpuset_mutex must be held by the caller.
  */
-static int remote_partition_enable(struct cpuset *cs, struct tmpmasks *tmp)
+static int remote_partition_enable(struct cpuset *cs, int new_prs,
+				   struct tmpmasks *tmp)
 {
 	/*
 	 * The user must have sysadmin privilege.
@@ -1485,18 +1561,14 @@ static int remote_partition_enable(struct cpuset *cs, struct tmpmasks *tmp)
 		return 0;
 
 	spin_lock_irq(&callback_lock);
-	cpumask_andnot(top_cpuset.effective_cpus,
-		       top_cpuset.effective_cpus, tmp->new_cpus);
-	cpumask_or(subpartitions_cpus,
-		   subpartitions_cpus, tmp->new_cpus);
-
+	partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
+	list_add(&cs->remote_sibling, &remote_children);
 	if (cs->use_parent_ecpus) {
 		struct cpuset *parent = parent_cs(cs);
 
 		cs->use_parent_ecpus = false;
 		parent->child_ecpus_count--;
 	}
-	list_add(&cs->remote_sibling, &remote_children);
 	spin_unlock_irq(&callback_lock);
 
 	/*
@@ -1524,13 +1596,8 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
 	WARN_ON_ONCE(!cpumask_subset(tmp->new_cpus, subpartitions_cpus));
 
 	spin_lock_irq(&callback_lock);
-	cpumask_andnot(subpartitions_cpus,
-		       subpartitions_cpus, tmp->new_cpus);
-	cpumask_and(tmp->new_cpus,
-		    tmp->new_cpus, cpu_active_mask);
-	cpumask_or(top_cpuset.effective_cpus,
-		   top_cpuset.effective_cpus, tmp->new_cpus);
 	list_del_init(&cs->remote_sibling);
+	partition_xcpus_del(cs->partition_root_state, NULL, tmp->new_cpus);
 	cs->partition_root_state = -cs->partition_root_state;
 	if (!cs->prs_err)
 		cs->prs_err = PERR_INVCPUS;
@@ -1557,6 +1624,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *newmask,
 			       struct tmpmasks *tmp)
 {
 	bool adding, deleting;
+	int prs = cs->partition_root_state;
 
 	if (WARN_ON_ONCE(!is_remote_partition(cs)))
 		return;
@@ -1580,20 +1648,10 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *newmask,
 		goto invalidate;
 
 	spin_lock_irq(&callback_lock);
-	if (adding) {
-		cpumask_or(subpartitions_cpus,
-			   subpartitions_cpus, tmp->addmask);
-		cpumask_andnot(top_cpuset.effective_cpus,
-			       top_cpuset.effective_cpus, tmp->addmask);
-	}
-	if (deleting) {
-		cpumask_andnot(subpartitions_cpus,
-			       subpartitions_cpus, tmp->delmask);
-		cpumask_and(tmp->delmask,
-			    tmp->delmask, cpu_active_mask);
-		cpumask_or(top_cpuset.effective_cpus,
-			   top_cpuset.effective_cpus, tmp->delmask);
-	}
+	if (adding)
+		partition_xcpus_add(prs, NULL, tmp->addmask);
+	if (deleting)
+		partition_xcpus_del(prs, NULL, tmp->delmask);
 	spin_unlock_irq(&callback_lock);
 
 	/*
@@ -1676,11 +1734,11 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
  * @tmp:     Temporary addmask and delmask
  * Return:   0 or a partition root state error code
  *
- * For partcmd_enable, the cpuset is being transformed from a non-partition
- * root to a partition root. The effective_xcpus (cpus_allowed if effective_xcpus
- * not set) mask of the given cpuset will be taken away from parent's
- * effective_cpus. The function will return 0 if all the CPUs listed in
- * effective_xcpus can be granted or an error code will be returned.
+ * For partcmd_enable*, the cpuset is being transformed from a non-partition
+ * root to a partition root. The effective_xcpus (cpus_allowed if
+ * effective_xcpus not set) mask of the given cpuset will be taken away from
+ * parent's effective_cpus. The function will return 0 if all the CPUs listed
+ * in effective_xcpus can be granted or an error code will be returned.
  *
  * For partcmd_disable, the cpuset is being transformed from a partition
  * root back to a non-partition root. Any CPUs in effective_xcpus will be
@@ -1695,7 +1753,7 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
  *
  * For partcmd_invalidate, the current partition will be made invalid.
  *
- * The partcmd_enable and partcmd_disable commands are used by
+ * The partcmd_enable* and partcmd_disable commands are used by
  * update_prstate(). An error code may be returned and the caller will check
  * for error.
  *
@@ -1760,7 +1818,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 
 	nocpu = tasks_nocpu_error(parent, cs, xcpus);
 
-	if (cmd == partcmd_enable) {
+	if ((cmd == partcmd_enable) || (cmd == partcmd_enablei)) {
 		/*
 		 * Enabling partition root is not allowed if its
 		 * effective_xcpus is empty or doesn't overlap with
@@ -1783,6 +1841,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 		cpumask_copy(tmp->delmask, xcpus);
 		deleting = true;
 		subparts_delta++;
+		new_prs = (cmd == partcmd_enable) ? PRS_ROOT : PRS_ISOLATED;
 	} else if (cmd == partcmd_disable) {
 		/*
 		 * May need to add cpus to parent's effective_cpus for
@@ -1792,6 +1851,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 			  cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus);
 		if (adding)
 			subparts_delta--;
+		new_prs = PRS_MEMBER;
 	} else if (newmask) {
 		/*
 		 * Empty cpumask is not allowed
@@ -1940,37 +2000,24 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 	 * newly deleted ones will be added back to effective_cpus.
 	 */
 	spin_lock_irq(&callback_lock);
-	if (adding) {
-		if (parent == &top_cpuset)
-			cpumask_andnot(subpartitions_cpus,
-				       subpartitions_cpus, tmp->addmask);
-		/*
-		 * Some of the CPUs in effective_xcpus might have been offlined.
-		 */
-		cpumask_or(parent->effective_cpus,
-			   parent->effective_cpus, tmp->addmask);
-		cpumask_and(parent->effective_cpus,
-			    parent->effective_cpus, cpu_active_mask);
-	}
-	if (deleting) {
-		if (parent == &top_cpuset)
-			cpumask_or(subpartitions_cpus,
-				   subpartitions_cpus, tmp->delmask);
-		cpumask_andnot(parent->effective_cpus,
-			       parent->effective_cpus, tmp->delmask);
-	}
-
-	if (is_partition_valid(parent)) {
-		parent->nr_subparts += subparts_delta;
-		WARN_ON_ONCE(parent->nr_subparts < 0);
-	}
-
 	if (old_prs != new_prs) {
 		cs->partition_root_state = new_prs;
 		if (new_prs <= 0)
 			cs->nr_subparts = 0;
 	}
+	/*
+	 * Adding to parent's effective_cpus means deletion CPUs from cs
+	 * and vice versa.
+	 */
+	if (adding)
+		partition_xcpus_del(old_prs, parent, tmp->addmask);
+	if (deleting)
+		partition_xcpus_add(new_prs, parent, tmp->delmask);
 
+	if (is_partition_valid(parent)) {
+		parent->nr_subparts += subparts_delta;
+		WARN_ON_ONCE(parent->nr_subparts < 0);
+	}
 	spin_unlock_irq(&callback_lock);
 
 	if ((old_prs != new_prs) && (cmd == partcmd_update))
@@ -2948,6 +2995,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	int err = PERR_NONE, old_prs = cs->partition_root_state;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
+	bool new_xcpus_state = false;
 
 	if (old_prs == new_prs)
 		return 0;
@@ -2977,6 +3025,9 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		goto out;
 
 	if (!old_prs) {
+		enum partition_cmd cmd = (new_prs == PRS_ROOT)
+				       ? partcmd_enable : partcmd_enablei;
+
 		/*
 		 * cpus_allowed cannot be empty.
 		 */
@@ -2985,19 +3036,18 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 			goto out;
 		}
 
-		err = update_parent_effective_cpumask(cs, partcmd_enable,
-						      NULL, &tmpmask);
+		err = update_parent_effective_cpumask(cs, cmd, NULL, &tmpmask);
 		/*
 		 * If an attempt to become local partition root fails,
 		 * try to become a remote partition root instead.
 		 */
-		if (err && remote_partition_enable(cs, &tmpmask))
+		if (err && remote_partition_enable(cs, new_prs, &tmpmask))
 			err = 0;
 	} else if (old_prs && new_prs) {
 		/*
 		 * A change in load balance state only, no change in cpumasks.
 		 */
-		;
+		new_xcpus_state = true;
 	} else {
 		/*
 		 * Switching back to member is always allowed even if it
@@ -3029,6 +3079,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	WRITE_ONCE(cs->prs_err, err);
 	if (!is_partition_valid(cs))
 		reset_partition_data(cs);
+	else if (new_xcpus_state)
+		partition_xcpus_newstate(old_prs, new_prs, cs->effective_xcpus);
 	spin_unlock_irq(&callback_lock);
 
 	/* Force update if switching back to member */
@@ -3386,6 +3438,7 @@ typedef enum {
 	FILE_SUBPARTS_CPULIST,
 	FILE_EXCLUSIVE_CPULIST,
 	FILE_EFFECTIVE_XCPULIST,
+	FILE_ISOLATED_CPULIST,
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
@@ -3582,6 +3635,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_SUBPARTS_CPULIST:
 		seq_printf(sf, "%*pbl\n", cpumask_pr_args(subpartitions_cpus));
 		break;
+	case FILE_ISOLATED_CPULIST:
+		seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -3875,6 +3931,13 @@ static struct cftype dfl_files[] = {
 		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
 	},
 
+	{
+		.name = "cpus.isolated",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_ISOLATED_CPULIST,
+		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -4194,6 +4257,7 @@ int __init cpuset_init(void)
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_xcpus, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH-cgroup 4/4] cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
  2023-10-13 18:11 [PATCH-cgroup 0/4] cgroup/cpuset: Improve CPU isolation in isolated partitions Waiman Long
                   ` (2 preceding siblings ...)
  2023-10-13 18:11 ` [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions Waiman Long
@ 2023-10-13 18:11 ` Waiman Long
  3 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2023-10-13 18:11 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Lai Jiangshan, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-kselftest, Waiman Long

To make CPUs in isolated cpuset partition closer in isolation to
the boot time isolated CPUs specified in the "isolcpus" boot command
line option, we need to take those CPUs out of the workqueue unbound
cpumask so that work functions from the unbound workqueues won't run
on those CPUs.  Otherwise, they will interfere the user tasks running
on those isolated CPUs.

With the introduction of the workqueue_unbound_exclude_cpumask() helper
function in a previous commit, those isolated CPUs can now be taken
out from the workqueue unbound cpumask.

This patch also updates cgroup-v2.rst to mention that isolated
CPUs will be excluded from unbound workqueue cpumask as well as
updating test_cpuset_prs.sh to verify the correctness of the new
*cpuset.cpus.isolated file, if available via cgroup_debug option.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst       | 10 +--
 kernel/cgroup/cpuset.c                        | 67 ++++++++++++++++---
 .../selftests/cgroup/test_cpuset_prs.sh       | 67 ++++++++++++++++---
 3 files changed, 120 insertions(+), 24 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index e40b8560e002..d91ec638403b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2311,11 +2311,11 @@ Cpuset Interface Files
 	partition or scheduling domain.  The set of exclusive CPUs is
 	determined by the value of its "cpuset.cpus.exclusive.effective".
 
-	When set to "isolated", the CPUs in that partition will
-	be in an isolated state without any load balancing from the
-	scheduler.  Tasks placed in such a partition with multiple
-	CPUs should be carefully distributed and bound to each of the
-	individual CPUs for optimal performance.
+	When set to "isolated", the CPUs in that partition will be in
+	an isolated state without any load balancing from the scheduler
+	and excluded from the unbound workqueues.  Tasks placed in such
+	a partition with multiple CPUs should be carefully distributed
+	and bound to each of the individual CPUs for optimal performance.
 
 	A partition root ("root" or "isolated") can be in one of the
 	two possible states - valid or invalid.  An invalid partition
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 19c8779798fd..da251764611f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -43,6 +43,7 @@
 #include <linux/sched/isolation.h>
 #include <linux/cgroup.h>
 #include <linux/wait.h>
+#include <linux/workqueue.h>
 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -1444,25 +1445,31 @@ static void partition_xcpus_newstate(int old_prs, int new_prs, struct cpumask *x
  * @new_prs: new partition_root_state
  * @parent: parent cpuset
  * @xcpus: exclusive CPUs to be added
+ * Return: true if isolated_cpus modified, false otherwise
  *
  * Remote partition if parent == NULL
  */
-static void partition_xcpus_add(int new_prs, struct cpuset *parent,
+static bool partition_xcpus_add(int new_prs, struct cpuset *parent,
 				struct cpumask *xcpus)
 {
+	bool isolcpus_updated;
+
 	WARN_ON_ONCE(new_prs < 0);
 	lockdep_assert_held(&callback_lock);
 	if (!parent)
 		parent = &top_cpuset;
 
+
 	if (parent == &top_cpuset)
 		cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus);
 
-	if (new_prs != parent->partition_root_state)
+	isolcpus_updated = (new_prs != parent->partition_root_state);
+	if (isolcpus_updated)
 		partition_xcpus_newstate(parent->partition_root_state, new_prs,
 					 xcpus);
 
 	cpumask_andnot(parent->effective_cpus, parent->effective_cpus, xcpus);
+	return isolcpus_updated;
 }
 
 /*
@@ -1470,12 +1477,15 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
  * @old_prs: old partition_root_state
  * @parent: parent cpuset
  * @xcpus: exclusive CPUs to be removed
+ * Return: true if isolated_cpus modified, false otherwise
  *
  * Remote partition if parent == NULL
  */
-static void partition_xcpus_del(int old_prs, struct cpuset *parent,
+static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
 				struct cpumask *xcpus)
 {
+	bool isolcpus_updated;
+
 	WARN_ON_ONCE(old_prs < 0);
 	lockdep_assert_held(&callback_lock);
 	if (!parent)
@@ -1484,12 +1494,34 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
 	if (parent == &top_cpuset)
 		cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus);
 
-	if (old_prs != parent->partition_root_state)
+	isolcpus_updated = (old_prs != parent->partition_root_state);
+	if (isolcpus_updated)
 		partition_xcpus_newstate(old_prs, parent->partition_root_state,
 					 xcpus);
 
 	cpumask_and(xcpus, xcpus, cpu_active_mask);
 	cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus);
+	return isolcpus_updated;
+}
+
+static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
+{
+	int ret;
+
+	if (!isolcpus_updated)
+		return;
+
+	ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
+
+	/*
+	 * An error of -EBUSY will be returned if there is a hotplug operation
+	 * in progress. Skip the update and hopefully the unbound workqueue
+	 * cpumask will be correctly set next time.
+	 */
+	if (ret == -EBUSY)
+		return;
+
+	WARN_ON_ONCE(ret < 0);
 }
 
 /*
@@ -1540,6 +1572,8 @@ static inline bool is_local_partition(struct cpuset *cs)
 static int remote_partition_enable(struct cpuset *cs, int new_prs,
 				   struct tmpmasks *tmp)
 {
+	bool isolcpus_updated;
+
 	/*
 	 * The user must have sysadmin privilege.
 	 */
@@ -1561,7 +1595,7 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 		return 0;
 
 	spin_lock_irq(&callback_lock);
-	partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
+	isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
 	list_add(&cs->remote_sibling, &remote_children);
 	if (cs->use_parent_ecpus) {
 		struct cpuset *parent = parent_cs(cs);
@@ -1570,13 +1604,13 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 		parent->child_ecpus_count--;
 	}
 	spin_unlock_irq(&callback_lock);
+	update_unbound_workqueue_cpumask(isolcpus_updated);
 
 	/*
 	 * Proprogate changes in top_cpuset's effective_cpus down the hierarchy.
 	 */
 	update_tasks_cpumask(&top_cpuset, tmp->new_cpus);
 	update_sibling_cpumasks(&top_cpuset, NULL, tmp);
-
 	return 1;
 }
 
@@ -1591,18 +1625,22 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
  */
 static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
 {
+	bool isolcpus_updated;
+
 	compute_effective_exclusive_cpumask(cs, tmp->new_cpus);
 	WARN_ON_ONCE(!is_remote_partition(cs));
 	WARN_ON_ONCE(!cpumask_subset(tmp->new_cpus, subpartitions_cpus));
 
 	spin_lock_irq(&callback_lock);
 	list_del_init(&cs->remote_sibling);
-	partition_xcpus_del(cs->partition_root_state, NULL, tmp->new_cpus);
+	isolcpus_updated = partition_xcpus_del(cs->partition_root_state,
+					       NULL, tmp->new_cpus);
 	cs->partition_root_state = -cs->partition_root_state;
 	if (!cs->prs_err)
 		cs->prs_err = PERR_INVCPUS;
 	reset_partition_data(cs);
 	spin_unlock_irq(&callback_lock);
+	update_unbound_workqueue_cpumask(isolcpus_updated);
 
 	/*
 	 * Proprogate changes in top_cpuset's effective_cpus down the hierarchy.
@@ -1625,6 +1663,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *newmask,
 {
 	bool adding, deleting;
 	int prs = cs->partition_root_state;
+	int isolcpus_updated = 0;
 
 	if (WARN_ON_ONCE(!is_remote_partition(cs)))
 		return;
@@ -1649,10 +1688,11 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *newmask,
 
 	spin_lock_irq(&callback_lock);
 	if (adding)
-		partition_xcpus_add(prs, NULL, tmp->addmask);
+		isolcpus_updated += partition_xcpus_add(prs, NULL, tmp->addmask);
 	if (deleting)
-		partition_xcpus_del(prs, NULL, tmp->delmask);
+		isolcpus_updated += partition_xcpus_del(prs, NULL, tmp->delmask);
 	spin_unlock_irq(&callback_lock);
+	update_unbound_workqueue_cpumask(isolcpus_updated);
 
 	/*
 	 * Proprogate changes in top_cpuset's effective_cpus down the hierarchy.
@@ -1774,6 +1814,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 	int part_error = PERR_NONE;	/* Partition error? */
 	int subparts_delta = 0;
 	struct cpumask *xcpus;		/* cs effective_xcpus */
+	int isolcpus_updated = 0;
 	bool nocpu;
 
 	lockdep_assert_held(&cpuset_mutex);
@@ -2010,15 +2051,18 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 	 * and vice versa.
 	 */
 	if (adding)
-		partition_xcpus_del(old_prs, parent, tmp->addmask);
+		isolcpus_updated += partition_xcpus_del(old_prs, parent,
+							tmp->addmask);
 	if (deleting)
-		partition_xcpus_add(new_prs, parent, tmp->delmask);
+		isolcpus_updated += partition_xcpus_add(new_prs, parent,
+							tmp->delmask);
 
 	if (is_partition_valid(parent)) {
 		parent->nr_subparts += subparts_delta;
 		WARN_ON_ONCE(parent->nr_subparts < 0);
 	}
 	spin_unlock_irq(&callback_lock);
+	update_unbound_workqueue_cpumask(isolcpus_updated);
 
 	if ((old_prs != new_prs) && (cmd == partcmd_update))
 		update_partition_exclusive(cs, new_prs);
@@ -3082,6 +3126,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	else if (new_xcpus_state)
 		partition_xcpus_newstate(old_prs, new_prs, cs->effective_xcpus);
 	spin_unlock_irq(&callback_lock);
+	update_unbound_workqueue_cpumask(new_xcpus_state);
 
 	/* Force update if switching back to member */
 	update_cpumasks_hier(cs, &tmpmask, !new_prs ? HIER_CHECKALL : 0);
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 2b825019f806..f117f704e675 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -232,11 +232,11 @@ TEST_MATRIX=(
 	" C0-3:S+ C1-3:S+ C2-3   C4-5   X2-3  X2-3:P1   P2     P1    0 A1:0-1,A2:,A3:2-3,B1:4-5 \
 								       A1:P0,A2:P1,A3:P2,B1:P1 2-3"
 	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3  X2-3:P1   P2     P1    0 A1:0-1,A2:,A3:2-3,B1:4 \
-								       A1:P0,A2:P1,A3:P2,B1:P1 2-4"
+								       A1:P0,A2:P1,A3:P2,B1:P1 2-4,2-3"
 	" C0-3:S+ C1-3:S+  C3     C4    X2-3  X2-3:P1   P2     P1    0 A1:0-1,A2:2,A3:3,B1:4 \
-								       A1:P0,A2:P1,A3:P2,B1:P1 2-4"
+								       A1:P0,A2:P1,A3:P2,B1:P1 2-4,3"
 	" C0-4:S+ C1-4:S+ C2-4     .    X2-4  X2-4:P2  X4:P1    .    0 A1:0-1,A2:2-3,A3:4 \
-								       A1:P0,A2:P2,A3:P1 2-4"
+								       A1:P0,A2:P2,A3:P1 2-4,2-3"
 	" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
 				   .      .      X5      .      .    0 A1:0-4,A2:1-4,A3:2-4 \
 								       A1:P0,A2:P-2,A3:P-1"
@@ -248,7 +248,7 @@ TEST_MATRIX=(
 	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3 X2-3:P2:O2=0 .   0 A1:0-1,A2:1,A3:3 A1:P0,A3:P2 2-3"
 	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1,A2:1,A3:2-3 A1:P0,A3:P2 2-3"
 	" C0-3:S+ C1-3:S+  C3      .    X2-3   X2-3    P2:O3=0   .   0 A1:0-2,A2:1-2,A3: A1:P0,A3:P2 3"
-	" C0-3:S+ C1-3:S+  C3      .    X2-3   X2-3   T:P2:O3=0  .   0 A1:0-2,A2:1-2,A3:1-2 A1:P0,A3:P-2 3"
+	" C0-3:S+ C1-3:S+  C3      .    X2-3   X2-3   T:P2:O3=0  .   0 A1:0-2,A2:1-2,A3:1-2 A1:P0,A3:P-2 3,"
 
 	# An invalidated remote partition cannot self-recover from hotplug
 	" C0-3:S+ C1-3:S+  C2      .    X2-3   X2-3   T:P2:O2=0 O2=1 0 A1:0-3,A2:1-3,A3:2 A1:P0,A3:P-2"
@@ -508,12 +508,14 @@ dump_states()
 		XECPUS=$DIR/cpuset.cpus.exclusive.effective
 		PRS=$DIR/cpuset.cpus.partition
 		PCPUS=$DIR/.__DEBUG__.cpuset.cpus.subpartitions
+		ISCPUS=$DIR/.__DEBUG__.cpuset.cpus.isolated
 		[[ -e $CPUS   ]] && echo "$CPUS: $(cat $CPUS)"
 		[[ -e $XCPUS  ]] && echo "$XCPUS: $(cat $XCPUS)"
 		[[ -e $ECPUS  ]] && echo "$ECPUS: $(cat $ECPUS)"
 		[[ -e $XECPUS ]] && echo "$XECPUS: $(cat $XECPUS)"
 		[[ -e $PRS    ]] && echo "$PRS: $(cat $PRS)"
 		[[ -e $PCPUS  ]] && echo "$PCPUS: $(cat $PCPUS)"
+		[[ -e $ISCPUS ]] && echo "$ISCPUS: $(cat $ISCPUS)"
 	done
 }
 
@@ -591,11 +593,17 @@ check_cgroup_states()
 
 #
 # Get isolated (including offline) CPUs by looking at
-# /sys/kernel/debug/sched/domains and compare that with the expected value.
+# /sys/kernel/debug/sched/domains and *cpuset.cpus.isolated control file,
+# if available, and compare that with the expected value.
 #
-# Note that a sched domain of just 1 CPU will be considered isolated.
+# Note that isolated CPUs from the sched/domains context include offline
+# CPUs as well as CPUs in non-isolated 1-CPU partition. Those CPUs may
+# not be included in the *cpuset.cpus.isolated control file which contains
+# only CPUs in isolated partitions.
 #
-# $1 - expected isolated cpu list
+# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
+# <isolcpus1> - expected sched/domains value
+# <isolcpus2> - *cpuset.cpus.isolated value = <isolcpus1> if not defined
 #
 check_isolcpus()
 {
@@ -603,8 +611,33 @@ check_isolcpus()
 	ISOLCPUS=
 	LASTISOLCPU=
 	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
+	ISCPUS=${CGROUP2}/.__DEBUG__.cpuset.cpus.isolated
+	if [[ $EXPECT_VAL = . ]]
+	then
+		EXPECT_VAL=
+		EXPECT_VAL2=
+	elif [[ $(expr $EXPECT_VAL : ".*,.*") > 0 ]]
+	then
+		set -- $(echo $EXPECT_VAL | sed -e "s/,/ /g")
+		EXPECT_VAL=$1
+		EXPECT_VAL2=$2
+	else
+		EXPECT_VAL2=$EXPECT_VAL
+	fi
+
+	#
+	# Check the debug isolated cpumask, if present
+	#
+	[[ -f $ISCPUS ]] && {
+		ISOLCPUS=$(cat $ISCPUS)
+		[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1
+		ISOLCPUS=
+	}
+
+	#
+	# Use the sched domain in debugfs to check isolated CPUs, if available
+	#
 	[[ -d $SCHED_DOMAINS ]] || return 0
-	[[ $EXPECT_VAL = . ]] && EXPECT_VAL=
 
 	for ((CPU=0; CPU < $NR_CPUS; CPU++))
 	do
@@ -648,6 +681,22 @@ test_fail()
 	exit 1
 }
 
+#
+# Check to see if there are unexpected isolated CPUs left
+#
+null_isolcpus_check()
+{
+	[[ $VERBOSE -gt 0 ]] || return 0
+	pause 0.01
+	check_isolcpus "."
+	if [[ $? -ne 0 ]]
+	then
+		echo "Unexpected isolated CPUs: $ISOLCPUS"
+		dump_states
+		exit 1
+	fi
+}
+
 #
 # Run cpuset state transition test
 #  $1 - test matrix name
@@ -733,6 +782,7 @@ run_state_test()
 			echo "Effective cpus changed to $NEWLIST after test $I!"
 			exit 1
 		}
+		null_isolcpus_check
 		[[ $VERBOSE -gt 0 ]] && echo "Test $I done."
 		((I++))
 	done
@@ -802,6 +852,7 @@ test_isolated()
 	console_msg "Cleaning up"
 	echo $$ > $CGROUP2/cgroup.procs
 	[[ -d A1 ]] && rmdir A1
+	null_isolcpus_check
 }
 
 #
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  2023-10-13 18:11 ` [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask Waiman Long
@ 2023-10-18  9:24   ` Tejun Heo
  2023-10-18 13:41     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2023-10-18  9:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

Hello,

On Fri, Oct 13, 2023 at 02:11:19PM -0400, Waiman Long wrote:
> When the "isolcpus" boot command line option is used to add a set
> of isolated CPUs, those CPUs will be excluded automatically from
> wq_unbound_cpumask to avoid running work functions from unbound
> workqueues.
> 
> Recently cpuset has been extended to allow the creation of partitions
> of isolated CPUs dynamically. To make it closer to the "isolcpus"
> in functionality, the CPUs in those isolated cpuset partitions should be
> excluded from wq_unbound_cpumask as well. This can be done currently by
> explicitly writing to the workqueue's cpumask sysfs file after creating
> the isolated partitions. However, this process can be error prone.
> Ideally, the cpuset code should be allowed to request the workqueue code
> to exclude those isolated CPUs from wq_unbound_cpumask so that this
> operation can be done automatically and the isolated CPUs will be returned
> back to wq_unbound_cpumask after the destructions of the isolated
> cpuset partitions.
> 
> This patch adds a new workqueue_unbound_exclude_cpumask() to enable
> that. This new function will exclude the specified isolated CPUs
> from wq_unbound_cpumask. To be able to restore those isolated CPUs
> back after the destruction of isolated cpuset partitions, a new
> wq_user_unbound_cpumask is added to store the user provided unbound
> cpumask either from the boot command line options or from writing to
> the cpumask sysfs file. This new cpumask provides the basis for CPU
> exclusion.

The behaviors around wq_unbound_cpumask is getting pretty inconsistent:

1. Housekeeping excludes isolated CPUs on boot but allows user to override
   it to include isolated CPUs afterwards.

2. If an unbound wq's cpumask doesn't have any intersection with
   wq_unbound_cpumask we ignore the per-wq cpumask and falls back to
   wq_unbound_cpumask.

3. You're adding a masking layer on top with exclude which fails to set if
   the intersection is empty.

Can we do the followings for consistency?

1. User's requested_unbound_cpumask is stored separately (as in this patch).

2. The effect wq_unbound_cpumask is determined by requested_unbound_cpumask
   & housekeeping_cpumask & cpuset_allowed_cpumask. The operation order
   matters. When an & operation yields an cpumask, the cpumask from the
   previous step is the effective one.

3. Expose these cpumasks in sysfs so that what's happening is obvious.

> +int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
> +{
> +	cpumask_var_t cpumask;
> +	int ret = 0;
> +
> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	/*
> +	 * The caller of this function may have called cpus_read_lock(),
> +	 * use cpus_read_trylock() to avoid potential deadlock.
> +	 */
> +	if (!cpus_read_trylock())
> +		return -EBUSY;

This means that a completely unrelated cpus_write_lock() can fail this
operation and thus cpuset config writes. Let's please not do this. Can't we
just make sure that the caller holds the lock?

> +	mutex_lock(&wq_pool_mutex);
> +
> +	if (!cpumask_andnot(cpumask, wq_user_unbound_cpumask, exclude_cpumask))
> +		ret = -EINVAL;	/* The new cpumask can't be empty */

For better or worse, the usual mode-of-failure for "no usable CPU" is just
falling back to something which works rather than failing the operation.
Let's follow that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions
  2023-10-13 18:11 ` [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions Waiman Long
@ 2023-10-18  9:26   ` Tejun Heo
  2023-10-18 13:30     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2023-10-18  9:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

On Fri, Oct 13, 2023 at 02:11:21PM -0400, Waiman Long wrote:
...
> @@ -3875,6 +3931,13 @@ static struct cftype dfl_files[] = {
>  		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
>  	},
>  
> +	{
> +		.name = "cpus.isolated",
> +		.seq_show = cpuset_common_seq_show,
> +		.private = FILE_ISOLATED_CPULIST,
> +		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
> +	},

I'd much rather show this in a wq sysfs file along with other related masks,
and not in a DEBUG file.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions
  2023-10-18  9:26   ` Tejun Heo
@ 2023-10-18 13:30     ` Waiman Long
  2023-10-18 18:08       ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2023-10-18 13:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

On 10/18/23 05:26, Tejun Heo wrote:
> On Fri, Oct 13, 2023 at 02:11:21PM -0400, Waiman Long wrote:
> ...
>> @@ -3875,6 +3931,13 @@ static struct cftype dfl_files[] = {
>>   		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
>>   	},
>>   
>> +	{
>> +		.name = "cpus.isolated",
>> +		.seq_show = cpuset_common_seq_show,
>> +		.private = FILE_ISOLATED_CPULIST,
>> +		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
>> +	},
> I'd much rather show this in a wq sysfs file along with other related masks,
> and not in a DEBUG file.

It can certainly be exposed as a permanent addition to the cgroup 
control files instead of a debug only file. However this set of isolated 
CPUs may be used by others not just by workqueue. So I doubt if it 
should be a sysfs file in the workqueue directory. I can see if it is 
possible to put a symlink there point back to the cgroupfs.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  2023-10-18  9:24   ` Tejun Heo
@ 2023-10-18 13:41     ` Waiman Long
  2023-10-18 19:18       ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2023-10-18 13:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

On 10/18/23 05:24, Tejun Heo wrote:
> Hello,
>
> On Fri, Oct 13, 2023 at 02:11:19PM -0400, Waiman Long wrote:
>> When the "isolcpus" boot command line option is used to add a set
>> of isolated CPUs, those CPUs will be excluded automatically from
>> wq_unbound_cpumask to avoid running work functions from unbound
>> workqueues.
>>
>> Recently cpuset has been extended to allow the creation of partitions
>> of isolated CPUs dynamically. To make it closer to the "isolcpus"
>> in functionality, the CPUs in those isolated cpuset partitions should be
>> excluded from wq_unbound_cpumask as well. This can be done currently by
>> explicitly writing to the workqueue's cpumask sysfs file after creating
>> the isolated partitions. However, this process can be error prone.
>> Ideally, the cpuset code should be allowed to request the workqueue code
>> to exclude those isolated CPUs from wq_unbound_cpumask so that this
>> operation can be done automatically and the isolated CPUs will be returned
>> back to wq_unbound_cpumask after the destructions of the isolated
>> cpuset partitions.
>>
>> This patch adds a new workqueue_unbound_exclude_cpumask() to enable
>> that. This new function will exclude the specified isolated CPUs
>> from wq_unbound_cpumask. To be able to restore those isolated CPUs
>> back after the destruction of isolated cpuset partitions, a new
>> wq_user_unbound_cpumask is added to store the user provided unbound
>> cpumask either from the boot command line options or from writing to
>> the cpumask sysfs file. This new cpumask provides the basis for CPU
>> exclusion.
> The behaviors around wq_unbound_cpumask is getting pretty inconsistent:
>
> 1. Housekeeping excludes isolated CPUs on boot but allows user to override
>     it to include isolated CPUs afterwards.
>
> 2. If an unbound wq's cpumask doesn't have any intersection with
>     wq_unbound_cpumask we ignore the per-wq cpumask and falls back to
>     wq_unbound_cpumask.
>
> 3. You're adding a masking layer on top with exclude which fails to set if
>     the intersection is empty.
>
> Can we do the followings for consistency?
>
> 1. User's requested_unbound_cpumask is stored separately (as in this patch).
>
> 2. The effect wq_unbound_cpumask is determined by requested_unbound_cpumask
>     & housekeeping_cpumask & cpuset_allowed_cpumask. The operation order
>     matters. When an & operation yields an cpumask, the cpumask from the
>     previous step is the effective one.
Sure. I will do that.
>
> 3. Expose these cpumasks in sysfs so that what's happening is obvious.

I can expose the requested_unbound_cpumask. As for the isolated CPUs, 
see my other reply.

>> +int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
>> +{
>> +	cpumask_var_t cpumask;
>> +	int ret = 0;
>> +
>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * The caller of this function may have called cpus_read_lock(),
>> +	 * use cpus_read_trylock() to avoid potential deadlock.
>> +	 */
>> +	if (!cpus_read_trylock())
>> +		return -EBUSY;
> This means that a completely unrelated cpus_write_lock() can fail this
> operation and thus cpuset config writes. Let's please not do this. Can't we
> just make sure that the caller holds the lock?
This condition is actually triggered by a few hotplug tests in 
test_cpuset_prs.sh. I will make sure that either cpu read or write lock 
is held before calling this function and eliminate rcu read locking here.
>
>> +	mutex_lock(&wq_pool_mutex);
>> +
>> +	if (!cpumask_andnot(cpumask, wq_user_unbound_cpumask, exclude_cpumask))
>> +		ret = -EINVAL;	/* The new cpumask can't be empty */
> For better or worse, the usual mode-of-failure for "no usable CPU" is just
> falling back to something which works rather than failing the operation.
> Let's follow that.

In this case, it is just leaving the current unbound cpumask unchanged. 
I will follow the precedence discussed above to make sure that there is 
a graceful fallback.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions
  2023-10-18 13:30     ` Waiman Long
@ 2023-10-18 18:08       ` Tejun Heo
  2023-10-18 18:24         ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2023-10-18 18:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

On Wed, Oct 18, 2023 at 09:30:04AM -0400, Waiman Long wrote:
> On 10/18/23 05:26, Tejun Heo wrote:
> > On Fri, Oct 13, 2023 at 02:11:21PM -0400, Waiman Long wrote:
> > ...
> > > @@ -3875,6 +3931,13 @@ static struct cftype dfl_files[] = {
> > >   		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
> > >   	},
> > > +	{
> > > +		.name = "cpus.isolated",
> > > +		.seq_show = cpuset_common_seq_show,
> > > +		.private = FILE_ISOLATED_CPULIST,
> > > +		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
> > > +	},
> > I'd much rather show this in a wq sysfs file along with other related masks,
> > and not in a DEBUG file.
> 
> It can certainly be exposed as a permanent addition to the cgroup control
> files instead of a debug only file. However this set of isolated CPUs may be
> used by others not just by workqueue. So I doubt if it should be a sysfs
> file in the workqueue directory. I can see if it is possible to put a
> symlink there point back to the cgroupfs.

I don't know whether it will happen but let's say there will be three
subsystems which call into workqueue for this. Wouldn't it be better to have
all of them in workqueue sysfs using a consistent naming scheme? What does
putting it in cgroupfs buy us?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions
  2023-10-18 18:08       ` Tejun Heo
@ 2023-10-18 18:24         ` Waiman Long
  2023-10-24  3:25           ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2023-10-18 18:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

On 10/18/23 14:08, Tejun Heo wrote:
> On Wed, Oct 18, 2023 at 09:30:04AM -0400, Waiman Long wrote:
>> On 10/18/23 05:26, Tejun Heo wrote:
>>> On Fri, Oct 13, 2023 at 02:11:21PM -0400, Waiman Long wrote:
>>> ...
>>>> @@ -3875,6 +3931,13 @@ static struct cftype dfl_files[] = {
>>>>    		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
>>>>    	},
>>>> +	{
>>>> +		.name = "cpus.isolated",
>>>> +		.seq_show = cpuset_common_seq_show,
>>>> +		.private = FILE_ISOLATED_CPULIST,
>>>> +		.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
>>>> +	},
>>> I'd much rather show this in a wq sysfs file along with other related masks,
>>> and not in a DEBUG file.
>> It can certainly be exposed as a permanent addition to the cgroup control
>> files instead of a debug only file. However this set of isolated CPUs may be
>> used by others not just by workqueue. So I doubt if it should be a sysfs
>> file in the workqueue directory. I can see if it is possible to put a
>> symlink there point back to the cgroupfs.
> I don't know whether it will happen but let's say there will be three
> subsystems which call into workqueue for this. Wouldn't it be better to have
> all of them in workqueue sysfs using a consistent naming scheme? What does
> putting it in cgroupfs buy us?

If you mean saving the exclusion cpumask no matter who the caller is, we 
can add another exclusion cpumask to save it and expose it to sysfs. 
This should be done in the first workqueue patch, not as part of this 
patch. I expose this isolated cpumask for testing purpose to be checked 
by the test_cpuset_prs.sh script for correctness. As said, I can expose 
it without cgroup_debug if you think the information is useful in general.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  2023-10-18 13:41     ` Waiman Long
@ 2023-10-18 19:18       ` Waiman Long
  2023-10-24  3:28         ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2023-10-18 19:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

On 10/18/23 09:41, Waiman Long wrote:
> On 10/18/23 05:24, Tejun Heo wrote:
>> Hello,
>>
>> On Fri, Oct 13, 2023 at 02:11:19PM -0400, Waiman Long wrote:
>>> When the "isolcpus" boot command line option is used to add a set
>>> of isolated CPUs, those CPUs will be excluded automatically from
>>> wq_unbound_cpumask to avoid running work functions from unbound
>>> workqueues.
>>>
>>> Recently cpuset has been extended to allow the creation of partitions
>>> of isolated CPUs dynamically. To make it closer to the "isolcpus"
>>> in functionality, the CPUs in those isolated cpuset partitions 
>>> should be
>>> excluded from wq_unbound_cpumask as well. This can be done currently by
>>> explicitly writing to the workqueue's cpumask sysfs file after creating
>>> the isolated partitions. However, this process can be error prone.
>>> Ideally, the cpuset code should be allowed to request the workqueue 
>>> code
>>> to exclude those isolated CPUs from wq_unbound_cpumask so that this
>>> operation can be done automatically and the isolated CPUs will be 
>>> returned
>>> back to wq_unbound_cpumask after the destructions of the isolated
>>> cpuset partitions.
>>>
>>> This patch adds a new workqueue_unbound_exclude_cpumask() to enable
>>> that. This new function will exclude the specified isolated CPUs
>>> from wq_unbound_cpumask. To be able to restore those isolated CPUs
>>> back after the destruction of isolated cpuset partitions, a new
>>> wq_user_unbound_cpumask is added to store the user provided unbound
>>> cpumask either from the boot command line options or from writing to
>>> the cpumask sysfs file. This new cpumask provides the basis for CPU
>>> exclusion.
>> The behaviors around wq_unbound_cpumask is getting pretty inconsistent:
>>
>> 1. Housekeeping excludes isolated CPUs on boot but allows user to 
>> override
>>     it to include isolated CPUs afterwards.
>>
>> 2. If an unbound wq's cpumask doesn't have any intersection with
>>     wq_unbound_cpumask we ignore the per-wq cpumask and falls back to
>>     wq_unbound_cpumask.
>>
>> 3. You're adding a masking layer on top with exclude which fails to 
>> set if
>>     the intersection is empty.
>>
>> Can we do the followings for consistency?
>>
>> 1. User's requested_unbound_cpumask is stored separately (as in this 
>> patch).
>>
>> 2. The effect wq_unbound_cpumask is determined by 
>> requested_unbound_cpumask
>>     & housekeeping_cpumask & cpuset_allowed_cpumask. The operation order
>>     matters. When an & operation yields an cpumask, the cpumask from the
>>     previous step is the effective one.
> Sure. I will do that.

I have a second thought after taking a further look at that. First of 
all, cpuset_allowed_mask isn't relevant here and the mask can certainly 
contain offline CPUs. So cpu_possible_mask is the proper fallback.

With the current patch, wq_user_unbound_cpumask is set up initially as  
(HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten by any 
subsequent write to workqueue/cpumask sysfs file. So using 
wq_user_unbound_cpumask has the implied precedence of user-sysfs written 
mask, command line isolcpus or nohz_full option mask and 
cpu_possible_mask. I think just fall back to wq_user_unbound_cpumask if 
the operation fails should be enough.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions
  2023-10-18 18:24         ` Waiman Long
@ 2023-10-24  3:25           ` Tejun Heo
  2023-10-25 18:46             ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2023-10-24  3:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

Hello, Waiman.

On Wed, Oct 18, 2023 at 02:24:00PM -0400, Waiman Long wrote:
> If you mean saving the exclusion cpumask no matter who the caller is, we can
> add another exclusion cpumask to save it and expose it to sysfs. This should
> be done in the first workqueue patch, not as part of this patch. I expose
> this isolated cpumask for testing purpose to be checked by the
> test_cpuset_prs.sh script for correctness. As said, I can expose it without
> cgroup_debug if you think the information is useful in general.

I don't really care where the cpumask is in the source tree. I just want all
the workqueue cpumasks presented to the userspace in a single place. Also, I
think it makes sense to publish it to userspace in an easily accessible
manner as what the eventual configuration ends up being can be confusing and
the effect it has on the system subtle.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  2023-10-18 19:18       ` Waiman Long
@ 2023-10-24  3:28         ` Tejun Heo
  2023-10-25 18:47           ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2023-10-24  3:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest

Hello,

On Wed, Oct 18, 2023 at 03:18:52PM -0400, Waiman Long wrote:
> I have a second thought after taking a further look at that. First of all,
> cpuset_allowed_mask isn't relevant here and the mask can certainly contain
> offline CPUs. So cpu_possible_mask is the proper fallback.
> 
> With the current patch, wq_user_unbound_cpumask is set up initially as 
> (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten by any
> subsequent write to workqueue/cpumask sysfs file. So using

The current behavior is not something which is carefully planned. It's more
accidental than anything. If we can come up with a more intutive and
consistent behavior, that should be fine.

> wq_user_unbound_cpumask has the implied precedence of user-sysfs written
> mask, command line isolcpus or nohz_full option mask and cpu_possible_mask.
> I think just fall back to wq_user_unbound_cpumask if the operation fails
> should be enough.

But yeah, that sounds acceptable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions
  2023-10-24  3:25           ` Tejun Heo
@ 2023-10-25 18:46             ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2023-10-25 18:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest


On 10/23/23 23:25, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Oct 18, 2023 at 02:24:00PM -0400, Waiman Long wrote:
>> If you mean saving the exclusion cpumask no matter who the caller is, we can
>> add another exclusion cpumask to save it and expose it to sysfs. This should
>> be done in the first workqueue patch, not as part of this patch. I expose
>> this isolated cpumask for testing purpose to be checked by the
>> test_cpuset_prs.sh script for correctness. As said, I can expose it without
>> cgroup_debug if you think the information is useful in general.
> I don't really care where the cpumask is in the source tree. I just want all
> the workqueue cpumasks presented to the userspace in a single place. Also, I
> think it makes sense to publish it to userspace in an easily accessible
> manner as what the eventual configuration ends up being can be confusing and
> the effect it has on the system subtle.

I have added 2 more read-only cpumask sysfs files in v2 to expose the 
information.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  2023-10-24  3:28         ` Tejun Heo
@ 2023-10-25 18:47           ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2023-10-25 18:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Lai Jiangshan,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-kselftest


On 10/23/23 23:28, Tejun Heo wrote:
> Hello,
>
> On Wed, Oct 18, 2023 at 03:18:52PM -0400, Waiman Long wrote:
>> I have a second thought after taking a further look at that. First of all,
>> cpuset_allowed_mask isn't relevant here and the mask can certainly contain
>> offline CPUs. So cpu_possible_mask is the proper fallback.
>>
>> With the current patch, wq_user_unbound_cpumask is set up initially as
>> (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten by any
>> subsequent write to workqueue/cpumask sysfs file. So using
> The current behavior is not something which is carefully planned. It's more
> accidental than anything. If we can come up with a more intutive and
> consistent behavior, that should be fine.
>
>> wq_user_unbound_cpumask has the implied precedence of user-sysfs written
>> mask, command line isolcpus or nohz_full option mask and cpu_possible_mask.
>> I think just fall back to wq_user_unbound_cpumask if the operation fails
>> should be enough.
> But yeah, that sounds acceptable.

I have implemented the fallback to the user requested cpumask in the 
failure case.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-10-25 18:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-13 18:11 [PATCH-cgroup 0/4] cgroup/cpuset: Improve CPU isolation in isolated partitions Waiman Long
2023-10-13 18:11 ` [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask Waiman Long
2023-10-18  9:24   ` Tejun Heo
2023-10-18 13:41     ` Waiman Long
2023-10-18 19:18       ` Waiman Long
2023-10-24  3:28         ` Tejun Heo
2023-10-25 18:47           ` Waiman Long
2023-10-13 18:11 ` [PATCH-cgroup 2/4] selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh Waiman Long
2023-10-13 18:11 ` [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions Waiman Long
2023-10-18  9:26   ` Tejun Heo
2023-10-18 13:30     ` Waiman Long
2023-10-18 18:08       ` Tejun Heo
2023-10-18 18:24         ` Waiman Long
2023-10-24  3:25           ` Tejun Heo
2023-10-25 18:46             ` Waiman Long
2023-10-13 18:11 ` [PATCH-cgroup 4/4] cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.