[PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-10  2:42 ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-10  2:42 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Juri Lelli
  Cc: cgroups, linux-kernel, Waiman Long

It was found that the following warning was displayed when remounting
controllers from cgroup v2 to v1:

[ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190
   :
[ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190
[ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 <0f> 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3
[ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202
[ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a
[ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000
[ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004
[ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000
[ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20
[ 8043.156576] FS:  00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000
[ 8043.164663] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0
[ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8043.191804] PKRU: 55555554
[ 8043.194517] Call Trace:
[ 8043.196970]  rebind_subsystems+0x18c/0x470
[ 8043.201070]  cgroup_setup_root+0x16c/0x2f0
[ 8043.205177]  cgroup1_root_to_use+0x204/0x2a0
[ 8043.209456]  cgroup1_get_tree+0x3e/0x120
[ 8043.213384]  vfs_get_tree+0x22/0xb0
[ 8043.216883]  do_new_mount+0x176/0x2d0
[ 8043.220550]  __x64_sys_mount+0x103/0x140
[ 8043.224474]  do_syscall_64+0x38/0x90
[ 8043.228063]  entry_SYSCALL_64_after_hwframe+0x44/0xae

It was caused by the "WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt))"
check. It was because the css had not been completely removed from
a previous call to css_kill() by cgroup_apply_control_disable() and
so the warning was triggered when cgroup_apply_control_disable() was
called again soon afterward.

Eliminating this incorrect warning by using percpu_ref_is_dying() as
escape for further action instead.

Fixes: 945ba19968888 ("cgroup: combine cgroup_mutex locking and offline css draining")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 881ce1470beb..e31bca9fcd46 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3140,7 +3140,16 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 			if (!css)
 				continue;
 
-			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
+			/*
+			 * A kill_css() might have been called previously, but
+			 * the css may still linger for a while before being
+			 * removed. Skip it in this case.
+			 */
+			if (percpu_ref_is_dying(&css->refcnt)) {
+				WARN_ON_ONCE(css->parent &&
+					cgroup_ss_mask(dsct) & (1 << ss->id));
+				continue;
+			}
 
 			if (css->parent &&
 			    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-10  2:42 ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-10  2:42 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Juri Lelli
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Waiman Long

It was found that the following warning was displayed when remounting
controllers from cgroup v2 to v1:

[ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190
   :
[ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190
[ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 <0f> 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3
[ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202
[ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a
[ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000
[ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004
[ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000
[ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20
[ 8043.156576] FS:  00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000
[ 8043.164663] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0
[ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8043.191804] PKRU: 55555554
[ 8043.194517] Call Trace:
[ 8043.196970]  rebind_subsystems+0x18c/0x470
[ 8043.201070]  cgroup_setup_root+0x16c/0x2f0
[ 8043.205177]  cgroup1_root_to_use+0x204/0x2a0
[ 8043.209456]  cgroup1_get_tree+0x3e/0x120
[ 8043.213384]  vfs_get_tree+0x22/0xb0
[ 8043.216883]  do_new_mount+0x176/0x2d0
[ 8043.220550]  __x64_sys_mount+0x103/0x140
[ 8043.224474]  do_syscall_64+0x38/0x90
[ 8043.228063]  entry_SYSCALL_64_after_hwframe+0x44/0xae

It was caused by the "WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt))"
check. It was because the css had not been completely removed from
a previous call to css_kill() by cgroup_apply_control_disable() and
so the warning was triggered when cgroup_apply_control_disable() was
called again soon afterward.

Eliminating this incorrect warning by using percpu_ref_is_dying() as
escape for further action instead.

Fixes: 945ba19968888 ("cgroup: combine cgroup_mutex locking and offline css draining")
Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup/cgroup.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 881ce1470beb..e31bca9fcd46 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3140,7 +3140,16 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 			if (!css)
 				continue;
 
-			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
+			/*
+			 * A kill_css() might have been called previously, but
+			 * the css may still linger for a while before being
+			 * removed. Skip it in this case.
+			 */
+			if (percpu_ref_is_dying(&css->refcnt)) {
+				WARN_ON_ONCE(css->parent &&
+					cgroup_ss_mask(dsct) & (1 << ss->id));
+				continue;
+			}
 
 			if (css->parent &&
 			    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
@ 2021-09-10  2:42   ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-10  2:42 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Juri Lelli
  Cc: cgroups, linux-kernel, Waiman Long

Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to
percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is
a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still
reference cpuset_mutex which are now incorrect.

Change all the references of cpuset_mutex to cpuset_rwsem.

Fixes: 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 56 ++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index df1ccf4558f8..2a9695ccb65f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -311,17 +311,19 @@ static struct cpuset top_cpuset = {
 		if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))
 
 /*
- * There are two global locks guarding cpuset structures - cpuset_mutex and
+ * There are two global locks guarding cpuset structures - cpuset_rwsem and
  * callback_lock. We also require taking task_lock() when dereferencing a
  * task's cpuset pointer. See "The task_lock() exception", at the end of this
- * comment.
+ * comment.  The cpuset code uses only cpuset_rwsem write lock.  Other
+ * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to
+ * prevent change to cpuset structures.
  *
  * A task must hold both locks to modify cpusets.  If a task holds
- * cpuset_mutex, then it blocks others wanting that mutex, ensuring that it
+ * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it
  * is the only task able to also acquire callback_lock and be able to
  * modify cpusets.  It can perform various checks on the cpuset structure
  * first, knowing nothing will change.  It can also allocate memory while
- * just holding cpuset_mutex.  While it is performing these checks, various
+ * just holding cpuset_rwsem.  While it is performing these checks, various
  * callback routines can briefly acquire callback_lock to query cpusets.
  * Once it is ready to make the changes, it takes callback_lock, blocking
  * everyone else.
@@ -393,7 +395,7 @@ static inline bool is_in_v2_mode(void)
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
  *
- * Call with callback_lock or cpuset_mutex held.
+ * Call with callback_lock or cpuset_rwsem held.
  */
 static void guarantee_online_cpus(struct task_struct *tsk,
 				  struct cpumask *pmask)
@@ -435,7 +437,7 @@ static void guarantee_online_cpus(struct task_struct *tsk,
  * One way or another, we guarantee to return some non-empty subset
  * of node_states[N_MEMORY].
  *
- * Call with callback_lock or cpuset_mutex held.
+ * Call with callback_lock or cpuset_rwsem held.
  */
 static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 {
@@ -447,7 +449,7 @@ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 /*
  * update task's spread flag if cpuset's page/slab spread flag is set
  *
- * Call with callback_lock or cpuset_mutex held.
+ * Call with callback_lock or cpuset_rwsem held.
  */
 static void cpuset_update_task_spread_flag(struct cpuset *cs,
 					struct task_struct *tsk)
@@ -468,7 +470,7 @@ static void cpuset_update_task_spread_flag(struct cpuset *cs,
  *
  * One cpuset is a subset of another if all its allowed CPUs and
  * Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set.  Call holding cpuset_mutex.
+ * are only set if the other's are set.  Call holding cpuset_rwsem.
  */
 
 static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
@@ -577,7 +579,7 @@ static inline void free_cpuset(struct cpuset *cs)
  * If we replaced the flag and mask values of the current cpuset
  * (cur) with those values in the trial cpuset (trial), would
  * our various subset and exclusive rules still be valid?  Presumes
- * cpuset_mutex held.
+ * cpuset_rwsem held.
  *
  * 'cur' is the address of an actual, in-use cpuset.  Operations
  * such as list traversal that depend on the actual address of the
@@ -700,7 +702,7 @@ static void update_domain_attr_tree(struct sched_domain_attr *dattr,
 	rcu_read_unlock();
 }
 
-/* Must be called with cpuset_mutex held.  */
+/* Must be called with cpuset_rwsem held.  */
 static inline int nr_cpusets(void)
 {
 	/* jump label reference count + the top-level cpuset */
@@ -726,7 +728,7 @@ static inline int nr_cpusets(void)
  * domains when operating in the severe memory shortage situations
  * that could cause allocation failures below.
  *
- * Must be called with cpuset_mutex held.
+ * Must be called with cpuset_rwsem held.
  *
  * The three key local variables below are:
  *    cp - cpuset pointer, used (together with pos_css) to perform a
@@ -1005,7 +1007,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
  * 'cpus' is removed, then call this routine to rebuild the
  * scheduler's dynamic sched domains.
  *
- * Call with cpuset_mutex held.  Takes cpus_read_lock().
+ * Call with cpuset_rwsem held.  Takes cpus_read_lock().
  */
 static void rebuild_sched_domains_locked(void)
 {
@@ -1078,7 +1080,7 @@ void rebuild_sched_domains(void)
  * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
  *
  * Iterate through each task of @cs updating its cpus_allowed to the
- * effective cpuset's.  As this function is called with cpuset_mutex held,
+ * effective cpuset's.  As this function is called with cpuset_rwsem held,
  * cpuset membership stays stable.
  */
 static void update_tasks_cpumask(struct cpuset *cs)
@@ -1347,7 +1349,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
  *
  * On legacy hierarchy, effective_cpus will be the same with cpu_allowed.
  *
- * Called with cpuset_mutex held
+ * Called with cpuset_rwsem held
  */
 static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 {
@@ -1704,12 +1706,12 @@ static void *cpuset_being_rebound;
  * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
  *
  * Iterate through each task of @cs updating its mems_allowed to the
- * effective cpuset's.  As this function is called with cpuset_mutex held,
+ * effective cpuset's.  As this function is called with cpuset_rwsem held,
  * cpuset membership stays stable.
  */
 static void update_tasks_nodemask(struct cpuset *cs)
 {
-	static nodemask_t newmems;	/* protected by cpuset_mutex */
+	static nodemask_t newmems;	/* protected by cpuset_rwsem */
 	struct css_task_iter it;
 	struct task_struct *task;
 
@@ -1722,7 +1724,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 	 * take while holding tasklist_lock.  Forks can happen - the
 	 * mpol_dup() cpuset_being_rebound check will catch such forks,
 	 * and rebind their vma mempolicies too.  Because we still hold
-	 * the global cpuset_mutex, we know that no other rebind effort
+	 * the global cpuset_rwsem, we know that no other rebind effort
 	 * will be contending for the global variable cpuset_being_rebound.
 	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
 	 * is idempotent.  Also migrate pages in each mm to new nodes.
@@ -1768,7 +1770,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
  *
  * On legacy hierarchy, effective_mems will be the same with mems_allowed.
  *
- * Called with cpuset_mutex held
+ * Called with cpuset_rwsem held
  */
 static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 {
@@ -1821,7 +1823,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
  * mempolicies and if the cpuset is marked 'memory_migrate',
  * migrate the tasks pages to the new memory.
  *
- * Call with cpuset_mutex held. May take callback_lock during call.
+ * Call with cpuset_rwsem held. May take callback_lock during call.
  * Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
  * lock each such tasks mm->mmap_lock, scan its vma's and rebind
  * their mempolicies to the cpusets new mems_allowed.
@@ -1911,7 +1913,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
  * @cs: the cpuset in which each task's spread flags needs to be changed
  *
  * Iterate through each task of @cs updating its spread flags.  As this
- * function is called with cpuset_mutex held, cpuset membership stays
+ * function is called with cpuset_rwsem held, cpuset membership stays
  * stable.
  */
 static void update_tasks_flags(struct cpuset *cs)
@@ -1931,7 +1933,7 @@ static void update_tasks_flags(struct cpuset *cs)
  * cs:		the cpuset to update
  * turning_on: 	whether the flag is being set or cleared
  *
- * Call with cpuset_mutex held.
+ * Call with cpuset_rwsem held.
  */
 
 static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
@@ -1980,7 +1982,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
  * cs: the cpuset to update
  * new_prs: new partition root state
  *
- * Call with cpuset_mutex held.
+ * Call with cpuset_rwsem held.
  */
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
@@ -2167,7 +2169,7 @@ static int fmeter_getrate(struct fmeter *fmp)
 
 static struct cpuset *cpuset_attach_old_cs;
 
-/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
+/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
 static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
 	struct cgroup_subsys_state *css;
@@ -2219,7 +2221,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 }
 
 /*
- * Protected by cpuset_mutex.  cpus_attach is used only by cpuset_attach()
+ * Protected by cpuset_rwsem.  cpus_attach is used only by cpuset_attach()
  * but we can't allocate it dynamically there.  Define it global and
  * allocate from cpuset_init().
  */
@@ -2227,7 +2229,7 @@ static cpumask_var_t cpus_attach;
 
 static void cpuset_attach(struct cgroup_taskset *tset)
 {
-	/* static buf protected by cpuset_mutex */
+	/* static buf protected by cpuset_rwsem */
 	static nodemask_t cpuset_attach_nodemask_to;
 	struct task_struct *task;
 	struct task_struct *leader;
@@ -2417,7 +2419,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	 * operation like this one can lead to a deadlock through kernfs
 	 * active_ref protection.  Let's break the protection.  Losing the
 	 * protection is okay as we check whether @cs is online after
-	 * grabbing cpuset_mutex anyway.  This only happens on the legacy
+	 * grabbing cpuset_rwsem anyway.  This only happens on the legacy
 	 * hierarchies.
 	 */
 	css_get(&cs->css);
@@ -3672,7 +3674,7 @@ void __cpuset_memory_pressure_bump(void)
  *  - Used for /proc/<pid>/cpuset.
  *  - No need to task_lock(tsk) on this tsk->cpuset reference, as it
  *    doesn't really matter if tsk->cpuset changes after we read it,
- *    and we take cpuset_mutex, keeping cpuset_attach() from changing it
+ *    and we take cpuset_rwsem, keeping cpuset_attach() from changing it
  *    anyway.
  */
 int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
@ 2021-09-10  2:42   ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-10  2:42 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Juri Lelli
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Waiman Long

Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to
percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is
a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still
reference cpuset_mutex which are now incorrect.

Change all the references of cpuset_mutex to cpuset_rwsem.

Fixes: 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem")
Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup/cpuset.c | 56 ++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index df1ccf4558f8..2a9695ccb65f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -311,17 +311,19 @@ static struct cpuset top_cpuset = {
 		if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))
 
 /*
- * There are two global locks guarding cpuset structures - cpuset_mutex and
+ * There are two global locks guarding cpuset structures - cpuset_rwsem and
  * callback_lock. We also require taking task_lock() when dereferencing a
  * task's cpuset pointer. See "The task_lock() exception", at the end of this
- * comment.
+ * comment.  The cpuset code uses only cpuset_rwsem write lock.  Other
+ * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to
+ * prevent change to cpuset structures.
  *
  * A task must hold both locks to modify cpusets.  If a task holds
- * cpuset_mutex, then it blocks others wanting that mutex, ensuring that it
+ * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it
  * is the only task able to also acquire callback_lock and be able to
  * modify cpusets.  It can perform various checks on the cpuset structure
  * first, knowing nothing will change.  It can also allocate memory while
- * just holding cpuset_mutex.  While it is performing these checks, various
+ * just holding cpuset_rwsem.  While it is performing these checks, various
  * callback routines can briefly acquire callback_lock to query cpusets.
  * Once it is ready to make the changes, it takes callback_lock, blocking
  * everyone else.
@@ -393,7 +395,7 @@ static inline bool is_in_v2_mode(void)
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
  *
- * Call with callback_lock or cpuset_mutex held.
+ * Call with callback_lock or cpuset_rwsem held.
  */
 static void guarantee_online_cpus(struct task_struct *tsk,
 				  struct cpumask *pmask)
@@ -435,7 +437,7 @@ static void guarantee_online_cpus(struct task_struct *tsk,
  * One way or another, we guarantee to return some non-empty subset
  * of node_states[N_MEMORY].
  *
- * Call with callback_lock or cpuset_mutex held.
+ * Call with callback_lock or cpuset_rwsem held.
  */
 static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 {
@@ -447,7 +449,7 @@ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 /*
  * update task's spread flag if cpuset's page/slab spread flag is set
  *
- * Call with callback_lock or cpuset_mutex held.
+ * Call with callback_lock or cpuset_rwsem held.
  */
 static void cpuset_update_task_spread_flag(struct cpuset *cs,
 					struct task_struct *tsk)
@@ -468,7 +470,7 @@ static void cpuset_update_task_spread_flag(struct cpuset *cs,
  *
  * One cpuset is a subset of another if all its allowed CPUs and
  * Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set.  Call holding cpuset_mutex.
+ * are only set if the other's are set.  Call holding cpuset_rwsem.
  */
 
 static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
@@ -577,7 +579,7 @@ static inline void free_cpuset(struct cpuset *cs)
  * If we replaced the flag and mask values of the current cpuset
  * (cur) with those values in the trial cpuset (trial), would
  * our various subset and exclusive rules still be valid?  Presumes
- * cpuset_mutex held.
+ * cpuset_rwsem held.
  *
  * 'cur' is the address of an actual, in-use cpuset.  Operations
  * such as list traversal that depend on the actual address of the
@@ -700,7 +702,7 @@ static void update_domain_attr_tree(struct sched_domain_attr *dattr,
 	rcu_read_unlock();
 }
 
-/* Must be called with cpuset_mutex held.  */
+/* Must be called with cpuset_rwsem held.  */
 static inline int nr_cpusets(void)
 {
 	/* jump label reference count + the top-level cpuset */
@@ -726,7 +728,7 @@ static inline int nr_cpusets(void)
  * domains when operating in the severe memory shortage situations
  * that could cause allocation failures below.
  *
- * Must be called with cpuset_mutex held.
+ * Must be called with cpuset_rwsem held.
  *
  * The three key local variables below are:
  *    cp - cpuset pointer, used (together with pos_css) to perform a
@@ -1005,7 +1007,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
  * 'cpus' is removed, then call this routine to rebuild the
  * scheduler's dynamic sched domains.
  *
- * Call with cpuset_mutex held.  Takes cpus_read_lock().
+ * Call with cpuset_rwsem held.  Takes cpus_read_lock().
  */
 static void rebuild_sched_domains_locked(void)
 {
@@ -1078,7 +1080,7 @@ void rebuild_sched_domains(void)
  * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
  *
  * Iterate through each task of @cs updating its cpus_allowed to the
- * effective cpuset's.  As this function is called with cpuset_mutex held,
+ * effective cpuset's.  As this function is called with cpuset_rwsem held,
  * cpuset membership stays stable.
  */
 static void update_tasks_cpumask(struct cpuset *cs)
@@ -1347,7 +1349,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
  *
  * On legacy hierarchy, effective_cpus will be the same with cpu_allowed.
  *
- * Called with cpuset_mutex held
+ * Called with cpuset_rwsem held
  */
 static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 {
@@ -1704,12 +1706,12 @@ static void *cpuset_being_rebound;
  * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
  *
  * Iterate through each task of @cs updating its mems_allowed to the
- * effective cpuset's.  As this function is called with cpuset_mutex held,
+ * effective cpuset's.  As this function is called with cpuset_rwsem held,
  * cpuset membership stays stable.
  */
 static void update_tasks_nodemask(struct cpuset *cs)
 {
-	static nodemask_t newmems;	/* protected by cpuset_mutex */
+	static nodemask_t newmems;	/* protected by cpuset_rwsem */
 	struct css_task_iter it;
 	struct task_struct *task;
 
@@ -1722,7 +1724,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 	 * take while holding tasklist_lock.  Forks can happen - the
 	 * mpol_dup() cpuset_being_rebound check will catch such forks,
 	 * and rebind their vma mempolicies too.  Because we still hold
-	 * the global cpuset_mutex, we know that no other rebind effort
+	 * the global cpuset_rwsem, we know that no other rebind effort
 	 * will be contending for the global variable cpuset_being_rebound.
 	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
 	 * is idempotent.  Also migrate pages in each mm to new nodes.
@@ -1768,7 +1770,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
  *
  * On legacy hierarchy, effective_mems will be the same with mems_allowed.
  *
- * Called with cpuset_mutex held
+ * Called with cpuset_rwsem held
  */
 static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 {
@@ -1821,7 +1823,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
  * mempolicies and if the cpuset is marked 'memory_migrate',
  * migrate the tasks pages to the new memory.
  *
- * Call with cpuset_mutex held. May take callback_lock during call.
+ * Call with cpuset_rwsem held. May take callback_lock during call.
  * Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
  * lock each such tasks mm->mmap_lock, scan its vma's and rebind
  * their mempolicies to the cpusets new mems_allowed.
@@ -1911,7 +1913,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
  * @cs: the cpuset in which each task's spread flags needs to be changed
  *
  * Iterate through each task of @cs updating its spread flags.  As this
- * function is called with cpuset_mutex held, cpuset membership stays
+ * function is called with cpuset_rwsem held, cpuset membership stays
  * stable.
  */
 static void update_tasks_flags(struct cpuset *cs)
@@ -1931,7 +1933,7 @@ static void update_tasks_flags(struct cpuset *cs)
  * cs:		the cpuset to update
  * turning_on: 	whether the flag is being set or cleared
  *
- * Call with cpuset_mutex held.
+ * Call with cpuset_rwsem held.
  */
 
 static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
@@ -1980,7 +1982,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
  * cs: the cpuset to update
  * new_prs: new partition root state
  *
- * Call with cpuset_mutex held.
+ * Call with cpuset_rwsem held.
  */
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
@@ -2167,7 +2169,7 @@ static int fmeter_getrate(struct fmeter *fmp)
 
 static struct cpuset *cpuset_attach_old_cs;
 
-/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
+/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
 static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
 	struct cgroup_subsys_state *css;
@@ -2219,7 +2221,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 }
 
 /*
- * Protected by cpuset_mutex.  cpus_attach is used only by cpuset_attach()
+ * Protected by cpuset_rwsem.  cpus_attach is used only by cpuset_attach()
  * but we can't allocate it dynamically there.  Define it global and
  * allocate from cpuset_init().
  */
@@ -2227,7 +2229,7 @@ static cpumask_var_t cpus_attach;
 
 static void cpuset_attach(struct cgroup_taskset *tset)
 {
-	/* static buf protected by cpuset_mutex */
+	/* static buf protected by cpuset_rwsem */
 	static nodemask_t cpuset_attach_nodemask_to;
 	struct task_struct *task;
 	struct task_struct *leader;
@@ -2417,7 +2419,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	 * operation like this one can lead to a deadlock through kernfs
 	 * active_ref protection.  Let's break the protection.  Losing the
 	 * protection is okay as we check whether @cs is online after
-	 * grabbing cpuset_mutex anyway.  This only happens on the legacy
+	 * grabbing cpuset_rwsem anyway.  This only happens on the legacy
 	 * hierarchies.
 	 */
 	css_get(&cs->css);
@@ -3672,7 +3674,7 @@ void __cpuset_memory_pressure_bump(void)
  *  - Used for /proc/<pid>/cpuset.
  *  - No need to task_lock(tsk) on this tsk->cpuset reference, as it
  *    doesn't really matter if tsk->cpuset changes after we read it,
- *    and we take cpuset_mutex, keeping cpuset_attach() from changing it
+ *    and we take cpuset_rwsem, keeping cpuset_attach() from changing it
  *    anyway.
  */
 int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:05   ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-13 18:05 UTC (permalink / raw)
  To: Waiman Long; +Cc: Zefan Li, Johannes Weiner, Juri Lelli, cgroups, linux-kernel

Hello,

On Thu, Sep 09, 2021 at 10:42:55PM -0400, Waiman Long wrote:
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 881ce1470beb..e31bca9fcd46 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3140,7 +3140,16 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
>  			if (!css)
>  				continue;
>  
> -			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
> +			/*
> +			 * A kill_css() might have been called previously, but
> +			 * the css may still linger for a while before being
> +			 * removed. Skip it in this case.
> +			 */
> +			if (percpu_ref_is_dying(&css->refcnt)) {
> +				WARN_ON_ONCE(css->parent &&
> +					cgroup_ss_mask(dsct) & (1 << ss->id));
> +				continue;
> +			}

This warning did help me catch some gnarly bugs. Any chance we can keep it
for normal cases and elide it just for remounting?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:05   ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-13 18:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Juri Lelli,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello,

On Thu, Sep 09, 2021 at 10:42:55PM -0400, Waiman Long wrote:
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 881ce1470beb..e31bca9fcd46 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3140,7 +3140,16 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
>  			if (!css)
>  				continue;
>  
> -			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
> +			/*
> +			 * A kill_css() might have been called previously, but
> +			 * the css may still linger for a while before being
> +			 * removed. Skip it in this case.
> +			 */
> +			if (percpu_ref_is_dying(&css->refcnt)) {
> +				WARN_ON_ONCE(css->parent &&
> +					cgroup_ss_mask(dsct) & (1 << ss->id));
> +				continue;
> +			}

This warning did help me catch some gnarly bugs. Any chance we can keep it
for normal cases and elide it just for remounting?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
  2021-09-10  2:42   ` Waiman Long
  (?)
@ 2021-09-13 18:06   ` Tejun Heo
  -1 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-13 18:06 UTC (permalink / raw)
  To: Waiman Long; +Cc: Zefan Li, Johannes Weiner, Juri Lelli, cgroups, linux-kernel

On Thu, Sep 09, 2021 at 10:42:56PM -0400, Waiman Long wrote:
> Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to
> percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is
> a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still
> reference cpuset_mutex which are now incorrect.
> 
> Change all the references of cpuset_mutex to cpuset_rwsem.
> 
> Fixes: 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem")
> Signed-off-by: Waiman Long <longman@redhat.com>

Applied to cgroup/for-5.15-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:35     ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-13 18:35 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Johannes Weiner, Juri Lelli, cgroups, linux-kernel

On 9/13/21 2:05 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Sep 09, 2021 at 10:42:55PM -0400, Waiman Long wrote:
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 881ce1470beb..e31bca9fcd46 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -3140,7 +3140,16 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
>>   			if (!css)
>>   				continue;
>>   
>> -			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
>> +			/*
>> +			 * A kill_css() might have been called previously, but
>> +			 * the css may still linger for a while before being
>> +			 * removed. Skip it in this case.
>> +			 */
>> +			if (percpu_ref_is_dying(&css->refcnt)) {
>> +				WARN_ON_ONCE(css->parent &&
>> +					cgroup_ss_mask(dsct) & (1 << ss->id));
>> +				continue;
>> +			}
> This warning did help me catch some gnarly bugs. Any chance we can keep it
> for normal cases and elide it just for remounting?

The problem with percpu_ref_is_dying() is the fact that it becomes true 
after percpu_ref_exit() is called in css_free_rwork_fn() which has an 
RCU delay. If you want to catch the fact that kill_css() has been 
called, we can check the CSS_DYING flag which is set in kill_css() by 
commit 33c35aa481786 ("cgroup: Prevent kill_css() from being called more 
than once"). Will that be an acceptable alternative?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:35     ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-13 18:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Juri Lelli,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/13/21 2:05 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Sep 09, 2021 at 10:42:55PM -0400, Waiman Long wrote:
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 881ce1470beb..e31bca9fcd46 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -3140,7 +3140,16 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
>>   			if (!css)
>>   				continue;
>>   
>> -			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
>> +			/*
>> +			 * A kill_css() might have been called previously, but
>> +			 * the css may still linger for a while before being
>> +			 * removed. Skip it in this case.
>> +			 */
>> +			if (percpu_ref_is_dying(&css->refcnt)) {
>> +				WARN_ON_ONCE(css->parent &&
>> +					cgroup_ss_mask(dsct) & (1 << ss->id));
>> +				continue;
>> +			}
> This warning did help me catch some gnarly bugs. Any chance we can keep it
> for normal cases and elide it just for remounting?

The problem with percpu_ref_is_dying() is the fact that it becomes true 
after percpu_ref_exit() is called in css_free_rwork_fn() which has an 
RCU delay. If you want to catch the fact that kill_css() has been 
called, we can check the CSS_DYING flag which is set in kill_css() by 
commit 33c35aa481786 ("cgroup: Prevent kill_css() from being called more 
than once"). Will that be an acceptable alternative?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:40       ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-13 18:40 UTC (permalink / raw)
  To: Waiman Long; +Cc: Zefan Li, Johannes Weiner, Juri Lelli, cgroups, linux-kernel

Hello,

On Mon, Sep 13, 2021 at 02:35:03PM -0400, Waiman Long wrote:
> The problem with percpu_ref_is_dying() is the fact that it becomes true
> after percpu_ref_exit() is called in css_free_rwork_fn() which has an RCU
> delay. If you want to catch the fact that kill_css() has been called, we can
> check the CSS_DYING flag which is set in kill_css() by commit 33c35aa481786
> ("cgroup: Prevent kill_css() from being called more than once"). Will that
> be an acceptable alternative?

So, it's checking that cgroup_lock_and_drain_offline() which is used by
cgroup_lock_and_drain_offline() actually waited for csses which are in the
process of going offline because some operations aren't safe when racing
against them. I think the right solution here is making sure that [u]mount
path drains the offlining cgroups rather than getting rid of the warning. I
think the warning is pointing out an actual problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:40       ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-13 18:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Juri Lelli,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello,

On Mon, Sep 13, 2021 at 02:35:03PM -0400, Waiman Long wrote:
> The problem with percpu_ref_is_dying() is the fact that it becomes true
> after percpu_ref_exit() is called in css_free_rwork_fn() which has an RCU
> delay. If you want to catch the fact that kill_css() has been called, we can
> check the CSS_DYING flag which is set in kill_css() by commit 33c35aa481786
> ("cgroup: Prevent kill_css() from being called more than once"). Will that
> be an acceptable alternative?

So, it's checking that cgroup_lock_and_drain_offline() which is used by
cgroup_lock_and_drain_offline() actually waited for csses which are in the
process of going offline because some operations aren't safe when racing
against them. I think the right solution here is making sure that [u]mount
path drains the offlining cgroups rather than getting rid of the warning. I
think the warning is pointing out an actual problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:43       ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-13 18:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Johannes Weiner, Juri Lelli, cgroups, linux-kernel

On 9/13/21 2:35 PM, Waiman Long wrote:
> On 9/13/21 2:05 PM, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Sep 09, 2021 at 10:42:55PM -0400, Waiman Long wrote:
>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>> index 881ce1470beb..e31bca9fcd46 100644
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -3140,7 +3140,16 @@ static void 
>>> cgroup_apply_control_disable(struct cgroup *cgrp)
>>>               if (!css)
>>>                   continue;
>>>   - WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
>>> +            /*
>>> +             * A kill_css() might have been called previously, but
>>> +             * the css may still linger for a while before being
>>> +             * removed. Skip it in this case.
>>> +             */
>>> +            if (percpu_ref_is_dying(&css->refcnt)) {
>>> +                WARN_ON_ONCE(css->parent &&
>>> +                    cgroup_ss_mask(dsct) & (1 << ss->id));
>>> +                continue;
>>> +            }
>> This warning did help me catch some gnarly bugs. Any chance we can 
>> keep it
>> for normal cases and elide it just for remounting?
>
> The problem with percpu_ref_is_dying() is the fact that it becomes 
> true after percpu_ref_exit() is called in css_free_rwork_fn() which 
> has an RCU delay. If you want to catch the fact that kill_css() has 
> been called, we can check the CSS_DYING flag which is set in 
> kill_css() by commit 33c35aa481786 ("cgroup: Prevent kill_css() from 
> being called more than once"). Will that be an acceptable alternative? 

Something like

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 881ce1470beb..851e54800ad8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3140,6 +3140,9 @@ static void cgroup_apply_control_disable(struct 
cgroup *cg
                         if (!css)
                                 continue;

+                       if (css->flags & CSS_DYING)
+                               continue;
+
WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));

                         if (css->parent &&

Cheers,
Longman


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:43       ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-13 18:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Juri Lelli,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/13/21 2:35 PM, Waiman Long wrote:
> On 9/13/21 2:05 PM, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Sep 09, 2021 at 10:42:55PM -0400, Waiman Long wrote:
>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>> index 881ce1470beb..e31bca9fcd46 100644
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -3140,7 +3140,16 @@ static void 
>>> cgroup_apply_control_disable(struct cgroup *cgrp)
>>>               if (!css)
>>>                   continue;
>>>   - WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
>>> +            /*
>>> +             * A kill_css() might have been called previously, but
>>> +             * the css may still linger for a while before being
>>> +             * removed. Skip it in this case.
>>> +             */
>>> +            if (percpu_ref_is_dying(&css->refcnt)) {
>>> +                WARN_ON_ONCE(css->parent &&
>>> +                    cgroup_ss_mask(dsct) & (1 << ss->id));
>>> +                continue;
>>> +            }
>> This warning did help me catch some gnarly bugs. Any chance we can 
>> keep it
>> for normal cases and elide it just for remounting?
>
> The problem with percpu_ref_is_dying() is the fact that it becomes 
> true after percpu_ref_exit() is called in css_free_rwork_fn() which 
> has an RCU delay. If you want to catch the fact that kill_css() has 
> been called, we can check the CSS_DYING flag which is set in 
> kill_css() by commit 33c35aa481786 ("cgroup: Prevent kill_css() from 
> being called more than once"). Will that be an acceptable alternative? 

Something like

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 881ce1470beb..851e54800ad8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3140,6 +3140,9 @@ static void cgroup_apply_control_disable(struct 
cgroup *cg
                         if (!css)
                                 continue;

+                       if (css->flags & CSS_DYING)
+                               continue;
+
WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));

                         if (css->parent &&

Cheers,
Longman


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:45         ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-13 18:45 UTC (permalink / raw)
  To: Waiman Long; +Cc: Zefan Li, Johannes Weiner, Juri Lelli, cgroups, linux-kernel

On Mon, Sep 13, 2021 at 02:43:44PM -0400, Waiman Long wrote:
> > The problem with percpu_ref_is_dying() is the fact that it becomes true
> > after percpu_ref_exit() is called in css_free_rwork_fn() which has an
> > RCU delay. If you want to catch the fact that kill_css() has been
> > called, we can check the CSS_DYING flag which is set in kill_css() by
> > commit 33c35aa481786 ("cgroup: Prevent kill_css() from being called more
> > than once"). Will that be an acceptable alternative?
> 
> Something like
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 881ce1470beb..851e54800ad8 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3140,6 +3140,9 @@ static void cgroup_apply_control_disable(struct cgroup
> *cg
>                         if (!css)
>                                 continue;
> 
> +                       if (css->flags & CSS_DYING)
> +                               continue;
> +

So, I don't think this would be correct. It is assumed that there are no
dying csses when control reaches this point. The right fix is making sure
that remount path clears up dying csses before calling into this path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:45         ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-13 18:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Juri Lelli,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Sep 13, 2021 at 02:43:44PM -0400, Waiman Long wrote:
> > The problem with percpu_ref_is_dying() is the fact that it becomes true
> > after percpu_ref_exit() is called in css_free_rwork_fn() which has an
> > RCU delay. If you want to catch the fact that kill_css() has been
> > called, we can check the CSS_DYING flag which is set in kill_css() by
> > commit 33c35aa481786 ("cgroup: Prevent kill_css() from being called more
> > than once"). Will that be an acceptable alternative?
> 
> Something like
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 881ce1470beb..851e54800ad8 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3140,6 +3140,9 @@ static void cgroup_apply_control_disable(struct cgroup
> *cg
>                         if (!css)
>                                 continue;
> 
> +                       if (css->flags & CSS_DYING)
> +                               continue;
> +

So, I don't think this would be correct. It is assumed that there are no
dying csses when control reaches this point. The right fix is making sure
that remount path clears up dying csses before calling into this path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:46           ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-13 18:46 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Zefan Li, Johannes Weiner, Juri Lelli, cgroups, linux-kernel

On 9/13/21 2:45 PM, Tejun Heo wrote:
> On Mon, Sep 13, 2021 at 02:43:44PM -0400, Waiman Long wrote:
>>> The problem with percpu_ref_is_dying() is the fact that it becomes true
>>> after percpu_ref_exit() is called in css_free_rwork_fn() which has an
>>> RCU delay. If you want to catch the fact that kill_css() has been
>>> called, we can check the CSS_DYING flag which is set in kill_css() by
>>> commit 33c35aa481786 ("cgroup: Prevent kill_css() from being called more
>>> than once"). Will that be an acceptable alternative?
>> Something like
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 881ce1470beb..851e54800ad8 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -3140,6 +3140,9 @@ static void cgroup_apply_control_disable(struct cgroup
>> *cg
>>                          if (!css)
>>                                  continue;
>>
>> +                       if (css->flags & CSS_DYING)
>> +                               continue;
>> +
> So, I don't think this would be correct. It is assumed that there are no
> dying csses when control reaches this point. The right fix is making sure
> that remount path clears up dying csses before calling into this path.

Thanks for the suggestion. I will look into that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable()
@ 2021-09-13 18:46           ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2021-09-13 18:46 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Zefan Li, Johannes Weiner, Juri Lelli,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/13/21 2:45 PM, Tejun Heo wrote:
> On Mon, Sep 13, 2021 at 02:43:44PM -0400, Waiman Long wrote:
>>> The problem with percpu_ref_is_dying() is the fact that it becomes true
>>> after percpu_ref_exit() is called in css_free_rwork_fn() which has an
>>> RCU delay. If you want to catch the fact that kill_css() has been
>>> called, we can check the CSS_DYING flag which is set in kill_css() by
>>> commit 33c35aa481786 ("cgroup: Prevent kill_css() from being called more
>>> than once"). Will that be an acceptable alternative?
>> Something like
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 881ce1470beb..851e54800ad8 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -3140,6 +3140,9 @@ static void cgroup_apply_control_disable(struct cgroup
>> *cg
>>                          if (!css)
>>                                  continue;
>>
>> +                       if (css->flags & CSS_DYING)
>> +                               continue;
>> +
> So, I don't think this would be correct. It is assumed that there are no
> dying csses when control reaches this point. The right fix is making sure
> that remount path clears up dying csses before calling into this path.

Thanks for the suggestion. I will look into that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-13 18:46 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10  2:42 [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable() Waiman Long
2021-09-10  2:42 ` Waiman Long
2021-09-10  2:42 ` [PATCH 2/2] cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem Waiman Long
2021-09-10  2:42   ` Waiman Long
2021-09-13 18:06   ` Tejun Heo
2021-09-13 18:05 ` [PATCH 1/2] cgroup: Fix incorrect warning from cgroup_apply_control_disable() Tejun Heo
2021-09-13 18:05   ` Tejun Heo
2021-09-13 18:35   ` Waiman Long
2021-09-13 18:35     ` Waiman Long
2021-09-13 18:40     ` Tejun Heo
2021-09-13 18:40       ` Tejun Heo
2021-09-13 18:43     ` Waiman Long
2021-09-13 18:43       ` Waiman Long
2021-09-13 18:45       ` Tejun Heo
2021-09-13 18:45         ` Tejun Heo
2021-09-13 18:46         ` Waiman Long
2021-09-13 18:46           ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.