All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] cgroup_subsys_state lifecycle fixups
@ 2022-05-25 15:15 ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

Two corner cases were hanging around [2][3] related to css lifecycles,
since they're loosely related I'm sending them together.

The 2nd patch fixes problems encountered in syzbot tests only.
Alternative solutions could be:
- daisy-chain css_release_work_fn from the offending css_killed_work_fn call,
- rework kill_css() not to rely on multi-stage css_killed_work_fn() [1].

The simplest approach was chosen.

The other existing users of percpu_ref_kill_and_confirm are not affected by
similar issues.

[1] Rough idea is to only synchronize via a completion like e.g.
nvmet_sq_destroy() does and move most of css_killed_work_fn() at the end of
kill_css(). kill_css() is only used in process context when de-configuring
controllers or rmdiring a cgroup.

[2] https://lore.kernel.org/lkml/20220404142535.145975-1-minhquangbui99@gmail.com/
[3] https://lore.kernel.org/lkml/20220412192459.227740-1-tadeusz.struk@linaro.org/

Michal Koutný (2):
  cgroup: Wait for cgroup_subsys_state offlining on unmount
  cgroup: Use separate work structs on css release path

 include/linux/cgroup-defs.h |  5 +++--
 kernel/cgroup/cgroup.c      | 19 +++++++++++--------
 2 files changed, 14 insertions(+), 10 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/2] cgroup_subsys_state lifecycle fixups
@ 2022-05-25 15:15 ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

Two corner cases were hanging around [2][3] related to css lifecycles,
since they're loosely related I'm sending them together.

The 2nd patch fixes problems encountered in syzbot tests only.
Alternative solutions could be:
- daisy-chain css_release_work_fn from the offending css_killed_work_fn call,
- rework kill_css() not to rely on multi-stage css_killed_work_fn() [1].

The simplest approach was chosen.

The other existing users of percpu_ref_kill_and_confirm are not affected by
similar issues.

[1] Rough idea is to only synchronize via a completion like e.g.
nvmet_sq_destroy() does and move most of css_killed_work_fn() at the end of
kill_css(). kill_css() is only used in process context when de-configuring
controllers or rmdiring a cgroup.

[2] https://lore.kernel.org/lkml/20220404142535.145975-1-minhquangbui99-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org/
[3] https://lore.kernel.org/lkml/20220412192459.227740-1-tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org/

Michal Koutn√Ω (2):
  cgroup: Wait for cgroup_subsys_state offlining on unmount
  cgroup: Use separate work structs on css release path

 include/linux/cgroup-defs.h |  5 +++--
 kernel/cgroup/cgroup.c      | 19 +++++++++++--------
 2 files changed, 14 insertions(+), 10 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/2] cgroup: Wait for cgroup_subsys_state offlining on unmount
@ 2022-05-25 15:15   ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

The reported problem here occurs when cgroup hierarchy is unmounted
quickly after last cgroup removal. The last cgroup prevents the root
cgroup css->refcnt from being killed. The respective cgroup root thus
remains permanently in existence.

This is actually intended behavior for memory controller whose state is
long-lived and there is no better option to attach it later (see also
commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory
controller lifetime")).

We can make the situation better by checking children list only after
any cgroups in the middle of removal are gone, detected via
cgroup_destroy_wq.

Reported-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://lore.kernel.org/r/20220404142535.145975-1-minhquangbui99@gmail.com
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 kernel/cgroup/cgroup.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..a5b0d5d54fbc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb)
 	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
 
 	/*
-	 * If @root doesn't have any children, start killing it.
+	 * If @root doesn't have any children held by residual state (e.g.
+	 * memory controller), start killing it, flush workqueue to filter out
+	 * transiently offlined children.
 	 * This prevents new mounts by disabling percpu_ref_tryget_live().
 	 *
 	 * And don't kill the default root.
 	 */
+	flush_workqueue(cgroup_destroy_wq);
 	if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
 	    !percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
 		cgroup_bpf_offline(&root->cgrp);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/2] cgroup: Wait for cgroup_subsys_state offlining on unmount
@ 2022-05-25 15:15   ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

The reported problem here occurs when cgroup hierarchy is unmounted
quickly after last cgroup removal. The last cgroup prevents the root
cgroup css->refcnt from being killed. The respective cgroup root thus
remains permanently in existence.

This is actually intended behavior for memory controller whose state is
long-lived and there is no better option to attach it later (see also
commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory
controller lifetime")).

We can make the situation better by checking children list only after
any cgroups in the middle of removal are gone, detected via
cgroup_destroy_wq.

Reported-by: Bui Quang Minh <minhquangbui99-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Link: https://lore.kernel.org/r/20220404142535.145975-1-minhquangbui99-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
Signed-off-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
---
 kernel/cgroup/cgroup.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..a5b0d5d54fbc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb)
 	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
 
 	/*
-	 * If @root doesn't have any children, start killing it.
+	 * If @root doesn't have any children held by residual state (e.g.
+	 * memory controller), start killing it, flush workqueue to filter out
+	 * transiently offlined children.
 	 * This prevents new mounts by disabling percpu_ref_tryget_live().
 	 *
 	 * And don't kill the default root.
 	 */
+	flush_workqueue(cgroup_destroy_wq);
 	if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
 	    !percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
 		cgroup_bpf_offline(&root->cgrp);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-25 15:15   ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

The cgroup_subsys_state of cgroup subsystems (not cgroup->self) use both
kill and release callbacks on their release path (see comment for
css_free_rwork_fn()).

When the last reference is also the base reference, we run into issues
when active work_struct (1) is re-initialized from css_release (2).

// ref=1: only base reference
kill_css()
  css_get() // fuse, ref+=1 == 2
  percpu_ref_kill_and_confirm
    // ref -= 1 == 1: kill base references
  [via rcu]
  css_killed_ref_fn == refcnt.confirm_switch
    queue_work(css->destroy_work)               (1)
    [via css->destroy_work]
    css_killed_work_fn == wq.func
      offline_css() // needs fuse
      css_put // ref -= 1 == 0: de-fuse, was last
        ...
        percpu_ref_put_many
           css_release
             queue_work(css->destroy_work)      (2)
             [via css->destroy_work]
             css_release_work_fn == wq.func

Despite we take a fuse reference in css_killed_work_fn() it serves
for pinning the css until only after offline_css().

We could check inside css_release whether destroy_work is active
(WORK_STRUCT_PENDING_BIT) and daisy-chain css_release_work_fn from
css_release(). In order to avoid clashes with various stages of the work
item processing, we just spend some space in css (my config's css grows
to 232B + 32B) and create a separate work entry for each user.

Reported-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com
Reported-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Link: https://lore.kernel.org/r/20220412192459.227740-1-tadeusz.struk@linaro.org/
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 include/linux/cgroup-defs.h |  5 +++--
 kernel/cgroup/cgroup.c      | 14 +++++++-------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..16b99aa04305 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -178,8 +178,9 @@ struct cgroup_subsys_state {
 	 */
 	atomic_t online_cnt;
 
-	/* percpu_ref killing and RCU release */
-	struct work_struct destroy_work;
+	/* percpu_ref killing, css release, and RCU release work structs */
+	struct work_struct killed_ref_work;
+	struct work_struct release_work;
 	struct rcu_work destroy_rwork;
 
 	/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a5b0d5d54fbc..33b3a44391d7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5102,7 +5102,7 @@ static struct cftype cgroup_base_files[] = {
  *    css_free_work_fn().
  *
  * It is actually hairier because both step 2 and 4 require process context
- * and thus involve punting to css->destroy_work adding two additional
+ * and thus involve punting to css->release_work adding two additional
  * steps to the already complex sequence.
  */
 static void css_free_rwork_fn(struct work_struct *work)
@@ -5157,7 +5157,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 static void css_release_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+		container_of(work, struct cgroup_subsys_state, release_work);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
@@ -5213,8 +5213,8 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	INIT_WORK(&css->release_work, css_release_work_fn);
+	queue_work(cgroup_destroy_wq, &css->release_work);
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5549,7 +5549,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 static void css_killed_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+		container_of(work, struct cgroup_subsys_state, killed_ref_work);
 
 	mutex_lock(&cgroup_mutex);
 
@@ -5570,8 +5570,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
 	if (atomic_dec_and_test(&css->online_cnt)) {
-		INIT_WORK(&css->destroy_work, css_killed_work_fn);
-		queue_work(cgroup_destroy_wq, &css->destroy_work);
+		INIT_WORK(&css->killed_ref_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->killed_ref_work);
 	}
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-25 15:15   ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

The cgroup_subsys_state of cgroup subsystems (not cgroup->self) use both
kill and release callbacks on their release path (see comment for
css_free_rwork_fn()).

When the last reference is also the base reference, we run into issues
when active work_struct (1) is re-initialized from css_release (2).

// ref=1: only base reference
kill_css()
  css_get() // fuse, ref+=1 == 2
  percpu_ref_kill_and_confirm
    // ref -= 1 == 1: kill base references
  [via rcu]
  css_killed_ref_fn == refcnt.confirm_switch
    queue_work(css->destroy_work)               (1)
    [via css->destroy_work]
    css_killed_work_fn == wq.func
      offline_css() // needs fuse
      css_put // ref -= 1 == 0: de-fuse, was last
        ...
        percpu_ref_put_many
           css_release
             queue_work(css->destroy_work)      (2)
             [via css->destroy_work]
             css_release_work_fn == wq.func

Despite we take a fuse reference in css_killed_work_fn() it serves
for pinning the css until only after offline_css().

We could check inside css_release whether destroy_work is active
(WORK_STRUCT_PENDING_BIT) and daisy-chain css_release_work_fn from
css_release(). In order to avoid clashes with various stages of the work
item processing, we just spend some space in css (my config's css grows
to 232B + 32B) and create a separate work entry for each user.

Reported-by: syzbot+e42ae441c3b10acf9e9d-Pl5Pbv+GP7P466ipTTIvnc23WoclnBCfAL8bYrjMMd8@public.gmane.org
Reported-by: Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Link: https://lore.kernel.org/r/20220412192459.227740-1-tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org/
Signed-off-by: Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Signed-off-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
---
 include/linux/cgroup-defs.h |  5 +++--
 kernel/cgroup/cgroup.c      | 14 +++++++-------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..16b99aa04305 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -178,8 +178,9 @@ struct cgroup_subsys_state {
 	 */
 	atomic_t online_cnt;
 
-	/* percpu_ref killing and RCU release */
-	struct work_struct destroy_work;
+	/* percpu_ref killing, css release, and RCU release work structs */
+	struct work_struct killed_ref_work;
+	struct work_struct release_work;
 	struct rcu_work destroy_rwork;
 
 	/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a5b0d5d54fbc..33b3a44391d7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5102,7 +5102,7 @@ static struct cftype cgroup_base_files[] = {
  *    css_free_work_fn().
  *
  * It is actually hairier because both step 2 and 4 require process context
- * and thus involve punting to css->destroy_work adding two additional
+ * and thus involve punting to css->release_work adding two additional
  * steps to the already complex sequence.
  */
 static void css_free_rwork_fn(struct work_struct *work)
@@ -5157,7 +5157,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 static void css_release_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+		container_of(work, struct cgroup_subsys_state, release_work);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
@@ -5213,8 +5213,8 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	INIT_WORK(&css->release_work, css_release_work_fn);
+	queue_work(cgroup_destroy_wq, &css->release_work);
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5549,7 +5549,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 static void css_killed_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+		container_of(work, struct cgroup_subsys_state, killed_ref_work);
 
 	mutex_lock(&cgroup_mutex);
 
@@ -5570,8 +5570,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
 	if (atomic_dec_and_test(&css->online_cnt)) {
-		INIT_WORK(&css->destroy_work, css_killed_work_fn);
-		queue_work(cgroup_destroy_wq, &css->destroy_work);
+		INIT_WORK(&css->killed_ref_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->killed_ref_work);
 	}
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-25 16:14     ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 16:14 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

On Wed, May 25, 2022 at 05:15:17PM +0200, Michal Koutný <mkoutny@suse.com> wrote:
> // ref=1: only base reference
> kill_css()
>   css_get() // fuse, ref+=1 == 2
>   percpu_ref_kill_and_confirm
>     // ref -= 1 == 1: kill base references
>   [via rcu]
>   css_killed_ref_fn == refcnt.confirm_switch
>     queue_work(css->destroy_work)               (1)
>     [via css->destroy_work]
>     css_killed_work_fn == wq.func
>       offline_css() // needs fuse
>       css_put // ref -= 1 == 0: de-fuse, was last
>         ...
>         percpu_ref_put_many
>            css_release
>              queue_work(css->destroy_work)      (2)
>              [via css->destroy_work]
>              css_release_work_fn == wq.func

Apologies, this is wrong explanation. (I thought this explains why
Tadeusz's patch with double get/put didn't fix it (i.e. any number
wouldn't help with the explanation above).)

But the above is not correct. I've looked at the stack trace [1] and the
offending percpu_ref_put_many is called from an RCU callback
percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops
to zero there...

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-25 16:14     ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-25 16:14 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

On Wed, May 25, 2022 at 05:15:17PM +0200, Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote:
> // ref=1: only base reference
> kill_css()
>   css_get() // fuse, ref+=1 == 2
>   percpu_ref_kill_and_confirm
>     // ref -= 1 == 1: kill base references
>   [via rcu]
>   css_killed_ref_fn == refcnt.confirm_switch
>     queue_work(css->destroy_work)               (1)
>     [via css->destroy_work]
>     css_killed_work_fn == wq.func
>       offline_css() // needs fuse
>       css_put // ref -= 1 == 0: de-fuse, was last
>         ...
>         percpu_ref_put_many
>            css_release
>              queue_work(css->destroy_work)      (2)
>              [via css->destroy_work]
>              css_release_work_fn == wq.func

Apologies, this is wrong explanation. (I thought this explains why
Tadeusz's patch with double get/put didn't fix it (i.e. any number
wouldn't help with the explanation above).)

But the above is not correct. I've looked at the stack trace [1] and the
offending percpu_ref_put_many is called from an RCU callback
percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops
to zero there...

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-26  9:56       ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-26  9:56 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

On Wed, May 25, 2022 at 06:14:55PM +0200, Michal Koutný <mkoutny@suse.com> wrote:
> But the above is not correct. I've looked at the stack trace [1] and the
> offending percpu_ref_put_many is called from an RCU callback
> percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops
> to zero there...

The link [1] should have been [1].
After some more thought, the following is possible sequencing of
involved functions.

// ref=A: initial state
kill_css()
  css_get // ref+=F == A+F: fuse
  percpu_ref_kill_and_confirm
    __percpu_ref_switch_to_atomic
      percpu_ref_get
        // ref += 1 == A+F+1: atomic mode, self-protection
    percpu_ref_put
      // ref -= 1 == A+F: kill the base reference
  [via rcu]
  percpu_ref_switch_to_atomic_rcu
    percpu_ref_call_confirm_rcu
      css_killed_ref_fn == refcnt.confirm_switch
        queue_work(css->destroy_work)        (1)
                                                     [via css->destroy_work]
                                                     css_killed_work_fn == wq.func
                                                       offline_css() // needs fuse
                                                       css_put // ref -= F == A: de-fuse
      percpu_ref_put
        // ref -= 1 == A-1: remove self-protection
        css_release                                   // A <= 1 -> 2nd queue_work explodes!
          queue_work(css->destroy_work)      (2)
          [via css->destroy_work]
          css_release_work_fn == wq.func

Another CPU would have to dispatch and run the css_killed_work_fn
callback in parallel to percpu_ref_switch_to_atomic_rcu. It's a more
correct explanation, however, its likelihood does seem very low. Perhaps
some debug prints of percpu_ref_data.data in percpu_ref_call_confirm_rcu
could shed more light onto this [2].

HTH,
Michal

[1] https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000
[2] I tried notifying syzbot about [3] moments ago.
[3] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-syzbot


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-26  9:56       ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-26  9:56 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk

On Wed, May 25, 2022 at 06:14:55PM +0200, Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote:
> But the above is not correct. I've looked at the stack trace [1] and the
> offending percpu_ref_put_many is called from an RCU callback
> percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops
> to zero there...

The link [1] should have been [1].
After some more thought, the following is possible sequencing of
involved functions.

// ref=A: initial state
kill_css()
  css_get // ref+=F == A+F: fuse
  percpu_ref_kill_and_confirm
    __percpu_ref_switch_to_atomic
      percpu_ref_get
        // ref += 1 == A+F+1: atomic mode, self-protection
    percpu_ref_put
      // ref -= 1 == A+F: kill the base reference
  [via rcu]
  percpu_ref_switch_to_atomic_rcu
    percpu_ref_call_confirm_rcu
      css_killed_ref_fn == refcnt.confirm_switch
        queue_work(css->destroy_work)        (1)
                                                     [via css->destroy_work]
                                                     css_killed_work_fn == wq.func
                                                       offline_css() // needs fuse
                                                       css_put // ref -= F == A: de-fuse
      percpu_ref_put
        // ref -= 1 == A-1: remove self-protection
        css_release                                   // A <= 1 -> 2nd queue_work explodes!
          queue_work(css->destroy_work)      (2)
          [via css->destroy_work]
          css_release_work_fn == wq.func

Another CPU would have to dispatch and run the css_killed_work_fn
callback in parallel to percpu_ref_switch_to_atomic_rcu. It's a more
correct explanation, however, its likelihood does seem very low. Perhaps
some debug prints of percpu_ref_data.data in percpu_ref_call_confirm_rcu
could shed more light onto this [2].

HTH,
Michal

[1] https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000
[2] I tried notifying syzbot about [3] moments ago.
[3] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-syzbot


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
  2022-05-26  9:56       ` Michal Koutný
@ 2022-05-26 18:15         ` Tejun Heo
  -1 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-05-26 18:15 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh,
	Tadeusz Struk

Hello, Michal.

On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
> // ref=A: initial state
> kill_css()
>   css_get // ref+=F == A+F: fuse
>   percpu_ref_kill_and_confirm
>     __percpu_ref_switch_to_atomic
>       percpu_ref_get
>         // ref += 1 == A+F+1: atomic mode, self-protection
>     percpu_ref_put
>       // ref -= 1 == A+F: kill the base reference
>   [via rcu]
>   percpu_ref_switch_to_atomic_rcu
>     percpu_ref_call_confirm_rcu
>       css_killed_ref_fn == refcnt.confirm_switch
>         queue_work(css->destroy_work)        (1)
>                                                      [via css->destroy_work]
>                                                      css_killed_work_fn == wq.func
>                                                        offline_css() // needs fuse
>                                                        css_put // ref -= F == A: de-fuse
>       percpu_ref_put
>         // ref -= 1 == A-1: remove self-protection
>         css_release                                   // A <= 1 -> 2nd queue_work explodes!

I'm not sure I'm following it but it's perfectly fine to re-use the work
item at this point. The work item actually can be re-cycled from the very
beginning of the work function. The only thing we need to make sure is that
we don't css_put() prematurely to avoid it being freed while we're using it.

For the sharing to be a problem, we should be queueing the release work item
while the destroy instance is still pending, and if that is the case, it
doesn't really matter whether we use two separate work items or not. We're
already broken and would just be shifting the problem to explode elsewhere.

The only possibility that I can think of is that somehow we're ending up
with an extra css_put() somewhere thus triggering the release path
prematurely. If that's the case, we'll prolly need to trace get/puts to find
out who's causing the ref imbalance.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-26 18:15         ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-05-26 18:15 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh, Tadeusz Struk

Hello, Michal.

On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
> // ref=A: initial state
> kill_css()
>   css_get // ref+=F == A+F: fuse
>   percpu_ref_kill_and_confirm
>     __percpu_ref_switch_to_atomic
>       percpu_ref_get
>         // ref += 1 == A+F+1: atomic mode, self-protection
>     percpu_ref_put
>       // ref -= 1 == A+F: kill the base reference
>   [via rcu]
>   percpu_ref_switch_to_atomic_rcu
>     percpu_ref_call_confirm_rcu
>       css_killed_ref_fn == refcnt.confirm_switch
>         queue_work(css->destroy_work)        (1)
>                                                      [via css->destroy_work]
>                                                      css_killed_work_fn == wq.func
>                                                        offline_css() // needs fuse
>                                                        css_put // ref -= F == A: de-fuse
>       percpu_ref_put
>         // ref -= 1 == A-1: remove self-protection
>         css_release                                   // A <= 1 -> 2nd queue_work explodes!

I'm not sure I'm following it but it's perfectly fine to re-use the work
item at this point. The work item actually can be re-cycled from the very
beginning of the work function. The only thing we need to make sure is that
we don't css_put() prematurely to avoid it being freed while we're using it.

For the sharing to be a problem, we should be queueing the release work item
while the destroy instance is still pending, and if that is the case, it
doesn't really matter whether we use two separate work items or not. We're
already broken and would just be shifting the problem to explode elsewhere.

The only possibility that I can think of is that somehow we're ending up
with an extra css_put() somewhere thus triggering the release path
prematurely. If that's the case, we'll prolly need to trace get/puts to find
out who's causing the ref imbalance.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-27 16:39           ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-05-27 16:39 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On 5/26/22 11:15, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>> // ref=A: initial state
>> kill_css()
>>    css_get // ref+=F == A+F: fuse
>>    percpu_ref_kill_and_confirm
>>      __percpu_ref_switch_to_atomic
>>        percpu_ref_get
>>          // ref += 1 == A+F+1: atomic mode, self-protection
>>      percpu_ref_put
>>        // ref -= 1 == A+F: kill the base reference
>>    [via rcu]
>>    percpu_ref_switch_to_atomic_rcu
>>      percpu_ref_call_confirm_rcu
>>        css_killed_ref_fn == refcnt.confirm_switch
>>          queue_work(css->destroy_work)        (1)
>>                                                       [via css->destroy_work]
>>                                                       css_killed_work_fn == wq.func
>>                                                         offline_css() // needs fuse
>>                                                         css_put // ref -= F == A: de-fuse
>>        percpu_ref_put
>>          // ref -= 1 == A-1: remove self-protection
>>          css_release                                   // A <= 1 -> 2nd queue_work explodes!
> 
> I'm not sure I'm following it but it's perfectly fine to re-use the work
> item at this point. The work item actually can be re-cycled from the very
> beginning of the work function. The only thing we need to make sure is that
> we don't css_put() prematurely to avoid it being freed while we're using it.
> 
> For the sharing to be a problem, we should be queueing the release work item
> while the destroy instance is still pending, and if that is the case, it
> doesn't really matter whether we use two separate work items or not. We're
> already broken and would just be shifting the problem to explode elsewhere.
> 
> The only possibility that I can think of is that somehow we're ending up
> with an extra css_put() somewhere thus triggering the release path
> prematurely. If that's the case, we'll prolly need to trace get/puts to find
> out who's causing the ref imbalance.

Hi Michal,
As far as I can see we are trying to test the same thing suggested by Tejun.
I just sent a test request to try this:
https://github.com/tstruk/linux/commit/master

Let me know if you have any more tests to run and I will hold off until
you are done.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-27 16:39           ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-05-27 16:39 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 5/26/22 11:15, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>> // ref=A: initial state
>> kill_css()
>>    css_get // ref+=F == A+F: fuse
>>    percpu_ref_kill_and_confirm
>>      __percpu_ref_switch_to_atomic
>>        percpu_ref_get
>>          // ref += 1 == A+F+1: atomic mode, self-protection
>>      percpu_ref_put
>>        // ref -= 1 == A+F: kill the base reference
>>    [via rcu]
>>    percpu_ref_switch_to_atomic_rcu
>>      percpu_ref_call_confirm_rcu
>>        css_killed_ref_fn == refcnt.confirm_switch
>>          queue_work(css->destroy_work)        (1)
>>                                                       [via css->destroy_work]
>>                                                       css_killed_work_fn == wq.func
>>                                                         offline_css() // needs fuse
>>                                                         css_put // ref -= F == A: de-fuse
>>        percpu_ref_put
>>          // ref -= 1 == A-1: remove self-protection
>>          css_release                                   // A <= 1 -> 2nd queue_work explodes!
> 
> I'm not sure I'm following it but it's perfectly fine to re-use the work
> item at this point. The work item actually can be re-cycled from the very
> beginning of the work function. The only thing we need to make sure is that
> we don't css_put() prematurely to avoid it being freed while we're using it.
> 
> For the sharing to be a problem, we should be queueing the release work item
> while the destroy instance is still pending, and if that is the case, it
> doesn't really matter whether we use two separate work items or not. We're
> already broken and would just be shifting the problem to explode elsewhere.
> 
> The only possibility that I can think of is that somehow we're ending up
> with an extra css_put() somewhere thus triggering the release path
> prematurely. If that's the case, we'll prolly need to trace get/puts to find
> out who's causing the ref imbalance.

Hi Michal,
As far as I can see we are trying to test the same thing suggested by Tejun.
I just sent a test request to try this:
https://github.com/tstruk/linux/commit/master

Let me know if you have any more tests to run and I will hold off until
you are done.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-27 16:54             ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-27 16:54 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Tejun Heo, cgroups, linux-kernel, Zefan Li, Johannes Weiner,
	Bui Quang Minh

Hello Tadeusz.

On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> As far as I can see we are trying to test the same thing suggested by Tejun.
> I just sent a test request to try this:
> https://github.com/tstruk/linux/commit/master

Yup, I've added few more prints to get more fine-grained resolution.
Also, I decided to use ftrace printk not to interfere with timing too
much (due to the original race hypothesis).

> Let me know if you have any more tests to run and I will hold off until
> you are done.

My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea).

So far, I'm not convinced, I extract the complete ftrace buffer from the
syzbot runs, so I'm not drawing any conclusions from the traces I've
got. I'm not going to continue today. You may have more luck with your
plain printk (if it's just imbalance and it avoids printk locking
sensitive paths).

HTH,
Michal

[1] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-b2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-27 16:54             ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-05-27 16:54 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

Hello Tadeusz.

On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> As far as I can see we are trying to test the same thing suggested by Tejun.
> I just sent a test request to try this:
> https://github.com/tstruk/linux/commit/master

Yup, I've added few more prints to get more fine-grained resolution.
Also, I decided to use ftrace printk not to interfere with timing too
much (due to the original race hypothesis).

> Let me know if you have any more tests to run and I will hold off until
> you are done.

My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea).

So far, I'm not convinced, I extract the complete ftrace buffer from the
syzbot runs, so I'm not drawing any conclusions from the traces I've
got. I'm not going to continue today. You may have more luck with your
plain printk (if it's just imbalance and it avoids printk locking
sensitive paths).

HTH,
Michal

[1] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-b2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-27 17:23               ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-05-27 17:23 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tadeusz Struk, cgroups, linux-kernel, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On Fri, May 27, 2022 at 06:54:29PM +0200, Michal Koutný wrote:
> Hello Tadeusz.
> 
> On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> > As far as I can see we are trying to test the same thing suggested by Tejun.
> > I just sent a test request to try this:
> > https://github.com/tstruk/linux/commit/master
> 
> Yup, I've added few more prints to get more fine-grained resolution.
> Also, I decided to use ftrace printk not to interfere with timing too
> much (due to the original race hypothesis).
> 
> > Let me know if you have any more tests to run and I will hold off until
> > you are done.
> 
> My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea).
> 
> So far, I'm not convinced, I extract the complete ftrace buffer from the
> syzbot runs, so I'm not drawing any conclusions from the traces I've
> got. I'm not going to continue today. You may have more luck with your
> plain printk (if it's just imbalance and it avoids printk locking
> sensitive paths).

At least for triaging, it may be easier to use bpftrace to attach kprobes to
the get/put functions with the right filter and record the kstack counts.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-05-27 17:23               ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-05-27 17:23 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tadeusz Struk, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On Fri, May 27, 2022 at 06:54:29PM +0200, Michal Koutný wrote:
> Hello Tadeusz.
> 
> On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> > As far as I can see we are trying to test the same thing suggested by Tejun.
> > I just sent a test request to try this:
> > https://github.com/tstruk/linux/commit/master
> 
> Yup, I've added few more prints to get more fine-grained resolution.
> Also, I decided to use ftrace printk not to interfere with timing too
> much (due to the original race hypothesis).
> 
> > Let me know if you have any more tests to run and I will hold off until
> > you are done.
> 
> My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea).
> 
> So far, I'm not convinced, I extract the complete ftrace buffer from the
> syzbot runs, so I'm not drawing any conclusions from the traces I've
> got. I'm not going to continue today. You may have more luck with your
> plain printk (if it's just imbalance and it avoids printk locking
> sensitive paths).

At least for triaging, it may be easier to use bpftrace to attach kprobes to
the get/put functions with the right filter and record the kstack counts.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:13           ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-01 23:13 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On 5/26/22 11:15, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>> // ref=A: initial state
>> kill_css()
>>    css_get // ref+=F == A+F: fuse
>>    percpu_ref_kill_and_confirm
>>      __percpu_ref_switch_to_atomic
>>        percpu_ref_get
>>          // ref += 1 == A+F+1: atomic mode, self-protection
>>      percpu_ref_put
>>        // ref -= 1 == A+F: kill the base reference
>>    [via rcu]
>>    percpu_ref_switch_to_atomic_rcu
>>      percpu_ref_call_confirm_rcu
>>        css_killed_ref_fn == refcnt.confirm_switch
>>          queue_work(css->destroy_work)        (1)
>>                                                       [via css->destroy_work]
>>                                                       css_killed_work_fn == wq.func
>>                                                         offline_css() // needs fuse
>>                                                         css_put // ref -= F == A: de-fuse
>>        percpu_ref_put
>>          // ref -= 1 == A-1: remove self-protection
>>          css_release                                   // A <= 1 -> 2nd queue_work explodes!
> 
> I'm not sure I'm following it but it's perfectly fine to re-use the work
> item at this point. The work item actually can be re-cycled from the very
> beginning of the work function. The only thing we need to make sure is that
> we don't css_put() prematurely to avoid it being freed while we're using it.

Yes, it is ok to reuse a work struct, but it's not ok to have the same
work struct enqueued twice on the same WQ when list debug is enabled.
That's why we are getting this "BUG: corrupted list.."

> For the sharing to be a problem, we should be queueing the release work item
> while the destroy instance is still pending, and if that is the case, it
> doesn't really matter whether we use two separate work items or not. We're
> already broken and would just be shifting the problem to explode elsewhere.
> 
> The only possibility that I can think of is that somehow we're ending up
> with an extra css_put() somewhere thus triggering the release path
> prematurely. If that's the case, we'll prolly need to trace get/puts to find
> out who's causing the ref imbalance.

That's right. Michal was on the right track for the kill_css() part.
What I think is going on is that once css_create() fails then
cgroup_subtree_control_write() ends up calling first kill_css() and
then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
The each_live_descendant_post() also iterates on the root.
Here is the call flow (sorry for long lines):

cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
   |
   |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
   |
   |  css_killed_ref_fn() <- first css->destroy_work enqueue
   |    |
   |    |->  INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
   |
   |
   |-> goto out_unlock;
   |     |
   |     |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
   |
   |
      css_release(percpu_ref) <- second css->destroy_work enqueue
        |
        |->  INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.


What seems to work for me as the simplest fix is to prevent enqueuing a dying
css in css_release() as below. Please let me know if that makes sense to you.

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..5618211487cc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
  	struct cgroup_subsys_state *css =
  		container_of(ref, struct cgroup_subsys_state, refcnt);
  
-	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (!(css->flags & CSS_DYING)) {
+		INIT_WORK(&css->destroy_work, css_release_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
  }
  
  static void init_and_link_css(struct cgroup_subsys_state *css,

-- 
Thanks,
Tadeusz

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:13           ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-01 23:13 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 5/26/22 11:15, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>> // ref=A: initial state
>> kill_css()
>>    css_get // ref+=F == A+F: fuse
>>    percpu_ref_kill_and_confirm
>>      __percpu_ref_switch_to_atomic
>>        percpu_ref_get
>>          // ref += 1 == A+F+1: atomic mode, self-protection
>>      percpu_ref_put
>>        // ref -= 1 == A+F: kill the base reference
>>    [via rcu]
>>    percpu_ref_switch_to_atomic_rcu
>>      percpu_ref_call_confirm_rcu
>>        css_killed_ref_fn == refcnt.confirm_switch
>>          queue_work(css->destroy_work)        (1)
>>                                                       [via css->destroy_work]
>>                                                       css_killed_work_fn == wq.func
>>                                                         offline_css() // needs fuse
>>                                                         css_put // ref -= F == A: de-fuse
>>        percpu_ref_put
>>          // ref -= 1 == A-1: remove self-protection
>>          css_release                                   // A <= 1 -> 2nd queue_work explodes!
> 
> I'm not sure I'm following it but it's perfectly fine to re-use the work
> item at this point. The work item actually can be re-cycled from the very
> beginning of the work function. The only thing we need to make sure is that
> we don't css_put() prematurely to avoid it being freed while we're using it.

Yes, it is ok to reuse a work struct, but it's not ok to have the same
work struct enqueued twice on the same WQ when list debug is enabled.
That's why we are getting this "BUG: corrupted list.."

> For the sharing to be a problem, we should be queueing the release work item
> while the destroy instance is still pending, and if that is the case, it
> doesn't really matter whether we use two separate work items or not. We're
> already broken and would just be shifting the problem to explode elsewhere.
> 
> The only possibility that I can think of is that somehow we're ending up
> with an extra css_put() somewhere thus triggering the release path
> prematurely. If that's the case, we'll prolly need to trace get/puts to find
> out who's causing the ref imbalance.

That's right. Michal was on the right track for the kill_css() part.
What I think is going on is that once css_create() fails then
cgroup_subtree_control_write() ends up calling first kill_css() and
then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
The each_live_descendant_post() also iterates on the root.
Here is the call flow (sorry for long lines):

cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
   |
   |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
   |
   |  css_killed_ref_fn() <- first css->destroy_work enqueue
   |    |
   |    |->  INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
   |
   |
   |-> goto out_unlock;
   |     |
   |     |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
   |
   |
      css_release(percpu_ref) <- second css->destroy_work enqueue
        |
        |->  INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.


What seems to work for me as the simplest fix is to prevent enqueuing a dying
css in css_release() as below. Please let me know if that makes sense to you.

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..5618211487cc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
  	struct cgroup_subsys_state *css =
  		container_of(ref, struct cgroup_subsys_state, refcnt);
  
-	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (!(css->flags & CSS_DYING)) {
+		INIT_WORK(&css->destroy_work, css_release_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
  }
  
  static void init_and_link_css(struct cgroup_subsys_state *css,

-- 
Thanks,
Tadeusz

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:20             ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-01 23:20 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
> > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
> > > // ref=A: initial state
> > > kill_css()
> > >    css_get // ref+=F == A+F: fuse
> > >    percpu_ref_kill_and_confirm
> > >      __percpu_ref_switch_to_atomic
> > >        percpu_ref_get
> > >          // ref += 1 == A+F+1: atomic mode, self-protection
> > >      percpu_ref_put
> > >        // ref -= 1 == A+F: kill the base reference
> > >    [via rcu]
> > >    percpu_ref_switch_to_atomic_rcu
> > >      percpu_ref_call_confirm_rcu
> > >        css_killed_ref_fn == refcnt.confirm_switch
> > >          queue_work(css->destroy_work)        (1)
> > >                                                       [via css->destroy_work]
> > >                                                       css_killed_work_fn == wq.func
> > >                                                         offline_css() // needs fuse
> > >                                                         css_put // ref -= F == A: de-fuse
> > >        percpu_ref_put
> > >          // ref -= 1 == A-1: remove self-protection
> > >          css_release                                   // A <= 1 -> 2nd queue_work explodes!
> > 
> > I'm not sure I'm following it but it's perfectly fine to re-use the work
> > item at this point. The work item actually can be re-cycled from the very
> > beginning of the work function. The only thing we need to make sure is that
> > we don't css_put() prematurely to avoid it being freed while we're using it.
> 
> Yes, it is ok to reuse a work struct, but it's not ok to have the same
> work struct enqueued twice on the same WQ when list debug is enabled.
> That's why we are getting this "BUG: corrupted list.."

The above scenario isn't that tho. Once the work item starts executing, wq
doesn't care about what happens to it and as killed_work_fn is holding a
reference, the release scheduling shouldn't happen before it starts
executing unless somebody is screwing up the refcnting.

> That's right. Michal was on the right track for the kill_css() part.
> What I think is going on is that once css_create() fails then
> cgroup_subtree_control_write() ends up calling first kill_css() and
> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
> The each_live_descendant_post() also iterates on the root.
> Here is the call flow (sorry for long lines):
> 
> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
>   |
>   |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
>   |
>   |  css_killed_ref_fn() <- first css->destroy_work enqueue
>   |    |
>   |    |->  INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
>   |
>   |
>   |-> goto out_unlock;
>   |     |
>   |     |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
>   |
>   |
>      css_release(percpu_ref) <- second css->destroy_work enqueue
>        |
>        |->  INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
> 
> 
> What seems to work for me as the simplest fix is to prevent enqueuing a dying
> css in css_release() as below. Please let me know if that makes sense to you.
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1779ccddb734..5618211487cc 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
>  	struct cgroup_subsys_state *css =
>  		container_of(ref, struct cgroup_subsys_state, refcnt);
> -	INIT_WORK(&css->destroy_work, css_release_work_fn);
> -	queue_work(cgroup_destroy_wq, &css->destroy_work);
> +	if (!(css->flags & CSS_DYING)) {
> +		INIT_WORK(&css->destroy_work, css_release_work_fn);
> +		queue_work(cgroup_destroy_wq, &css->destroy_work);
> +	}

When the problem is ref imbalance, how can above be the solution? Of course
release path won't cause an issue if they don't run, but we still need to
free the thing, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:20             ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-01 23:20 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
> > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
> > > // ref=A: initial state
> > > kill_css()
> > >    css_get // ref+=F == A+F: fuse
> > >    percpu_ref_kill_and_confirm
> > >      __percpu_ref_switch_to_atomic
> > >        percpu_ref_get
> > >          // ref += 1 == A+F+1: atomic mode, self-protection
> > >      percpu_ref_put
> > >        // ref -= 1 == A+F: kill the base reference
> > >    [via rcu]
> > >    percpu_ref_switch_to_atomic_rcu
> > >      percpu_ref_call_confirm_rcu
> > >        css_killed_ref_fn == refcnt.confirm_switch
> > >          queue_work(css->destroy_work)        (1)
> > >                                                       [via css->destroy_work]
> > >                                                       css_killed_work_fn == wq.func
> > >                                                         offline_css() // needs fuse
> > >                                                         css_put // ref -= F == A: de-fuse
> > >        percpu_ref_put
> > >          // ref -= 1 == A-1: remove self-protection
> > >          css_release                                   // A <= 1 -> 2nd queue_work explodes!
> > 
> > I'm not sure I'm following it but it's perfectly fine to re-use the work
> > item at this point. The work item actually can be re-cycled from the very
> > beginning of the work function. The only thing we need to make sure is that
> > we don't css_put() prematurely to avoid it being freed while we're using it.
> 
> Yes, it is ok to reuse a work struct, but it's not ok to have the same
> work struct enqueued twice on the same WQ when list debug is enabled.
> That's why we are getting this "BUG: corrupted list.."

The above scenario isn't that tho. Once the work item starts executing, wq
doesn't care about what happens to it and as killed_work_fn is holding a
reference, the release scheduling shouldn't happen before it starts
executing unless somebody is screwing up the refcnting.

> That's right. Michal was on the right track for the kill_css() part.
> What I think is going on is that once css_create() fails then
> cgroup_subtree_control_write() ends up calling first kill_css() and
> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
> The each_live_descendant_post() also iterates on the root.
> Here is the call flow (sorry for long lines):
> 
> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
>   |
>   |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
>   |
>   |  css_killed_ref_fn() <- first css->destroy_work enqueue
>   |    |
>   |    |->  INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
>   |
>   |
>   |-> goto out_unlock;
>   |     |
>   |     |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
>   |
>   |
>      css_release(percpu_ref) <- second css->destroy_work enqueue
>        |
>        |->  INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
> 
> 
> What seems to work for me as the simplest fix is to prevent enqueuing a dying
> css in css_release() as below. Please let me know if that makes sense to you.
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1779ccddb734..5618211487cc 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
>  	struct cgroup_subsys_state *css =
>  		container_of(ref, struct cgroup_subsys_state, refcnt);
> -	INIT_WORK(&css->destroy_work, css_release_work_fn);
> -	queue_work(cgroup_destroy_wq, &css->destroy_work);
> +	if (!(css->flags & CSS_DYING)) {
> +		INIT_WORK(&css->destroy_work, css_release_work_fn);
> +		queue_work(cgroup_destroy_wq, &css->destroy_work);
> +	}

When the problem is ref imbalance, how can above be the solution? Of course
release path won't cause an issue if they don't run, but we still need to
free the thing, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:37               ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-01 23:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On 6/1/22 16:20, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
>>> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>>>> // ref=A: initial state
>>>> kill_css()
>>>>     css_get // ref+=F == A+F: fuse
>>>>     percpu_ref_kill_and_confirm
>>>>       __percpu_ref_switch_to_atomic
>>>>         percpu_ref_get
>>>>           // ref += 1 == A+F+1: atomic mode, self-protection
>>>>       percpu_ref_put
>>>>         // ref -= 1 == A+F: kill the base reference
>>>>     [via rcu]
>>>>     percpu_ref_switch_to_atomic_rcu
>>>>       percpu_ref_call_confirm_rcu
>>>>         css_killed_ref_fn == refcnt.confirm_switch
>>>>           queue_work(css->destroy_work)        (1)
>>>>                                                        [via css->destroy_work]
>>>>                                                        css_killed_work_fn == wq.func
>>>>                                                          offline_css() // needs fuse
>>>>                                                          css_put // ref -= F == A: de-fuse
>>>>         percpu_ref_put
>>>>           // ref -= 1 == A-1: remove self-protection
>>>>           css_release                                   // A <= 1 -> 2nd queue_work explodes!
>>>
>>> I'm not sure I'm following it but it's perfectly fine to re-use the work
>>> item at this point. The work item actually can be re-cycled from the very
>>> beginning of the work function. The only thing we need to make sure is that
>>> we don't css_put() prematurely to avoid it being freed while we're using it.
>>
>> Yes, it is ok to reuse a work struct, but it's not ok to have the same
>> work struct enqueued twice on the same WQ when list debug is enabled.
>> That's why we are getting this "BUG: corrupted list.."
> 
> The above scenario isn't that tho. Once the work item starts executing, wq
> doesn't care about what happens to it and as killed_work_fn is holding a
> reference, the release scheduling shouldn't happen before it starts
> executing unless somebody is screwing up the refcnting.
> 
>> That's right. Michal was on the right track for the kill_css() part.
>> What I think is going on is that once css_create() fails then
>> cgroup_subtree_control_write() ends up calling first kill_css() and
>> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
>> The each_live_descendant_post() also iterates on the root.
>> Here is the call flow (sorry for long lines):
>>
>> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
>>    |
>>    |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
>>    |
>>    |  css_killed_ref_fn() <- first css->destroy_work enqueue
>>    |    |
>>    |    |->  INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
>>    |
>>    |
>>    |-> goto out_unlock;
>>    |     |
>>    |     |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
>>    |
>>    |
>>       css_release(percpu_ref) <- second css->destroy_work enqueue
>>         |
>>         |->  INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
>>
>>
>> What seems to work for me as the simplest fix is to prevent enqueuing a dying
>> css in css_release() as below. Please let me know if that makes sense to you.
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1779ccddb734..5618211487cc 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
>>   	struct cgroup_subsys_state *css =
>>   		container_of(ref, struct cgroup_subsys_state, refcnt);
>> -	INIT_WORK(&css->destroy_work, css_release_work_fn);
>> -	queue_work(cgroup_destroy_wq, &css->destroy_work);
>> +	if (!(css->flags & CSS_DYING)) {
>> +		INIT_WORK(&css->destroy_work, css_release_work_fn);
>> +		queue_work(cgroup_destroy_wq, &css->destroy_work);
>> +	}
> 
> When the problem is ref imbalance, how can above be the solution? Of course
> release path won't cause an issue if they don't run, but we still need to
> free the thing, right?

Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
on it. Only css_get() & css_put() modify the refcnt value.
And for the "free the thing" the css_killed_work_fn() does that.
It calls offline_css(css) and css_put(css) for the whole css hierarchy.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:37               ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-01 23:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 6/1/22 16:20, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
>>> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>>>> // ref=A: initial state
>>>> kill_css()
>>>>     css_get // ref+=F == A+F: fuse
>>>>     percpu_ref_kill_and_confirm
>>>>       __percpu_ref_switch_to_atomic
>>>>         percpu_ref_get
>>>>           // ref += 1 == A+F+1: atomic mode, self-protection
>>>>       percpu_ref_put
>>>>         // ref -= 1 == A+F: kill the base reference
>>>>     [via rcu]
>>>>     percpu_ref_switch_to_atomic_rcu
>>>>       percpu_ref_call_confirm_rcu
>>>>         css_killed_ref_fn == refcnt.confirm_switch
>>>>           queue_work(css->destroy_work)        (1)
>>>>                                                        [via css->destroy_work]
>>>>                                                        css_killed_work_fn == wq.func
>>>>                                                          offline_css() // needs fuse
>>>>                                                          css_put // ref -= F == A: de-fuse
>>>>         percpu_ref_put
>>>>           // ref -= 1 == A-1: remove self-protection
>>>>           css_release                                   // A <= 1 -> 2nd queue_work explodes!
>>>
>>> I'm not sure I'm following it but it's perfectly fine to re-use the work
>>> item at this point. The work item actually can be re-cycled from the very
>>> beginning of the work function. The only thing we need to make sure is that
>>> we don't css_put() prematurely to avoid it being freed while we're using it.
>>
>> Yes, it is ok to reuse a work struct, but it's not ok to have the same
>> work struct enqueued twice on the same WQ when list debug is enabled.
>> That's why we are getting this "BUG: corrupted list.."
> 
> The above scenario isn't that tho. Once the work item starts executing, wq
> doesn't care about what happens to it and as killed_work_fn is holding a
> reference, the release scheduling shouldn't happen before it starts
> executing unless somebody is screwing up the refcnting.
> 
>> That's right. Michal was on the right track for the kill_css() part.
>> What I think is going on is that once css_create() fails then
>> cgroup_subtree_control_write() ends up calling first kill_css() and
>> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
>> The each_live_descendant_post() also iterates on the root.
>> Here is the call flow (sorry for long lines):
>>
>> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
>>    |
>>    |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
>>    |
>>    |  css_killed_ref_fn() <- first css->destroy_work enqueue
>>    |    |
>>    |    |->  INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
>>    |
>>    |
>>    |-> goto out_unlock;
>>    |     |
>>    |     |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
>>    |
>>    |
>>       css_release(percpu_ref) <- second css->destroy_work enqueue
>>         |
>>         |->  INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
>>
>>
>> What seems to work for me as the simplest fix is to prevent enqueuing a dying
>> css in css_release() as below. Please let me know if that makes sense to you.
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1779ccddb734..5618211487cc 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
>>   	struct cgroup_subsys_state *css =
>>   		container_of(ref, struct cgroup_subsys_state, refcnt);
>> -	INIT_WORK(&css->destroy_work, css_release_work_fn);
>> -	queue_work(cgroup_destroy_wq, &css->destroy_work);
>> +	if (!(css->flags & CSS_DYING)) {
>> +		INIT_WORK(&css->destroy_work, css_release_work_fn);
>> +		queue_work(cgroup_destroy_wq, &css->destroy_work);
>> +	}
> 
> When the problem is ref imbalance, how can above be the solution? Of course
> release path won't cause an issue if they don't run, but we still need to
> free the thing, right?

Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
on it. Only css_get() & css_put() modify the refcnt value.
And for the "free the thing" the css_killed_work_fn() does that.
It calls offline_css(css) and css_put(css) for the whole css hierarchy.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:43                 ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-01 23:43 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote:
> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called

Yeah, the base ref is special for percpu_ref.

> on it. Only css_get() & css_put() modify the refcnt value.
> And for the "free the thing" the css_killed_work_fn() does that.
> It calls offline_css(css) and css_put(css) for the whole css hierarchy.

Yeah, the freeing path depends on the css_put(css) invoking css_release()
which schedules the work item which actually frees. Am I misunderstanding
something here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-01 23:43                 ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-01 23:43 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote:
> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called

Yeah, the base ref is special for percpu_ref.

> on it. Only css_get() & css_put() modify the refcnt value.
> And for the "free the thing" the css_killed_work_fn() does that.
> It calls offline_css(css) and css_put(css) for the whole css hierarchy.

Yeah, the freeing path depends on the css_put(css) invoking css_release()
which schedules the work item which actually frees. Am I misunderstanding
something here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:00                   ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02  0:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On 6/1/22 16:43, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote:
>> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
>> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
> 
> Yeah, the base ref is special for percpu_ref.
> 
>> on it. Only css_get() & css_put() modify the refcnt value.
>> And for the "free the thing" the css_killed_work_fn() does that.
>> It calls offline_css(css) and css_put(css) for the whole css hierarchy.
> 
> Yeah, the freeing path depends on the css_put(css) invoking css_release()
> which schedules the work item which actually frees. Am I misunderstanding
> something here?

What I'm trying to say is that it's not really a ref imbalance problem.
I think once the kill_css() has been called on a css, and it is enqueued to be
killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
flow path". It will all be handled by the css_killed_ref_fn() function.

The fact the css_release() is called (via cgroup_kn_unlock()) just after
kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
(cgroup_destroy_wq), just with different function. This results in the
BUG: corrupted list in insert_work issue.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:00                   ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02  0:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 6/1/22 16:43, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote:
>> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
>> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
> 
> Yeah, the base ref is special for percpu_ref.
> 
>> on it. Only css_get() & css_put() modify the refcnt value.
>> And for the "free the thing" the css_killed_work_fn() does that.
>> It calls offline_css(css) and css_put(css) for the whole css hierarchy.
> 
> Yeah, the freeing path depends on the css_put(css) invoking css_release()
> which schedules the work item which actually frees. Am I misunderstanding
> something here?

What I'm trying to say is that it's not really a ref imbalance problem.
I think once the kill_css() has been called on a css, and it is enqueued to be
killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
flow path". It will all be handled by the css_killed_ref_fn() function.

The fact the css_release() is called (via cgroup_kn_unlock()) just after
kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
(cgroup_destroy_wq), just with different function. This results in the
BUG: corrupted list in insert_work issue.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:07                     ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-02  0:07 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

Hello,

On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote:
> What I'm trying to say is that it's not really a ref imbalance problem.
> I think once the kill_css() has been called on a css, and it is enqueued to be
> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
> flow path". It will all be handled by the css_killed_ref_fn() function.
> 
> The fact the css_release() is called (via cgroup_kn_unlock()) just after
> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
> (cgroup_destroy_wq), just with different function. This results in the
> BUG: corrupted list in insert_work issue.

I have a hard time following here. The kill / release paths relationship
isn't that complicated. The kill path is invoked when the percpu_ref's base
ref is killed and holds an extra ref so that it's guaranteed that release
can't happen before the kill path is done with the css. When the final put
happens - whether that's from the kill path or someone else, which often is
the case - the release path triggers. If we have release getting scheduled
while the kill path isn't finished, it is a reference counting problem,
right?

Can you elaborate the exact scenario that you think is happening? Please
feel free to omit the function calls and all that. Just lay out who's doing
what.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:07                     ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-02  0:07 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

Hello,

On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote:
> What I'm trying to say is that it's not really a ref imbalance problem.
> I think once the kill_css() has been called on a css, and it is enqueued to be
> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
> flow path". It will all be handled by the css_killed_ref_fn() function.
> 
> The fact the css_release() is called (via cgroup_kn_unlock()) just after
> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
> (cgroup_destroy_wq), just with different function. This results in the
> BUG: corrupted list in insert_work issue.

I have a hard time following here. The kill / release paths relationship
isn't that complicated. The kill path is invoked when the percpu_ref's base
ref is killed and holds an extra ref so that it's guaranteed that release
can't happen before the kill path is done with the css. When the final put
happens - whether that's from the kill path or someone else, which often is
the case - the release path triggers. If we have release getting scheduled
while the kill path isn't finished, it is a reference counting problem,
right?

Can you elaborate the exact scenario that you think is happening? Please
feel free to omit the function calls and all that. Just lay out who's doing
what.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:26                       ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02  0:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On 6/1/22 17:07, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote:
>> What I'm trying to say is that it's not really a ref imbalance problem.
>> I think once the kill_css() has been called on a css, and it is enqueued to be
>> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
>> flow path". It will all be handled by the css_killed_ref_fn() function.
>>
>> The fact the css_release() is called (via cgroup_kn_unlock()) just after
>> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
>> (cgroup_destroy_wq), just with different function. This results in the
>> BUG: corrupted list in insert_work issue.
> 
> I have a hard time following here. The kill / release paths relationship
> isn't that complicated. The kill path is invoked when the percpu_ref's base
> ref is killed and holds an extra ref so that it's guaranteed that release
> can't happen before the kill path is done with the css. When the final put
> happens - whether that's from the kill path or someone else, which often is
> the case - the release path triggers. If we have release getting scheduled
> while the kill path isn't finished, it is a reference counting problem,
> right?
> 
> Can you elaborate the exact scenario that you think is happening? Please
> feel free to omit the function calls and all that. Just lay out who's doing
> what.

Ok the problem is that

1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq

We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
which causes "BUG: corrupted list in insert_work"

So I think the easiest way to solve this would be to have two separate work_structs,
one for the killed_ref path and css_release path as in:

https://lore.kernel.org/all/20220523212724.233314-1-tadeusz.struk@linaro.org/

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:26                       ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02  0:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 6/1/22 17:07, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote:
>> What I'm trying to say is that it's not really a ref imbalance problem.
>> I think once the kill_css() has been called on a css, and it is enqueued to be
>> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
>> flow path". It will all be handled by the css_killed_ref_fn() function.
>>
>> The fact the css_release() is called (via cgroup_kn_unlock()) just after
>> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
>> (cgroup_destroy_wq), just with different function. This results in the
>> BUG: corrupted list in insert_work issue.
> 
> I have a hard time following here. The kill / release paths relationship
> isn't that complicated. The kill path is invoked when the percpu_ref's base
> ref is killed and holds an extra ref so that it's guaranteed that release
> can't happen before the kill path is done with the css. When the final put
> happens - whether that's from the kill path or someone else, which often is
> the case - the release path triggers. If we have release getting scheduled
> while the kill path isn't finished, it is a reference counting problem,
> right?
> 
> Can you elaborate the exact scenario that you think is happening? Please
> feel free to omit the function calls and all that. Just lay out who's doing
> what.

Ok the problem is that

1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq

We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
which causes "BUG: corrupted list in insert_work"

So I think the easiest way to solve this would be to have two separate work_structs,
one for the killed_ref path and css_release path as in:

https://lore.kernel.org/all/20220523212724.233314-1-tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org/

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:29                         ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-02  0:29 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote:
> Ok the problem is that
> 
> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq
> 
> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
> which causes "BUG: corrupted list in insert_work"

#2 shouldn't be happening before kill_ref_fn() is done with the css. If what
you're saying is happening, what's broken is the fact that the refcnt is
reaching 0 prematurely.

> So I think the easiest way to solve this would be to have two separate work_structs,
> one for the killed_ref path and css_release path as in:

If you do that, you'd just be racing the free path against the kill path and
the css might get freed while the kill path is still accessing it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:29                         ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-06-02  0:29 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote:
> Ok the problem is that
> 
> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq
> 
> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
> which causes "BUG: corrupted list in insert_work"

#2 shouldn't be happening before kill_ref_fn() is done with the css. If what
you're saying is happening, what's broken is the fact that the refcnt is
reaching 0 prematurely.

> So I think the easiest way to solve this would be to have two separate work_structs,
> one for the killed_ref path and css_release path as in:

If you do that, you'd just be racing the free path against the kill path and
the css might get freed while the kill path is still accessing it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:40                           ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02  0:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh

On 6/1/22 17:29, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote:
>> Ok the problem is that
>>
>> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
>> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq
>>
>> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
>> which causes "BUG: corrupted list in insert_work"
> 
> #2 shouldn't be happening before kill_ref_fn() is done with the css. If what
> you're saying is happening, what's broken is the fact that the refcnt is
> reaching 0 prematurely.

css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
and it will only enqueue the css_killed_work_fn() to be called later.
Then css_put()->css_release() will be called before the css_killed_work_fn() will even
get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later.
The problem happens on the second enqueue. So there need to be something in place that
will make sure that css_killed_work_fn() is done before css_release() can enqueue
the second job. Does it sound right?
  
>> So I think the easiest way to solve this would be to have two separate work_structs,
>> one for the killed_ref path and css_release path as in:
> 
> If you do that, you'd just be racing the free path against the kill path and
> the css might get freed while the kill path is still accessing it.
> 
> Thanks.
> 


-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02  0:40                           ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02  0:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 6/1/22 17:29, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote:
>> Ok the problem is that
>>
>> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
>> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq
>>
>> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
>> which causes "BUG: corrupted list in insert_work"
> 
> #2 shouldn't be happening before kill_ref_fn() is done with the css. If what
> you're saying is happening, what's broken is the fact that the refcnt is
> reaching 0 prematurely.

css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
and it will only enqueue the css_killed_work_fn() to be called later.
Then css_put()->css_release() will be called before the css_killed_work_fn() will even
get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later.
The problem happens on the second enqueue. So there need to be something in place that
will make sure that css_killed_work_fn() is done before css_release() can enqueue
the second job. Does it sound right?
  
>> So I think the easiest way to solve this would be to have two separate work_structs,
>> one for the killed_ref path and css_release path as in:
> 
> If you do that, you'd just be racing the free path against the kill path and
> the css might get freed while the kill path is still accessing it.
> 
> Thanks.
> 


-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02 11:47                             ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-06-02 11:47 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Tejun Heo, cgroups, linux-kernel, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
> and it will only enqueue the css_killed_work_fn() to be called later.
> Then css_put()->css_release() will be called before the css_killed_work_fn() will even
> get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later.
> The problem happens on the second enqueue. So there need to be something in place that
> will make sure that css_killed_work_fn() is done before css_release() can enqueue
> the second job.

IIUC, here you describe the same scenario I broke down at [1].

> Does it sound right?

I added a parameter A there (that is sum of base and percpu references
before kill_css()).
I thought it fails because A == 1 (i.e. killing the base reference),
however, that seems an unlikely situation (because cgroup code uses a
"fuse" reference to pin css for offline_css()).

So the remaining option (at least I find it more likely now) is that
A == 0 (A < 0 would trigger the warning in
percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can
get to the bottom of this with detailed enough tracing of gets/puts.

Splitting the work struct is condradictive to the existing approach with
the "fuse" reference.

(BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> The fact the css_release() is called (via cgroup_kn_unlock()) just after
> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
> (cgroup_destroy_wq), just with different function. This results in the
> BUG: corrupted list in insert_work issue.

Where do you see a critical css_release called from cgroup_kn_unlock()?
I always observed the css_release() being called via
percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot
logs.))

Thanks,
Michal

[1] https://lore.kernel.org/r/Yo7KfEOz92kS2z5Y@blackbook/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02 11:47                             ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-06-02 11:47 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
> and it will only enqueue the css_killed_work_fn() to be called later.
> Then css_put()->css_release() will be called before the css_killed_work_fn() will even
> get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later.
> The problem happens on the second enqueue. So there need to be something in place that
> will make sure that css_killed_work_fn() is done before css_release() can enqueue
> the second job.

IIUC, here you describe the same scenario I broke down at [1].

> Does it sound right?

I added a parameter A there (that is sum of base and percpu references
before kill_css()).
I thought it fails because A == 1 (i.e. killing the base reference),
however, that seems an unlikely situation (because cgroup code uses a
"fuse" reference to pin css for offline_css()).

So the remaining option (at least I find it more likely now) is that
A == 0 (A < 0 would trigger the warning in
percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can
get to the bottom of this with detailed enough tracing of gets/puts.

Splitting the work struct is condradictive to the existing approach with
the "fuse" reference.

(BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> The fact the css_release() is called (via cgroup_kn_unlock()) just after
> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
> (cgroup_destroy_wq), just with different function. This results in the
> BUG: corrupted list in insert_work issue.

Where do you see a critical css_release called from cgroup_kn_unlock()?
I always observed the css_release() being called via
percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot
logs.))

Thanks,
Michal

[1] https://lore.kernel.org/r/Yo7KfEOz92kS2z5Y@blackbook/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02 14:28                               ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02 14:28 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups, linux-kernel, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 6/2/22 04:47, Michal Koutný wrote:
> On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk<tadeusz.struk@linaro.org>  wrote:
>> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
>> and it will only enqueue the css_killed_work_fn() to be called later.
>> Then css_put()->css_release() will be called before the css_killed_work_fn() will even
>> get a chance to run, and it will also*only*  enqueue css_release_work_fn() to be called later.
>> The problem happens on the second enqueue. So there need to be something in place that
>> will make sure that css_killed_work_fn() is done before css_release() can enqueue
>> the second job.
> IIUC, here you describe the same scenario I broke down at [1].

Right, except the last css_put(), which I think is called from cgroup_kn_unlock()
See below.

>> Does it sound right?
> I added a parameter A there (that is sum of base and percpu references
> before kill_css()).
> I thought it fails because A == 1 (i.e. killing the base reference),
> however, that seems an unlikely situation (because cgroup code uses a
> "fuse" reference to pin css for offline_css()).
> 
> So the remaining option (at least I find it more likely now) is that
> A == 0 (A < 0 would trigger the warning in
> percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can
> get to the bottom of this with detailed enough tracing of gets/puts.
> 
> Splitting the work struct is condradictive to the existing approach with
> the "fuse" reference.
> 
> (BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk<tadeusz.struk@linaro.org>  wrote:
>> The fact the css_release() is called (via cgroup_kn_unlock()) just after
>> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
>> (cgroup_destroy_wq), just with different function. This results in the
>> BUG: corrupted list in insert_work issue.
> Where do you see a critical css_release called from cgroup_kn_unlock()?
> I always observed the css_release() being called via
> percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot

it goes like this:
cgroup_kn_unlock(kn)->cgroup_put(cgrp)->css_put(&cgrp->self), which
brings the refcnt to zero and triggers css_release().
I think what's missing is something that will serialize the kill
and release paths. I will try to put something together today.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
@ 2022-06-02 14:28                               ` Tadeusz Struk
  0 siblings, 0 replies; 40+ messages in thread
From: Tadeusz Struk @ 2022-06-02 14:28 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Bui Quang Minh

On 6/2/22 04:47, Michal Koutný wrote:
> On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk<tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>  wrote:
>> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
>> and it will only enqueue the css_killed_work_fn() to be called later.
>> Then css_put()->css_release() will be called before the css_killed_work_fn() will even
>> get a chance to run, and it will also*only*  enqueue css_release_work_fn() to be called later.
>> The problem happens on the second enqueue. So there need to be something in place that
>> will make sure that css_killed_work_fn() is done before css_release() can enqueue
>> the second job.
> IIUC, here you describe the same scenario I broke down at [1].

Right, except the last css_put(), which I think is called from cgroup_kn_unlock()
See below.

>> Does it sound right?
> I added a parameter A there (that is sum of base and percpu references
> before kill_css()).
> I thought it fails because A == 1 (i.e. killing the base reference),
> however, that seems an unlikely situation (because cgroup code uses a
> "fuse" reference to pin css for offline_css()).
> 
> So the remaining option (at least I find it more likely now) is that
> A == 0 (A < 0 would trigger the warning in
> percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can
> get to the bottom of this with detailed enough tracing of gets/puts.
> 
> Splitting the work struct is condradictive to the existing approach with
> the "fuse" reference.
> 
> (BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk<tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>  wrote:
>> The fact the css_release() is called (via cgroup_kn_unlock()) just after
>> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
>> (cgroup_destroy_wq), just with different function. This results in the
>> BUG: corrupted list in insert_work issue.
> Where do you see a critical css_release called from cgroup_kn_unlock()?
> I always observed the css_release() being called via
> percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot

it goes like this:
cgroup_kn_unlock(kn)->cgroup_put(cgrp)->css_put(&cgrp->self), which
brings the refcnt to zero and triggers css_release().
I think what's missing is something that will serialize the kill
and release paths. I will try to put something together today.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-06-02 14:28 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-25 15:15 [PATCH 0/2] cgroup_subsys_state lifecycle fixups Michal Koutný
2022-05-25 15:15 ` Michal Koutný
2022-05-25 15:15 ` [PATCH 1/2] cgroup: Wait for cgroup_subsys_state offlining on unmount Michal Koutný
2022-05-25 15:15   ` Michal Koutný
2022-05-25 15:15 ` [PATCH 2/2] cgroup: Use separate work structs on css release path Michal Koutný
2022-05-25 15:15   ` Michal Koutný
2022-05-25 16:14   ` Michal Koutný
2022-05-25 16:14     ` Michal Koutný
2022-05-26  9:56     ` Michal Koutný
2022-05-26  9:56       ` Michal Koutný
2022-05-26 18:15       ` Tejun Heo
2022-05-26 18:15         ` Tejun Heo
2022-05-27 16:39         ` Tadeusz Struk
2022-05-27 16:39           ` Tadeusz Struk
2022-05-27 16:54           ` Michal Koutný
2022-05-27 16:54             ` Michal Koutný
2022-05-27 17:23             ` Tejun Heo
2022-05-27 17:23               ` Tejun Heo
2022-06-01 23:13         ` Tadeusz Struk
2022-06-01 23:13           ` Tadeusz Struk
2022-06-01 23:20           ` Tejun Heo
2022-06-01 23:20             ` Tejun Heo
2022-06-01 23:37             ` Tadeusz Struk
2022-06-01 23:37               ` Tadeusz Struk
2022-06-01 23:43               ` Tejun Heo
2022-06-01 23:43                 ` Tejun Heo
2022-06-02  0:00                 ` Tadeusz Struk
2022-06-02  0:00                   ` Tadeusz Struk
2022-06-02  0:07                   ` Tejun Heo
2022-06-02  0:07                     ` Tejun Heo
2022-06-02  0:26                     ` Tadeusz Struk
2022-06-02  0:26                       ` Tadeusz Struk
2022-06-02  0:29                       ` Tejun Heo
2022-06-02  0:29                         ` Tejun Heo
2022-06-02  0:40                         ` Tadeusz Struk
2022-06-02  0:40                           ` Tadeusz Struk
2022-06-02 11:47                           ` Michal Koutný
2022-06-02 11:47                             ` Michal Koutný
2022-06-02 14:28                             ` Tadeusz Struk
2022-06-02 14:28                               ` Tadeusz Struk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.