* [PATCH 0/2] cgroup_subsys_state lifecycle fixups @ 2022-05-25 15:15 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw) To: cgroups, linux-kernel Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk Two corner cases were hanging around [2][3] related to css lifecycles, since they're loosely related I'm sending them together. The 2nd patch fixes problems encountered in syzbot tests only. Alternative solutions could be: - daisy-chain css_release_work_fn from the offending css_killed_work_fn call, - rework kill_css() not to rely on multi-stage css_killed_work_fn() [1]. The simplest approach was chosen. The other existing users of percpu_ref_kill_and_confirm are not affected by similar issues. [1] Rough idea is to only synchronize via a completion like e.g. nvmet_sq_destroy() does and move most of css_killed_work_fn() at the end of kill_css(). kill_css() is only used in process context when de-configuring controllers or rmdiring a cgroup. [2] https://lore.kernel.org/lkml/20220404142535.145975-1-minhquangbui99@gmail.com/ [3] https://lore.kernel.org/lkml/20220412192459.227740-1-tadeusz.struk@linaro.org/ Michal Koutný (2): cgroup: Wait for cgroup_subsys_state offlining on unmount cgroup: Use separate work structs on css release path include/linux/cgroup-defs.h | 5 +++-- kernel/cgroup/cgroup.c | 19 +++++++++++-------- 2 files changed, 14 insertions(+), 10 deletions(-) -- 2.35.3 ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 0/2] cgroup_subsys_state lifecycle fixups @ 2022-05-25 15:15 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk Two corner cases were hanging around [2][3] related to css lifecycles, since they're loosely related I'm sending them together. The 2nd patch fixes problems encountered in syzbot tests only. Alternative solutions could be: - daisy-chain css_release_work_fn from the offending css_killed_work_fn call, - rework kill_css() not to rely on multi-stage css_killed_work_fn() [1]. The simplest approach was chosen. The other existing users of percpu_ref_kill_and_confirm are not affected by similar issues. [1] Rough idea is to only synchronize via a completion like e.g. nvmet_sq_destroy() does and move most of css_killed_work_fn() at the end of kill_css(). kill_css() is only used in process context when de-configuring controllers or rmdiring a cgroup. [2] https://lore.kernel.org/lkml/20220404142535.145975-1-minhquangbui99-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org/ [3] https://lore.kernel.org/lkml/20220412192459.227740-1-tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org/ Michal Koutn√Ω (2): cgroup: Wait for cgroup_subsys_state offlining on unmount cgroup: Use separate work structs on css release path include/linux/cgroup-defs.h | 5 +++-- kernel/cgroup/cgroup.c | 19 +++++++++++-------- 2 files changed, 14 insertions(+), 10 deletions(-) -- 2.35.3 ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/2] cgroup: Wait for cgroup_subsys_state offlining on unmount @ 2022-05-25 15:15 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw) To: cgroups, linux-kernel Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk The reported problem here occurs when cgroup hierarchy is unmounted quickly after last cgroup removal. The last cgroup prevents the root cgroup css->refcnt from being killed. The respective cgroup root thus remains permanently in existence. This is actually intended behavior for memory controller whose state is long-lived and there is no better option to attach it later (see also commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory controller lifetime")). We can make the situation better by checking children list only after any cgroups in the middle of removal are gone, detected via cgroup_destroy_wq. Reported-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://lore.kernel.org/r/20220404142535.145975-1-minhquangbui99@gmail.com Signed-off-by: Michal Koutný <mkoutny@suse.com> --- kernel/cgroup/cgroup.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index adb820e98f24..a5b0d5d54fbc 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb) struct cgroup_root *root = cgroup_root_from_kf(kf_root); /* - * If @root doesn't have any children, start killing it. + * If @root doesn't have any children held by residual state (e.g. + * memory controller), start killing it, flush workqueue to filter out + * transiently offlined children. * This prevents new mounts by disabling percpu_ref_tryget_live(). * * And don't kill the default root. */ + flush_workqueue(cgroup_destroy_wq); if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root && !percpu_ref_is_dying(&root->cgrp.self.refcnt)) { cgroup_bpf_offline(&root->cgrp); -- 2.35.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 1/2] cgroup: Wait for cgroup_subsys_state offlining on unmount @ 2022-05-25 15:15 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk The reported problem here occurs when cgroup hierarchy is unmounted quickly after last cgroup removal. The last cgroup prevents the root cgroup css->refcnt from being killed. The respective cgroup root thus remains permanently in existence. This is actually intended behavior for memory controller whose state is long-lived and there is no better option to attach it later (see also commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory controller lifetime")). We can make the situation better by checking children list only after any cgroups in the middle of removal are gone, detected via cgroup_destroy_wq. Reported-by: Bui Quang Minh <minhquangbui99-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Link: https://lore.kernel.org/r/20220404142535.145975-1-minhquangbui99-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Signed-off-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org> --- kernel/cgroup/cgroup.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index adb820e98f24..a5b0d5d54fbc 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb) struct cgroup_root *root = cgroup_root_from_kf(kf_root); /* - * If @root doesn't have any children, start killing it. + * If @root doesn't have any children held by residual state (e.g. + * memory controller), start killing it, flush workqueue to filter out + * transiently offlined children. * This prevents new mounts by disabling percpu_ref_tryget_live(). * * And don't kill the default root. */ + flush_workqueue(cgroup_destroy_wq); if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root && !percpu_ref_is_dying(&root->cgrp.self.refcnt)) { cgroup_bpf_offline(&root->cgrp); -- 2.35.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-25 15:15 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw) To: cgroups, linux-kernel Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk The cgroup_subsys_state of cgroup subsystems (not cgroup->self) use both kill and release callbacks on their release path (see comment for css_free_rwork_fn()). When the last reference is also the base reference, we run into issues when active work_struct (1) is re-initialized from css_release (2). // ref=1: only base reference kill_css() css_get() // fuse, ref+=1 == 2 percpu_ref_kill_and_confirm // ref -= 1 == 1: kill base references [via rcu] css_killed_ref_fn == refcnt.confirm_switch queue_work(css->destroy_work) (1) [via css->destroy_work] css_killed_work_fn == wq.func offline_css() // needs fuse css_put // ref -= 1 == 0: de-fuse, was last ... percpu_ref_put_many css_release queue_work(css->destroy_work) (2) [via css->destroy_work] css_release_work_fn == wq.func Despite we take a fuse reference in css_killed_work_fn() it serves for pinning the css until only after offline_css(). We could check inside css_release whether destroy_work is active (WORK_STRUCT_PENDING_BIT) and daisy-chain css_release_work_fn from css_release(). In order to avoid clashes with various stages of the work item processing, we just spend some space in css (my config's css grows to 232B + 32B) and create a separate work entry for each user. Reported-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Reported-by: Tadeusz Struk <tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220412192459.227740-1-tadeusz.struk@linaro.org/ Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Signed-off-by: Michal Koutný <mkoutny@suse.com> --- include/linux/cgroup-defs.h | 5 +++-- kernel/cgroup/cgroup.c | 14 +++++++------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..16b99aa04305 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -178,8 +178,9 @@ struct cgroup_subsys_state { */ atomic_t online_cnt; - /* percpu_ref killing and RCU release */ - struct work_struct destroy_work; + /* percpu_ref killing, css release, and RCU release work structs */ + struct work_struct killed_ref_work; + struct work_struct release_work; struct rcu_work destroy_rwork; /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a5b0d5d54fbc..33b3a44391d7 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5102,7 +5102,7 @@ static struct cftype cgroup_base_files[] = { * css_free_work_fn(). * * It is actually hairier because both step 2 and 4 require process context - * and thus involve punting to css->destroy_work adding two additional + * and thus involve punting to css->release_work adding two additional * steps to the already complex sequence. */ static void css_free_rwork_fn(struct work_struct *work) @@ -5157,7 +5157,7 @@ static void css_free_rwork_fn(struct work_struct *work) static void css_release_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, release_work); struct cgroup_subsys *ss = css->ss; struct cgroup *cgrp = css->cgroup; @@ -5213,8 +5213,8 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->release_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->release_work); } static void init_and_link_css(struct cgroup_subsys_state *css, @@ -5549,7 +5549,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) static void css_killed_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, killed_ref_work); mutex_lock(&cgroup_mutex); @@ -5570,8 +5570,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt); if (atomic_dec_and_test(&css->online_cnt)) { - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->killed_ref_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->killed_ref_work); } } -- 2.35.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-25 15:15 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 15:15 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk The cgroup_subsys_state of cgroup subsystems (not cgroup->self) use both kill and release callbacks on their release path (see comment for css_free_rwork_fn()). When the last reference is also the base reference, we run into issues when active work_struct (1) is re-initialized from css_release (2). // ref=1: only base reference kill_css() css_get() // fuse, ref+=1 == 2 percpu_ref_kill_and_confirm // ref -= 1 == 1: kill base references [via rcu] css_killed_ref_fn == refcnt.confirm_switch queue_work(css->destroy_work) (1) [via css->destroy_work] css_killed_work_fn == wq.func offline_css() // needs fuse css_put // ref -= 1 == 0: de-fuse, was last ... percpu_ref_put_many css_release queue_work(css->destroy_work) (2) [via css->destroy_work] css_release_work_fn == wq.func Despite we take a fuse reference in css_killed_work_fn() it serves for pinning the css until only after offline_css(). We could check inside css_release whether destroy_work is active (WORK_STRUCT_PENDING_BIT) and daisy-chain css_release_work_fn from css_release(). In order to avoid clashes with various stages of the work item processing, we just spend some space in css (my config's css grows to 232B + 32B) and create a separate work entry for each user. Reported-by: syzbot+e42ae441c3b10acf9e9d-Pl5Pbv+GP7P466ipTTIvnc23WoclnBCfAL8bYrjMMd8@public.gmane.org Reported-by: Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> Link: https://lore.kernel.org/r/20220412192459.227740-1-tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org/ Signed-off-by: Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> Signed-off-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org> --- include/linux/cgroup-defs.h | 5 +++-- kernel/cgroup/cgroup.c | 14 +++++++------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..16b99aa04305 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -178,8 +178,9 @@ struct cgroup_subsys_state { */ atomic_t online_cnt; - /* percpu_ref killing and RCU release */ - struct work_struct destroy_work; + /* percpu_ref killing, css release, and RCU release work structs */ + struct work_struct killed_ref_work; + struct work_struct release_work; struct rcu_work destroy_rwork; /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a5b0d5d54fbc..33b3a44391d7 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5102,7 +5102,7 @@ static struct cftype cgroup_base_files[] = { * css_free_work_fn(). * * It is actually hairier because both step 2 and 4 require process context - * and thus involve punting to css->destroy_work adding two additional + * and thus involve punting to css->release_work adding two additional * steps to the already complex sequence. */ static void css_free_rwork_fn(struct work_struct *work) @@ -5157,7 +5157,7 @@ static void css_free_rwork_fn(struct work_struct *work) static void css_release_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, release_work); struct cgroup_subsys *ss = css->ss; struct cgroup *cgrp = css->cgroup; @@ -5213,8 +5213,8 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->release_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->release_work); } static void init_and_link_css(struct cgroup_subsys_state *css, @@ -5549,7 +5549,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) static void css_killed_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, killed_ref_work); mutex_lock(&cgroup_mutex); @@ -5570,8 +5570,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt); if (atomic_dec_and_test(&css->online_cnt)) { - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->killed_ref_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->killed_ref_work); } } -- 2.35.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-25 16:14 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 16:14 UTC (permalink / raw) To: cgroups, linux-kernel Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk On Wed, May 25, 2022 at 05:15:17PM +0200, Michal Koutný <mkoutny@suse.com> wrote: > // ref=1: only base reference > kill_css() > css_get() // fuse, ref+=1 == 2 > percpu_ref_kill_and_confirm > // ref -= 1 == 1: kill base references > [via rcu] > css_killed_ref_fn == refcnt.confirm_switch > queue_work(css->destroy_work) (1) > [via css->destroy_work] > css_killed_work_fn == wq.func > offline_css() // needs fuse > css_put // ref -= 1 == 0: de-fuse, was last > ... > percpu_ref_put_many > css_release > queue_work(css->destroy_work) (2) > [via css->destroy_work] > css_release_work_fn == wq.func Apologies, this is wrong explanation. (I thought this explains why Tadeusz's patch with double get/put didn't fix it (i.e. any number wouldn't help with the explanation above).) But the above is not correct. I've looked at the stack trace [1] and the offending percpu_ref_put_many is called from an RCU callback percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops to zero there... Regards, Michal ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-25 16:14 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-25 16:14 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk On Wed, May 25, 2022 at 05:15:17PM +0200, Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote: > // ref=1: only base reference > kill_css() > css_get() // fuse, ref+=1 == 2 > percpu_ref_kill_and_confirm > // ref -= 1 == 1: kill base references > [via rcu] > css_killed_ref_fn == refcnt.confirm_switch > queue_work(css->destroy_work) (1) > [via css->destroy_work] > css_killed_work_fn == wq.func > offline_css() // needs fuse > css_put // ref -= 1 == 0: de-fuse, was last > ... > percpu_ref_put_many > css_release > queue_work(css->destroy_work) (2) > [via css->destroy_work] > css_release_work_fn == wq.func Apologies, this is wrong explanation. (I thought this explains why Tadeusz's patch with double get/put didn't fix it (i.e. any number wouldn't help with the explanation above).) But the above is not correct. I've looked at the stack trace [1] and the offending percpu_ref_put_many is called from an RCU callback percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops to zero there... Regards, Michal ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-26 9:56 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-26 9:56 UTC (permalink / raw) To: cgroups, linux-kernel Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk On Wed, May 25, 2022 at 06:14:55PM +0200, Michal Koutný <mkoutny@suse.com> wrote: > But the above is not correct. I've looked at the stack trace [1] and the > offending percpu_ref_put_many is called from an RCU callback > percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops > to zero there... The link [1] should have been [1]. After some more thought, the following is possible sequencing of involved functions. // ref=A: initial state kill_css() css_get // ref+=F == A+F: fuse percpu_ref_kill_and_confirm __percpu_ref_switch_to_atomic percpu_ref_get // ref += 1 == A+F+1: atomic mode, self-protection percpu_ref_put // ref -= 1 == A+F: kill the base reference [via rcu] percpu_ref_switch_to_atomic_rcu percpu_ref_call_confirm_rcu css_killed_ref_fn == refcnt.confirm_switch queue_work(css->destroy_work) (1) [via css->destroy_work] css_killed_work_fn == wq.func offline_css() // needs fuse css_put // ref -= F == A: de-fuse percpu_ref_put // ref -= 1 == A-1: remove self-protection css_release // A <= 1 -> 2nd queue_work explodes! queue_work(css->destroy_work) (2) [via css->destroy_work] css_release_work_fn == wq.func Another CPU would have to dispatch and run the css_killed_work_fn callback in parallel to percpu_ref_switch_to_atomic_rcu. It's a more correct explanation, however, its likelihood does seem very low. Perhaps some debug prints of percpu_ref_data.data in percpu_ref_call_confirm_rcu could shed more light onto this [2]. HTH, Michal [1] https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000 [2] I tried notifying syzbot about [3] moments ago. [3] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-syzbot ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-26 9:56 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-26 9:56 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Tejun Heo, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk On Wed, May 25, 2022 at 06:14:55PM +0200, Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote: > But the above is not correct. I've looked at the stack trace [1] and the > offending percpu_ref_put_many is called from an RCU callback > percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops > to zero there... The link [1] should have been [1]. After some more thought, the following is possible sequencing of involved functions. // ref=A: initial state kill_css() css_get // ref+=F == A+F: fuse percpu_ref_kill_and_confirm __percpu_ref_switch_to_atomic percpu_ref_get // ref += 1 == A+F+1: atomic mode, self-protection percpu_ref_put // ref -= 1 == A+F: kill the base reference [via rcu] percpu_ref_switch_to_atomic_rcu percpu_ref_call_confirm_rcu css_killed_ref_fn == refcnt.confirm_switch queue_work(css->destroy_work) (1) [via css->destroy_work] css_killed_work_fn == wq.func offline_css() // needs fuse css_put // ref -= F == A: de-fuse percpu_ref_put // ref -= 1 == A-1: remove self-protection css_release // A <= 1 -> 2nd queue_work explodes! queue_work(css->destroy_work) (2) [via css->destroy_work] css_release_work_fn == wq.func Another CPU would have to dispatch and run the css_killed_work_fn callback in parallel to percpu_ref_switch_to_atomic_rcu. It's a more correct explanation, however, its likelihood does seem very low. Perhaps some debug prints of percpu_ref_data.data in percpu_ref_call_confirm_rcu could shed more light onto this [2]. HTH, Michal [1] https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000 [2] I tried notifying syzbot about [3] moments ago. [3] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-syzbot ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path 2022-05-26 9:56 ` Michal Koutný @ 2022-05-26 18:15 ` Tejun Heo -1 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-05-26 18:15 UTC (permalink / raw) To: Michal Koutný Cc: cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk Hello, Michal. On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: > // ref=A: initial state > kill_css() > css_get // ref+=F == A+F: fuse > percpu_ref_kill_and_confirm > __percpu_ref_switch_to_atomic > percpu_ref_get > // ref += 1 == A+F+1: atomic mode, self-protection > percpu_ref_put > // ref -= 1 == A+F: kill the base reference > [via rcu] > percpu_ref_switch_to_atomic_rcu > percpu_ref_call_confirm_rcu > css_killed_ref_fn == refcnt.confirm_switch > queue_work(css->destroy_work) (1) > [via css->destroy_work] > css_killed_work_fn == wq.func > offline_css() // needs fuse > css_put // ref -= F == A: de-fuse > percpu_ref_put > // ref -= 1 == A-1: remove self-protection > css_release // A <= 1 -> 2nd queue_work explodes! I'm not sure I'm following it but it's perfectly fine to re-use the work item at this point. The work item actually can be re-cycled from the very beginning of the work function. The only thing we need to make sure is that we don't css_put() prematurely to avoid it being freed while we're using it. For the sharing to be a problem, we should be queueing the release work item while the destroy instance is still pending, and if that is the case, it doesn't really matter whether we use two separate work items or not. We're already broken and would just be shifting the problem to explode elsewhere. The only possibility that I can think of is that somehow we're ending up with an extra css_put() somewhere thus triggering the release path prematurely. If that's the case, we'll prolly need to trace get/puts to find out who's causing the ref imbalance. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-26 18:15 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-05-26 18:15 UTC (permalink / raw) To: Michal Koutný Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh, Tadeusz Struk Hello, Michal. On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: > // ref=A: initial state > kill_css() > css_get // ref+=F == A+F: fuse > percpu_ref_kill_and_confirm > __percpu_ref_switch_to_atomic > percpu_ref_get > // ref += 1 == A+F+1: atomic mode, self-protection > percpu_ref_put > // ref -= 1 == A+F: kill the base reference > [via rcu] > percpu_ref_switch_to_atomic_rcu > percpu_ref_call_confirm_rcu > css_killed_ref_fn == refcnt.confirm_switch > queue_work(css->destroy_work) (1) > [via css->destroy_work] > css_killed_work_fn == wq.func > offline_css() // needs fuse > css_put // ref -= F == A: de-fuse > percpu_ref_put > // ref -= 1 == A-1: remove self-protection > css_release // A <= 1 -> 2nd queue_work explodes! I'm not sure I'm following it but it's perfectly fine to re-use the work item at this point. The work item actually can be re-cycled from the very beginning of the work function. The only thing we need to make sure is that we don't css_put() prematurely to avoid it being freed while we're using it. For the sharing to be a problem, we should be queueing the release work item while the destroy instance is still pending, and if that is the case, it doesn't really matter whether we use two separate work items or not. We're already broken and would just be shifting the problem to explode elsewhere. The only possibility that I can think of is that somehow we're ending up with an extra css_put() somewhere thus triggering the release path prematurely. If that's the case, we'll prolly need to trace get/puts to find out who's causing the ref imbalance. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-27 16:39 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-05-27 16:39 UTC (permalink / raw) To: Tejun Heo, Michal Koutný Cc: cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On 5/26/22 11:15, Tejun Heo wrote: > Hello, Michal. > > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: >> // ref=A: initial state >> kill_css() >> css_get // ref+=F == A+F: fuse >> percpu_ref_kill_and_confirm >> __percpu_ref_switch_to_atomic >> percpu_ref_get >> // ref += 1 == A+F+1: atomic mode, self-protection >> percpu_ref_put >> // ref -= 1 == A+F: kill the base reference >> [via rcu] >> percpu_ref_switch_to_atomic_rcu >> percpu_ref_call_confirm_rcu >> css_killed_ref_fn == refcnt.confirm_switch >> queue_work(css->destroy_work) (1) >> [via css->destroy_work] >> css_killed_work_fn == wq.func >> offline_css() // needs fuse >> css_put // ref -= F == A: de-fuse >> percpu_ref_put >> // ref -= 1 == A-1: remove self-protection >> css_release // A <= 1 -> 2nd queue_work explodes! > > I'm not sure I'm following it but it's perfectly fine to re-use the work > item at this point. The work item actually can be re-cycled from the very > beginning of the work function. The only thing we need to make sure is that > we don't css_put() prematurely to avoid it being freed while we're using it. > > For the sharing to be a problem, we should be queueing the release work item > while the destroy instance is still pending, and if that is the case, it > doesn't really matter whether we use two separate work items or not. We're > already broken and would just be shifting the problem to explode elsewhere. > > The only possibility that I can think of is that somehow we're ending up > with an extra css_put() somewhere thus triggering the release path > prematurely. If that's the case, we'll prolly need to trace get/puts to find > out who's causing the ref imbalance. Hi Michal, As far as I can see we are trying to test the same thing suggested by Tejun. I just sent a test request to try this: https://github.com/tstruk/linux/commit/master Let me know if you have any more tests to run and I will hold off until you are done. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-27 16:39 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-05-27 16:39 UTC (permalink / raw) To: Tejun Heo, Michal Koutný Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On 5/26/22 11:15, Tejun Heo wrote: > Hello, Michal. > > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: >> // ref=A: initial state >> kill_css() >> css_get // ref+=F == A+F: fuse >> percpu_ref_kill_and_confirm >> __percpu_ref_switch_to_atomic >> percpu_ref_get >> // ref += 1 == A+F+1: atomic mode, self-protection >> percpu_ref_put >> // ref -= 1 == A+F: kill the base reference >> [via rcu] >> percpu_ref_switch_to_atomic_rcu >> percpu_ref_call_confirm_rcu >> css_killed_ref_fn == refcnt.confirm_switch >> queue_work(css->destroy_work) (1) >> [via css->destroy_work] >> css_killed_work_fn == wq.func >> offline_css() // needs fuse >> css_put // ref -= F == A: de-fuse >> percpu_ref_put >> // ref -= 1 == A-1: remove self-protection >> css_release // A <= 1 -> 2nd queue_work explodes! > > I'm not sure I'm following it but it's perfectly fine to re-use the work > item at this point. The work item actually can be re-cycled from the very > beginning of the work function. The only thing we need to make sure is that > we don't css_put() prematurely to avoid it being freed while we're using it. > > For the sharing to be a problem, we should be queueing the release work item > while the destroy instance is still pending, and if that is the case, it > doesn't really matter whether we use two separate work items or not. We're > already broken and would just be shifting the problem to explode elsewhere. > > The only possibility that I can think of is that somehow we're ending up > with an extra css_put() somewhere thus triggering the release path > prematurely. If that's the case, we'll prolly need to trace get/puts to find > out who's causing the ref imbalance. Hi Michal, As far as I can see we are trying to test the same thing suggested by Tejun. I just sent a test request to try this: https://github.com/tstruk/linux/commit/master Let me know if you have any more tests to run and I will hold off until you are done. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-27 16:54 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-27 16:54 UTC (permalink / raw) To: Tadeusz Struk Cc: Tejun Heo, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh Hello Tadeusz. On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote: > As far as I can see we are trying to test the same thing suggested by Tejun. > I just sent a test request to try this: > https://github.com/tstruk/linux/commit/master Yup, I've added few more prints to get more fine-grained resolution. Also, I decided to use ftrace printk not to interfere with timing too much (due to the original race hypothesis). > Let me know if you have any more tests to run and I will hold off until > you are done. My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea). So far, I'm not convinced, I extract the complete ftrace buffer from the syzbot runs, so I'm not drawing any conclusions from the traces I've got. I'm not going to continue today. You may have more luck with your plain printk (if it's just imbalance and it avoids printk locking sensitive paths). HTH, Michal [1] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-b2 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-27 16:54 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-05-27 16:54 UTC (permalink / raw) To: Tadeusz Struk Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh Hello Tadeusz. On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote: > As far as I can see we are trying to test the same thing suggested by Tejun. > I just sent a test request to try this: > https://github.com/tstruk/linux/commit/master Yup, I've added few more prints to get more fine-grained resolution. Also, I decided to use ftrace printk not to interfere with timing too much (due to the original race hypothesis). > Let me know if you have any more tests to run and I will hold off until > you are done. My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea). So far, I'm not convinced, I extract the complete ftrace buffer from the syzbot runs, so I'm not drawing any conclusions from the traces I've got. I'm not going to continue today. You may have more luck with your plain printk (if it's just imbalance and it avoids printk locking sensitive paths). HTH, Michal [1] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-b2 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-27 17:23 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-05-27 17:23 UTC (permalink / raw) To: Michal Koutný Cc: Tadeusz Struk, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On Fri, May 27, 2022 at 06:54:29PM +0200, Michal Koutný wrote: > Hello Tadeusz. > > On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote: > > As far as I can see we are trying to test the same thing suggested by Tejun. > > I just sent a test request to try this: > > https://github.com/tstruk/linux/commit/master > > Yup, I've added few more prints to get more fine-grained resolution. > Also, I decided to use ftrace printk not to interfere with timing too > much (due to the original race hypothesis). > > > Let me know if you have any more tests to run and I will hold off until > > you are done. > > My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea). > > So far, I'm not convinced, I extract the complete ftrace buffer from the > syzbot runs, so I'm not drawing any conclusions from the traces I've > got. I'm not going to continue today. You may have more luck with your > plain printk (if it's just imbalance and it avoids printk locking > sensitive paths). At least for triaging, it may be easier to use bpftrace to attach kprobes to the get/put functions with the right filter and record the kstack counts. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-05-27 17:23 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-05-27 17:23 UTC (permalink / raw) To: Michal Koutný Cc: Tadeusz Struk, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On Fri, May 27, 2022 at 06:54:29PM +0200, Michal Koutný wrote: > Hello Tadeusz. > > On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote: > > As far as I can see we are trying to test the same thing suggested by Tejun. > > I just sent a test request to try this: > > https://github.com/tstruk/linux/commit/master > > Yup, I've added few more prints to get more fine-grained resolution. > Also, I decided to use ftrace printk not to interfere with timing too > much (due to the original race hypothesis). > > > Let me know if you have any more tests to run and I will hold off until > > you are done. > > My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea). > > So far, I'm not convinced, I extract the complete ftrace buffer from the > syzbot runs, so I'm not drawing any conclusions from the traces I've > got. I'm not going to continue today. You may have more luck with your > plain printk (if it's just imbalance and it avoids printk locking > sensitive paths). At least for triaging, it may be easier to use bpftrace to attach kprobes to the get/put functions with the right filter and record the kstack counts. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:13 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-01 23:13 UTC (permalink / raw) To: Tejun Heo, Michal Koutný Cc: cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On 5/26/22 11:15, Tejun Heo wrote: > Hello, Michal. > > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: >> // ref=A: initial state >> kill_css() >> css_get // ref+=F == A+F: fuse >> percpu_ref_kill_and_confirm >> __percpu_ref_switch_to_atomic >> percpu_ref_get >> // ref += 1 == A+F+1: atomic mode, self-protection >> percpu_ref_put >> // ref -= 1 == A+F: kill the base reference >> [via rcu] >> percpu_ref_switch_to_atomic_rcu >> percpu_ref_call_confirm_rcu >> css_killed_ref_fn == refcnt.confirm_switch >> queue_work(css->destroy_work) (1) >> [via css->destroy_work] >> css_killed_work_fn == wq.func >> offline_css() // needs fuse >> css_put // ref -= F == A: de-fuse >> percpu_ref_put >> // ref -= 1 == A-1: remove self-protection >> css_release // A <= 1 -> 2nd queue_work explodes! > > I'm not sure I'm following it but it's perfectly fine to re-use the work > item at this point. The work item actually can be re-cycled from the very > beginning of the work function. The only thing we need to make sure is that > we don't css_put() prematurely to avoid it being freed while we're using it. Yes, it is ok to reuse a work struct, but it's not ok to have the same work struct enqueued twice on the same WQ when list debug is enabled. That's why we are getting this "BUG: corrupted list.." > For the sharing to be a problem, we should be queueing the release work item > while the destroy instance is still pending, and if that is the case, it > doesn't really matter whether we use two separate work items or not. We're > already broken and would just be shifting the problem to explode elsewhere. > > The only possibility that I can think of is that somehow we're ending up > with an extra css_put() somewhere thus triggering the release path > prematurely. If that's the case, we'll prolly need to trace get/puts to find > out who's causing the ref imbalance. That's right. Michal was on the right track for the kill_css() part. What I think is going on is that once css_create() fails then cgroup_subtree_control_write() ends up calling first kill_css() and then css_put() on the same css, I think it's &cgrp->self of the kernfs_node. The each_live_descendant_post() also iterates on the root. Here is the call flow (sorry for long lines): cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error | |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called | | css_killed_ref_fn() <- first css->destroy_work enqueue | | | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); | | |-> goto out_unlock; | | | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called | | css_release(percpu_ref) <- second css->destroy_work enqueue | |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption. What seems to work for me as the simplest fix is to prevent enqueuing a dying css in css_release() as below. Please let me know if that makes sense to you. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..5618211487cc 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (!(css->flags & CSS_DYING)) { + INIT_WORK(&css->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } } static void init_and_link_css(struct cgroup_subsys_state *css, -- Thanks, Tadeusz ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:13 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-01 23:13 UTC (permalink / raw) To: Tejun Heo, Michal Koutný Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On 5/26/22 11:15, Tejun Heo wrote: > Hello, Michal. > > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: >> // ref=A: initial state >> kill_css() >> css_get // ref+=F == A+F: fuse >> percpu_ref_kill_and_confirm >> __percpu_ref_switch_to_atomic >> percpu_ref_get >> // ref += 1 == A+F+1: atomic mode, self-protection >> percpu_ref_put >> // ref -= 1 == A+F: kill the base reference >> [via rcu] >> percpu_ref_switch_to_atomic_rcu >> percpu_ref_call_confirm_rcu >> css_killed_ref_fn == refcnt.confirm_switch >> queue_work(css->destroy_work) (1) >> [via css->destroy_work] >> css_killed_work_fn == wq.func >> offline_css() // needs fuse >> css_put // ref -= F == A: de-fuse >> percpu_ref_put >> // ref -= 1 == A-1: remove self-protection >> css_release // A <= 1 -> 2nd queue_work explodes! > > I'm not sure I'm following it but it's perfectly fine to re-use the work > item at this point. The work item actually can be re-cycled from the very > beginning of the work function. The only thing we need to make sure is that > we don't css_put() prematurely to avoid it being freed while we're using it. Yes, it is ok to reuse a work struct, but it's not ok to have the same work struct enqueued twice on the same WQ when list debug is enabled. That's why we are getting this "BUG: corrupted list.." > For the sharing to be a problem, we should be queueing the release work item > while the destroy instance is still pending, and if that is the case, it > doesn't really matter whether we use two separate work items or not. We're > already broken and would just be shifting the problem to explode elsewhere. > > The only possibility that I can think of is that somehow we're ending up > with an extra css_put() somewhere thus triggering the release path > prematurely. If that's the case, we'll prolly need to trace get/puts to find > out who's causing the ref imbalance. That's right. Michal was on the right track for the kill_css() part. What I think is going on is that once css_create() fails then cgroup_subtree_control_write() ends up calling first kill_css() and then css_put() on the same css, I think it's &cgrp->self of the kernfs_node. The each_live_descendant_post() also iterates on the root. Here is the call flow (sorry for long lines): cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error | |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called | | css_killed_ref_fn() <- first css->destroy_work enqueue | | | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); | | |-> goto out_unlock; | | | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called | | css_release(percpu_ref) <- second css->destroy_work enqueue | |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption. What seems to work for me as the simplest fix is to prevent enqueuing a dying css in css_release() as below. Please let me know if that makes sense to you. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..5618211487cc 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (!(css->flags & CSS_DYING)) { + INIT_WORK(&css->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } } static void init_and_link_css(struct cgroup_subsys_state *css, -- Thanks, Tadeusz ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:20 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-01 23:20 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote: > > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: > > > // ref=A: initial state > > > kill_css() > > > css_get // ref+=F == A+F: fuse > > > percpu_ref_kill_and_confirm > > > __percpu_ref_switch_to_atomic > > > percpu_ref_get > > > // ref += 1 == A+F+1: atomic mode, self-protection > > > percpu_ref_put > > > // ref -= 1 == A+F: kill the base reference > > > [via rcu] > > > percpu_ref_switch_to_atomic_rcu > > > percpu_ref_call_confirm_rcu > > > css_killed_ref_fn == refcnt.confirm_switch > > > queue_work(css->destroy_work) (1) > > > [via css->destroy_work] > > > css_killed_work_fn == wq.func > > > offline_css() // needs fuse > > > css_put // ref -= F == A: de-fuse > > > percpu_ref_put > > > // ref -= 1 == A-1: remove self-protection > > > css_release // A <= 1 -> 2nd queue_work explodes! > > > > I'm not sure I'm following it but it's perfectly fine to re-use the work > > item at this point. The work item actually can be re-cycled from the very > > beginning of the work function. The only thing we need to make sure is that > > we don't css_put() prematurely to avoid it being freed while we're using it. > > Yes, it is ok to reuse a work struct, but it's not ok to have the same > work struct enqueued twice on the same WQ when list debug is enabled. > That's why we are getting this "BUG: corrupted list.." The above scenario isn't that tho. Once the work item starts executing, wq doesn't care about what happens to it and as killed_work_fn is holding a reference, the release scheduling shouldn't happen before it starts executing unless somebody is screwing up the refcnting. > That's right. Michal was on the right track for the kill_css() part. > What I think is going on is that once css_create() fails then > cgroup_subtree_control_write() ends up calling first kill_css() and > then css_put() on the same css, I think it's &cgrp->self of the kernfs_node. > The each_live_descendant_post() also iterates on the root. > Here is the call flow (sorry for long lines): > > cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error > | > |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called > | > | css_killed_ref_fn() <- first css->destroy_work enqueue > | | > | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); > | > | > |-> goto out_unlock; > | | > | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called > | > | > css_release(percpu_ref) <- second css->destroy_work enqueue > | > |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption. > > > What seems to work for me as the simplest fix is to prevent enqueuing a dying > css in css_release() as below. Please let me know if that makes sense to you. > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 1779ccddb734..5618211487cc 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref) > struct cgroup_subsys_state *css = > container_of(ref, struct cgroup_subsys_state, refcnt); > - INIT_WORK(&css->destroy_work, css_release_work_fn); > - queue_work(cgroup_destroy_wq, &css->destroy_work); > + if (!(css->flags & CSS_DYING)) { > + INIT_WORK(&css->destroy_work, css_release_work_fn); > + queue_work(cgroup_destroy_wq, &css->destroy_work); > + } When the problem is ref imbalance, how can above be the solution? Of course release path won't cause an issue if they don't run, but we still need to free the thing, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:20 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-01 23:20 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote: > > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: > > > // ref=A: initial state > > > kill_css() > > > css_get // ref+=F == A+F: fuse > > > percpu_ref_kill_and_confirm > > > __percpu_ref_switch_to_atomic > > > percpu_ref_get > > > // ref += 1 == A+F+1: atomic mode, self-protection > > > percpu_ref_put > > > // ref -= 1 == A+F: kill the base reference > > > [via rcu] > > > percpu_ref_switch_to_atomic_rcu > > > percpu_ref_call_confirm_rcu > > > css_killed_ref_fn == refcnt.confirm_switch > > > queue_work(css->destroy_work) (1) > > > [via css->destroy_work] > > > css_killed_work_fn == wq.func > > > offline_css() // needs fuse > > > css_put // ref -= F == A: de-fuse > > > percpu_ref_put > > > // ref -= 1 == A-1: remove self-protection > > > css_release // A <= 1 -> 2nd queue_work explodes! > > > > I'm not sure I'm following it but it's perfectly fine to re-use the work > > item at this point. The work item actually can be re-cycled from the very > > beginning of the work function. The only thing we need to make sure is that > > we don't css_put() prematurely to avoid it being freed while we're using it. > > Yes, it is ok to reuse a work struct, but it's not ok to have the same > work struct enqueued twice on the same WQ when list debug is enabled. > That's why we are getting this "BUG: corrupted list.." The above scenario isn't that tho. Once the work item starts executing, wq doesn't care about what happens to it and as killed_work_fn is holding a reference, the release scheduling shouldn't happen before it starts executing unless somebody is screwing up the refcnting. > That's right. Michal was on the right track for the kill_css() part. > What I think is going on is that once css_create() fails then > cgroup_subtree_control_write() ends up calling first kill_css() and > then css_put() on the same css, I think it's &cgrp->self of the kernfs_node. > The each_live_descendant_post() also iterates on the root. > Here is the call flow (sorry for long lines): > > cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error > | > |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called > | > | css_killed_ref_fn() <- first css->destroy_work enqueue > | | > | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); > | > | > |-> goto out_unlock; > | | > | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called > | > | > css_release(percpu_ref) <- second css->destroy_work enqueue > | > |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption. > > > What seems to work for me as the simplest fix is to prevent enqueuing a dying > css in css_release() as below. Please let me know if that makes sense to you. > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 1779ccddb734..5618211487cc 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref) > struct cgroup_subsys_state *css = > container_of(ref, struct cgroup_subsys_state, refcnt); > - INIT_WORK(&css->destroy_work, css_release_work_fn); > - queue_work(cgroup_destroy_wq, &css->destroy_work); > + if (!(css->flags & CSS_DYING)) { > + INIT_WORK(&css->destroy_work, css_release_work_fn); > + queue_work(cgroup_destroy_wq, &css->destroy_work); > + } When the problem is ref imbalance, how can above be the solution? Of course release path won't cause an issue if they don't run, but we still need to free the thing, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:37 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-01 23:37 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 16:20, Tejun Heo wrote: > On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote: >>> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: >>>> // ref=A: initial state >>>> kill_css() >>>> css_get // ref+=F == A+F: fuse >>>> percpu_ref_kill_and_confirm >>>> __percpu_ref_switch_to_atomic >>>> percpu_ref_get >>>> // ref += 1 == A+F+1: atomic mode, self-protection >>>> percpu_ref_put >>>> // ref -= 1 == A+F: kill the base reference >>>> [via rcu] >>>> percpu_ref_switch_to_atomic_rcu >>>> percpu_ref_call_confirm_rcu >>>> css_killed_ref_fn == refcnt.confirm_switch >>>> queue_work(css->destroy_work) (1) >>>> [via css->destroy_work] >>>> css_killed_work_fn == wq.func >>>> offline_css() // needs fuse >>>> css_put // ref -= F == A: de-fuse >>>> percpu_ref_put >>>> // ref -= 1 == A-1: remove self-protection >>>> css_release // A <= 1 -> 2nd queue_work explodes! >>> >>> I'm not sure I'm following it but it's perfectly fine to re-use the work >>> item at this point. The work item actually can be re-cycled from the very >>> beginning of the work function. The only thing we need to make sure is that >>> we don't css_put() prematurely to avoid it being freed while we're using it. >> >> Yes, it is ok to reuse a work struct, but it's not ok to have the same >> work struct enqueued twice on the same WQ when list debug is enabled. >> That's why we are getting this "BUG: corrupted list.." > > The above scenario isn't that tho. Once the work item starts executing, wq > doesn't care about what happens to it and as killed_work_fn is holding a > reference, the release scheduling shouldn't happen before it starts > executing unless somebody is screwing up the refcnting. > >> That's right. Michal was on the right track for the kill_css() part. >> What I think is going on is that once css_create() fails then >> cgroup_subtree_control_write() ends up calling first kill_css() and >> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node. >> The each_live_descendant_post() also iterates on the root. >> Here is the call flow (sorry for long lines): >> >> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error >> | >> |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called >> | >> | css_killed_ref_fn() <- first css->destroy_work enqueue >> | | >> | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); >> | >> | >> |-> goto out_unlock; >> | | >> | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called >> | >> | >> css_release(percpu_ref) <- second css->destroy_work enqueue >> | >> |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption. >> >> >> What seems to work for me as the simplest fix is to prevent enqueuing a dying >> css in css_release() as below. Please let me know if that makes sense to you. >> >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 1779ccddb734..5618211487cc 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref) >> struct cgroup_subsys_state *css = >> container_of(ref, struct cgroup_subsys_state, refcnt); >> - INIT_WORK(&css->destroy_work, css_release_work_fn); >> - queue_work(cgroup_destroy_wq, &css->destroy_work); >> + if (!(css->flags & CSS_DYING)) { >> + INIT_WORK(&css->destroy_work, css_release_work_fn); >> + queue_work(cgroup_destroy_wq, &css->destroy_work); >> + } > > When the problem is ref imbalance, how can above be the solution? Of course > release path won't cause an issue if they don't run, but we still need to > free the thing, right? Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called on it. Only css_get() & css_put() modify the refcnt value. And for the "free the thing" the css_killed_work_fn() does that. It calls offline_css(css) and css_put(css) for the whole css hierarchy. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:37 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-01 23:37 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 16:20, Tejun Heo wrote: > On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote: >>> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote: >>>> // ref=A: initial state >>>> kill_css() >>>> css_get // ref+=F == A+F: fuse >>>> percpu_ref_kill_and_confirm >>>> __percpu_ref_switch_to_atomic >>>> percpu_ref_get >>>> // ref += 1 == A+F+1: atomic mode, self-protection >>>> percpu_ref_put >>>> // ref -= 1 == A+F: kill the base reference >>>> [via rcu] >>>> percpu_ref_switch_to_atomic_rcu >>>> percpu_ref_call_confirm_rcu >>>> css_killed_ref_fn == refcnt.confirm_switch >>>> queue_work(css->destroy_work) (1) >>>> [via css->destroy_work] >>>> css_killed_work_fn == wq.func >>>> offline_css() // needs fuse >>>> css_put // ref -= F == A: de-fuse >>>> percpu_ref_put >>>> // ref -= 1 == A-1: remove self-protection >>>> css_release // A <= 1 -> 2nd queue_work explodes! >>> >>> I'm not sure I'm following it but it's perfectly fine to re-use the work >>> item at this point. The work item actually can be re-cycled from the very >>> beginning of the work function. The only thing we need to make sure is that >>> we don't css_put() prematurely to avoid it being freed while we're using it. >> >> Yes, it is ok to reuse a work struct, but it's not ok to have the same >> work struct enqueued twice on the same WQ when list debug is enabled. >> That's why we are getting this "BUG: corrupted list.." > > The above scenario isn't that tho. Once the work item starts executing, wq > doesn't care about what happens to it and as killed_work_fn is holding a > reference, the release scheduling shouldn't happen before it starts > executing unless somebody is screwing up the refcnting. > >> That's right. Michal was on the right track for the kill_css() part. >> What I think is going on is that once css_create() fails then >> cgroup_subtree_control_write() ends up calling first kill_css() and >> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node. >> The each_live_descendant_post() also iterates on the root. >> Here is the call flow (sorry for long lines): >> >> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error >> | >> |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called >> | >> | css_killed_ref_fn() <- first css->destroy_work enqueue >> | | >> | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); >> | >> | >> |-> goto out_unlock; >> | | >> | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called >> | >> | >> css_release(percpu_ref) <- second css->destroy_work enqueue >> | >> |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption. >> >> >> What seems to work for me as the simplest fix is to prevent enqueuing a dying >> css in css_release() as below. Please let me know if that makes sense to you. >> >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 1779ccddb734..5618211487cc 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref) >> struct cgroup_subsys_state *css = >> container_of(ref, struct cgroup_subsys_state, refcnt); >> - INIT_WORK(&css->destroy_work, css_release_work_fn); >> - queue_work(cgroup_destroy_wq, &css->destroy_work); >> + if (!(css->flags & CSS_DYING)) { >> + INIT_WORK(&css->destroy_work, css_release_work_fn); >> + queue_work(cgroup_destroy_wq, &css->destroy_work); >> + } > > When the problem is ref imbalance, how can above be the solution? Of course > release path won't cause an issue if they don't run, but we still need to > free the thing, right? Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called on it. Only css_get() & css_put() modify the refcnt value. And for the "free the thing" the css_killed_work_fn() does that. It calls offline_css(css) and css_put(css) for the whole css hierarchy. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:43 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-01 23:43 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote: > Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) > doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called Yeah, the base ref is special for percpu_ref. > on it. Only css_get() & css_put() modify the refcnt value. > And for the "free the thing" the css_killed_work_fn() does that. > It calls offline_css(css) and css_put(css) for the whole css hierarchy. Yeah, the freeing path depends on the css_put(css) invoking css_release() which schedules the work item which actually frees. Am I misunderstanding something here? Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-01 23:43 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-01 23:43 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote: > Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) > doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called Yeah, the base ref is special for percpu_ref. > on it. Only css_get() & css_put() modify the refcnt value. > And for the "free the thing" the css_killed_work_fn() does that. > It calls offline_css(css) and css_put(css) for the whole css hierarchy. Yeah, the freeing path depends on the css_put(css) invoking css_release() which schedules the work item which actually frees. Am I misunderstanding something here? Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:00 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 0:00 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 16:43, Tejun Heo wrote: > On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote: >> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) >> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called > > Yeah, the base ref is special for percpu_ref. > >> on it. Only css_get() & css_put() modify the refcnt value. >> And for the "free the thing" the css_killed_work_fn() does that. >> It calls offline_css(css) and css_put(css) for the whole css hierarchy. > > Yeah, the freeing path depends on the css_put(css) invoking css_release() > which schedules the work item which actually frees. Am I misunderstanding > something here? What I'm trying to say is that it's not really a ref imbalance problem. I think once the kill_css() has been called on a css, and it is enqueued to be killed, we shouldn't call css_put(css) on it anymore outside of the "killed call flow path". It will all be handled by the css_killed_ref_fn() function. The fact the css_release() is called (via cgroup_kn_unlock()) just after kill_css() causes the css->destroy_work to be enqueued twice on the same WQ (cgroup_destroy_wq), just with different function. This results in the BUG: corrupted list in insert_work issue. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:00 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 0:00 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 16:43, Tejun Heo wrote: > On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote: >> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) >> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called > > Yeah, the base ref is special for percpu_ref. > >> on it. Only css_get() & css_put() modify the refcnt value. >> And for the "free the thing" the css_killed_work_fn() does that. >> It calls offline_css(css) and css_put(css) for the whole css hierarchy. > > Yeah, the freeing path depends on the css_put(css) invoking css_release() > which schedules the work item which actually frees. Am I misunderstanding > something here? What I'm trying to say is that it's not really a ref imbalance problem. I think once the kill_css() has been called on a css, and it is enqueued to be killed, we shouldn't call css_put(css) on it anymore outside of the "killed call flow path". It will all be handled by the css_killed_ref_fn() function. The fact the css_release() is called (via cgroup_kn_unlock()) just after kill_css() causes the css->destroy_work to be enqueued twice on the same WQ (cgroup_destroy_wq), just with different function. This results in the BUG: corrupted list in insert_work issue. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:07 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-02 0:07 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh Hello, On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote: > What I'm trying to say is that it's not really a ref imbalance problem. > I think once the kill_css() has been called on a css, and it is enqueued to be > killed, we shouldn't call css_put(css) on it anymore outside of the "killed call > flow path". It will all be handled by the css_killed_ref_fn() function. > > The fact the css_release() is called (via cgroup_kn_unlock()) just after > kill_css() causes the css->destroy_work to be enqueued twice on the same WQ > (cgroup_destroy_wq), just with different function. This results in the > BUG: corrupted list in insert_work issue. I have a hard time following here. The kill / release paths relationship isn't that complicated. The kill path is invoked when the percpu_ref's base ref is killed and holds an extra ref so that it's guaranteed that release can't happen before the kill path is done with the css. When the final put happens - whether that's from the kill path or someone else, which often is the case - the release path triggers. If we have release getting scheduled while the kill path isn't finished, it is a reference counting problem, right? Can you elaborate the exact scenario that you think is happening? Please feel free to omit the function calls and all that. Just lay out who's doing what. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:07 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-02 0:07 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh Hello, On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote: > What I'm trying to say is that it's not really a ref imbalance problem. > I think once the kill_css() has been called on a css, and it is enqueued to be > killed, we shouldn't call css_put(css) on it anymore outside of the "killed call > flow path". It will all be handled by the css_killed_ref_fn() function. > > The fact the css_release() is called (via cgroup_kn_unlock()) just after > kill_css() causes the css->destroy_work to be enqueued twice on the same WQ > (cgroup_destroy_wq), just with different function. This results in the > BUG: corrupted list in insert_work issue. I have a hard time following here. The kill / release paths relationship isn't that complicated. The kill path is invoked when the percpu_ref's base ref is killed and holds an extra ref so that it's guaranteed that release can't happen before the kill path is done with the css. When the final put happens - whether that's from the kill path or someone else, which often is the case - the release path triggers. If we have release getting scheduled while the kill path isn't finished, it is a reference counting problem, right? Can you elaborate the exact scenario that you think is happening? Please feel free to omit the function calls and all that. Just lay out who's doing what. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:26 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 0:26 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 17:07, Tejun Heo wrote: > Hello, > > On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote: >> What I'm trying to say is that it's not really a ref imbalance problem. >> I think once the kill_css() has been called on a css, and it is enqueued to be >> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call >> flow path". It will all be handled by the css_killed_ref_fn() function. >> >> The fact the css_release() is called (via cgroup_kn_unlock()) just after >> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ >> (cgroup_destroy_wq), just with different function. This results in the >> BUG: corrupted list in insert_work issue. > > I have a hard time following here. The kill / release paths relationship > isn't that complicated. The kill path is invoked when the percpu_ref's base > ref is killed and holds an extra ref so that it's guaranteed that release > can't happen before the kill path is done with the css. When the final put > happens - whether that's from the kill path or someone else, which often is > the case - the release path triggers. If we have release getting scheduled > while the kill path isn't finished, it is a reference counting problem, > right? > > Can you elaborate the exact scenario that you think is happening? Please > feel free to omit the function calls and all that. Just lay out who's doing > what. Ok the problem is that 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq), which causes "BUG: corrupted list in insert_work" So I think the easiest way to solve this would be to have two separate work_structs, one for the killed_ref path and css_release path as in: https://lore.kernel.org/all/20220523212724.233314-1-tadeusz.struk@linaro.org/ -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:26 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 0:26 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 17:07, Tejun Heo wrote: > Hello, > > On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote: >> What I'm trying to say is that it's not really a ref imbalance problem. >> I think once the kill_css() has been called on a css, and it is enqueued to be >> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call >> flow path". It will all be handled by the css_killed_ref_fn() function. >> >> The fact the css_release() is called (via cgroup_kn_unlock()) just after >> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ >> (cgroup_destroy_wq), just with different function. This results in the >> BUG: corrupted list in insert_work issue. > > I have a hard time following here. The kill / release paths relationship > isn't that complicated. The kill path is invoked when the percpu_ref's base > ref is killed and holds an extra ref so that it's guaranteed that release > can't happen before the kill path is done with the css. When the final put > happens - whether that's from the kill path or someone else, which often is > the case - the release path triggers. If we have release getting scheduled > while the kill path isn't finished, it is a reference counting problem, > right? > > Can you elaborate the exact scenario that you think is happening? Please > feel free to omit the function calls and all that. Just lay out who's doing > what. Ok the problem is that 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq), which causes "BUG: corrupted list in insert_work" So I think the easiest way to solve this would be to have two separate work_structs, one for the killed_ref path and css_release path as in: https://lore.kernel.org/all/20220523212724.233314-1-tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org/ -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:29 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-02 0:29 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote: > Ok the problem is that > > 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq > 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq > > We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq), > which causes "BUG: corrupted list in insert_work" #2 shouldn't be happening before kill_ref_fn() is done with the css. If what you're saying is happening, what's broken is the fact that the refcnt is reaching 0 prematurely. > So I think the easiest way to solve this would be to have two separate work_structs, > one for the killed_ref path and css_release path as in: If you do that, you'd just be racing the free path against the kill path and the css might get freed while the kill path is still accessing it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:29 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2022-06-02 0:29 UTC (permalink / raw) To: Tadeusz Struk Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote: > Ok the problem is that > > 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq > 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq > > We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq), > which causes "BUG: corrupted list in insert_work" #2 shouldn't be happening before kill_ref_fn() is done with the css. If what you're saying is happening, what's broken is the fact that the refcnt is reaching 0 prematurely. > So I think the easiest way to solve this would be to have two separate work_structs, > one for the killed_ref path and css_release path as in: If you do that, you'd just be racing the free path against the kill path and the css might get freed while the kill path is still accessing it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:40 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 0:40 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 17:29, Tejun Heo wrote: > On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote: >> Ok the problem is that >> >> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq >> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq >> >> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq), >> which causes "BUG: corrupted list in insert_work" > > #2 shouldn't be happening before kill_ref_fn() is done with the css. If what > you're saying is happening, what's broken is the fact that the refcnt is > reaching 0 prematurely. css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm()) and it will only enqueue the css_killed_work_fn() to be called later. Then css_put()->css_release() will be called before the css_killed_work_fn() will even get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later. The problem happens on the second enqueue. So there need to be something in place that will make sure that css_killed_work_fn() is done before css_release() can enqueue the second job. Does it sound right? >> So I think the easiest way to solve this would be to have two separate work_structs, >> one for the killed_ref path and css_release path as in: > > If you do that, you'd just be racing the free path against the kill path and > the css might get freed while the kill path is still accessing it. > > Thanks. > -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 0:40 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 0:40 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/1/22 17:29, Tejun Heo wrote: > On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote: >> Ok the problem is that >> >> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq >> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq >> >> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq), >> which causes "BUG: corrupted list in insert_work" > > #2 shouldn't be happening before kill_ref_fn() is done with the css. If what > you're saying is happening, what's broken is the fact that the refcnt is > reaching 0 prematurely. css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm()) and it will only enqueue the css_killed_work_fn() to be called later. Then css_put()->css_release() will be called before the css_killed_work_fn() will even get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later. The problem happens on the second enqueue. So there need to be something in place that will make sure that css_killed_work_fn() is done before css_release() can enqueue the second job. Does it sound right? >> So I think the easiest way to solve this would be to have two separate work_structs, >> one for the killed_ref path and css_release path as in: > > If you do that, you'd just be racing the free path against the kill path and > the css might get freed while the kill path is still accessing it. > > Thanks. > -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 11:47 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-06-02 11:47 UTC (permalink / raw) To: Tadeusz Struk Cc: Tejun Heo, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote: > css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm()) > and it will only enqueue the css_killed_work_fn() to be called later. > Then css_put()->css_release() will be called before the css_killed_work_fn() will even > get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later. > The problem happens on the second enqueue. So there need to be something in place that > will make sure that css_killed_work_fn() is done before css_release() can enqueue > the second job. IIUC, here you describe the same scenario I broke down at [1]. > Does it sound right? I added a parameter A there (that is sum of base and percpu references before kill_css()). I thought it fails because A == 1 (i.e. killing the base reference), however, that seems an unlikely situation (because cgroup code uses a "fuse" reference to pin css for offline_css()). So the remaining option (at least I find it more likely now) is that A == 0 (A < 0 would trigger the warning in percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can get to the bottom of this with detailed enough tracing of gets/puts. Splitting the work struct is condradictive to the existing approach with the "fuse" reference. (BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote: > The fact the css_release() is called (via cgroup_kn_unlock()) just after > kill_css() causes the css->destroy_work to be enqueued twice on the same WQ > (cgroup_destroy_wq), just with different function. This results in the > BUG: corrupted list in insert_work issue. Where do you see a critical css_release called from cgroup_kn_unlock()? I always observed the css_release() being called via percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot logs.)) Thanks, Michal [1] https://lore.kernel.org/r/Yo7KfEOz92kS2z5Y@blackbook/ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 11:47 ` Michal Koutný 0 siblings, 0 replies; 40+ messages in thread From: Michal Koutný @ 2022-06-02 11:47 UTC (permalink / raw) To: Tadeusz Struk Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote: > css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm()) > and it will only enqueue the css_killed_work_fn() to be called later. > Then css_put()->css_release() will be called before the css_killed_work_fn() will even > get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later. > The problem happens on the second enqueue. So there need to be something in place that > will make sure that css_killed_work_fn() is done before css_release() can enqueue > the second job. IIUC, here you describe the same scenario I broke down at [1]. > Does it sound right? I added a parameter A there (that is sum of base and percpu references before kill_css()). I thought it fails because A == 1 (i.e. killing the base reference), however, that seems an unlikely situation (because cgroup code uses a "fuse" reference to pin css for offline_css()). So the remaining option (at least I find it more likely now) is that A == 0 (A < 0 would trigger the warning in percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can get to the bottom of this with detailed enough tracing of gets/puts. Splitting the work struct is condradictive to the existing approach with the "fuse" reference. (BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote: > The fact the css_release() is called (via cgroup_kn_unlock()) just after > kill_css() causes the css->destroy_work to be enqueued twice on the same WQ > (cgroup_destroy_wq), just with different function. This results in the > BUG: corrupted list in insert_work issue. Where do you see a critical css_release called from cgroup_kn_unlock()? I always observed the css_release() being called via percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot logs.)) Thanks, Michal [1] https://lore.kernel.org/r/Yo7KfEOz92kS2z5Y@blackbook/ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 14:28 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 14:28 UTC (permalink / raw) To: Michal Koutný Cc: Tejun Heo, cgroups, linux-kernel, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/2/22 04:47, Michal Koutný wrote: > On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk<tadeusz.struk@linaro.org> wrote: >> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm()) >> and it will only enqueue the css_killed_work_fn() to be called later. >> Then css_put()->css_release() will be called before the css_killed_work_fn() will even >> get a chance to run, and it will also*only* enqueue css_release_work_fn() to be called later. >> The problem happens on the second enqueue. So there need to be something in place that >> will make sure that css_killed_work_fn() is done before css_release() can enqueue >> the second job. > IIUC, here you describe the same scenario I broke down at [1]. Right, except the last css_put(), which I think is called from cgroup_kn_unlock() See below. >> Does it sound right? > I added a parameter A there (that is sum of base and percpu references > before kill_css()). > I thought it fails because A == 1 (i.e. killing the base reference), > however, that seems an unlikely situation (because cgroup code uses a > "fuse" reference to pin css for offline_css()). > > So the remaining option (at least I find it more likely now) is that > A == 0 (A < 0 would trigger the warning in > percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can > get to the bottom of this with detailed enough tracing of gets/puts. > > Splitting the work struct is condradictive to the existing approach with > the "fuse" reference. > > (BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk<tadeusz.struk@linaro.org> wrote: >> The fact the css_release() is called (via cgroup_kn_unlock()) just after >> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ >> (cgroup_destroy_wq), just with different function. This results in the >> BUG: corrupted list in insert_work issue. > Where do you see a critical css_release called from cgroup_kn_unlock()? > I always observed the css_release() being called via > percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot it goes like this: cgroup_kn_unlock(kn)->cgroup_put(cgrp)->css_put(&cgrp->self), which brings the refcnt to zero and triggers css_release(). I think what's missing is something that will serialize the kill and release paths. I will try to put something together today. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] cgroup: Use separate work structs on css release path @ 2022-06-02 14:28 ` Tadeusz Struk 0 siblings, 0 replies; 40+ messages in thread From: Tadeusz Struk @ 2022-06-02 14:28 UTC (permalink / raw) To: Michal Koutný Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner, Bui Quang Minh On 6/2/22 04:47, Michal Koutný wrote: > On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk<tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote: >> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm()) >> and it will only enqueue the css_killed_work_fn() to be called later. >> Then css_put()->css_release() will be called before the css_killed_work_fn() will even >> get a chance to run, and it will also*only* enqueue css_release_work_fn() to be called later. >> The problem happens on the second enqueue. So there need to be something in place that >> will make sure that css_killed_work_fn() is done before css_release() can enqueue >> the second job. > IIUC, here you describe the same scenario I broke down at [1]. Right, except the last css_put(), which I think is called from cgroup_kn_unlock() See below. >> Does it sound right? > I added a parameter A there (that is sum of base and percpu references > before kill_css()). > I thought it fails because A == 1 (i.e. killing the base reference), > however, that seems an unlikely situation (because cgroup code uses a > "fuse" reference to pin css for offline_css()). > > So the remaining option (at least I find it more likely now) is that > A == 0 (A < 0 would trigger the warning in > percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can > get to the bottom of this with detailed enough tracing of gets/puts. > > Splitting the work struct is condradictive to the existing approach with > the "fuse" reference. > > (BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk<tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote: >> The fact the css_release() is called (via cgroup_kn_unlock()) just after >> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ >> (cgroup_destroy_wq), just with different function. This results in the >> BUG: corrupted list in insert_work issue. > Where do you see a critical css_release called from cgroup_kn_unlock()? > I always observed the css_release() being called via > percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot it goes like this: cgroup_kn_unlock(kn)->cgroup_put(cgrp)->css_put(&cgrp->self), which brings the refcnt to zero and triggers css_release(). I think what's missing is something that will serialize the kill and release paths. I will try to put something together today. -- Thanks, Tadeusz ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2022-06-02 14:28 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-25 15:15 [PATCH 0/2] cgroup_subsys_state lifecycle fixups Michal Koutný 2022-05-25 15:15 ` Michal Koutný 2022-05-25 15:15 ` [PATCH 1/2] cgroup: Wait for cgroup_subsys_state offlining on unmount Michal Koutný 2022-05-25 15:15 ` Michal Koutný 2022-05-25 15:15 ` [PATCH 2/2] cgroup: Use separate work structs on css release path Michal Koutný 2022-05-25 15:15 ` Michal Koutný 2022-05-25 16:14 ` Michal Koutný 2022-05-25 16:14 ` Michal Koutný 2022-05-26 9:56 ` Michal Koutný 2022-05-26 9:56 ` Michal Koutný 2022-05-26 18:15 ` Tejun Heo 2022-05-26 18:15 ` Tejun Heo 2022-05-27 16:39 ` Tadeusz Struk 2022-05-27 16:39 ` Tadeusz Struk 2022-05-27 16:54 ` Michal Koutný 2022-05-27 16:54 ` Michal Koutný 2022-05-27 17:23 ` Tejun Heo 2022-05-27 17:23 ` Tejun Heo 2022-06-01 23:13 ` Tadeusz Struk 2022-06-01 23:13 ` Tadeusz Struk 2022-06-01 23:20 ` Tejun Heo 2022-06-01 23:20 ` Tejun Heo 2022-06-01 23:37 ` Tadeusz Struk 2022-06-01 23:37 ` Tadeusz Struk 2022-06-01 23:43 ` Tejun Heo 2022-06-01 23:43 ` Tejun Heo 2022-06-02 0:00 ` Tadeusz Struk 2022-06-02 0:00 ` Tadeusz Struk 2022-06-02 0:07 ` Tejun Heo 2022-06-02 0:07 ` Tejun Heo 2022-06-02 0:26 ` Tadeusz Struk 2022-06-02 0:26 ` Tadeusz Struk 2022-06-02 0:29 ` Tejun Heo 2022-06-02 0:29 ` Tejun Heo 2022-06-02 0:40 ` Tadeusz Struk 2022-06-02 0:40 ` Tadeusz Struk 2022-06-02 11:47 ` Michal Koutný 2022-06-02 11:47 ` Michal Koutný 2022-06-02 14:28 ` Tadeusz Struk 2022-06-02 14:28 ` Tadeusz Struk
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.