[PATCH] cgroup: don't queue css_release_work if one already pending

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-04-12 19:24 Tadeusz Struk
  2022-04-14 16:44   ` Michal Koutný
  2022-05-23 21:27 ` [PATCH v2] cgroups: separate destroy_work into two separate wq Tadeusz Struk
  0 siblings, 2 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-04-12 19:24 UTC (permalink / raw)
  To: cgroups
  Cc: Tadeusz Struk, Tejun Heo, Zefan Li, Johannes Weiner,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

Syzbot found a corrupted list bug scenario that can be triggered from
cgroup css_create(). The reproduces writes to cgroup.subtree_control
file, which invokes cgroup_apply_control_enable(), css_create(), and
css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
work for an css->refcnt initialized with css_release() destructor,
and there is a chance that the css_release() function will be invoked
for a cgroup_subsys_state, for which a destroy_work has already been
queued via css_create() error path. This causes a list_add corruption
as can be seen in the syzkaller report [1].
This can be avoided by adding a check to css_release() that checks
if it has already been enqueued.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: <cgroups@vger.kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <bpf@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>

Reported-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com
Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
---
 kernel/cgroup/cgroup.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..9ae2de29f8c9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,11 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
+			      work_data_bits(&css->destroy_work))) {
+		INIT_WORK(&css->destroy_work, css_release_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-04-14 16:44   ` Michal Koutný
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Koutný @ 2022-04-14 16:44 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: cgroups, Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

Hello Tadeusz.

Thanks for analyzing this syzbot report. Let me provide my understanding
of the test case and explanation why I think your patch fixes it but is
not fully correct.

On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> Syzbot found a corrupted list bug scenario that can be triggered from
> cgroup css_create(). The reproduces writes to cgroup.subtree_control
> file, which invokes cgroup_apply_control_enable(), css_create(), and
> css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.

The reproducer code makes it hard for me to understand which function
fails with ENOMEM.
But I can see your patch fixes the reproducer and your additional debug
patch which proves that css->destroy_work is re-queued.

> In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
> work for an css->refcnt initialized with css_release() destructor,

Note that css_free_rwork_fn() utilizes css->destroy_*r*work.
The error path in css_create() open codes relevant parts of
css_release_work_fn() so that css_release() can be skipped and the
refcnt is eventually just percpu_ref_exit()'d.

> and there is a chance that the css_release() function will be invoked
> for a cgroup_subsys_state, for which a destroy_work has already been
> queued via css_create() error path.

But I think the problem is css_populate_dir() failing in
cgroup_apply_control_enable(). (Is this what you actually meant?
css_create() error path is then irrelevant, no?)

The already created csses should then be rolled back via 
	cgroup_restore_control(cgrp);
	cgroup_apply_control_disable(cgrp);
	   ...
	   kill_css(css)

I suspect the double-queuing is a result of the fact that there exists
only the single reference to the css->refcnt. I.e. it's
percpu_ref_kill_and_confirm()'d and released both at the same time.

(Normally (when not killing the last reference), css->destroy_work reuse
is not a problem because of the sequenced chain
css_killed_work_fn()->css_put()->css_release().)

> This can be avoided by adding a check to css_release() that checks
> if it has already been enqueued.

If that's what's happening, then your patch omits the final
css_release_work_fn() in favor of css_killed_work_fn() but both should
be run during the rollback upon css_populate_dir() failure.

So an alternative approach to tackle this situation would be to split
css->destroy_work into two work work_structs (one for killing, one for
releasing) at the cost of inflating cgroup_subsys_state.

Take my hypothesis with a grain of salt maybe the assumption (last
reference == initial reference) is not different from normal operation.

Regards,
Michal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-04-14 16:44   ` Michal Koutný
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Koutný @ 2022-04-14 16:44 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Zefan Li,
	Johannes Weiner, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh,
	netdev-u79uwXL29TY76Z2rM5mHXA, bpf-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	syzbot+e42ae441c3b10acf9e9d-Pl5Pbv+GP7P466ipTTIvnc23WoclnBCfAL8bYrjMMd8

Hello Tadeusz.

Thanks for analyzing this syzbot report. Let me provide my understanding
of the test case and explanation why I think your patch fixes it but is
not fully correct.

On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> Syzbot found a corrupted list bug scenario that can be triggered from
> cgroup css_create(). The reproduces writes to cgroup.subtree_control
> file, which invokes cgroup_apply_control_enable(), css_create(), and
> css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.

The reproducer code makes it hard for me to understand which function
fails with ENOMEM.
But I can see your patch fixes the reproducer and your additional debug
patch which proves that css->destroy_work is re-queued.

> In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
> work for an css->refcnt initialized with css_release() destructor,

Note that css_free_rwork_fn() utilizes css->destroy_*r*work.
The error path in css_create() open codes relevant parts of
css_release_work_fn() so that css_release() can be skipped and the
refcnt is eventually just percpu_ref_exit()'d.

> and there is a chance that the css_release() function will be invoked
> for a cgroup_subsys_state, for which a destroy_work has already been
> queued via css_create() error path.

But I think the problem is css_populate_dir() failing in
cgroup_apply_control_enable(). (Is this what you actually meant?
css_create() error path is then irrelevant, no?)

The already created csses should then be rolled back via 
	cgroup_restore_control(cgrp);
	cgroup_apply_control_disable(cgrp);
	   ...
	   kill_css(css)

I suspect the double-queuing is a result of the fact that there exists
only the single reference to the css->refcnt. I.e. it's
percpu_ref_kill_and_confirm()'d and released both at the same time.

(Normally (when not killing the last reference), css->destroy_work reuse
is not a problem because of the sequenced chain
css_killed_work_fn()->css_put()->css_release().)

> This can be avoided by adding a check to css_release() that checks
> if it has already been enqueued.

If that's what's happening, then your patch omits the final
css_release_work_fn() in favor of css_killed_work_fn() but both should
be run during the rollback upon css_populate_dir() failure.

So an alternative approach to tackle this situation would be to split
css->destroy_work into two work work_structs (one for killing, one for
releasing) at the cost of inflating cgroup_subsys_state.

Take my hypothesis with a grain of salt maybe the assumption (last
reference == initial reference) is not different from normal operation.

Regards,
Michal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-04-14 17:51     ` Tadeusz Struk
  0 siblings, 0 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-04-14 17:51 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

Hi Michal,
Thanks for your analysis.

On 4/14/22 09:44, Michal Koutný wrote:
> Hello Tadeusz.
> 
> Thanks for analyzing this syzbot report. Let me provide my understanding
> of the test case and explanation why I think your patch fixes it but is
> not fully correct.
> 
> On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
>> Syzbot found a corrupted list bug scenario that can be triggered from
>> cgroup css_create(). The reproduces writes to cgroup.subtree_control
>> file, which invokes cgroup_apply_control_enable(), css_create(), and
>> css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
> 
> The reproducer code makes it hard for me to understand which function
> fails with ENOMEM.
> But I can see your patch fixes the reproducer and your additional debug
> patch which proves that css->destroy_work is re-queued.

Yes, it is hard to see the actual failing point because, I think it is randomly
failing in different places. I think in the actual case that causes the list
corruption is in fact in css_create().
It is the css_create() error path that does fist rcu enqueue in:

https://elixir.bootlin.com/linux/v5.10.109/source/kernel/cgroup/cgroup.c#L5228

and the second is triggered by the css->refcnt calling css_release()

The reason why we don't see it actually failing in css_create() in the trace
dump is that the fail_dump() is rate-limited, see:
https://elixir.bootlin.com/linux/v5.18-rc2/source/lib/fault-inject.c#L44

I was confused as well, so I put additional debug prints in every place
where css_release() can fail, and it was actually in
css_create()->cgroup_idr_alloc() that failed in my case.

What happened was, the write triggered:
cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()

which, allocates and initializes the css, then fails in cgroup_idr_alloc(),
bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:

cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)

which then calls ref->data->release(ref) and enqueues the same
&css->destroy_rwork on cgroup_destroy_wq causing list corruption in insert_work.

>> In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
>> work for an css->refcnt initialized with css_release() destructor,
> 
> Note that css_free_rwork_fn() utilizes css->destroy_*r*work.
> The error path in css_create() open codes relevant parts of
> css_release_work_fn() so that css_release() can be skipped and the
> refcnt is eventually just percpu_ref_exit()'d.
> 
>> and there is a chance that the css_release() function will be invoked
>> for a cgroup_subsys_state, for which a destroy_work has already been
>> queued via css_create() error path.
> 
> But I think the problem is css_populate_dir() failing in
> cgroup_apply_control_enable(). (Is this what you actually meant?
> css_create() error path is then irrelevant, no?)

I thought so too at first as the the crushdump shows that this is failing
in css_populate_dir(), but this is not the fail that causes the list corruption.
The code can recover from the fail in css_populate_dir().
The fail that causes trouble is in css_create(), that makes it go to its error path.
I can dig out the patch with my debug prints and request syzbot to run it
if you want.

> 
> The already created csses should then be rolled back via
> 	cgroup_restore_control(cgrp);
> 	cgroup_apply_control_disable(cgrp);
> 	   ...
> 	   kill_css(css)
> 
> I suspect the double-queuing is a result of the fact that there exists
> only the single reference to the css->refcnt. I.e. it's
> percpu_ref_kill_and_confirm()'d and released both at the same time.
> 
> (Normally (when not killing the last reference), css->destroy_work reuse
> is not a problem because of the sequenced chain
> css_killed_work_fn()->css_put()->css_release().)
> 
>> This can be avoided by adding a check to css_release() that checks
>> if it has already been enqueued.
> 
> If that's what's happening, then your patch omits the final
> css_release_work_fn() in favor of css_killed_work_fn() but both should
> be run during the rollback upon css_populate_dir() failure.

This change only prevents from double queue:

queue_[rcu]_work(cgroup_destroy_wq, &css->destroy_rwork);

I don't see how it affects the css_killed_work_fn() clean path.
I didn't look at it, since I thought it is irrelevant in this case.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-04-14 17:51     ` Tadeusz Struk
  0 siblings, 0 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-04-14 17:51 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Zefan Li,
	Johannes Weiner, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh,
	netdev-u79uwXL29TY76Z2rM5mHXA, bpf-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	syzbot+e42ae441c3b10acf9e9d-Pl5Pbv+GP7P466ipTTIvnc23WoclnBCfAL8bYrjMMd8

Hi Michal,
Thanks for your analysis.

On 4/14/22 09:44, Michal KoutnÃ½ wrote:
> Hello Tadeusz.
> 
> Thanks for analyzing this syzbot report. Let me provide my understanding
> of the test case and explanation why I think your patch fixes it but is
> not fully correct.
> 
> On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk <tadeusz.struk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>> Syzbot found a corrupted list bug scenario that can be triggered from
>> cgroup css_create(). The reproduces writes to cgroup.subtree_control
>> file, which invokes cgroup_apply_control_enable(), css_create(), and
>> css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
> 
> The reproducer code makes it hard for me to understand which function
> fails with ENOMEM.
> But I can see your patch fixes the reproducer and your additional debug
> patch which proves that css->destroy_work is re-queued.

Yes, it is hard to see the actual failing point because, I think it is randomly
failing in different places. I think in the actual case that causes the list
corruption is in fact in css_create().
It is the css_create() error path that does fist rcu enqueue in:

https://elixir.bootlin.com/linux/v5.10.109/source/kernel/cgroup/cgroup.c#L5228

and the second is triggered by the css->refcnt calling css_release()

The reason why we don't see it actually failing in css_create() in the trace
dump is that the fail_dump() is rate-limited, see:
https://elixir.bootlin.com/linux/v5.18-rc2/source/lib/fault-inject.c#L44

I was confused as well, so I put additional debug prints in every place
where css_release() can fail, and it was actually in
css_create()->cgroup_idr_alloc() that failed in my case.

What happened was, the write triggered:
cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()

which, allocates and initializes the css, then fails in cgroup_idr_alloc(),
bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:

cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)

which then calls ref->data->release(ref) and enqueues the same
&css->destroy_rwork on cgroup_destroy_wq causing list corruption in insert_work.

>> In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
>> work for an css->refcnt initialized with css_release() destructor,
> 
> Note that css_free_rwork_fn() utilizes css->destroy_*r*work.
> The error path in css_create() open codes relevant parts of
> css_release_work_fn() so that css_release() can be skipped and the
> refcnt is eventually just percpu_ref_exit()'d.
> 
>> and there is a chance that the css_release() function will be invoked
>> for a cgroup_subsys_state, for which a destroy_work has already been
>> queued via css_create() error path.
> 
> But I think the problem is css_populate_dir() failing in
> cgroup_apply_control_enable(). (Is this what you actually meant?
> css_create() error path is then irrelevant, no?)

I thought so too at first as the the crushdump shows that this is failing
in css_populate_dir(), but this is not the fail that causes the list corruption.
The code can recover from the fail in css_populate_dir().
The fail that causes trouble is in css_create(), that makes it go to its error path.
I can dig out the patch with my debug prints and request syzbot to run it
if you want.

> 
> The already created csses should then be rolled back via
> 	cgroup_restore_control(cgrp);
> 	cgroup_apply_control_disable(cgrp);
> 	   ...
> 	   kill_css(css)
> 
> I suspect the double-queuing is a result of the fact that there exists
> only the single reference to the css->refcnt. I.e. it's
> percpu_ref_kill_and_confirm()'d and released both at the same time.
> 
> (Normally (when not killing the last reference), css->destroy_work reuse
> is not a problem because of the sequenced chain
> css_killed_work_fn()->css_put()->css_release().)
> 
>> This can be avoided by adding a check to css_release() that checks
>> if it has already been enqueued.
> 
> If that's what's happening, then your patch omits the final
> css_release_work_fn() in favor of css_killed_work_fn() but both should
> be run during the rollback upon css_populate_dir() failure.

This change only prevents from double queue:

queue_[rcu]_work(cgroup_destroy_wq, &css->destroy_rwork);

I don't see how it affects the css_killed_work_fn() clean path.
I didn't look at it, since I thought it is irrelevant in this case.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-04-21 23:43       ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2022-04-21 23:43 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups, Zefan Li, Johannes Weiner, Christian Brauner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

Hello,

On Thu, Apr 14, 2022 at 10:51:18AM -0700, Tadeusz Struk wrote:
> What happened was, the write triggered:
> cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()
> 
> which, allocates and initializes the css, then fails in cgroup_idr_alloc(),
> bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

Yes, but this css hasn't been installed yet.

> then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:
> 
> cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)

And this is a different css. cgroup->self which isn't connected to the half
built css which got destroyed in css_create().

So, I have a bit of difficulty following this scenario. The way that the
current code uses destroy_work is definitely nasty and it'd probably be a
good idea to separate out the different use cases, but let's first
understand what's failing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-04-21 23:43       ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2022-04-21 23:43 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, stable-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	syzbot+e42ae441c3b10acf9e9d-Pl5Pbv+GP7P466ipTTIvnc23WoclnBCfAL8bYrjMMd8

Hello,

On Thu, Apr 14, 2022 at 10:51:18AM -0700, Tadeusz Struk wrote:
> What happened was, the write triggered:
> cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()
> 
> which, allocates and initializes the css, then fails in cgroup_idr_alloc(),
> bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

Yes, but this css hasn't been installed yet.

> then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:
> 
> cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)

And this is a different css. cgroup->self which isn't connected to the half
built css which got destroyed in css_create().

So, I have a bit of difficulty following this scenario. The way that the
current code uses destroy_work is definitely nasty and it'd probably be a
good idea to separate out the different use cases, but let's first
understand what's failing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-04-14 16:44   ` Michal Koutný
  (?)
  (?)
@ 2022-04-22  0:00   ` Tejun Heo
  2022-04-22 11:05     ` Michal Koutný
  -1 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2022-04-22  0:00 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tadeusz Struk, cgroups, Zefan Li, Johannes Weiner,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

On Thu, Apr 14, 2022 at 06:44:09PM +0200, Michal Koutný wrote:
> I suspect the double-queuing is a result of the fact that there exists
> only the single reference to the css->refcnt. I.e. it's
> percpu_ref_kill_and_confirm()'d and released both at the same time.
> 
> (Normally (when not killing the last reference), css->destroy_work reuse
> is not a problem because of the sequenced chain
> css_killed_work_fn()->css_put()->css_release().)

If this is the case, we need to hold an extra reference to be put by the
css_killed_work_fn(), right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-04-22  0:00   ` Tejun Heo
@ 2022-04-22 11:05     ` Michal Koutný
  2022-05-18 16:48         ` Tadeusz Struk
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Koutný @ 2022-04-22 11:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tadeusz Struk, cgroups, Zefan Li, Johannes Weiner,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo <tj@kernel.org> wrote:
> If this is the case, we need to hold an extra reference to be put by the
> css_killed_work_fn(), right?

I looked into it a bit more lately and found that there already is such
a fuse in kill_css() [1].

At the same type syzbots stack trace demonstrates the fuse is
ineffective

> css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146                    (**)
> percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline]
> percpu_ref_put include/linux/percpu-refcount.h:338 [inline]
> percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline]        (*)
> percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199
> rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485
> rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722
> rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735
> __do_softirq+0x27e/0x596 kernel/softirq.c:305

(*) this calls css_killed_ref_fn confirm_switch
(**) zero references after confirmed kill?

So, I was also looking at the possible race with css_free_rwork_fn()
(from failed css_create()) but that would likely emit a warning from
__percpu_ref_exit().

So, I still think there's something fishy (so far possible only via
artificial ENOMEM injection) that needs an explanation...

Michal

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c#n5608


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-04-22 11:05     ` Michal Koutný
@ 2022-05-18 16:48         ` Tadeusz Struk
  0 siblings, 0 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-18 16:48 UTC (permalink / raw)
  To: Michal Koutný, Tejun Heo
  Cc: cgroups, Zefan Li, Johannes Weiner, Christian Brauner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

On 4/22/22 04:05, Michal Koutný wrote:
> On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo <tj@kernel.org> wrote:
>> If this is the case, we need to hold an extra reference to be put by the
>> css_killed_work_fn(), right?
> 
> I looked into it a bit more lately and found that there already is such
> a fuse in kill_css() [1].
> 
> At the same type syzbots stack trace demonstrates the fuse is
> ineffective
> 
>> css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146                    (**)
>> percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline]
>> percpu_ref_put include/linux/percpu-refcount.h:338 [inline]
>> percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline]        (*)
>> percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199
>> rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485
>> rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722
>> rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735
>> __do_softirq+0x27e/0x596 kernel/softirq.c:305
> 
> (*) this calls css_killed_ref_fn confirm_switch
> (**) zero references after confirmed kill?
> 
> So, I was also looking at the possible race with css_free_rwork_fn()
> (from failed css_create()) but that would likely emit a warning from
> __percpu_ref_exit().
> 
> So, I still think there's something fishy (so far possible only via
> artificial ENOMEM injection) that needs an explanation...

I can't reliably reproduce this issue on neither mainline nor v5.10, where
syzbot originally found it. It still triggers for syzbot though.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
@ 2022-05-18 16:48         ` Tadeusz Struk
  0 siblings, 0 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-18 16:48 UTC (permalink / raw)
  To: Michal Koutný, Tejun Heo
  Cc: cgroups, Zefan Li, Johannes Weiner, Christian Brauner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

On 4/22/22 04:05, Michal KoutnÃ½ wrote:
> On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo <tj@kernel.org> wrote:
>> If this is the case, we need to hold an extra reference to be put by the
>> css_killed_work_fn(), right?
> 
> I looked into it a bit more lately and found that there already is such
> a fuse in kill_css() [1].
> 
> At the same type syzbots stack trace demonstrates the fuse is
> ineffective
> 
>> css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146                    (**)
>> percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline]
>> percpu_ref_put include/linux/percpu-refcount.h:338 [inline]
>> percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline]        (*)
>> percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199
>> rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485
>> rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722
>> rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735
>> __do_softirq+0x27e/0x596 kernel/softirq.c:305
> 
> (*) this calls css_killed_ref_fn confirm_switch
> (**) zero references after confirmed kill?
> 
> So, I was also looking at the possible race with css_free_rwork_fn()
> (from failed css_create()) but that would likely emit a warning from
> __percpu_ref_exit().
> 
> So, I still think there's something fishy (so far possible only via
> artificial ENOMEM injection) that needs an explanation...

I can't reliably reproduce this issue on neither mainline nor v5.10, where
syzbot originally found it. It still triggers for syzbot though.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-18 16:48         ` Tadeusz Struk
  (?)
@ 2022-05-19 11:23         ` Hillf Danton
  2022-05-19 23:26           ` Tadeusz Struk
  -1 siblings, 1 reply; 23+ messages in thread
From: Hillf Danton @ 2022-05-19 11:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tadeusz Struk, Michal Koutny, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk  wrote:
> On 4/22/22 04:05, Michal Koutny wrote:
> > On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo <tj@kernel.org> wrote:
> >> If this is the case, we need to hold an extra reference to be put by the
> >> css_killed_work_fn(), right?

That put could trigger INIT_WORK in css_release() and warning [1]
on init active (active state 0) object OTOH as the same
css->destroy_work is used in both kill and release pathes.

Hillf

[1] https://lore.kernel.org/lkml/000000000000ff747805debce6c6@google.com/
> > 
> > I looked into it a bit more lately and found that there already is such
> > a fuse in kill_css() [1].
> > 
> > At the same type syzbots stack trace demonstrates the fuse is
> > ineffective
> > 
> >> css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146                    (**)
> >> percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline]
> >> percpu_ref_put include/linux/percpu-refcount.h:338 [inline]
> >> percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline]        (*)
> >> percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199
> >> rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485
> >> rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722
> >> rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735
> >> __do_softirq+0x27e/0x596 kernel/softirq.c:305
> > 
> > (*) this calls css_killed_ref_fn confirm_switch
> > (**) zero references after confirmed kill?
> > 
> > So, I was also looking at the possible race with css_free_rwork_fn()
> > (from failed css_create()) but that would likely emit a warning from
> > __percpu_ref_exit().
> > 
> > So, I still think there's something fishy (so far possible only via
> > artificial ENOMEM injection) that needs an explanation...
> 
> I can't reliably reproduce this issue on neither mainline nor v5.10, where
> syzbot originally found it. It still triggers for syzbot though.
> 
> -- 
> Thanks,
> Tadeusz


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-19 11:23         ` Hillf Danton
@ 2022-05-19 23:26           ` Tadeusz Struk
  2022-05-20  8:13             ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-19 23:26 UTC (permalink / raw)
  To: Hillf Danton, Tejun Heo
  Cc: Michal Koutny, linux-kernel, linux-mm, syzbot+e42ae441c3b10acf9e9d

On 5/19/22 04:23, Hillf Danton wrote:
> On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk  wrote:
>> On 4/22/22 04:05, Michal Koutny wrote:
>>> On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo<tj@kernel.org>  wrote:
>>>> If this is the case, we need to hold an extra reference to be put by the
>>>> css_killed_work_fn(), right?
> That put could trigger INIT_WORK in css_release() and warning [1]
> on init active (active state 0) object OTOH as the same
> css->destroy_work is used in both kill and release pathes.

Will this help if there would be two WQs, one for the css_release path
and one for the rcu_work?

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..a4873b33e488 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -124,6 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
   * which may lead to deadlock.
   */
  static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_destroy_rcu_wq;
  
  /* generate an array of cgroup subsystem pointers */
  #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,

-- 
Thanks,
Tadeusz

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-19 23:26           ` Tadeusz Struk
@ 2022-05-20  8:13             ` Tejun Heo
  2022-05-20 16:38               ` Tadeusz Struk
  2022-05-20 23:48               ` Hillf Danton
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2022-05-20  8:13 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Hillf Danton, Michal Koutny, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On Thu, May 19, 2022 at 04:26:51PM -0700, Tadeusz Struk wrote:
> On 5/19/22 04:23, Hillf Danton wrote:
> > On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk  wrote:
> > > On 4/22/22 04:05, Michal Koutny wrote:
> > > > On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo<tj@kernel.org>  wrote:
> > > > > If this is the case, we need to hold an extra reference to be put by the
> > > > > css_killed_work_fn(), right?
> > That put could trigger INIT_WORK in css_release() and warning [1]
> > on init active (active state 0) object OTOH as the same
> > css->destroy_work is used in both kill and release pathes.

Hmm... wouldn't the extra reference keep release from happening?

> Will this help if there would be two WQs, one for the css_release path
> and one for the rcu_work?
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..a4873b33e488 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -124,6 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
>   * which may lead to deadlock.
>   */
>  static struct workqueue_struct *cgroup_destroy_wq;
> +static struct workqueue_struct *cgroup_destroy_rcu_wq;

I don't understand why this would help. Care to elaborate?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-20  8:13             ` Tejun Heo
@ 2022-05-20 16:38               ` Tadeusz Struk
  2022-05-20 16:42                 ` Michal Koutný
  2022-05-20 23:48               ` Hillf Danton
  1 sibling, 1 reply; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-20 16:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hillf Danton, Michal Koutny, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On 5/20/22 01:13, Tejun Heo wrote:
> On Thu, May 19, 2022 at 04:26:51PM -0700, Tadeusz Struk wrote:
>> On 5/19/22 04:23, Hillf Danton wrote:
>>> On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk  wrote:
>>>> On 4/22/22 04:05, Michal Koutny wrote:
>>>>> On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo<tj@kernel.org>  wrote:
>>>>>> If this is the case, we need to hold an extra reference to be put by the
>>>>>> css_killed_work_fn(), right?
>>> That put could trigger INIT_WORK in css_release() and warning [1]
>>> on init active (active state 0) object OTOH as the same
>>> css->destroy_work is used in both kill and release paths.
> 
> Hmm... wouldn't the extra reference keep release from happening?
> 
>> Will this help if there would be two WQs, one for the css_release path
>> and one for the rcu_work?
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index adb820e98f24..a4873b33e488 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -124,6 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
>>    * which may lead to deadlock.
>>    */
>>   static struct workqueue_struct *cgroup_destroy_wq;
>> +static struct workqueue_struct *cgroup_destroy_rcu_wq;
> 
> I don't understand why this would help. Care to elaborate?

I think it will help to solve the list corruption issue:

list_add corruption. prev->next should be next (ffff8881f705c060), but was ffff888113123870. (prev=ffff888113123870).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:28!

as this is a result of enqueuing the same css->destroy_work onto the same WQ,
one on the rcu path and one on the css_release path.
I will prototype it today and test with syzbot.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-20 16:38               ` Tadeusz Struk
@ 2022-05-20 16:42                 ` Michal Koutný
  2022-05-20 16:56                   ` Tadeusz Struk
  2022-05-23 19:00                   ` Tadeusz Struk
  0 siblings, 2 replies; 23+ messages in thread
From: Michal Koutný @ 2022-05-20 16:42 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Tejun Heo, Hillf Danton, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> as this is a result of enqueuing the same css->destroy_work onto the same WQ,
> one on the rcu path and one on the css_release path.
> I will prototype it today and test with syzbot.

In my understanding, you'd need two independent work_structs in a css,
not two separate workqueues to put the single entry on.

Michal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-20 16:42                 ` Michal Koutný
@ 2022-05-20 16:56                   ` Tadeusz Struk
  2022-05-23 19:00                   ` Tadeusz Struk
  1 sibling, 0 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-20 16:56 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Hillf Danton, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On 5/20/22 09:42, Michal Koutný wrote:
> On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk<tadeusz.struk@linaro.org>  wrote:
>> as this is a result of enqueuing the same css->destroy_work onto the same WQ,
>> one on the rcu path and one on the css_release path.
>> I will prototype it today and test with syzbot.
> In my understanding, you'd need two independent work_structs in a css,
> not two separate workqueues to put the single entry on.

I think either way would work, but two workqueues would be less churn.
I can try both.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-20  8:13             ` Tejun Heo
  2022-05-20 16:38               ` Tadeusz Struk
@ 2022-05-20 23:48               ` Hillf Danton
  1 sibling, 0 replies; 23+ messages in thread
From: Hillf Danton @ 2022-05-20 23:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tadeusz Struk, Michal Koutny, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On Thu, 19 May 2022 22:13:47 -1000 Tejun Heo  wrote:
> On Thu, May 19, 2022 at 04:26:51PM -0700, Tadeusz Struk wrote:
> > On 5/19/22 04:23, Hillf Danton wrote:
> > > On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk  wrote:
> > > > On 4/22/22 04:05, Michal Koutny wrote:
> > > > > On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo<tj@kernel.org>  wrote:
> > > > > > If this is the case, we need to hold an extra reference to be put by the
> > > > > > css_killed_work_fn(), right?
> > > 
> > > That put could trigger INIT_WORK in css_release() and warning [1]
> > > on init active (active state 0) object OTOH as the same
> > > css->destroy_work is used in both kill and release pathes.
> 
> Hmm... wouldn't the extra reference keep release from happening?

Hm...Hm... have difficulty following up given the risk of memleak without
release.

> 
> > Will this help if there would be two WQs, one for the css_release path
> > and one for the rcu_work?

Unlikely as adding one more queue does not help INIT_WORK.

Hillf
> > 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index adb820e98f24..a4873b33e488 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -124,6 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
> >   * which may lead to deadlock.
> >   */
> >  static struct workqueue_struct *cgroup_destroy_wq;
> > +static struct workqueue_struct *cgroup_destroy_rcu_wq;
> 
> I don't understand why this would help. Care to elaborate?
> 
> Thanks.
> 
> -- 
> tejun


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-20 16:42                 ` Michal Koutný
  2022-05-20 16:56                   ` Tadeusz Struk
@ 2022-05-23 19:00                   ` Tadeusz Struk
  2022-05-23 19:02                     ` Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-23 19:00 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Hillf Danton, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On 5/20/22 09:42, Michal Koutný wrote:
> On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
>> as this is a result of enqueuing the same css->destroy_work onto the same WQ,
>> one on the rcu path and one on the css_release path.
>> I will prototype it today and test with syzbot.
> 
> In my understanding, you'd need two independent work_structs in a css,
> not two separate workqueues to put the single entry on.

Yes, separating the css_killed_ref and css_release paths with two separate work_structs
fixes the two syzbot list corruption issues [1]&[2].
I tested it on mainline v5.18.0 and v5.10.117
In case of [2] the mainline triggers an issue, but it is unrelated to list corruption.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd
[2] https://syzkaller.appspot.com/bug?id=3c7ff113ccb695e839b859da3fc481c36eb1cfd5

I will send a patch soon.

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-23 19:00                   ` Tadeusz Struk
@ 2022-05-23 19:02                     ` Tejun Heo
  2022-05-23 19:08                       ` Tadeusz Struk
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2022-05-23 19:02 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Michal Koutný,
	Hillf Danton, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On Mon, May 23, 2022 at 12:00:06PM -0700, Tadeusz Struk wrote:
> On 5/20/22 09:42, Michal Koutný wrote:
> > On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> > > as this is a result of enqueuing the same css->destroy_work onto the same WQ,
> > > one on the rcu path and one on the css_release path.
> > > I will prototype it today and test with syzbot.
> > 
> > In my understanding, you'd need two independent work_structs in a css,
> > not two separate workqueues to put the single entry on.
> 
> Yes, separating the css_killed_ref and css_release paths with two separate work_structs
> fixes the two syzbot list corruption issues [1]&[2].
> I tested it on mainline v5.18.0 and v5.10.117
> In case of [2] the mainline triggers an issue, but it is unrelated to list corruption.

Can you try holding an extra ref in the killed path?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-23 19:02                     ` Tejun Heo
@ 2022-05-23 19:08                       ` Tadeusz Struk
  2022-05-23 20:05                         ` Tadeusz Struk
  0 siblings, 1 reply; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-23 19:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Hillf Danton, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On 5/23/22 12:02, Tejun Heo wrote:
> Can you try holding an extra ref in the killed path?

Yes, I can try that. Should that be with or without the extra work_struct?

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cgroup: don't queue css_release_work if one already pending
  2022-05-23 19:08                       ` Tadeusz Struk
@ 2022-05-23 20:05                         ` Tadeusz Struk
  0 siblings, 0 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-23 20:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Hillf Danton, linux-kernel, linux-mm,
	syzbot+e42ae441c3b10acf9e9d

On 5/23/22 12:08, Tadeusz Struk wrote:
> On 5/23/22 12:02, Tejun Heo wrote:
>> Can you try holding an extra ref in the killed path?
> 
> Yes, I can try that. Should that be with or without the extra work_struct?

With this patch:
https://syzkaller.appspot.com/text?tag=Patch&x=1264b23df00000

The issue still triggers:
https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] cgroups: separate destroy_work into two separate wq
  2022-04-12 19:24 [PATCH] cgroup: don't queue css_release_work if one already pending Tadeusz Struk
  2022-04-14 16:44   ` Michal Koutný
@ 2022-05-23 21:27 ` Tadeusz Struk
  1 sibling, 0 replies; 23+ messages in thread
From: Tadeusz Struk @ 2022-05-23 21:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tadeusz Struk, Zefan Li, Johannes Weiner, Christian Brauner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, cgroups, netdev, bpf, stable, linux-kernel,
	syzbot+e42ae441c3b10acf9e9d

Syzbot found a corrupted list bug scenario that can be triggered from
cgroup css_create(). The reproduces writes to cgroup.subtree_control
file, which invokes cgroup_apply_control_enable(), css_create(), and
css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
work for an css->refcnt initialized with css_release() destructor,
and there is a chance that the css_release() function will be invoked
for a cgroup_subsys_state, for which a destroy_work has already been
queued via css_create() error path. This causes a list_add corruption
as can be seen in the syzkaller report [1].
This can be fixed by separating the css_release and ref_kill paths
to work with two separate work_structs.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: <cgroups@vger.kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <bpf@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>

Reported-and-tested-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com
Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
---
v2: Add a separate work_struct for the css_ref_kill path instead of
    checking if a work has already been enqueued.
---
 include/linux/cgroup-defs.h |  5 +++--
 kernel/cgroup/cgroup.c      | 14 +++++++-------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..92b0c5e8c472 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -178,8 +178,9 @@ struct cgroup_subsys_state {
 	 */
 	atomic_t online_cnt;
 
-	/* percpu_ref killing and RCU release */
-	struct work_struct destroy_work;
+	/* percpu_ref killing, css release, and RCU release work structs */
+	struct work_struct release_work;
+	struct work_struct killed_ref_work;
 	struct rcu_work destroy_rwork;
 
 	/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..3e00a793e15d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5099,7 +5099,7 @@ static struct cftype cgroup_base_files[] = {
  *    css_free_work_fn().
  *
  * It is actually hairier because both step 2 and 4 require process context
- * and thus involve punting to css->destroy_work adding two additional
+ * and thus involve punting to css->release_work adding two additional
  * steps to the already complex sequence.
  */
 static void css_free_rwork_fn(struct work_struct *work)
@@ -5154,7 +5154,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 static void css_release_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+		container_of(work, struct cgroup_subsys_state, release_work);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
@@ -5210,8 +5210,8 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	INIT_WORK(&css->release_work, css_release_work_fn);
+	queue_work(cgroup_destroy_wq, &css->release_work);
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5546,7 +5546,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 static void css_killed_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+		container_of(work, struct cgroup_subsys_state, killed_ref_work);
 
 	mutex_lock(&cgroup_mutex);
 
@@ -5567,8 +5567,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
 	if (atomic_dec_and_test(&css->online_cnt)) {
-		INIT_WORK(&css->destroy_work, css_killed_work_fn);
-		queue_work(cgroup_destroy_wq, &css->destroy_work);
+		INIT_WORK(&css->killed_ref_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->killed_ref_work);
 	}
 }
 
-- 
2.36.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-05-23 21:30 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-12 19:24 [PATCH] cgroup: don't queue css_release_work if one already pending Tadeusz Struk
2022-04-14 16:44 ` Michal Koutný
2022-04-14 16:44   ` Michal Koutný
2022-04-14 17:51   ` Tadeusz Struk
2022-04-14 17:51     ` Tadeusz Struk
2022-04-21 23:43     ` Tejun Heo
2022-04-21 23:43       ` Tejun Heo
2022-04-22  0:00   ` Tejun Heo
2022-04-22 11:05     ` Michal Koutný
2022-05-18 16:48       ` Tadeusz Struk
2022-05-18 16:48         ` Tadeusz Struk
2022-05-19 11:23         ` Hillf Danton
2022-05-19 23:26           ` Tadeusz Struk
2022-05-20  8:13             ` Tejun Heo
2022-05-20 16:38               ` Tadeusz Struk
2022-05-20 16:42                 ` Michal Koutný
2022-05-20 16:56                   ` Tadeusz Struk
2022-05-23 19:00                   ` Tadeusz Struk
2022-05-23 19:02                     ` Tejun Heo
2022-05-23 19:08                       ` Tadeusz Struk
2022-05-23 20:05                         ` Tadeusz Struk
2022-05-20 23:48               ` Hillf Danton
2022-05-23 21:27 ` [PATCH v2] cgroups: separate destroy_work into two separate wq Tadeusz Struk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.