stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()
       [not found] <CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com>
@ 2023-02-02  3:00 ` Munehisa Kamata
  2023-02-02  4:56   ` Eric Biggers
  0 siblings, 1 reply; 14+ messages in thread
From: Munehisa Kamata @ 2023-02-02  3:00 UTC (permalink / raw)
  To: surenb
  Cc: ebiggers, hannes, hdanton, kamatam, linux-kernel, linux-mm,
	mengcc, stable

If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed without clearing the queue and reference, then it can
result in use-after-free as below. Use wake_up_pollfree() instead to
ensure that the queue and reference are cleared before freeing.

 BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
 Write of size 4 at addr ffff88810e625328 by task a.out/4404

 CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
 Call Trace:
 <TASK>
 dump_stack_lvl+0x73/0xa0
 print_report+0x16c/0x4e0
 ? _printk+0x59/0x80
 ? __virt_addr_valid+0xb8/0x130
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_report+0xc3/0xf0
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_check_range+0x2d2/0x310
 _raw_spin_lock_irqsave+0x60/0xc0
 remove_wait_queue+0x1a/0xa0
 ep_free+0x12c/0x170
 ep_eventpoll_release+0x26/0x30
 __fput+0x202/0x400
 task_work_run+0x11d/0x170
 do_exit+0x495/0x1130
 ? update_cfs_rq_load_avg+0x2c2/0x2e0
 do_group_exit+0x100/0x100
 get_signal+0xd67/0xde0
 ? finish_task_switch+0x15f/0x3a0
 arch_do_signal_or_restart+0x2a/0x2b0
 exit_to_user_mode_prepare+0x94/0x100
 syscall_exit_to_user_mode+0x20/0x40
 do_syscall_64+0x52/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f8e392bfb91
 Code: Unable to access opcode bytes at 0x7f8e392bfb67.
 RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

 Allocated by task 4404:
 kasan_set_track+0x3d/0x60
 __kasan_kmalloc+0x85/0x90
 psi_trigger_create+0x113/0x3e0
 pressure_write+0x146/0x2e0
 cgroup_file_write+0x11c/0x250
 kernfs_fop_write_iter+0x186/0x220
 vfs_write+0x3d8/0x5c0
 ksys_write+0x90/0x110
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 4407:
 kasan_set_track+0x3d/0x60
 kasan_save_free_info+0x27/0x40
 ____kasan_slab_free+0x11d/0x170
 slab_free_freelist_hook+0x87/0x150
 __kmem_cache_free+0xcb/0x180
 psi_trigger_destroy+0x2e8/0x310
 cgroup_file_release+0x4f/0xb0
 kernfs_drain_open_files+0x165/0x1f0
 kernfs_drain+0x162/0x1a0
 __kernfs_remove+0x1fb/0x310
 kernfs_remove_by_name_ns+0x95/0xe0
 cgroup_addrm_files+0x67f/0x700
 cgroup_destroy_locked+0x283/0x3c0
 cgroup_rmdir+0x29/0x100
 kernfs_iop_rmdir+0xd1/0x140
 vfs_rmdir+0xfe/0x240
 do_rmdir+0x13d/0x280
 __x64_sys_rmdir+0x2c/0x30
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
---
 kernel/sched/psi.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8ac8b81bfee6..6e66c15f6450 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
 
 	group = t->group;
 	/*
-	 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
-	 * from under a polling process.
+	 * Wakeup waiters to stop polling and clear the queue to prevent it from
+	 * being accessed later. Can happen if cgroup is deleted from under a
+	 * polling process otherwise.
 	 */
-	wake_up_interruptible(&t->event_wait);
+	wake_up_pollfree(&t->event_wait);
 
 	mutex_lock(&group->trigger_lock);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-02  3:00 ` [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue() Munehisa Kamata
@ 2023-02-02  4:56   ` Eric Biggers
  2023-02-02 21:11     ` Suren Baghdasaryan
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2023-02-02  4:56 UTC (permalink / raw)
  To: Munehisa Kamata
  Cc: surenb, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8ac8b81bfee6..6e66c15f6450 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
>  
>  	group = t->group;
>  	/*
> -	 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> -	 * from under a polling process.
> +	 * Wakeup waiters to stop polling and clear the queue to prevent it from
> +	 * being accessed later. Can happen if cgroup is deleted from under a
> +	 * polling process otherwise.
>  	 */
> -	wake_up_interruptible(&t->event_wait);
> +	wake_up_pollfree(&t->event_wait);
>  
>  	mutex_lock(&group->trigger_lock);

wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
lifetime of the waitqueue be fixed instead?

- Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-02  4:56   ` Eric Biggers
@ 2023-02-02 21:11     ` Suren Baghdasaryan
  2023-02-09 17:09       ` Suren Baghdasaryan
  0 siblings, 1 reply; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-02 21:11 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Munehisa Kamata, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 8ac8b81bfee6..6e66c15f6450 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> >
> >       group = t->group;
> >       /*
> > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > -      * from under a polling process.
> > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > +      * being accessed later. Can happen if cgroup is deleted from under a
> > +      * polling process otherwise.
> >        */
> > -     wake_up_interruptible(&t->event_wait);
> > +     wake_up_pollfree(&t->event_wait);
> >
> >       mutex_lock(&group->trigger_lock);
>
> wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> lifetime of the waitqueue be fixed instead?

waitqueue lifetime in this case is linked to cgroup_file_release(),
which seems appropriate to me here. Unfortunately
cgroup_file_release() is not directly linked to the file's lifetime.
For more details see:
https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
.
So, if we want to fix the lifetime of the waitqueue, we would have to
tie cgroup_file_release() to the fput() somehow. IOW, the fix would
have to be done at the cgroups or higher (kernfs?) layer.
Thanks,
Suren.

>
> - Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-02 21:11     ` Suren Baghdasaryan
@ 2023-02-09 17:09       ` Suren Baghdasaryan
  2023-02-09 18:46         ` Eric Biggers
  0 siblings, 1 reply; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-09 17:09 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Munehisa Kamata, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > --- a/kernel/sched/psi.c
> > > +++ b/kernel/sched/psi.c
> > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > >
> > >       group = t->group;
> > >       /*
> > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > -      * from under a polling process.
> > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > +      * polling process otherwise.
> > >        */
> > > -     wake_up_interruptible(&t->event_wait);
> > > +     wake_up_pollfree(&t->event_wait);
> > >
> > >       mutex_lock(&group->trigger_lock);
> >
> > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > lifetime of the waitqueue be fixed instead?
>
> waitqueue lifetime in this case is linked to cgroup_file_release(),
> which seems appropriate to me here. Unfortunately
> cgroup_file_release() is not directly linked to the file's lifetime.
> For more details see:
> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> .
> So, if we want to fix the lifetime of the waitqueue, we would have to
> tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> have to be done at the cgroups or higher (kernfs?) layer.

Hi Eric,
Do you still object to using wake_up_pollfree() for this case?
Changing higher levels to make cgroup_file_release() be tied to fput()
would be ideal but I think that would be a big change for this one
case. If you agree I'll Ack this patch.
Thanks,
Suren.

> Thanks,
> Suren.
>
> >
> > - Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-09 17:09       ` Suren Baghdasaryan
@ 2023-02-09 18:46         ` Eric Biggers
  2023-02-09 19:13           ` Suren Baghdasaryan
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2023-02-09 18:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Munehisa Kamata, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > --- a/kernel/sched/psi.c
> > > > +++ b/kernel/sched/psi.c
> > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > >
> > > >       group = t->group;
> > > >       /*
> > > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > -      * from under a polling process.
> > > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > > +      * polling process otherwise.
> > > >        */
> > > > -     wake_up_interruptible(&t->event_wait);
> > > > +     wake_up_pollfree(&t->event_wait);
> > > >
> > > >       mutex_lock(&group->trigger_lock);
> > >
> > > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > > lifetime of the waitqueue be fixed instead?
> >
> > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > which seems appropriate to me here. Unfortunately
> > cgroup_file_release() is not directly linked to the file's lifetime.
> > For more details see:
> > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > .
> > So, if we want to fix the lifetime of the waitqueue, we would have to
> > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > have to be done at the cgroups or higher (kernfs?) layer.
> 
> Hi Eric,
> Do you still object to using wake_up_pollfree() for this case?
> Changing higher levels to make cgroup_file_release() be tied to fput()
> would be ideal but I think that would be a big change for this one
> case. If you agree I'll Ack this patch.
> Thanks,
> Suren.
> 

I haven't read the code closely in this case.  I'm just letting you know that
wake_up_pollfree() is very much a last-resort option for when the waitqueue
lifetime can't be fixed.  So if you want to use wake_up_pollfree(), you need to
explain why no other fix is possible.  For example maybe the UAPI depends on the
waitqueue having a nonstandard lifetime.

- Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-09 18:46         ` Eric Biggers
@ 2023-02-09 19:13           ` Suren Baghdasaryan
  2023-02-13 23:50             ` Suren Baghdasaryan
  0 siblings, 1 reply; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-09 19:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Munehisa Kamata, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > >
> > > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > > --- a/kernel/sched/psi.c
> > > > > +++ b/kernel/sched/psi.c
> > > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > > >
> > > > >       group = t->group;
> > > > >       /*
> > > > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > > -      * from under a polling process.
> > > > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > > > +      * polling process otherwise.
> > > > >        */
> > > > > -     wake_up_interruptible(&t->event_wait);
> > > > > +     wake_up_pollfree(&t->event_wait);
> > > > >
> > > > >       mutex_lock(&group->trigger_lock);
> > > >
> > > > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > > > lifetime of the waitqueue be fixed instead?
> > >
> > > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > > which seems appropriate to me here. Unfortunately
> > > cgroup_file_release() is not directly linked to the file's lifetime.
> > > For more details see:
> > > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > > .
> > > So, if we want to fix the lifetime of the waitqueue, we would have to
> > > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > > have to be done at the cgroups or higher (kernfs?) layer.
> >
> > Hi Eric,
> > Do you still object to using wake_up_pollfree() for this case?
> > Changing higher levels to make cgroup_file_release() be tied to fput()
> > would be ideal but I think that would be a big change for this one
> > case. If you agree I'll Ack this patch.
> > Thanks,
> > Suren.
> >
>
> I haven't read the code closely in this case.  I'm just letting you know that
> wake_up_pollfree() is very much a last-resort option for when the waitqueue
> lifetime can't be fixed.

Got it. Thanks for the warning.
I think it can be fixed but the right fix would require a sizable
higher level refactoring which might be more justifiable if we have
more such cases in the future.

>  So if you want to use wake_up_pollfree(), you need to
> explain why no other fix is possible.  For example maybe the UAPI depends on the
> waitqueue having a nonstandard lifetime.

I think the changelog should explain that the waitqueue lifetime in
cases of non-root cgroups is tied to cgroup_file_release() callback,
which in turn is not tied to file's lifetime. That's the reason for
waitqueue and the file having different lifecycles. Would that suffice
as the justification?
Again, I'm not saying that no other fix is possible, but that the
right fix would be much more complex.
Thanks,
Suren.

>
> - Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-09 19:13           ` Suren Baghdasaryan
@ 2023-02-13 23:50             ` Suren Baghdasaryan
  2023-02-14  7:04               ` [PATCH v2] " Munehisa Kamata
  0 siblings, 1 reply; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-13 23:50 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Munehisa Kamata, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Thu, Feb 9, 2023 at 11:13 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > > >
> > > > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > > > --- a/kernel/sched/psi.c
> > > > > > +++ b/kernel/sched/psi.c
> > > > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > > > >
> > > > > >       group = t->group;
> > > > > >       /*
> > > > > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > > > -      * from under a polling process.
> > > > > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > > > > +      * polling process otherwise.
> > > > > >        */
> > > > > > -     wake_up_interruptible(&t->event_wait);
> > > > > > +     wake_up_pollfree(&t->event_wait);
> > > > > >
> > > > > >       mutex_lock(&group->trigger_lock);
> > > > >
> > > > > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > > > > lifetime of the waitqueue be fixed instead?
> > > >
> > > > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > > > which seems appropriate to me here. Unfortunately
> > > > cgroup_file_release() is not directly linked to the file's lifetime.
> > > > For more details see:
> > > > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > > > .
> > > > So, if we want to fix the lifetime of the waitqueue, we would have to
> > > > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > > > have to be done at the cgroups or higher (kernfs?) layer.
> > >
> > > Hi Eric,
> > > Do you still object to using wake_up_pollfree() for this case?
> > > Changing higher levels to make cgroup_file_release() be tied to fput()
> > > would be ideal but I think that would be a big change for this one
> > > case. If you agree I'll Ack this patch.
> > > Thanks,
> > > Suren.
> > >
> >
> > I haven't read the code closely in this case.  I'm just letting you know that
> > wake_up_pollfree() is very much a last-resort option for when the waitqueue
> > lifetime can't be fixed.
>
> Got it. Thanks for the warning.
> I think it can be fixed but the right fix would require a sizable
> higher level refactoring which might be more justifiable if we have
> more such cases in the future.
>
> >  So if you want to use wake_up_pollfree(), you need to
> > explain why no other fix is possible.  For example maybe the UAPI depends on the
> > waitqueue having a nonstandard lifetime.
>
> I think the changelog should explain that the waitqueue lifetime in
> cases of non-root cgroups is tied to cgroup_file_release() callback,
> which in turn is not tied to file's lifetime. That's the reason for
> waitqueue and the file having different lifecycles. Would that suffice
> as the justification?

Ok, in the absence of objections, I would suggest resending this patch
with the changelog including details about waitqueue lifetime and
reasons wake_up_pollfree() is required here.
Munehisa, feel free to reuse
https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
if you find it useful.
Thanks,
Suren.

> Again, I'm not saying that no other fix is possible, but that the
> right fix would be much more complex.
> Thanks,
> Suren.
>
> >
> > - Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-13 23:50             ` Suren Baghdasaryan
@ 2023-02-14  7:04               ` Munehisa Kamata
  2023-02-14 17:10                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 14+ messages in thread
From: Munehisa Kamata @ 2023-02-14  7:04 UTC (permalink / raw)
  To: surenb
  Cc: ebiggers, hannes, hdanton, kamatam, linux-kernel, linux-mm,
	mengcc, stable

If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed without clearing the queue and reference in the
following path.

 do_rmdir
   cgroup_rmdir
     kernfs_drain_open_files
       cgroup_file_release
         cgroup_pressure_release
           psi_trigger_destroy

However, the polling thread can keep having the last reference to the
pressure file that is tied to the freed waitqueue until explicit close or
exit later.

 fput
   ep_eventpoll_release
     ep_free
       ep_remove_wait_queue
         remove_wait_queue

Then, the thread accesses to the already-freed waitqueue when dropping the
reference and results in use-after-free as pasted below.

The fundamental problem here is that the lifetime of the waitqueue is not
tied to the file's real lifetime as shown above. Using wake_up_pollfree()
here might be less than ideal, but it also is not fully contradicting the
comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the
waitqueue's lifetime is not tied to file's one and can be considered as
another special case. While this would be fixable by somehow making
cgroup_file_release() be tied to the fput(), it would require sizable
refactoring at cgroups or higher layer which might be more justifiable if
we identify more cases like this.

 BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
 Write of size 4 at addr ffff88810e625328 by task a.out/4404

 CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
 Call Trace:
 <TASK>
 dump_stack_lvl+0x73/0xa0
 print_report+0x16c/0x4e0
 ? _printk+0x59/0x80
 ? __virt_addr_valid+0xb8/0x130
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_report+0xc3/0xf0
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_check_range+0x2d2/0x310
 _raw_spin_lock_irqsave+0x60/0xc0
 remove_wait_queue+0x1a/0xa0
 ep_free+0x12c/0x170
 ep_eventpoll_release+0x26/0x30
 __fput+0x202/0x400
 task_work_run+0x11d/0x170
 do_exit+0x495/0x1130
 ? update_cfs_rq_load_avg+0x2c2/0x2e0
 do_group_exit+0x100/0x100
 get_signal+0xd67/0xde0
 ? finish_task_switch+0x15f/0x3a0
 arch_do_signal_or_restart+0x2a/0x2b0
 exit_to_user_mode_prepare+0x94/0x100
 syscall_exit_to_user_mode+0x20/0x40
 do_syscall_64+0x52/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f8e392bfb91
 Code: Unable to access opcode bytes at 0x7f8e392bfb67.
 RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

 Allocated by task 4404:
 kasan_set_track+0x3d/0x60
 __kasan_kmalloc+0x85/0x90
 psi_trigger_create+0x113/0x3e0
 pressure_write+0x146/0x2e0
 cgroup_file_write+0x11c/0x250
 kernfs_fop_write_iter+0x186/0x220
 vfs_write+0x3d8/0x5c0
 ksys_write+0x90/0x110
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 4407:
 kasan_set_track+0x3d/0x60
 kasan_save_free_info+0x27/0x40
 ____kasan_slab_free+0x11d/0x170
 slab_free_freelist_hook+0x87/0x150
 __kmem_cache_free+0xcb/0x180
 psi_trigger_destroy+0x2e8/0x310
 cgroup_file_release+0x4f/0xb0
 kernfs_drain_open_files+0x165/0x1f0
 kernfs_drain+0x162/0x1a0
 __kernfs_remove+0x1fb/0x310
 kernfs_remove_by_name_ns+0x95/0xe0
 cgroup_addrm_files+0x67f/0x700
 cgroup_destroy_locked+0x283/0x3c0
 cgroup_rmdir+0x29/0x100
 kernfs_iop_rmdir+0xd1/0x140
 vfs_rmdir+0xfe/0x240
 do_rmdir+0x13d/0x280
 __x64_sys_rmdir+0x2c/0x30
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

v2: updated commit message

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
---
 kernel/sched/psi.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8ac8b81bfee6..6e66c15f6450 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
 
 	group = t->group;
 	/*
-	 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
-	 * from under a polling process.
+	 * Wakeup waiters to stop polling and clear the queue to prevent it from
+	 * being accessed later. Can happen if cgroup is deleted from under a
+	 * polling process otherwise.
 	 */
-	wake_up_interruptible(&t->event_wait);
+	wake_up_pollfree(&t->event_wait);
 
 	mutex_lock(&group->trigger_lock);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-14  7:04               ` [PATCH v2] " Munehisa Kamata
@ 2023-02-14 17:10                 ` Suren Baghdasaryan
  2023-02-14 18:13                   ` [PATCH v3] " Munehisa Kamata
  0 siblings, 1 reply; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-14 17:10 UTC (permalink / raw)
  To: Munehisa Kamata
  Cc: ebiggers, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

Thanks!
Overall LGTM, just a couple of nits (simplifications):

On Mon, Feb 13, 2023 at 11:04 PM Munehisa Kamata <kamatam@amazon.com> wrote:
>
> If a non-root cgroup gets removed when there is a thread that registered
> trigger and is polling on a pressure file within the cgroup, the polling
> waitqueue gets freed without clearing the queue and reference in the
> following path.

Let's remove "without clearing the queue and reference" in the above
sentence. The next section explains why this is problematic, therefore
mentioning that here is unnecessary IMHO.

>
>  do_rmdir
>    cgroup_rmdir
>      kernfs_drain_open_files
>        cgroup_file_release
>          cgroup_pressure_release
>            psi_trigger_destroy
>
> However, the polling thread can keep having the last reference to the
> pressure file that is tied to the freed waitqueue until explicit close or
> exit later.

Suggest replacing: However, the polling thread still has a reference
to the pressure file it is polling and will access the freed waitqueue
when file is closed or upon exit:

>
>  fput
>    ep_eventpoll_release
>      ep_free
>        ep_remove_wait_queue
>          remove_wait_queue
>
> Then, the thread accesses to the already-freed waitqueue when dropping the
> reference and results in use-after-free as pasted below.

Suggest replacing: This results is use-after-free as pasted below.

>
> The fundamental problem here is that the lifetime of the waitqueue is not
> tied to the file's real lifetime as shown above.

The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real
lifetime.

> Using wake_up_pollfree()
> here might be less than ideal, but it also is not fully contradicting the
> comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the
> waitqueue's lifetime is not tied to file's one and can be considered as
> another special case. While this would be fixable by somehow making
> cgroup_file_release() be tied to the fput(), it would require sizable
> refactoring at cgroups or higher layer which might be more justifiable if
> we identify more cases like this.
>
>  BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
>  Write of size 4 at addr ffff88810e625328 by task a.out/4404
>
>  CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
>  Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
>  Call Trace:
>  <TASK>
>  dump_stack_lvl+0x73/0xa0
>  print_report+0x16c/0x4e0
>  ? _printk+0x59/0x80
>  ? __virt_addr_valid+0xb8/0x130
>  ? _raw_spin_lock_irqsave+0x60/0xc0
>  kasan_report+0xc3/0xf0
>  ? _raw_spin_lock_irqsave+0x60/0xc0
>  kasan_check_range+0x2d2/0x310
>  _raw_spin_lock_irqsave+0x60/0xc0
>  remove_wait_queue+0x1a/0xa0
>  ep_free+0x12c/0x170
>  ep_eventpoll_release+0x26/0x30
>  __fput+0x202/0x400
>  task_work_run+0x11d/0x170
>  do_exit+0x495/0x1130
>  ? update_cfs_rq_load_avg+0x2c2/0x2e0
>  do_group_exit+0x100/0x100
>  get_signal+0xd67/0xde0
>  ? finish_task_switch+0x15f/0x3a0
>  arch_do_signal_or_restart+0x2a/0x2b0
>  exit_to_user_mode_prepare+0x94/0x100
>  syscall_exit_to_user_mode+0x20/0x40
>  do_syscall_64+0x52/0x90
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>  RIP: 0033:0x7f8e392bfb91
>  Code: Unable to access opcode bytes at 0x7f8e392bfb67.
>  RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
>  RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
>  RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
>  RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
>  R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
>  R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
>  </TASK>
>
>  Allocated by task 4404:
>  kasan_set_track+0x3d/0x60
>  __kasan_kmalloc+0x85/0x90
>  psi_trigger_create+0x113/0x3e0
>  pressure_write+0x146/0x2e0
>  cgroup_file_write+0x11c/0x250
>  kernfs_fop_write_iter+0x186/0x220
>  vfs_write+0x3d8/0x5c0
>  ksys_write+0x90/0x110
>  do_syscall_64+0x43/0x90
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
>  Freed by task 4407:
>  kasan_set_track+0x3d/0x60
>  kasan_save_free_info+0x27/0x40
>  ____kasan_slab_free+0x11d/0x170
>  slab_free_freelist_hook+0x87/0x150
>  __kmem_cache_free+0xcb/0x180
>  psi_trigger_destroy+0x2e8/0x310
>  cgroup_file_release+0x4f/0xb0
>  kernfs_drain_open_files+0x165/0x1f0
>  kernfs_drain+0x162/0x1a0
>  __kernfs_remove+0x1fb/0x310
>  kernfs_remove_by_name_ns+0x95/0xe0
>  cgroup_addrm_files+0x67f/0x700
>  cgroup_destroy_locked+0x283/0x3c0
>  cgroup_rmdir+0x29/0x100
>  kernfs_iop_rmdir+0xd1/0x140
>  vfs_rmdir+0xfe/0x240
>  do_rmdir+0x13d/0x280
>  __x64_sys_rmdir+0x2c/0x30
>  do_syscall_64+0x43/0x90
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> v2: updated commit message
>
> Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
> Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> Cc: stable@vger.kernel.org
> Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
> ---
>  kernel/sched/psi.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8ac8b81bfee6..6e66c15f6450 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
>
>         group = t->group;
>         /*
> -        * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> -        * from under a polling process.
> +        * Wakeup waiters to stop polling and clear the queue to prevent it from
> +        * being accessed later. Can happen if cgroup is deleted from under a
> +        * polling process otherwise.

This "otherwise" at the end seems extra. Was there a continuation of
this comment which was removed without removing this "otherwise" ?

>          */
> -       wake_up_interruptible(&t->event_wait);
> +       wake_up_pollfree(&t->event_wait);
>
>         mutex_lock(&group->trigger_lock);
>
> --
> 2.38.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-14 17:10                 ` Suren Baghdasaryan
@ 2023-02-14 18:13                   ` Munehisa Kamata
  2023-02-14 18:28                     ` Suren Baghdasaryan
  2023-02-14 18:55                     ` Eric Biggers
  0 siblings, 2 replies; 14+ messages in thread
From: Munehisa Kamata @ 2023-02-14 18:13 UTC (permalink / raw)
  To: surenb
  Cc: ebiggers, hannes, hdanton, kamatam, linux-kernel, linux-mm,
	mengcc, stable

If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path.

 do_rmdir
   cgroup_rmdir
     kernfs_drain_open_files
       cgroup_file_release
         cgroup_pressure_release
           psi_trigger_destroy

However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit.

 fput
   ep_eventpoll_release
     ep_free
       ep_remove_wait_queue
         remove_wait_queue

This results in use-after-free as pasted below.

The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it also is not
fully contradicting the comment at commit 42288cb44c4b ("wait: add
wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
one and can be considered as another special case. While this would be
fixable by somehow making cgroup_file_release() be tied to the fput(), it
would require sizable refactoring at cgroups or higher layer which might be
more justifiable if we identify more cases like this.

 BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
 Write of size 4 at addr ffff88810e625328 by task a.out/4404

 CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
 Call Trace:
 <TASK>
 dump_stack_lvl+0x73/0xa0
 print_report+0x16c/0x4e0
 ? _printk+0x59/0x80
 ? __virt_addr_valid+0xb8/0x130
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_report+0xc3/0xf0
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_check_range+0x2d2/0x310
 _raw_spin_lock_irqsave+0x60/0xc0
 remove_wait_queue+0x1a/0xa0
 ep_free+0x12c/0x170
 ep_eventpoll_release+0x26/0x30
 __fput+0x202/0x400
 task_work_run+0x11d/0x170
 do_exit+0x495/0x1130
 ? update_cfs_rq_load_avg+0x2c2/0x2e0
 do_group_exit+0x100/0x100
 get_signal+0xd67/0xde0
 ? finish_task_switch+0x15f/0x3a0
 arch_do_signal_or_restart+0x2a/0x2b0
 exit_to_user_mode_prepare+0x94/0x100
 syscall_exit_to_user_mode+0x20/0x40
 do_syscall_64+0x52/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f8e392bfb91
 Code: Unable to access opcode bytes at 0x7f8e392bfb67.
 RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

 Allocated by task 4404:
 kasan_set_track+0x3d/0x60
 __kasan_kmalloc+0x85/0x90
 psi_trigger_create+0x113/0x3e0
 pressure_write+0x146/0x2e0
 cgroup_file_write+0x11c/0x250
 kernfs_fop_write_iter+0x186/0x220
 vfs_write+0x3d8/0x5c0
 ksys_write+0x90/0x110
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 4407:
 kasan_set_track+0x3d/0x60
 kasan_save_free_info+0x27/0x40
 ____kasan_slab_free+0x11d/0x170
 slab_free_freelist_hook+0x87/0x150
 __kmem_cache_free+0xcb/0x180
 psi_trigger_destroy+0x2e8/0x310
 cgroup_file_release+0x4f/0xb0
 kernfs_drain_open_files+0x165/0x1f0
 kernfs_drain+0x162/0x1a0
 __kernfs_remove+0x1fb/0x310
 kernfs_remove_by_name_ns+0x95/0xe0
 cgroup_addrm_files+0x67f/0x700
 cgroup_destroy_locked+0x283/0x3c0
 cgroup_rmdir+0x29/0x100
 kernfs_iop_rmdir+0xd1/0x140
 vfs_rmdir+0xfe/0x240
 do_rmdir+0x13d/0x280
 __x64_sys_rmdir+0x2c/0x30
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

v3: updated commit message and the comment in the code
v2: updated commit message

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
---
 kernel/sched/psi.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8ac8b81bfee6..02e011cabe91 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
 
 	group = t->group;
 	/*
-	 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
-	 * from under a polling process.
+	 * Wakeup waiters to stop polling and clear the queue to prevent it from
+	 * being accessed later. Can happen if cgroup is deleted from under a
+	 * polling process.
 	 */
-	wake_up_interruptible(&t->event_wait);
+	wake_up_pollfree(&t->event_wait);
 
 	mutex_lock(&group->trigger_lock);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-14 18:13                   ` [PATCH v3] " Munehisa Kamata
@ 2023-02-14 18:28                     ` Suren Baghdasaryan
  2023-02-14 18:29                       ` Suren Baghdasaryan
  2023-02-14 18:55                     ` Eric Biggers
  1 sibling, 1 reply; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-14 18:28 UTC (permalink / raw)
  To: Munehisa Kamata
  Cc: ebiggers, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Tue, Feb 14, 2023 at 10:13 AM Munehisa Kamata <kamatam@amazon.com> wrote:
>
> If a non-root cgroup gets removed when there is a thread that registered
> trigger and is polling on a pressure file within the cgroup, the polling
> waitqueue gets freed in the following path.
>
>  do_rmdir
>    cgroup_rmdir
>      kernfs_drain_open_files
>        cgroup_file_release
>          cgroup_pressure_release
>            psi_trigger_destroy
>
> However, the polling thread still has a reference to the pressure file and
> will access the freed waitqueue when the file is closed or upon exit.
>
>  fput
>    ep_eventpoll_release
>      ep_free
>        ep_remove_wait_queue
>          remove_wait_queue
>
> This results in use-after-free as pasted below.
>
> The fundamental problem here is that cgroup_file_release() (and
> consequently waitqueue's lifetime) is not tied to the file's real lifetime.
> Using wake_up_pollfree() here might be less than ideal, but it also is not
> fully contradicting the comment at commit 42288cb44c4b ("wait: add
> wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> one and can be considered as another special case. While this would be
> fixable by somehow making cgroup_file_release() be tied to the fput(), it
> would require sizable refactoring at cgroups or higher layer which might be
> more justifiable if we identify more cases like this.
>
>  BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
>  Write of size 4 at addr ffff88810e625328 by task a.out/4404
>
>  CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
>  Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
>  Call Trace:
>  <TASK>
>  dump_stack_lvl+0x73/0xa0
>  print_report+0x16c/0x4e0
>  ? _printk+0x59/0x80
>  ? __virt_addr_valid+0xb8/0x130
>  ? _raw_spin_lock_irqsave+0x60/0xc0
>  kasan_report+0xc3/0xf0
>  ? _raw_spin_lock_irqsave+0x60/0xc0
>  kasan_check_range+0x2d2/0x310
>  _raw_spin_lock_irqsave+0x60/0xc0
>  remove_wait_queue+0x1a/0xa0
>  ep_free+0x12c/0x170
>  ep_eventpoll_release+0x26/0x30
>  __fput+0x202/0x400
>  task_work_run+0x11d/0x170
>  do_exit+0x495/0x1130
>  ? update_cfs_rq_load_avg+0x2c2/0x2e0
>  do_group_exit+0x100/0x100
>  get_signal+0xd67/0xde0
>  ? finish_task_switch+0x15f/0x3a0
>  arch_do_signal_or_restart+0x2a/0x2b0
>  exit_to_user_mode_prepare+0x94/0x100
>  syscall_exit_to_user_mode+0x20/0x40
>  do_syscall_64+0x52/0x90
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>  RIP: 0033:0x7f8e392bfb91
>  Code: Unable to access opcode bytes at 0x7f8e392bfb67.
>  RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
>  RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
>  RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
>  RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
>  R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
>  R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
>  </TASK>
>
>  Allocated by task 4404:
>  kasan_set_track+0x3d/0x60
>  __kasan_kmalloc+0x85/0x90
>  psi_trigger_create+0x113/0x3e0
>  pressure_write+0x146/0x2e0
>  cgroup_file_write+0x11c/0x250
>  kernfs_fop_write_iter+0x186/0x220
>  vfs_write+0x3d8/0x5c0
>  ksys_write+0x90/0x110
>  do_syscall_64+0x43/0x90
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
>  Freed by task 4407:
>  kasan_set_track+0x3d/0x60
>  kasan_save_free_info+0x27/0x40
>  ____kasan_slab_free+0x11d/0x170
>  slab_free_freelist_hook+0x87/0x150
>  __kmem_cache_free+0xcb/0x180
>  psi_trigger_destroy+0x2e8/0x310
>  cgroup_file_release+0x4f/0xb0
>  kernfs_drain_open_files+0x165/0x1f0
>  kernfs_drain+0x162/0x1a0
>  __kernfs_remove+0x1fb/0x310
>  kernfs_remove_by_name_ns+0x95/0xe0
>  cgroup_addrm_files+0x67f/0x700
>  cgroup_destroy_locked+0x283/0x3c0
>  cgroup_rmdir+0x29/0x100
>  kernfs_iop_rmdir+0xd1/0x140
>  vfs_rmdir+0xfe/0x240
>  do_rmdir+0x13d/0x280
>  __x64_sys_rmdir+0x2c/0x30
>  do_syscall_64+0x43/0x90
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> v3: updated commit message and the comment in the code
> v2: updated commit message
>
> Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
> Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> Cc: stable@vger.kernel.org
> Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> Signed-off-by: Mengchi Cheng <mengcc@amazon.com>

Acked-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  kernel/sched/psi.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8ac8b81bfee6..02e011cabe91 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
>
>         group = t->group;
>         /*
> -        * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> -        * from under a polling process.
> +        * Wakeup waiters to stop polling and clear the queue to prevent it from
> +        * being accessed later. Can happen if cgroup is deleted from under a
> +        * polling process.
>          */
> -       wake_up_interruptible(&t->event_wait);
> +       wake_up_pollfree(&t->event_wait);
>
>         mutex_lock(&group->trigger_lock);
>
> --
> 2.38.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-14 18:28                     ` Suren Baghdasaryan
@ 2023-02-14 18:29                       ` Suren Baghdasaryan
  0 siblings, 0 replies; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-14 18:29 UTC (permalink / raw)
  To: Munehisa Kamata, Peter Zijlstra
  Cc: ebiggers, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Tue, Feb 14, 2023 at 10:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Feb 14, 2023 at 10:13 AM Munehisa Kamata <kamatam@amazon.com> wrote:
> >
> > If a non-root cgroup gets removed when there is a thread that registered
> > trigger and is polling on a pressure file within the cgroup, the polling
> > waitqueue gets freed in the following path.
> >
> >  do_rmdir
> >    cgroup_rmdir
> >      kernfs_drain_open_files
> >        cgroup_file_release
> >          cgroup_pressure_release
> >            psi_trigger_destroy
> >
> > However, the polling thread still has a reference to the pressure file and
> > will access the freed waitqueue when the file is closed or upon exit.
> >
> >  fput
> >    ep_eventpoll_release
> >      ep_free
> >        ep_remove_wait_queue
> >          remove_wait_queue
> >
> > This results in use-after-free as pasted below.
> >
> > The fundamental problem here is that cgroup_file_release() (and
> > consequently waitqueue's lifetime) is not tied to the file's real lifetime.
> > Using wake_up_pollfree() here might be less than ideal, but it also is not
> > fully contradicting the comment at commit 42288cb44c4b ("wait: add
> > wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> > one and can be considered as another special case. While this would be
> > fixable by somehow making cgroup_file_release() be tied to the fput(), it
> > would require sizable refactoring at cgroups or higher layer which might be
> > more justifiable if we identify more cases like this.
> >
> >  BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
> >  Write of size 4 at addr ffff88810e625328 by task a.out/4404
> >
> >  CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
> >  Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
> >  Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0x73/0xa0
> >  print_report+0x16c/0x4e0
> >  ? _printk+0x59/0x80
> >  ? __virt_addr_valid+0xb8/0x130
> >  ? _raw_spin_lock_irqsave+0x60/0xc0
> >  kasan_report+0xc3/0xf0
> >  ? _raw_spin_lock_irqsave+0x60/0xc0
> >  kasan_check_range+0x2d2/0x310
> >  _raw_spin_lock_irqsave+0x60/0xc0
> >  remove_wait_queue+0x1a/0xa0
> >  ep_free+0x12c/0x170
> >  ep_eventpoll_release+0x26/0x30
> >  __fput+0x202/0x400
> >  task_work_run+0x11d/0x170
> >  do_exit+0x495/0x1130
> >  ? update_cfs_rq_load_avg+0x2c2/0x2e0
> >  do_group_exit+0x100/0x100
> >  get_signal+0xd67/0xde0
> >  ? finish_task_switch+0x15f/0x3a0
> >  arch_do_signal_or_restart+0x2a/0x2b0
> >  exit_to_user_mode_prepare+0x94/0x100
> >  syscall_exit_to_user_mode+0x20/0x40
> >  do_syscall_64+0x52/0x90
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >  RIP: 0033:0x7f8e392bfb91
> >  Code: Unable to access opcode bytes at 0x7f8e392bfb67.
> >  RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
> >  RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
> >  RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
> >  RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
> >  R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
> >  R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
> >  </TASK>
> >
> >  Allocated by task 4404:
> >  kasan_set_track+0x3d/0x60
> >  __kasan_kmalloc+0x85/0x90
> >  psi_trigger_create+0x113/0x3e0
> >  pressure_write+0x146/0x2e0
> >  cgroup_file_write+0x11c/0x250
> >  kernfs_fop_write_iter+0x186/0x220
> >  vfs_write+0x3d8/0x5c0
> >  ksys_write+0x90/0x110
> >  do_syscall_64+0x43/0x90
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> >  Freed by task 4407:
> >  kasan_set_track+0x3d/0x60
> >  kasan_save_free_info+0x27/0x40
> >  ____kasan_slab_free+0x11d/0x170
> >  slab_free_freelist_hook+0x87/0x150
> >  __kmem_cache_free+0xcb/0x180
> >  psi_trigger_destroy+0x2e8/0x310
> >  cgroup_file_release+0x4f/0xb0
> >  kernfs_drain_open_files+0x165/0x1f0
> >  kernfs_drain+0x162/0x1a0
> >  __kernfs_remove+0x1fb/0x310
> >  kernfs_remove_by_name_ns+0x95/0xe0
> >  cgroup_addrm_files+0x67f/0x700
> >  cgroup_destroy_locked+0x283/0x3c0
> >  cgroup_rmdir+0x29/0x100
> >  kernfs_iop_rmdir+0xd1/0x140
> >  vfs_rmdir+0xfe/0x240
> >  do_rmdir+0x13d/0x280
> >  __x64_sys_rmdir+0x2c/0x30
> >  do_syscall_64+0x43/0x90
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > v3: updated commit message and the comment in the code
> > v2: updated commit message
> >
> > Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
> > Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> > Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
>

Acked-by: Suren Baghdasaryan <surenb@google.com>

CC'ing PeterZ directly for inclusion into his tree.

>
> > ---
> >  kernel/sched/psi.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 8ac8b81bfee6..02e011cabe91 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> >
> >         group = t->group;
> >         /*
> > -        * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > -        * from under a polling process.
> > +        * Wakeup waiters to stop polling and clear the queue to prevent it from
> > +        * being accessed later. Can happen if cgroup is deleted from under a
> > +        * polling process.
> >          */
> > -       wake_up_interruptible(&t->event_wait);
> > +       wake_up_pollfree(&t->event_wait);
> >
> >         mutex_lock(&group->trigger_lock);
> >
> > --
> > 2.38.1
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-14 18:13                   ` [PATCH v3] " Munehisa Kamata
  2023-02-14 18:28                     ` Suren Baghdasaryan
@ 2023-02-14 18:55                     ` Eric Biggers
  2023-02-14 19:13                       ` Suren Baghdasaryan
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2023-02-14 18:55 UTC (permalink / raw)
  To: Munehisa Kamata
  Cc: surenb, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Tue, Feb 14, 2023 at 10:13:35AM -0800, Munehisa Kamata wrote:
> Using wake_up_pollfree() here might be less than ideal, but it also is not
> fully contradicting the comment at commit 42288cb44c4b ("wait: add
> wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> one and can be considered as another special case.

If that comment is outdated now, then please update the comment too.  We can do
better than "not fully contradicting".

- Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()
  2023-02-14 18:55                     ` Eric Biggers
@ 2023-02-14 19:13                       ` Suren Baghdasaryan
  0 siblings, 0 replies; 14+ messages in thread
From: Suren Baghdasaryan @ 2023-02-14 19:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Munehisa Kamata, hannes, hdanton, linux-kernel, linux-mm, mengcc, stable

On Tue, Feb 14, 2023 at 10:55 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Tue, Feb 14, 2023 at 10:13:35AM -0800, Munehisa Kamata wrote:
> > Using wake_up_pollfree() here might be less than ideal, but it also is not
> > fully contradicting the comment at commit 42288cb44c4b ("wait: add
> > wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> > one and can be considered as another special case.
>
> If that comment is outdated now, then please update the comment too.  We can do
> better than "not fully contradicting".

Agree. I don't see it contradicting that comment.

>
> - Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-02-14 19:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com>
2023-02-02  3:00 ` [PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue() Munehisa Kamata
2023-02-02  4:56   ` Eric Biggers
2023-02-02 21:11     ` Suren Baghdasaryan
2023-02-09 17:09       ` Suren Baghdasaryan
2023-02-09 18:46         ` Eric Biggers
2023-02-09 19:13           ` Suren Baghdasaryan
2023-02-13 23:50             ` Suren Baghdasaryan
2023-02-14  7:04               ` [PATCH v2] " Munehisa Kamata
2023-02-14 17:10                 ` Suren Baghdasaryan
2023-02-14 18:13                   ` [PATCH v3] " Munehisa Kamata
2023-02-14 18:28                     ` Suren Baghdasaryan
2023-02-14 18:29                       ` Suren Baghdasaryan
2023-02-14 18:55                     ` Eric Biggers
2023-02-14 19:13                       ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).