* [PATCH] exit: Retain nsproxy for exit_task_work() work entries @ 2021-12-08 18:05 Michal Koutný 2021-12-08 18:45 ` Eric W. Biederman 0 siblings, 1 reply; 13+ messages in thread From: Michal Koutný @ 2021-12-08 18:05 UTC (permalink / raw) To: linux-kernel Cc: Eric W . Biederman, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Tejun Heo The reported issue is an attempted write in a cgroup file, by a zombie who has/had acct(2) enabled. Such a write needs cgroup_ns for access checking. Ordinary acct_process() would not be affected by this as it is called well before exit_task_namespaces(). However, the reported NULL dereference is a different acct data writer: Call Trace: <TASK> kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296 __kernel_write+0x5d1/0xaf0 fs/read_write.c:535 do_acct_process+0x112a/0x17b0 kernel/acct.c:518 acct_pin_kill+0x27/0x130 kernel/acct.c:173 pin_kill+0x2a6/0x940 fs/fs_pin.c:44 mnt_pin_kill+0xc1/0x170 fs/fs_pin.c:81 cleanup_mnt+0x4bc/0x510 fs/namespace.c:1130 task_work_run+0x146/0x1c0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0x705/0x24f0 kernel/exit.c:832 do_group_exit+0x168/0x2d0 kernel/exit.c:929 get_signal+0x16b0/0x2090 kernel/signal.c:2820 arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300 do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae i.e. called as one of task_work_run() entries. The historical commit 8aac62706ada ("move exit_task_namespaces() outside of exit_notify()") argues that exit_task_namespaces() must come before exit_task_work() because ipc_ns cleanup calls fput/task_work_add. There is accompanying commit e7b2c4069252 ("fput: task_work_add() can fail if the caller has passed exit_task_work()") in the original series that makes fput() robust in situations when task_work_add() cannot be used anymore. So in order to ensure that task_work_run() entries of the exiting task have the nsproxy still available, swap the order of exit_task_namespaces() and exit_task_work(). This change may appear like a partial revert of 8aac62706ada but this particular ordering change shouldn't matter with the fix from e7b2c4069252 and the other reason for 8aac62706ada (keeping exit_notify simpler) still holds. Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com Cc: Oleg Nesterov <oleg@redhat.com> Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option") Signed-off-by: Michal Koutný <mkoutny@suse.com> --- kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) I wasn't able to reproduce the syzbot's crash manually so the effectiveness of the fix is only based on the reasoning. diff --git a/kernel/exit.c b/kernel/exit.c index f702a6a63686..0c2abdebb87c 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -828,8 +828,8 @@ void __noreturn do_exit(long code) exit_fs(tsk); if (group_dead) disassociate_ctty(1); - exit_task_namespaces(tsk); exit_task_work(tsk); + exit_task_namespaces(tsk); exit_thread(tsk); /* -- 2.33.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-08 18:05 [PATCH] exit: Retain nsproxy for exit_task_work() work entries Michal Koutný @ 2021-12-08 18:45 ` Eric W. Biederman 2021-12-08 19:06 ` Tejun Heo 2021-12-10 23:12 ` Michal Koutný 0 siblings, 2 replies; 13+ messages in thread From: Eric W. Biederman @ 2021-12-08 18:45 UTC (permalink / raw) To: Michal Koutný Cc: linux-kernel, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Tejun Heo, security, Andy Lutomirski, Jann Horn Adding the security list and a couple of other people because I feel like I just opened pandora's box, and I don't know how bad this issue is. TL;DR the cgroup file system is checking permissions at write time. Michal Koutný <mkoutny@suse.com> writes: > The reported issue is an attempted write in a cgroup file, by a zombie > who has/had acct(2) enabled. Such a write needs cgroup_ns for access > checking. Ordinary acct_process() would not be affected by this as it is > called well before exit_task_namespaces(). However, the reported NULL > dereference is a different acct data writer: > > Call Trace: > <TASK> > kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296 > __kernel_write+0x5d1/0xaf0 fs/read_write.c:535 > do_acct_process+0x112a/0x17b0 kernel/acct.c:518 > acct_pin_kill+0x27/0x130 kernel/acct.c:173 > pin_kill+0x2a6/0x940 fs/fs_pin.c:44 > mnt_pin_kill+0xc1/0x170 fs/fs_pin.c:81 > cleanup_mnt+0x4bc/0x510 fs/namespace.c:1130 > task_work_run+0x146/0x1c0 kernel/task_work.c:164 > exit_task_work include/linux/task_work.h:32 [inline] > do_exit+0x705/0x24f0 kernel/exit.c:832 > do_group_exit+0x168/0x2d0 kernel/exit.c:929 > get_signal+0x16b0/0x2090 kernel/signal.c:2820 > arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868 > handle_signal_work kernel/entry/common.c:148 [inline] > exit_to_user_mode_loop kernel/entry/common.c:172 [inline] > exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207 > __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] > syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300 > do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86 > entry_SYSCALL_64_after_hwframe+0x44/0xae > > i.e. called as one of task_work_run() entries. > > The historical commit 8aac62706ada ("move exit_task_namespaces() outside > of exit_notify()") argues that exit_task_namespaces() must come before > exit_task_work() because ipc_ns cleanup calls fput/task_work_add. > > There is accompanying commit e7b2c4069252 ("fput: task_work_add() can > fail if the caller has passed exit_task_work()") in the original series > that makes fput() robust in situations when task_work_add() cannot be > used anymore. > > So in order to ensure that task_work_run() entries of the exiting task > have the nsproxy still available, swap the order of > exit_task_namespaces() and exit_task_work(). > > This change may appear like a partial revert of 8aac62706ada but this > particular ordering change shouldn't matter with the fix from > e7b2c4069252 and the other reason for 8aac62706ada (keeping exit_notify > simpler) still holds. I think I follow your reasoning and I think it will even fix the issue but no. That is completely very much not the correct fix. That permission check in cgroup_file_write is just plain wrong. Permission checks on files need to happen at open time, not at write time. It is all too easy to confuse something that writes to stdout to write to a file of your choosing to make permission checking at write time a good idea. It looks like the this issue was introduced in commit 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option"). I wish I knew where I was when that patch was posted for review so I could suggest a better implementation. That said I can't quite tell if the test should be moved into cgroup_file_open or if there is a permission entry that would work. Oh bleep! The likely named cgroup_procs_write_permission isn't a permission method at all. It is called from cgroup_procs_write which I believe is called from cgroup_file_write. I may be wrong but at first glance this looks like the cgroup code is going to need significant surgery to get the permission checks happening at open time where they belong. Please don't apply this patch. exit_task_work running after exit_task_namespaces is the messenger that just told us about something ugly. Eric > Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com > Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com > Cc: Oleg Nesterov <oleg@redhat.com> > Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option") > Signed-off-by: Michal Koutný <mkoutny@suse.com> > --- > kernel/exit.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > I wasn't able to reproduce the syzbot's crash manually so the effectiveness of > the fix is only based on the reasoning. > > diff --git a/kernel/exit.c b/kernel/exit.c > index f702a6a63686..0c2abdebb87c 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -828,8 +828,8 @@ void __noreturn do_exit(long code) > exit_fs(tsk); > if (group_dead) > disassociate_ctty(1); > - exit_task_namespaces(tsk); > exit_task_work(tsk); > + exit_task_namespaces(tsk); > exit_thread(tsk); > > /* ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-08 18:45 ` Eric W. Biederman @ 2021-12-08 19:06 ` Tejun Heo 2021-12-08 19:39 ` Linus Torvalds 2021-12-10 23:12 ` Michal Koutný 1 sibling, 1 reply; 13+ messages in thread From: Tejun Heo @ 2021-12-08 19:06 UTC (permalink / raw) To: Eric W. Biederman Cc: Michal Koutný, linux-kernel, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, security, Andy Lutomirski, Jann Horn Hello, Eric. On Wed, Dec 08, 2021 at 12:45:54PM -0600, Eric W. Biederman wrote: > Permission checks on files need to happen at open time, not at > write time. It is all too easy to confuse something that writes > to stdout to write to a file of your choosing to make permission > checking at write time a good idea. Whether a given operation is allowed or not is dependent on the content of the write. I don't see how it can be moved to open time. I'm not a fan of the fs based interface but they're not real files and it has been like this from the beginning. > That said I can't quite tell if the test should be moved into > cgroup_file_open or if there is a permission entry that would work. It can't. > I may be wrong but at first glance this looks like the cgroup code is > going to need significant surgery to get the permission checks happening > at open time where they belong. There's no way to change that without changing the interface drastically. > Please don't apply this patch. > > exit_task_work running after exit_task_namespaces is the messenger > that just told us about something ugly. I don't see good alternatives here. You got any? Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-08 19:06 ` Tejun Heo @ 2021-12-08 19:39 ` Linus Torvalds 2021-12-08 19:49 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2021-12-08 19:39 UTC (permalink / raw) To: Tejun Heo Cc: Eric W. Biederman, Michal Koutný, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn On Wed, Dec 8, 2021 at 11:06 AM Tejun Heo <tj@kernel.org> wrote: > > > That said I can't quite tell if the test should be moved into > > cgroup_file_open or if there is a permission entry that would work. > > It can't. Two options: (a) anybody doing "current process" permission checks at IO time should then also check that the credentials are still the same as when the file was opened. (b) alternatively, go ahead and do the permission check at IO time, but do it using "file->f_cred" (ie the open-time permission), not the current process ones. In the above, (a) and (b) are basically the same: it uses f_cred for permission checking. The only difference is that in (a) you may be using some function that _technically_ uses the implicit "current credentials" (there are many of them), and you just separately make sure that those current credentials are identical to what they were at open time. Obviously (b) is hugely preferred, but sometimes the code organization (ie "file or f_cred just isn't passed down deep enough") means that (a) might be the only realistic option. IOW, it's not *wrong* to do permission checking at IO time, but it absolutely needs to be done using the open-time credentials. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-08 19:39 ` Linus Torvalds @ 2021-12-08 19:49 ` Tejun Heo 2021-12-08 23:07 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2021-12-08 19:49 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michal Koutný, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn Hello, On Wed, Dec 08, 2021 at 11:39:43AM -0800, Linus Torvalds wrote: > (b) alternatively, go ahead and do the permission check at IO time, > but do it using "file->f_cred" (ie the open-time permission), not the > current process ones. > > In the above, (a) and (b) are basically the same: it uses f_cred for > permission checking. The only difference is that in (a) you may be > using some function that _technically_ uses the implicit "current > credentials" (there are many of them), and you just separately make > sure that those current credentials are identical to what they were at > open time. > > Obviously (b) is hugely preferred, but sometimes the code organization > (ie "file or f_cred just isn't passed down deep enough") means that > (a) might be the only realistic option. > > IOW, it's not *wrong* to do permission checking at IO time, but it > absolutely needs to be done using the open-time credentials. Yeah, (b) sounds good to me. Will look into it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-08 19:49 ` Tejun Heo @ 2021-12-08 23:07 ` Tejun Heo 2021-12-09 13:44 ` Michal Koutný 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2021-12-08 23:07 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michal Koutný, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn Hello, On Wed, Dec 08, 2021 at 09:49:17AM -1000, Tejun Heo wrote: > > (b) alternatively, go ahead and do the permission check at IO time, > > but do it using "file->f_cred" (ie the open-time permission), not the > > current process ones. So, I have sth like the following. It builds and euid-open test case behaves as expected (can't evade delegation restrictions if the fd was opened with lesser euid). The namespace part is a bit more involved as f_cred doesn't capture them on open. I made cgroup file open path capture it and use that for all permission checks. Please let me know if anything looks weird. Otherwise, I'm gonna add selftests and prep the patchset. Thanks. Index: work/kernel/cgroup/cgroup-v1.c =================================================================== --- work.orig/kernel/cgroup/cgroup-v1.c +++ work/kernel/cgroup/cgroup-v1.c @@ -504,10 +504,11 @@ static ssize_t __cgroup1_procs_write(str goto out_unlock; /* - * Even if we're attaching all tasks in the thread group, we only - * need to check permissions on one of them. + * Even if we're attaching all tasks in the thread group, we only need + * to check permissions on one of them. Check permissions using the + * credentials from file open to protect against inherited fd attacks. */ - cred = current_cred(); + cred = of->file->f_cred; tcred = get_task_cred(task); if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && !uid_eq(cred->euid, tcred->uid) && Index: work/kernel/cgroup/cgroup.c =================================================================== --- work.orig/kernel/cgroup/cgroup.c +++ work/kernel/cgroup/cgroup.c @@ -3630,6 +3630,7 @@ static int cgroup_cpu_pressure_show(stru static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf, size_t nbytes, enum psi_res res) { + struct cgroup_file_ctx *ctx = of->priv; struct psi_trigger *new; struct cgroup *cgrp; struct psi_group *psi; @@ -3648,7 +3649,7 @@ static ssize_t cgroup_pressure_write(str return PTR_ERR(new); } - psi_trigger_replace(&of->priv, new); + psi_trigger_replace(&ctx->psi.trigger, new); cgroup_put(cgrp); @@ -3679,12 +3680,16 @@ static ssize_t cgroup_cpu_pressure_write static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { - return psi_trigger_poll(&of->priv, of->file, pt); + struct cgroup_file_ctx *ctx = of->priv; + + return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); } static void cgroup_pressure_release(struct kernfs_open_file *of) { - psi_trigger_replace(&of->priv, NULL); + struct cgroup_file_ctx *ctx = of->priv; + + psi_trigger_replace(&ctx->psi.trigger, NULL); } bool cgroup_psi_enabled(void) @@ -3811,24 +3816,42 @@ static ssize_t cgroup_kill_write(struct static int cgroup_file_open(struct kernfs_open_file *of) { struct cftype *cft = of_cft(of); + struct cgroup_file_ctx *ctx; + int ret; - if (cft->open) - return cft->open(of); - return 0; + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + ctx->ns = current->nsproxy->cgroup_ns; + get_cgroup_ns(ctx->ns); + + of->priv = ctx; + + if (!cft->open) + return 0; + + ret = cft->open(of); + if (ret) + kfree(ctx); + return ret; } static void cgroup_file_release(struct kernfs_open_file *of) { struct cftype *cft = of_cft(of); + struct cgroup_file_ctx *ctx = of->priv; if (cft->release) cft->release(of); + put_cgroup_ns(ctx->ns); + kfree(ctx); } static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; + struct cgroup_file_ctx *ctx = of->priv; struct cgroup *cgrp = of->kn->parent->priv; struct cftype *cft = of_cft(of); struct cgroup_subsys_state *css; @@ -3845,7 +3868,7 @@ static ssize_t cgroup_file_write(struct */ if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) && !(cft->flags & CFTYPE_NS_DELEGATABLE) && - ns != &init_cgroup_ns && ns->root_cset->dfl_cgrp == cgrp) + ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp) return -EPERM; if (cft->write) @@ -4751,21 +4774,23 @@ void css_task_iter_end(struct css_task_i static void cgroup_procs_release(struct kernfs_open_file *of) { - if (of->priv) { - css_task_iter_end(of->priv); - kfree(of->priv); + struct cgroup_file_ctx *ctx = of->priv; + + if (ctx->procs.it) { + css_task_iter_end(ctx->procs.it); + kfree(ctx->procs.it); } } static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos) { struct kernfs_open_file *of = s->private; - struct css_task_iter *it = of->priv; + struct cgroup_file_ctx *ctx = of->priv; if (pos) (*pos)++; - return css_task_iter_next(it); + return css_task_iter_next(ctx->procs.it); } static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos, @@ -4773,7 +4798,8 @@ static void *__cgroup_procs_start(struct { struct kernfs_open_file *of = s->private; struct cgroup *cgrp = seq_css(s)->cgroup; - struct css_task_iter *it = of->priv; + struct cgroup_file_ctx *ctx = of->priv; + struct css_task_iter *it = ctx->procs.it; /* * When a seq_file is seeked, it's always traversed sequentially @@ -4786,7 +4812,7 @@ static void *__cgroup_procs_start(struct it = kzalloc(sizeof(*it), GFP_KERNEL); if (!it) return ERR_PTR(-ENOMEM); - of->priv = it; + ctx->procs.it = it; css_task_iter_start(&cgrp->self, iter_flags, it); } else if (!(*pos)) { css_task_iter_end(it); @@ -4838,9 +4864,9 @@ static int cgroup_may_write(const struct static int cgroup_procs_write_permission(struct cgroup *src_cgrp, struct cgroup *dst_cgrp, - struct super_block *sb) + struct super_block *sb, + struct cgroup_namespace *ns) { - struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup *com_cgrp = src_cgrp; int ret; @@ -4869,11 +4895,12 @@ static int cgroup_procs_write_permission static int cgroup_attach_permissions(struct cgroup *src_cgrp, struct cgroup *dst_cgrp, - struct super_block *sb, bool threadgroup) + struct super_block *sb, bool threadgroup, + struct cgroup_namespace *ns) { int ret = 0; - ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb); + ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb, ns); if (ret) return ret; @@ -4890,8 +4917,10 @@ static int cgroup_attach_permissions(str static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, bool threadgroup) { + struct cgroup_file_ctx *ctx = of->priv; struct cgroup *src_cgrp, *dst_cgrp; struct task_struct *task; + const struct cred *saved_cred; ssize_t ret; bool locked; @@ -4909,9 +4938,16 @@ static ssize_t __cgroup_procs_write(stru src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); spin_unlock_irq(&css_set_lock); - /* process and thread migrations follow same delegation rule */ + /* + * Process and thread migrations follow same delegation rule. Check + * permissions using the credentials from file open to protect against + * inherited fd attacks. + */ + saved_cred = override_creds(of->file->f_cred); ret = cgroup_attach_permissions(src_cgrp, dst_cgrp, - of->file->f_path.dentry->d_sb, threadgroup); + of->file->f_path.dentry->d_sb, + threadgroup, ctx->ns); + revert_creds(saved_cred); if (ret) goto out_finish; @@ -6130,7 +6166,8 @@ static int cgroup_css_set_fork(struct ke goto err; ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb, - !(kargs->flags & CLONE_THREAD)); + !(kargs->flags & CLONE_THREAD), + current->nsproxy->cgroup_ns); if (ret) goto err; Index: work/kernel/cgroup/cgroup-internal.h =================================================================== --- work.orig/kernel/cgroup/cgroup-internal.h +++ work/kernel/cgroup/cgroup-internal.h @@ -65,6 +65,20 @@ static inline struct cgroup_fs_context * return container_of(kfc, struct cgroup_fs_context, kfc); } +struct cgroup_file_ctx { + struct cgroup_namespace *ns; + + union { + struct { + struct css_task_iter *it; + } procs; + + struct { + void *trigger; + } psi; + }; +}; + /* * A cgroup can be associated with multiple css_sets as different tasks may * belong to different cgroups on different hierarchies. In the other ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-08 23:07 ` Tejun Heo @ 2021-12-09 13:44 ` Michal Koutný 2021-12-09 14:08 ` Christian Brauner 0 siblings, 1 reply; 13+ messages in thread From: Michal Koutný @ 2021-12-09 13:44 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Eric W. Biederman, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn, Christian Brauner On Wed, Dec 08, 2021 at 01:07:54PM -1000, Tejun Heo <tj@kernel.org> wrote: > + saved_cred = override_creds(of->file->f_cred); > ret = cgroup_attach_permissions(src_cgrp, dst_cgrp, > - of->file->f_path.dentry->d_sb, threadgroup); > + of->file->f_path.dentry->d_sb, > + threadgroup, ctx->ns); > + revert_creds(saved_cred); I wonder now whether such a wrap shouldn't also be around cgroup_kill() too (+ replacement of send_sig() with group_send_sig_info() [1])? This shouldn't break the use case of passing cgroup kill fd to a less privileged task for (auto)destruction purposes but on the other hand it would prevent subverting the fd to a more privileged confused task to kill otherwise disallowed processes. Thanks, Michal [1] https://lore.kernel.org/r/m1v97x6niq.fsf@fess.ebiederm.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-09 13:44 ` Michal Koutný @ 2021-12-09 14:08 ` Christian Brauner 2021-12-09 14:47 ` Michal Koutný 0 siblings, 1 reply; 13+ messages in thread From: Christian Brauner @ 2021-12-09 14:08 UTC (permalink / raw) To: Michal Koutný Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn On Thu, Dec 09, 2021 at 02:44:19PM +0100, Michal Koutný wrote: > On Wed, Dec 08, 2021 at 01:07:54PM -1000, Tejun Heo <tj@kernel.org> wrote: > > > + saved_cred = override_creds(of->file->f_cred); > > ret = cgroup_attach_permissions(src_cgrp, dst_cgrp, > > - of->file->f_path.dentry->d_sb, threadgroup); > > + of->file->f_path.dentry->d_sb, > > + threadgroup, ctx->ns); > > + revert_creds(saved_cred); > > I wonder now whether such a wrap shouldn't also be around cgroup_kill() > too (+ replacement of send_sig() with group_send_sig_info() [1])? send_sig() isn't used that was changed in response to a review. I'm confused. > > This shouldn't break the use case of passing cgroup kill fd to a less > privileged task for (auto)destruction purposes but on the other hand it > would prevent subverting the fd to a more privileged confused task to > kill otherwise disallowed processes. Kill and freeze only do time permission checking at open. Why would you introduce another write time check? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-09 14:08 ` Christian Brauner @ 2021-12-09 14:47 ` Michal Koutný 2021-12-09 15:06 ` Christian Brauner 0 siblings, 1 reply; 13+ messages in thread From: Michal Koutný @ 2021-12-09 14:47 UTC (permalink / raw) To: Christian Brauner Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn On Thu, Dec 09, 2021 at 03:08:26PM +0100, Christian Brauner <christian.brauner@ubuntu.com> wrote: > send_sig() isn't used that was changed in response to a review. I'm > confused. Sorry for ambiguity, I meant this instance [1]. > Kill and freeze only do time permission checking at open. Why would you > introduce another write time check? Let's have a cgroup G with tasks t1,...,tn (run by user u) and some monitoring tasks m1,...,mk belonging to a different user v != u. Currently u can kill also the tasks of v -- I'm not sure if that's intentional. My argument would apply if it wasn't -- it'd be suscebtible to similar abuse, i.e. passing the opened fd to a more privileged process to kill also v's tasks. (But if the intention is to be able to kill anyone in the cgroup, then it likely doesn't matter.) Michal [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v5.16-rc4#n3762 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-09 14:47 ` Michal Koutný @ 2021-12-09 15:06 ` Christian Brauner 2021-12-09 16:39 ` Michal Koutný 0 siblings, 1 reply; 13+ messages in thread From: Christian Brauner @ 2021-12-09 15:06 UTC (permalink / raw) To: Michal Koutný Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn On Thu, Dec 09, 2021 at 03:47:00PM +0100, Michal Koutný wrote: > On Thu, Dec 09, 2021 at 03:08:26PM +0100, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > send_sig() isn't used that was changed in response to a review. I'm > > confused. > > Sorry for ambiguity, I meant this instance [1]. Sure, seems good. > > > Kill and freeze only do time permission checking at open. Why would you > > introduce another write time check? > > Let's have a cgroup G with tasks t1,...,tn (run by user u) and some > monitoring tasks m1,...,mk belonging to a different user v != u. > > Currently u can kill also the tasks of v -- I'm not sure if that's > intentional. My argument would apply if it wasn't -- it'd be suscebtible That was discussed and is intentional and is supposed to mirror the behavior of cgroup.freeze. Delegated killing was supposed to work and was one use-case. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-09 15:06 ` Christian Brauner @ 2021-12-09 16:39 ` Michal Koutný 0 siblings, 0 replies; 13+ messages in thread From: Michal Koutný @ 2021-12-09 16:39 UTC (permalink / raw) To: Christian Brauner Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Linux Kernel Mailing List, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Security Officers, Andy Lutomirski, Jann Horn On Thu, Dec 09, 2021 at 04:06:55PM +0100, Christian Brauner <christian.brauner@ubuntu.com> wrote: > That was discussed and is intentional and is supposed to mirror the > behavior of cgroup.freeze. Delegated killing was supposed to work and > was one use-case. Thanks for the clarification. The cgroup_kill() doesn't need the override_creds() treating then (clearing my previous wondering). Michal ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-08 18:45 ` Eric W. Biederman 2021-12-08 19:06 ` Tejun Heo @ 2021-12-10 23:12 ` Michal Koutný 2021-12-13 15:24 ` Eric W. Biederman 1 sibling, 1 reply; 13+ messages in thread From: Michal Koutný @ 2021-12-10 23:12 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-kernel, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Tejun Heo, security, Andy Lutomirski, Jann Horn On Wed, Dec 08, 2021 at 12:45:54PM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote: > TL;DR the cgroup file system is checking permissions at write time. Thank you for bringing that up (handled in a separate thread now). > I think I follow your reasoning and I think it will even fix the issue > but no. FTR, part of Tejun's series [1] ensures that cgroup_ns is accessed directly without nsproxy and a reference to it is kept while the file is opened. I.e. that'd properly fix this particular crash reported by syzbot. > Please don't apply this patch. > > exit_task_work running after exit_task_namespaces is the messenger > that just told us about something ugly. In (my) theory some other task_work callbacks could (transitively) rely on the current->nsproxy which could still be cleared by exit_task_namespaces(). Is there another reason why to have exit_task_namespaces() before exit_task_work()? Thanks, Michal [1] https://lore.kernel.org/r/20211209214707.805617-4-tj@kernel.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries 2021-12-10 23:12 ` Michal Koutný @ 2021-12-13 15:24 ` Eric W. Biederman 0 siblings, 0 replies; 13+ messages in thread From: Eric W. Biederman @ 2021-12-13 15:24 UTC (permalink / raw) To: Michal Koutný Cc: linux-kernel, Jens Axboe, Kees Cook, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov, Tejun Heo, security, Andy Lutomirski, Jann Horn Michal Koutný <mkoutny@suse.com> writes: > On Wed, Dec 08, 2021 at 12:45:54PM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >> TL;DR the cgroup file system is checking permissions at write time. > > Thank you for bringing that up (handled in a separate thread now). > >> I think I follow your reasoning and I think it will even fix the issue >> but no. > > FTR, part of Tejun's series [1] ensures that cgroup_ns is accessed > directly without nsproxy and a reference to it is kept while the file > is opened. I.e. that'd properly fix this particular crash reported by > syzbot. > >> Please don't apply this patch. >> >> exit_task_work running after exit_task_namespaces is the messenger >> that just told us about something ugly. > > In (my) theory some other task_work callbacks could (transitively) rely > on the current->nsproxy which could still be cleared by > exit_task_namespaces(). > Is there another reason why to have exit_task_namespaces() before > exit_task_work()? We already have the principle that things are going to be cleaned up before exit_task_work is called and exit_files depends upon that. So I think the burden is to find a good reason why exit_task_work should move not to defend it. If we don't want things cleaned up before exit_task_work it should come at the start of do_exit and exit_files and others need to stop depending upon it. Which seems like challenging change to make. Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-12-13 15:25 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-08 18:05 [PATCH] exit: Retain nsproxy for exit_task_work() work entries Michal Koutný 2021-12-08 18:45 ` Eric W. Biederman 2021-12-08 19:06 ` Tejun Heo 2021-12-08 19:39 ` Linus Torvalds 2021-12-08 19:49 ` Tejun Heo 2021-12-08 23:07 ` Tejun Heo 2021-12-09 13:44 ` Michal Koutný 2021-12-09 14:08 ` Christian Brauner 2021-12-09 14:47 ` Michal Koutný 2021-12-09 15:06 ` Christian Brauner 2021-12-09 16:39 ` Michal Koutný 2021-12-10 23:12 ` Michal Koutný 2021-12-13 15:24 ` Eric W. Biederman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.