[PATCH] exit: Retain nsproxy for exit_task_work() work entries

* [PATCH] exit: Retain nsproxy for exit_task_work() work entries
@ 2021-12-08 18:05 Michal Koutný
  2021-12-08 18:45 ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Koutný @ 2021-12-08 18:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric W . Biederman, Jens Axboe, Kees Cook, Oleg Nesterov,
	Peter Zijlstra, Thomas Gleixner, Jim Newsome, Alexey Gladkov,
	Tejun Heo

The reported issue is an attempted write in a cgroup file, by a zombie
who has/had acct(2) enabled. Such a write needs cgroup_ns for access
checking. Ordinary acct_process() would not be affected by this as it is
called well before exit_task_namespaces(). However, the reported NULL
dereference is a different acct data writer:

	Call Trace:
	 <TASK>
	 kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296
	 __kernel_write+0x5d1/0xaf0 fs/read_write.c:535
	 do_acct_process+0x112a/0x17b0 kernel/acct.c:518
	 acct_pin_kill+0x27/0x130 kernel/acct.c:173
	 pin_kill+0x2a6/0x940 fs/fs_pin.c:44
	 mnt_pin_kill+0xc1/0x170 fs/fs_pin.c:81
	 cleanup_mnt+0x4bc/0x510 fs/namespace.c:1130
	 task_work_run+0x146/0x1c0 kernel/task_work.c:164
	 exit_task_work include/linux/task_work.h:32 [inline]
	 do_exit+0x705/0x24f0 kernel/exit.c:832
	 do_group_exit+0x168/0x2d0 kernel/exit.c:929
	 get_signal+0x16b0/0x2090 kernel/signal.c:2820
	 arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
	 handle_signal_work kernel/entry/common.c:148 [inline]
	 exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
	 exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
	 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
	 syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
	 do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

i.e. called as one of task_work_run() entries.

The historical commit 8aac62706ada ("move exit_task_namespaces() outside
of exit_notify()") argues that exit_task_namespaces() must come before
exit_task_work() because ipc_ns cleanup calls fput/task_work_add.

There is accompanying commit e7b2c4069252 ("fput: task_work_add() can
fail if the caller has passed exit_task_work()") in the original series
that makes fput() robust in situations when task_work_add() cannot be
used anymore.

So in order to ensure that task_work_run() entries of the exiting task
have the nsproxy still available, swap the order of
exit_task_namespaces() and exit_task_work().

This change may appear like a partial revert of 8aac62706ada but this
particular ordering change shouldn't matter with the fix from
e7b2c4069252 and the other reason for 8aac62706ada (keeping exit_notify
simpler) still holds.

Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
Cc: Oleg Nesterov <oleg@redhat.com>
Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 kernel/exit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

I wasn't able to reproduce the syzbot's crash manually so the effectiveness of
the fix is only based on the reasoning.

diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..0c2abdebb87c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -828,8 +828,8 @@ void __noreturn do_exit(long code)
 	exit_fs(tsk);
 	if (group_dead)
 		disassociate_ctty(1);
-	exit_task_namespaces(tsk);
 	exit_task_work(tsk);
+	exit_task_namespaces(tsk);
 	exit_thread(tsk);
 
 	/*
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread