* [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() @ 2024-04-24 10:52 Stas Sergeev 2024-04-24 10:52 ` [PATCH 1/2] fs: reorganize path_openat() Stas Sergeev ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Stas Sergeev @ 2024-04-24 10:52 UTC (permalink / raw) To: linux-kernel Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche This patch-set implements the OA2_INHERIT_CRED flag for openat2() syscall. It is needed to perform an open operation with the creds that were in effect when the dir_fd was opened. This allows the process to pre-open some dirs and switch eUID (and other UIDs/GIDs) to the less-privileged user, while still retaining the possibility to open/create files within the pre-opened directory set. The sand-boxing is security-oriented: symlinks leading outside of a sand-box are rejected. /proc magic links are rejected. The more detailed description (including security considerations) is available in the log messages of individual patches. Changes in v4: - add optimizations suggested by David Laight <David.Laight@ACULAB.COM> - move security checks to build_open_flags() - force RESOLVE_NO_MAGICLINKS as suggested by Andy Lutomirski <luto@kernel.org> Changes in v3: - partially revert v2 changes to avoid overriding capabilities. Only the bare minimum is overridden: fsuid, fsgid and group_info. Document the fact the full cred override is unwanted, as it may represent an unneeded security risk. Changes in v2: - capture full struct cred instead of just fsuid/fsgid. Suggested by Stefan Metzmacher <metze@samba.org> CC: Stefan Metzmacher <metze@samba.org> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andy Lutomirski <luto@kernel.org> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Jeff Layton <jlayton@kernel.org> CC: Chuck Lever <chuck.lever@oracle.com> CC: Alexander Aring <alex.aring@gmail.com> CC: David Laight <David.Laight@ACULAB.COM> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: linux-api@vger.kernel.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Christian Göttsche <cgzones@googlemail.com> -- 2.44.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/2] fs: reorganize path_openat() 2024-04-24 10:52 [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Stas Sergeev @ 2024-04-24 10:52 ` Stas Sergeev 2024-04-25 8:13 ` kernel test robot 2024-04-24 10:52 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev 2024-04-24 16:09 ` [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Christian Brauner 2 siblings, 1 reply; 18+ messages in thread From: Stas Sergeev @ 2024-04-24 10:52 UTC (permalink / raw) To: linux-kernel Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche This patch moves the call to alloc_empty_file() below the call to path_init(). That changes is needed for the next patch, which adds a cred override for alloc_empty_file(). The needed cred info is only available after the call to path_init(). No functional changes are intended by that patch. Signed-off-by: Stas Sergeev <stsp2@yandex.ru> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Andy Lutomirski <luto@kernel.org> CC: David Laight <David.Laight@ACULAB.COM> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org --- fs/namei.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index c5b2a25be7d0..413eef134234 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3781,23 +3781,30 @@ static struct file *path_openat(struct nameidata *nd, { struct file *file; int error; + u64 open_flags = op->open_flag; - file = alloc_empty_file(op->open_flag, current_cred()); - if (IS_ERR(file)) - return file; - - if (unlikely(file->f_flags & __O_TMPFILE)) { - error = do_tmpfile(nd, flags, op, file); - } else if (unlikely(file->f_flags & O_PATH)) { - error = do_o_path(nd, flags, file); + if (unlikely(open_flags & (__O_TMPFILE | O_PATH))) { + file = alloc_empty_file(open_flags, current_cred()); + if (IS_ERR(file)) + return file; + if (open_flags & __O_TMPFILE) + error = do_tmpfile(nd, flags, op, file); + else + error = do_o_path(nd, flags, file); } else { const char *s = path_init(nd, flags); - while (!(error = link_path_walk(s, nd)) && - (s = open_last_lookups(nd, file, op)) != NULL) - ; + file = alloc_empty_file(open_flags, current_cred()); + error = PTR_ERR_OR_ZERO(file); + if (!error) { + while (!(error = link_path_walk(s, nd)) && + (s = open_last_lookups(nd, file, op)) != NULL) + ; + } if (!error) error = do_open(nd, file, op); terminate_walk(nd); + if (IS_ERR(file)) + return file; } if (likely(!error)) { if (likely(file->f_mode & FMODE_OPENED)) -- 2.44.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] fs: reorganize path_openat() 2024-04-24 10:52 ` [PATCH 1/2] fs: reorganize path_openat() Stas Sergeev @ 2024-04-25 8:13 ` kernel test robot 0 siblings, 0 replies; 18+ messages in thread From: kernel test robot @ 2024-04-25 8:13 UTC (permalink / raw) To: Stas Sergeev Cc: oe-lkp, lkp, Eric Biederman, Alexander Viro, Christian Brauner, Jan Kara, Andy Lutomirski, David Laight, linux-fsdevel, linux-kernel, Stas Sergeev, Stefan Metzmacher, Jeff Layton, Chuck Lever, Alexander Aring, linux-api, Paolo Bonzini, Christian Göttsche, oliver.sang Hello, kernel test robot noticed "BUG:sleeping_function_called_from_invalid_context_at_include/linux/sched/mm.h" on: commit: 831d3c6cc6f05873e33f4aaebafbb9c27618ea0b ("[PATCH 1/2] fs: reorganize path_openat()") url: https://github.com/intel-lab-lkp/linux/commits/Stas-Sergeev/fs-reorganize-path_openat/20240424-185527 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 9d1ddab261f3e2af7c384dc02238784ce0cf9f98 patch link: https://lore.kernel.org/all/20240424105248.189032-2-stsp2@yandex.ru/ patch subject: [PATCH 1/2] fs: reorganize path_openat() in testcase: boot compiler: clang-17 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +-------------------------------------------------------------------------------+------------+------------+ | | 9d1ddab261 | 831d3c6cc6 | +-------------------------------------------------------------------------------+------------+------------+ | boot_successes | 6 | 0 | | boot_failures | 0 | 6 | | BUG:sleeping_function_called_from_invalid_context_at_include/linux/sched/mm.h | 0 | 6 | +-------------------------------------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202404251525.39b4af4e-lkp@intel.com [ 0.591465][ T33] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:315 [ 0.592508][ T33] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 33, name: kworker/u8:2 [ 0.593515][ T33] preempt_count: 0, expected: 0 [ 0.594071][ T33] RCU nest depth: 1, expected: 0 [ 0.594633][ T33] CPU: 0 PID: 33 Comm: kworker/u8:2 Not tainted 6.9.0-rc5-00037-g831d3c6cc6f0 #1 [ 0.595637][ T33] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 0.596216][ T33] Workqueue: async async_run_entry_fn [ 0.596216][ T33] Call Trace: [ 0.596216][ T33] <TASK> [ 0.596216][ T33] dump_stack_lvl (lib/dump_stack.c:116) [ 0.596216][ T33] __might_resched (kernel/sched/core.c:10198) [ 0.596216][ T33] kmem_cache_alloc (include/linux/kernel.h:73 include/linux/sched/mm.h:315 mm/slub.c:3746 mm/slub.c:3827 mm/slub.c:3852) [ 0.596216][ T33] alloc_empty_file (fs/file_table.c:203) [ 0.596216][ T33] path_openat (fs/namei.c:3796) [ 0.596216][ T33] do_filp_open (fs/namei.c:3833) [ 0.596216][ T33] file_open_name (fs/open.c:1352) [ 0.596216][ T33] filp_open (fs/open.c:1371) [ 0.596216][ T33] do_name (init/initramfs.c:373) [ 0.596216][ T33] flush_buffer (init/initramfs.c:452 init/initramfs.c:464) [ 0.596216][ T33] ? __pfx_flush_buffer (init/initramfs.c:458) [ 0.596216][ T33] __gunzip (lib/decompress_inflate.c:161) [ 0.596216][ T33] ? __pfx_nofill (lib/decompress_inflate.c:37) [ 0.596216][ T33] unpack_to_rootfs (init/initramfs.c:520) [ 0.596216][ T33] ? __pfx_error (init/initramfs.c:59) [ 0.596216][ T33] do_populate_rootfs (init/initramfs.c:714) [ 0.596216][ T33] async_run_entry_fn (kernel/async.c:136) [ 0.596216][ T33] process_scheduled_works (kernel/workqueue.c:3259) [ 0.596216][ T33] worker_thread (include/linux/list.h:373 kernel/workqueue.c:955 kernel/workqueue.c:3417) [ 0.596216][ T33] ? __pfx_worker_thread (kernel/workqueue.c:3362) [ 0.596216][ T33] kthread (kernel/kthread.c:390) [ 0.596216][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 0.596216][ T33] ret_from_fork (arch/x86/kernel/process.c:153) [ 0.596216][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 0.596216][ T33] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) [ 0.596216][ T33] </TASK> [ 1.603321][ T33] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:315 [ 1.604448][ T33] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 33, name: kworker/u8:2 [ 1.605466][ T33] preempt_count: 0, expected: 0 [ 1.606028][ T33] RCU nest depth: 1, expected: 0 [ 1.606599][ T33] CPU: 0 PID: 33 Comm: kworker/u8:2 Tainted: G W 6.9.0-rc5-00037-g831d3c6cc6f0 #1 [ 1.607761][ T33] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 1.608136][ T33] Workqueue: async async_run_entry_fn [ 1.608136][ T33] Call Trace: [ 1.608136][ T33] <TASK> [ 1.608136][ T33] dump_stack_lvl (lib/dump_stack.c:116) [ 1.608136][ T33] __might_resched (kernel/sched/core.c:10198) [ 1.608136][ T33] kmem_cache_alloc (include/linux/kernel.h:73 include/linux/sched/mm.h:315 mm/slub.c:3746 mm/slub.c:3827 mm/slub.c:3852) [ 1.608136][ T33] alloc_empty_file (fs/file_table.c:203) [ 1.608136][ T33] path_openat (fs/namei.c:3796) [ 1.608136][ T33] do_filp_open (fs/namei.c:3833) [ 1.608136][ T33] file_open_name (fs/open.c:1352) [ 1.608136][ T33] filp_open (fs/open.c:1371) [ 1.608136][ T33] do_name (init/initramfs.c:373) [ 1.608136][ T33] flush_buffer (init/initramfs.c:452 init/initramfs.c:464) [ 1.608136][ T33] ? __pfx_flush_buffer (init/initramfs.c:458) [ 1.608136][ T33] __gunzip (lib/decompress_inflate.c:161) [ 1.608136][ T33] ? __pfx_nofill (lib/decompress_inflate.c:37) [ 1.608136][ T33] unpack_to_rootfs (init/initramfs.c:520) [ 1.608136][ T33] ? __pfx_error (init/initramfs.c:59) [ 1.608136][ T33] do_populate_rootfs (init/initramfs.c:714) [ 1.608136][ T33] async_run_entry_fn (kernel/async.c:136) [ 1.608136][ T33] process_scheduled_works (kernel/workqueue.c:3259) [ 1.608136][ T33] worker_thread (include/linux/list.h:373 kernel/workqueue.c:955 kernel/workqueue.c:3417) [ 1.608136][ T33] ? __pfx_worker_thread (kernel/workqueue.c:3362) [ 1.608136][ T33] kthread (kernel/kthread.c:390) [ 1.608136][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 1.608136][ T33] ret_from_fork (arch/x86/kernel/process.c:153) [ 1.608136][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 1.608136][ T33] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) [ 1.608136][ T33] </TASK> [ 2.602317][ T33] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:315 [ 2.603414][ T33] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 33, name: kworker/u8:2 [ 2.604433][ T33] preempt_count: 0, expected: 0 [ 2.604985][ T33] RCU nest depth: 1, expected: 0 [ 2.605547][ T33] CPU: 0 PID: 33 Comm: kworker/u8:2 Tainted: G W 6.9.0-rc5-00037-g831d3c6cc6f0 #1 [ 2.606689][ T33] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 2.607825][ T33] Workqueue: async async_run_entry_fn [ 2.608140][ T33] Call Trace: [ 2.608140][ T33] <TASK> [ 2.608140][ T33] dump_stack_lvl (lib/dump_stack.c:116) [ 2.608140][ T33] __might_resched (kernel/sched/core.c:10198) [ 2.608140][ T33] kmem_cache_alloc (include/linux/kernel.h:73 include/linux/sched/mm.h:315 mm/slub.c:3746 mm/slub.c:3827 mm/slub.c:3852) [ 2.608140][ T33] alloc_empty_file (fs/file_table.c:203) [ 2.608140][ T33] path_openat (fs/namei.c:3796) [ 2.608140][ T33] do_filp_open (fs/namei.c:3833) [ 2.608140][ T33] file_open_name (fs/open.c:1352) [ 2.608140][ T33] filp_open (fs/open.c:1371) [ 2.608140][ T33] do_name (init/initramfs.c:373) [ 2.608140][ T33] flush_buffer (init/initramfs.c:452 init/initramfs.c:464) [ 2.608140][ T33] ? __pfx_flush_buffer (init/initramfs.c:458) [ 2.608140][ T33] __gunzip (lib/decompress_inflate.c:161) [ 2.608140][ T33] ? __pfx_nofill (lib/decompress_inflate.c:37) [ 2.608140][ T33] unpack_to_rootfs (init/initramfs.c:520) [ 2.608140][ T33] ? __pfx_error (init/initramfs.c:59) [ 2.608140][ T33] do_populate_rootfs (init/initramfs.c:714) [ 2.608140][ T33] async_run_entry_fn (kernel/async.c:136) [ 2.608140][ T33] process_scheduled_works (kernel/workqueue.c:3259) [ 2.608140][ T33] worker_thread (include/linux/list.h:373 kernel/workqueue.c:955 kernel/workqueue.c:3417) [ 2.608140][ T33] ? __pfx_worker_thread (kernel/workqueue.c:3362) [ 2.608140][ T33] kthread (kernel/kthread.c:390) [ 2.608140][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 2.608140][ T33] ret_from_fork (arch/x86/kernel/process.c:153) [ 2.608140][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 2.608140][ T33] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) [ 2.608140][ T33] </TASK> [ 3.648001][ T33] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:315 [ 3.649103][ T33] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 33, name: kworker/u8:2 [ 3.650109][ T33] preempt_count: 0, expected: 0 [ 3.650660][ T33] RCU nest depth: 1, expected: 0 [ 3.651223][ T33] CPU: 0 PID: 33 Comm: kworker/u8:2 Tainted: G W 6.9.0-rc5-00037-g831d3c6cc6f0 #1 [ 3.651979][ T33] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 3.651979][ T33] Workqueue: async async_run_entry_fn [ 3.651979][ T33] Call Trace: [ 3.651979][ T33] <TASK> [ 3.651979][ T33] dump_stack_lvl (lib/dump_stack.c:116) [ 3.651979][ T33] __might_resched (kernel/sched/core.c:10198) [ 3.651979][ T33] kmem_cache_alloc (include/linux/kernel.h:73 include/linux/sched/mm.h:315 mm/slub.c:3746 mm/slub.c:3827 mm/slub.c:3852) [ 3.651979][ T33] alloc_empty_file (fs/file_table.c:203) [ 3.651979][ T33] path_openat (fs/namei.c:3796) [ 3.651979][ T33] do_filp_open (fs/namei.c:3833) [ 3.651979][ T33] file_open_name (fs/open.c:1352) [ 3.651979][ T33] filp_open (fs/open.c:1371) [ 3.651979][ T33] do_name (init/initramfs.c:373) [ 3.651979][ T33] flush_buffer (init/initramfs.c:452 init/initramfs.c:464) [ 3.651979][ T33] ? __pfx_flush_buffer (init/initramfs.c:458) [ 3.651979][ T33] __gunzip (lib/decompress_inflate.c:161) [ 3.651979][ T33] ? __pfx_nofill (lib/decompress_inflate.c:37) [ 3.651979][ T33] unpack_to_rootfs (init/initramfs.c:520) [ 3.651979][ T33] ? __pfx_error (init/initramfs.c:59) [ 3.651979][ T33] do_populate_rootfs (init/initramfs.c:714) [ 3.651979][ T33] async_run_entry_fn (kernel/async.c:136) [ 3.651979][ T33] process_scheduled_works (kernel/workqueue.c:3259) [ 3.651979][ T33] worker_thread (include/linux/list.h:373 kernel/workqueue.c:955 kernel/workqueue.c:3417) [ 3.651979][ T33] ? __pfx_worker_thread (kernel/workqueue.c:3362) [ 3.651979][ T33] kthread (kernel/kthread.c:390) [ 3.651979][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 3.651979][ T33] ret_from_fork (arch/x86/kernel/process.c:153) [ 3.651979][ T33] ? __pfx_kthread (kernel/kthread.c:341) [ 3.651979][ T33] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) [ 3.651979][ T33] </TASK> [ 3.705833][ T33] Freeing initrd memory: 185612K The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240425/202404251525.39b4af4e-lkp@intel.com -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-24 10:52 [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Stas Sergeev 2024-04-24 10:52 ` [PATCH 1/2] fs: reorganize path_openat() Stas Sergeev @ 2024-04-24 10:52 ` Stas Sergeev 2024-04-25 2:31 ` Al Viro ` (2 more replies) 2024-04-24 16:09 ` [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Christian Brauner 2 siblings, 3 replies; 18+ messages in thread From: Stas Sergeev @ 2024-04-24 10:52 UTC (permalink / raw) To: linux-kernel Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche This flag performs the open operation with the fs credentials (fsuid, fsgid, group_info) that were in effect when dir_fd was opened. This allows the process to pre-open some directories and then change eUID (and all other UIDs/GIDs) to a less-privileged user, retaining the ability to open/create files within these directories. Design goal: The idea is to provide a very light-weight sandboxing, where the process, without the use of any heavy-weight techniques like chroot within namespaces, can restrict the access to the set of pre-opened directories. This patch is just a first step to such sandboxing. If things go well, in the future the same extension can be added to more syscalls. These should include at least unlinkat(), renameat2() and the not-yet-upstreamed setxattrat(). Security considerations: - Only the bare minimal set of credentials is overridden: fsuid, fsgid and group_info. The rest, for example capabilities, are not overridden to avoid unneeded security risks. - To avoid sandboxing escape, this patch makes sure the restricted lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT. - To avoid leaking creds across exec, this patch requires O_CLOEXEC flag on a directory. - Magic /proc symlinks are discarded, as suggested by Andy Lutomirski <luto@kernel.org> Use cases: Virtual machines that deal with untrusted code, can use that instead of a more heavy-weighted approaches. Currently the approach is being tested on a dosemu2 VM. Signed-off-by: Stas Sergeev <stsp2@yandex.ru> CC: Stefan Metzmacher <metze@samba.org> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andy Lutomirski <luto@kernel.org> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Jeff Layton <jlayton@kernel.org> CC: Chuck Lever <chuck.lever@oracle.com> CC: Alexander Aring <alex.aring@gmail.com> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Christian Göttsche <cgzones@googlemail.com> --- fs/internal.h | 2 +- fs/namei.c | 56 ++++++++++++++++++++++++++++++++++-- fs/open.c | 10 ++++++- include/linux/fcntl.h | 2 ++ include/uapi/linux/openat2.h | 3 ++ 5 files changed, 69 insertions(+), 4 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 7ca738904e34..692b53b19aad 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -169,7 +169,7 @@ static inline void sb_end_ro_state_change(struct super_block *sb) * open.c */ struct open_flags { - int open_flag; + u64 open_flag; umode_t mode; int acc_mode; int intent; diff --git a/fs/namei.c b/fs/namei.c index 413eef134234..aeb9f504538e 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -586,6 +586,9 @@ struct nameidata { int dfd; vfsuid_t dir_vfsuid; umode_t dir_mode; + kuid_t dir_open_fsuid; + kgid_t dir_open_fsgid; + struct group_info *dir_open_groups; } __randomize_layout; #define ND_ROOT_PRESET 1 @@ -695,6 +698,8 @@ static void terminate_walk(struct nameidata *nd) nd->depth = 0; nd->path.mnt = NULL; nd->path.dentry = NULL; + if (nd->dir_open_groups) + put_group_info(nd->dir_open_groups); } /* path_put is needed afterwards regardless of success or failure */ @@ -2414,6 +2419,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_fsuid = current_cred()->fsuid; + nd->dir_open_fsgid = current_cred()->fsgid; + nd->dir_open_groups = get_current_groups(); } else { /* Caller must check execute permissions on the starting path component */ struct fd f = fdget_raw(nd->dfd); @@ -2437,6 +2445,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_fsuid = f.file->f_cred->fsuid; + nd->dir_open_fsgid = f.file->f_cred->fsgid; + nd->dir_open_groups = get_group_info( + f.file->f_cred->group_info); fdput(f); } @@ -3776,6 +3788,29 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file) return error; } +static const struct cred *openat2_override_creds(struct nameidata *nd) +{ + const struct cred *old_cred; + struct cred *override_cred; + + override_cred = prepare_creds(); + if (!override_cred) + return NULL; + + override_cred->fsuid = nd->dir_open_fsuid; + override_cred->fsgid = nd->dir_open_fsgid; + override_cred->group_info = nd->dir_open_groups; + + override_cred->non_rcu = 1; + + old_cred = override_creds(override_cred); + + /* override_cred() gets its own ref */ + put_cred(override_cred); + + return old_cred; +} + static struct file *path_openat(struct nameidata *nd, const struct open_flags *op, unsigned flags) { @@ -3793,8 +3828,23 @@ static struct file *path_openat(struct nameidata *nd, error = do_o_path(nd, flags, file); } else { const char *s = path_init(nd, flags); - file = alloc_empty_file(open_flags, current_cred()); - error = PTR_ERR_OR_ZERO(file); + const struct cred *old_cred = NULL; + + error = 0; + if (open_flags & OA2_INHERIT_CRED) { + /* Only work with O_CLOEXEC dirs. */ + if (!get_close_on_exec(nd->dfd)) + error = -EPERM; + + if (!error) + old_cred = openat2_override_creds(nd); + } + if (!error) { + file = alloc_empty_file(open_flags, current_cred()); + error = PTR_ERR_OR_ZERO(file); + } else { + file = ERR_PTR(error); + } if (!error) { while (!(error = link_path_walk(s, nd)) && (s = open_last_lookups(nd, file, op)) != NULL) @@ -3802,6 +3852,8 @@ static struct file *path_openat(struct nameidata *nd, } if (!error) error = do_open(nd, file, op); + if (old_cred) + revert_creds(old_cred); terminate_walk(nd); if (IS_ERR(file)) return file; diff --git a/fs/open.c b/fs/open.c index ee8460c83c77..c871ff8fc6e3 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1225,7 +1225,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) * values before calling build_open_flags(), but openat2(2) checks all * of its arguments. */ - if (flags & ~VALID_OPEN_FLAGS) + if (flags & ~VALID_OPENAT2_FLAGS) return -EINVAL; if (how->resolve & ~VALID_RESOLVE_FLAGS) return -EINVAL; @@ -1326,6 +1326,14 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) lookup_flags |= LOOKUP_CACHED; } + if (flags & OA2_INHERIT_CRED) { + /* Inherit creds only with scoped look-up modes. */ + if (!(lookup_flags & LOOKUP_IS_SCOPED)) + return -EPERM; + /* Reject /proc "magic" links if inheriting creds. */ + lookup_flags |= LOOKUP_NO_MAGICLINKS; + } + op->lookup_flags = lookup_flags; return 0; } diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index a332e79b3207..b71f8b162102 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -12,6 +12,8 @@ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) +#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_INHERIT_CRED) + /* List of all valid flags for the how->resolve argument: */ #define VALID_RESOLVE_FLAGS \ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h index a5feb7604948..cdd676a10b62 100644 --- a/include/uapi/linux/openat2.h +++ b/include/uapi/linux/openat2.h @@ -40,4 +40,7 @@ struct open_how { return -EAGAIN if that's not possible. */ +/* openat2-specific flags go to upper 4 bytes. */ +#define OA2_INHERIT_CRED (1ULL << 32) + #endif /* _UAPI_LINUX_OPENAT2_H */ -- 2.44.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-24 10:52 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev @ 2024-04-25 2:31 ` Al Viro 2024-04-25 7:24 ` stsp 2024-04-25 9:23 ` stsp 2024-04-25 13:50 ` kernel test robot 2024-04-25 14:02 ` Christian Brauner 2 siblings, 2 replies; 18+ messages in thread From: Al Viro @ 2024-04-25 2:31 UTC (permalink / raw) To: Stas Sergeev Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche On Wed, Apr 24, 2024 at 01:52:48PM +0300, Stas Sergeev wrote: > @@ -3793,8 +3828,23 @@ static struct file *path_openat(struct nameidata *nd, > error = do_o_path(nd, flags, file); > } else { > const char *s = path_init(nd, flags); > - file = alloc_empty_file(open_flags, current_cred()); > - error = PTR_ERR_OR_ZERO(file); > + const struct cred *old_cred = NULL; > + > + error = 0; > + if (open_flags & OA2_INHERIT_CRED) { > + /* Only work with O_CLOEXEC dirs. */ > + if (!get_close_on_exec(nd->dfd)) > + error = -EPERM; > + > + if (!error) > + old_cred = openat2_override_creds(nd); > + } > + if (!error) { > + file = alloc_empty_file(open_flags, current_cred()); Consider the following, currently absolutely harmless situation: * process is owned by luser:students. * descriptor 69 refers to root-opened root directory (O_RDONLY) What's the expected result of fcntl(69, F_SEFTD, O_CLOEXEC); opening "etc/shadow" with dirfd equal to 69 and your flag given subsequent read() from the resulting descriptor? At which point will the kernel say "go fuck yourself, I'm not letting you read that file", provided that attacker passes that new flag of yours? As a bonus question, how about opening it for _write_, seeing that this is an obvious instant roothole? Again, currently the setup that has a root-opened directory in descriptor table of a non-root process is safe. Incidentally, suppose you have the same process run with stdin opened (r/o) by root. F_SETFD it to O_CLOEXEC, then use your open with dirfd being 0, pathname - "" and flags - O_RDWR. AFAICS, without an explicit opt-in by the original opener it's a non-starter, and TBH I doubt that even with such opt-in (FMODE_CRED, whatever) it would be a good idea - it gives too much. NAKed-by: Al Viro <viro@zeniv.linux.org.uk> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-25 2:31 ` Al Viro @ 2024-04-25 7:24 ` stsp 2024-04-25 9:23 ` stsp 1 sibling, 0 replies; 18+ messages in thread From: stsp @ 2024-04-25 7:24 UTC (permalink / raw) To: Al Viro Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche 25.04.2024 05:31, Al Viro пишет: > Consider the following, currently absolutely harmless situation: > * process is owned by luser:students. > * descriptor 69 refers to root-opened root directory (O_RDONLY) > What's the expected result of > fcntl(69, F_SEFTD, O_CLOEXEC); > opening "etc/shadow" with dirfd equal to 69 and your flag given > subsequent read() from the resulting descriptor? > > At which point will the kernel say "go fuck yourself, I'm not letting you > read that file", provided that attacker passes that new flag of yours? > > As a bonus question, how about opening it for _write_, seeing that this > is an obvious instant roothole? > > Again, currently the setup that has a root-opened directory in descriptor > table of a non-root process is safe. > > Incidentally, suppose you have the same process run with stdin opened > (r/o) by root. F_SETFD it to O_CLOEXEC, then use your open with > dirfd being 0, pathname - "" and flags - O_RDWR. Ok, F_SETFD, how simple. :( > AFAICS, without an explicit opt-in by the original opener it's > a non-starter, and TBH I doubt that even with such opt-in (FMODE_CRED, > whatever) it would be a good idea - it gives too much. Yes, which is why I am quite sceptical to this FMODE_CRED idea. Please note that my O_CLOEXEC check actually meant to check that exactly this process have opened the dir. It just didn't happen that way, as you pointed. Can I replace the O_CLOEXEC check with some explicit check that makes sure the fd was opened by exactly that process? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-25 2:31 ` Al Viro 2024-04-25 7:24 ` stsp @ 2024-04-25 9:23 ` stsp 1 sibling, 0 replies; 18+ messages in thread From: stsp @ 2024-04-25 9:23 UTC (permalink / raw) To: Al Viro Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche 25.04.2024 05:31, Al Viro пишет: > Incidentally, suppose you have the same process run with stdin opened > (r/o) by root. F_SETFD it to O_CLOEXEC, then use your open with > dirfd being 0, pathname - "" and flags - O_RDWR. I actually checked this with the test-case. It seems to return ENOENT: Breakpoint 1, openat2 (dirfd=0, pathname=0x7fffffffdbee "", how=0x7fffffffd5e0, size=24) at tst.c:13 13 return syscall(SYS_openat2, dirfd, pathname, how, size); (gdb) fin Run till exit from #0 openat2 (dirfd=0, pathname=0x7fffffffdbee "", how=0x7fffffffd5e0, size=24) at tst.c:13 0x000000000040167b in main (argc=3, argv=0x7fffffffd7b8) at tst.c:140 140 fd = openat2(0, efile, &how1, sizeof(how1)); Value returned is $1 = -1 (gdb) list 135 err = fcntl(0, F_SETFD, O_CLOEXEC); 136 if (err) { 137 perror("fcntl(F_SETFD)"); 138 return EXIT_FAILURE; 139 } 140 fd = openat2(0, efile, &how1, sizeof(how1)); 141 if (fd == -1) { 142 perror("openat2(1)"); 143 // return EXIT_FAILURE; 144 } else { (gdb) p errno $2 = 2 So it seems the creds can't be stolen from a non-dir fd, but I wonder why ENOENT is returned instead of ENOTDIR. Such ENOENT is not dicumented in a man page of openat2(), so I guess there is some problem here even w/o my patch. :) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-24 10:52 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev 2024-04-25 2:31 ` Al Viro @ 2024-04-25 13:50 ` kernel test robot 2024-04-25 14:02 ` Christian Brauner 2 siblings, 0 replies; 18+ messages in thread From: kernel test robot @ 2024-04-25 13:50 UTC (permalink / raw) To: Stas Sergeev Cc: oe-lkp, lkp, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, Paolo Bonzini, Christian Göttsche, linux-fsdevel, linux-kernel, Stas Sergeev, David Laight, linux-api, oliver.sang Hello, kernel test robot noticed "BUG:KASAN:wild-memory-access_in_terminate_walk" on: commit: 97bb54b42b1d6150e9ae11a7bf7833ed9f8c471d ("[PATCH 2/2] openat2: add OA2_INHERIT_CRED flag") url: https://github.com/intel-lab-lkp/linux/commits/Stas-Sergeev/fs-reorganize-path_openat/20240424-185527 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 9d1ddab261f3e2af7c384dc02238784ce0cf9f98 patch link: https://lore.kernel.org/all/20240424105248.189032-3-stsp2@yandex.ru/ patch subject: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag in testcase: boot compiler: clang-17 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +---------------------------------------------------------------------------------------+------------+------------+ | | 831d3c6cc6 | 97bb54b42b | +---------------------------------------------------------------------------------------+------------+------------+ | BUG:KASAN:wild-memory-access_in_terminate_walk | 0 | 12 | | canonical_address#:#[##] | 0 | 12 | | RIP:terminate_walk | 0 | 12 | | Kernel_panic-not_syncing:Fatal_exception | 0 | 12 | +---------------------------------------------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202404252107.3c18eed2-lkp@intel.com [ 2.555857][ T16] BUG: KASAN: wild-memory-access in terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) [ 2.556181][ T16] Write of size 4 at addr aaaaaaaaaaaaaaaa by task kdevtmpfs/16 [ 2.556181][ T16] [ 2.556181][ T16] CPU: 0 PID: 16 Comm: kdevtmpfs Tainted: G T 6.9.0-rc5-00038-g97bb54b42b1d #1 c90cc2d91176f38ca16e85ead0a72934082854cd [ 2.556181][ T16] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 2.556181][ T16] Call Trace: [ 2.556181][ T16] <TASK> [ 2.556181][ T16] dump_stack_lvl (lib/dump_stack.c:116) [ 2.556181][ T16] print_report (mm/kasan/report.c:?) [ 2.556181][ T16] ? kasan_report (mm/kasan/report.c:214 mm/kasan/report.c:590) [ 2.556181][ T16] ? terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) [ 2.556181][ T16] kasan_report (mm/kasan/report.c:603) [ 2.556181][ T16] ? terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) [ 2.556181][ T16] kasan_check_range (mm/kasan/generic.c:?) [ 2.556181][ T16] terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) [ 2.556181][ T16] path_lookupat (fs/namei.c:2515) [ 2.556181][ T16] filename_lookup (fs/namei.c:2526) [ 2.556181][ T16] kern_path (fs/namei.c:2634) [ 2.556181][ T16] init_mount (fs/init.c:22) [ 2.556181][ T16] devtmpfs_setup (drivers/base/devtmpfs.c:419) [ 2.556181][ T16] devtmpfsd (drivers/base/devtmpfs.c:436) [ 2.556181][ T16] kthread (kernel/kthread.c:390) [ 2.556181][ T16] ? vclkdev_alloc (drivers/base/devtmpfs.c:435) [ 2.556181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) [ 2.556181][ T16] ret_from_fork (arch/x86/kernel/process.c:153) [ 2.556181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) [ 2.556181][ T16] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) [ 2.556181][ T16] </TASK> [ 2.556181][ T16] ================================================================== [ 2.556184][ T16] Disabling lock debugging due to kernel taint [ 2.556901][ T16] general protection fault, probably for non-canonical address 0xaaaaaaaaaaaaaaaa: 0000 [#1] KASAN PTI [ 2.558131][ T16] CPU: 0 PID: 16 Comm: kdevtmpfs Tainted: G B T 6.9.0-rc5-00038-g97bb54b42b1d #1 c90cc2d91176f38ca16e85ead0a72934082854cd [ 2.559653][ T16] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 2.560181][ T16] RIP: 0010:terminate_walk (arch/x86/include/asm/atomic.h:103 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) [ 2.560181][ T16] Code: 03 43 80 3c 2e 00 74 08 4c 89 ff e8 01 61 f4 ff 49 8b 1f 48 85 db 74 41 48 89 df be 04 00 00 00 e8 dc 61 f4 ff b8 ff ff ff ff <0f> c1 03 83 f8 01 75 25 43 80 3c 2e 00 74 08 4c 89 ff e8 d0 60 f4 All code ======== 0: 03 43 80 add -0x80(%rbx),%eax 3: 3c 2e cmp $0x2e,%al 5: 00 74 08 4c add %dh,0x4c(%rax,%rcx,1) 9: 89 ff mov %edi,%edi b: e8 01 61 f4 ff call 0xfffffffffff46111 10: 49 8b 1f mov (%r15),%rbx 13: 48 85 db test %rbx,%rbx 16: 74 41 je 0x59 18: 48 89 df mov %rbx,%rdi 1b: be 04 00 00 00 mov $0x4,%esi 20: e8 dc 61 f4 ff call 0xfffffffffff46201 25: b8 ff ff ff ff mov $0xffffffff,%eax 2a:* 0f c1 03 xadd %eax,(%rbx) <-- trapping instruction 2d: 83 f8 01 cmp $0x1,%eax 30: 75 25 jne 0x57 32: 43 80 3c 2e 00 cmpb $0x0,(%r14,%r13,1) 37: 74 08 je 0x41 39: 4c 89 ff mov %r15,%rdi 3c: e8 .byte 0xe8 3d: d0 60 f4 shlb -0xc(%rax) Code starting with the faulting instruction =========================================== 0: 0f c1 03 xadd %eax,(%rbx) 3: 83 f8 01 cmp $0x1,%eax 6: 75 25 jne 0x2d 8: 43 80 3c 2e 00 cmpb $0x0,(%r14,%r13,1) d: 74 08 je 0x17 f: 4c 89 ff mov %r15,%rdi 12: e8 .byte 0xe8 13: d0 60 f4 shlb -0xc(%rax) [ 2.560181][ T16] RSP: 0000:ffffc9000010fc40 EFLAGS: 00010246 [ 2.560181][ T16] RAX: 00000000ffffffff RBX: aaaaaaaaaaaaaaaa RCX: ffffffff811e4a0f [ 2.560181][ T16] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff8792adc0 [ 2.560181][ T16] RBP: 0000000000000011 R08: ffffffff8792adc7 R09: 1ffffffff0f255b8 [ 2.560181][ T16] R10: dffffc0000000000 R11: fffffbfff0f255b9 R12: 1ffff92000021fc4 [ 2.560181][ T16] R13: dffffc0000000000 R14: 1ffff92000021fc1 R15: ffffc9000010fe08 [ 2.560181][ T16] FS: 0000000000000000(0000) GS:ffffffff878dc000(0000) knlGS:0000000000000000 [ 2.560181][ T16] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.560181][ T16] CR2: ffff88843ffff000 CR3: 000000000789c000 CR4: 00000000000406f0 [ 2.560181][ T16] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2.560181][ T16] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2.560181][ T16] Call Trace: [ 2.560181][ T16] <TASK> [ 2.560181][ T16] ? __die_body (arch/x86/kernel/dumpstack.c:421) [ 2.560181][ T16] ? die_addr (arch/x86/kernel/dumpstack.c:?) [ 2.560181][ T16] ? exc_general_protection (arch/x86/kernel/traps.c:?) [ 2.560181][ T16] ? end_report (arch/x86/include/asm/current.h:49 mm/kasan/report.c:240) [ 2.560181][ T16] ? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:617) [ 2.560181][ T16] ? add_taint (arch/x86/include/asm/bitops.h:60 include/asm-generic/bitops/instrumented-atomic.h:29 kernel/panic.c:555) [ 2.560181][ T16] ? terminate_walk (arch/x86/include/asm/atomic.h:103 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) [ 2.560181][ T16] path_lookupat (fs/namei.c:2515) [ 2.560181][ T16] filename_lookup (fs/namei.c:2526) [ 2.560181][ T16] kern_path (fs/namei.c:2634) [ 2.560181][ T16] init_mount (fs/init.c:22) [ 2.560181][ T16] devtmpfs_setup (drivers/base/devtmpfs.c:419) [ 2.560181][ T16] devtmpfsd (drivers/base/devtmpfs.c:436) [ 2.560181][ T16] kthread (kernel/kthread.c:390) [ 2.560181][ T16] ? vclkdev_alloc (drivers/base/devtmpfs.c:435) [ 2.560181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) [ 2.560181][ T16] ret_from_fork (arch/x86/kernel/process.c:153) [ 2.560181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) [ 2.560181][ T16] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) [ 2.560181][ T16] </TASK> [ 2.560181][ T16] Modules linked in: [ 2.560183][ T16] ---[ end trace 0000000000000000 ]--- [ 2.560820][ T16] RIP: 0010:terminate_walk (arch/x86/include/asm/atomic.h:103 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) [ 2.561462][ T16] Code: 03 43 80 3c 2e 00 74 08 4c 89 ff e8 01 61 f4 ff 49 8b 1f 48 85 db 74 41 48 89 df be 04 00 00 00 e8 dc 61 f4 ff b8 ff ff ff ff <0f> c1 03 83 f8 01 75 25 43 80 3c 2e 00 74 08 4c 89 ff e8 d0 60 f4 All code ======== 0: 03 43 80 add -0x80(%rbx),%eax 3: 3c 2e cmp $0x2e,%al 5: 00 74 08 4c add %dh,0x4c(%rax,%rcx,1) 9: 89 ff mov %edi,%edi b: e8 01 61 f4 ff call 0xfffffffffff46111 10: 49 8b 1f mov (%r15),%rbx 13: 48 85 db test %rbx,%rbx 16: 74 41 je 0x59 18: 48 89 df mov %rbx,%rdi 1b: be 04 00 00 00 mov $0x4,%esi 20: e8 dc 61 f4 ff call 0xfffffffffff46201 25: b8 ff ff ff ff mov $0xffffffff,%eax 2a:* 0f c1 03 xadd %eax,(%rbx) <-- trapping instruction 2d: 83 f8 01 cmp $0x1,%eax 30: 75 25 jne 0x57 32: 43 80 3c 2e 00 cmpb $0x0,(%r14,%r13,1) 37: 74 08 je 0x41 39: 4c 89 ff mov %r15,%rdi 3c: e8 .byte 0xe8 3d: d0 60 f4 shlb -0xc(%rax) Code starting with the faulting instruction =========================================== 0: 0f c1 03 xadd %eax,(%rbx) 3: 83 f8 01 cmp $0x1,%eax 6: 75 25 jne 0x2d 8: 43 80 3c 2e 00 cmpb $0x0,(%r14,%r13,1) d: 74 08 je 0x17 f: 4c 89 ff mov %r15,%rdi 12: e8 .byte 0xe8 13: d0 60 f4 shlb -0xc(%rax) The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240425/202404252107.3c18eed2-lkp@intel.com -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-24 10:52 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev 2024-04-25 2:31 ` Al Viro 2024-04-25 13:50 ` kernel test robot @ 2024-04-25 14:02 ` Christian Brauner 2024-04-26 13:36 ` stsp 2 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2024-04-25 14:02 UTC (permalink / raw) To: Stas Sergeev Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche > struct open_flags { > - int open_flag; > + u64 open_flag; Btw, this change taken together with > +#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_INHERIT_CRED) is also ripe to causes subtle bugs and security issues. This new VALID_OPENAT2_FLAGS define bypasses BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS), "struct open_flags doesn't yet handle flags > 32 bits"); in build_open_flags(). And right now lookup_open(), open_last_lookups(), and do_open() just do: int open_flag = op->open_flag; Because op->open_flag was 32bit that was fine. But now ->open_flag is 64bit which means we truncate the upper 32bit including OA2_INHERIT_CRED or any other new flag in the upper 32bits in those functions. So as soon as there's an additional check in e.g., do_open() for OA2_INHERIT_CRED or in any of the other helpers that's security relevant we're fscked because that flag is never seen and no bot will help us here. And it's super easy to miss during review... ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-25 14:02 ` Christian Brauner @ 2024-04-26 13:36 ` stsp 0 siblings, 0 replies; 18+ messages in thread From: stsp @ 2024-04-26 13:36 UTC (permalink / raw) To: Christian Brauner Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche 25.04.2024 17:02, Christian Brauner пишет: >> struct open_flags { >> - int open_flag; >> + u64 open_flag; > Btw, this change taken together with All fixed in v5. I dropped u64 use. Other comments are addressed as well. Please let me know if I missed some. Thank you. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() 2024-04-24 10:52 [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Stas Sergeev 2024-04-24 10:52 ` [PATCH 1/2] fs: reorganize path_openat() Stas Sergeev 2024-04-24 10:52 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev @ 2024-04-24 16:09 ` Christian Brauner 2024-04-24 17:50 ` stsp 2 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2024-04-24 16:09 UTC (permalink / raw) To: Stas Sergeev Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche On Wed, Apr 24, 2024 at 01:52:46PM +0300, Stas Sergeev wrote: > This patch-set implements the OA2_INHERIT_CRED flag for openat2() syscall. > It is needed to perform an open operation with the creds that were in > effect when the dir_fd was opened. This allows the process to pre-open > some dirs and switch eUID (and other UIDs/GIDs) to the less-privileged > user, while still retaining the possibility to open/create files within > the pre-opened directory set. > > The sand-boxing is security-oriented: symlinks leading outside of a > sand-box are rejected. /proc magic links are rejected. > The more detailed description (including security considerations) > is available in the log messages of individual patches. > > Changes in v4: > - add optimizations suggested by David Laight <David.Laight@ACULAB.COM> > - move security checks to build_open_flags() > - force RESOLVE_NO_MAGICLINKS as suggested by Andy Lutomirski <luto@kernel.org> > > Changes in v3: > - partially revert v2 changes to avoid overriding capabilities. > Only the bare minimum is overridden: fsuid, fsgid and group_info. > Document the fact the full cred override is unwanted, as it may > represent an unneeded security risk. > > Changes in v2: > - capture full struct cred instead of just fsuid/fsgid. > Suggested by Stefan Metzmacher <metze@samba.org> This smells ripe enough to serve as an attack vector in non-obvious ways. And in general this has the potential to confuse the hell out unsuspecting userspace. They can now suddenly get sent such special-sauce files such as this that they have no way of recognizing as there's neither an FMODE_* flag nor is the OA2_* flag recorded so it's not available in F_GETFL. There's not even a way to restrict that new flag because no LSM ever sees it. So that behavior might break LSM assumptions as well. And it is effectively usable to steal credentials. If process A opens a directory with uid/gid 0 then sends that directory fd via AF_UNIX or something to process B then process B can inherit the uid/gid of process A by specifying OA2_* with no way for process A to prevent this - not even through an LSM. The permission checking model that we have right now is already baroque. I see zero reason to add more complexity for the sake of "lightweight sandboxing". We have LSMs and namespaces for stuff like this. NAK. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() 2024-04-24 16:09 ` [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Christian Brauner @ 2024-04-24 17:50 ` stsp 2024-04-25 9:54 ` Christian Brauner 0 siblings, 1 reply; 18+ messages in thread From: stsp @ 2024-04-24 17:50 UTC (permalink / raw) To: Christian Brauner Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche 24.04.2024 19:09, Christian Brauner пишет: > This smells ripe enough to serve as an attack vector in non-obvious > ways. And in general this has the potential to confuse the hell out > unsuspecting userspace. Unsuspecting user-space will simply not use this flag. What do you mean? > They can now suddenly get sent such > special-sauce files There are no any special files. This flag helps you to open a file on which you currently have no perms to open, but had those in the past. > such as this that they have no way of recognizing as > there's neither an FMODE_* flag nor is the OA2_* flag recorded so it's > not available in F_GETFL. > > There's not even a way to restrict that new flag because no LSM ever > sees it. So that behavior might break LSM assumptions as well. > > And it is effectively usable to steal credentials. If process A opens a > directory with uid/gid 0 then sends that directory fd via AF_UNIX or > something to process B then process B can inherit the uid/gid of process No, it doesn't inherit anything. The inheritance happens only for a duration of an open() call, helping open() to succeed. The creds are reverted when open() completed. The only theoretically possible attack would be to open some file you'd never intended to open. Also note that a very minimal sed of creds is overridden: fsuid, fsgid, groupinfo. > A by specifying OA2_* with no way for process A to prevent this - not > even through an LSM. If process B doesn't use that flag, it inherits nothing, no matter what process A did or passed via a socket. So an unaware process that doesn't use that flag, is completely unaffected. > The permission checking model that we have right now is already baroque. > I see zero reason to add more complexity for the sake of "lightweight > sandboxing". We have LSMs and namespaces for stuff like this. > > NAK. I don't think it is fair to say NAK without actually reading the patch or asking its author for clarifications. Even though you didn't ask, I provided my clarifications above, as I find that a polite action. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() 2024-04-24 17:50 ` stsp @ 2024-04-25 9:54 ` Christian Brauner 2024-04-25 10:12 ` stsp 0 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2024-04-25 9:54 UTC (permalink / raw) To: stsp Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche On Wed, Apr 24, 2024 at 08:50:30PM +0300, stsp wrote: > 24.04.2024 19:09, Christian Brauner пишет: > > This smells ripe enough to serve as an attack vector in non-obvious > > ways. And in general this has the potential to confuse the hell out > > unsuspecting userspace. > > Unsuspecting user-space will simply > not use this flag. What do you mean? > > > > They can now suddenly get sent such > > special-sauce files > > There are no any special files. > This flag helps you to open a file on > which you currently have no perms > to open, but had those in the past. > > > > such as this that they have no way of recognizing as > > there's neither an FMODE_* flag nor is the OA2_* flag recorded so it's > > not available in F_GETFL. > > > > There's not even a way to restrict that new flag because no LSM ever > > sees it. So that behavior might break LSM assumptions as well. > > > > And it is effectively usable to steal credentials. If process A opens a > > directory with uid/gid 0 then sends that directory fd via AF_UNIX or > > something to process B then process B can inherit the uid/gid of process > > No, it doesn't inherit anything. > The inheritance happens only for > a duration of an open() call, helping > open() to succeed. The creds are > reverted when open() completed. > > The only theoretically possible attack > would be to open some file you'd never > intended to open. Also note that a > very minimal sed of creds is overridden: > fsuid, fsgid, groupinfo. > > > A by specifying OA2_* with no way for process A to prevent this - not > > even through an LSM. > > If process B doesn't use that flag, it > inherits nothing, no matter what process > A did or passed via a socket. > So an unaware process that doesn't > use that flag, is completely unaffected. The point is that the original opener has no way to prevent his creds being abused by a completely unrelated process later on. Something I've clearly explained in my mail. > > > The permission checking model that we have right now is already baroque. > > I see zero reason to add more complexity for the sake of "lightweight > > sandboxing". We have LSMs and namespaces for stuff like this. > > > > NAK. > > I don't think it is fair to say NAK > without actually reading the patch > or asking its author for clarifications. > Even though you didn't ask, I provided > my clarifications above, as I find that > a polite action. I'm not sure what you don't understand or why you need further clarification. Your patch allows any opener using your new flag to steal the uid/gid/whatever from the original opener. It was even worse in the first version where the whole struct cred of the original opener was used. It's obviously a glaring security hole that's opened up by this. Let alone that the justification "It's useful for some lightweight sandboxing" is absolutely not sufficient to justify substantial shifts in the permission model. The NAK stands. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() 2024-04-25 9:54 ` Christian Brauner @ 2024-04-25 10:12 ` stsp 2024-04-25 12:08 ` Christian Brauner 0 siblings, 1 reply; 18+ messages in thread From: stsp @ 2024-04-25 10:12 UTC (permalink / raw) To: Christian Brauner Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche 25.04.2024 12:54, Christian Brauner пишет: > I'm not sure what you don't understand or why you need further > clarification. Your patch allows any opener using your new flag to steal > the uid/gid/whatever from the original opener. No, absolutely impossible (see below). > It was even worse in the > first version where the whole struct cred of the original opener was > used. It's obviously a glaring security hole that's opened up by this. Well, it was the second version actually (first one only had fsuid/fsgid), but no, its the same thing either way. The creds are overridden for a diration of an openat2() syscall. It doesn't matter what uid/gid are there, because they are not used during openat2(), and are reverted back at the end. The only reason I decided to get back to fsuid/fsgid, were the capabilities, which I don't want to be raised during openat2(). > Let alone that the justification "It's useful for some lightweight > sandboxing" is absolutely not sufficient to justify substantial shifts > in the permission model. > > The NAK stands. But I am sure you still don't understand what exactly the patch does, so why not to ask instead of asserting? You say uid/gid can be stolen, but no, it can't: the creds are reverted. Only fsuid/fsgid (and caps in v2 of the patch) actually affect openat2(), but nothing is "leaked" after openat2() finished. That said, Viro already pointed to the actual problem, and the patch-testing bot did the same. So I have a valid complains already, and you don't have to add the invalid ones to them. :) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() 2024-04-25 10:12 ` stsp @ 2024-04-25 12:08 ` Christian Brauner 2024-04-25 12:39 ` stsp 0 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2024-04-25 12:08 UTC (permalink / raw) To: stsp Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche > But I am sure you still don't understand > what exactly the patch does, so why not > to ask instead of asserting? > You say uid/gid can be stolen, but no, > it can't: the creds are reverted. Only > fsuid/fsgid (and caps in v2 of the patch) > actually affect openat2(), but nothing is > "leaked" after openat2() finished. I say "stolen" because the original opener has no say in this. You're taking their fsuid/fsgid and groups and overriding creds for the duration of the lookup and open. Something the original opener never consented to. But let's call it "borrowed" if you're hung up on terminology here. But ultimately it's the same complaint: the original opener has no way of controlling this behavior. Once ignored in my first reply, and now again conveniently cut off again. Let alone all the other objections. And fwiw, the same way you somehow feel like I haven't read your patch it seems you to consistently underestimate the implications of this change. Which is really strange because it's pretty obvious. It's effectively temporary setuid/setgid with some light limitations. Leaking directory descriptors across security boundaries becomes a lot more interesting with this patch applied. Something which has happened multiple times already and heavily figures in container escapes. And the RESOLVE_BENEATH/IN_ROOT restrictions only stave off symlink (and magic link) attacks. If a directory fd is leaked a container can take the fsuid/fsgid/groups from the original opener of that directory and write to disk with whatever it resolves to in that namespace's idmapping. It's basically a nice way to puzzle together arbitrary fsuid/fsgid and idmapping pairs in whatever namespace the opener happens to be in. And again it also messes with LSMs because you're changing credentials behind their back. And to the "unsuspecting userspace" point you dismissed earlier. Providing a dirfd to a lesser privileged process isn't horrendously dangerous right now. But with this change it sure is. For stuff like libpathrs or systemd's fdstore this change has immediate security implications. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() 2024-04-25 12:08 ` Christian Brauner @ 2024-04-25 12:39 ` stsp 0 siblings, 0 replies; 18+ messages in thread From: stsp @ 2024-04-25 12:39 UTC (permalink / raw) To: Christian Brauner Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, David Laight, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche 25.04.2024 15:08, Christian Brauner пишет: >> But I am sure you still don't understand >> what exactly the patch does, so why not >> to ask instead of asserting? >> You say uid/gid can be stolen, but no, >> it can't: the creds are reverted. Only >> fsuid/fsgid (and caps in v2 of the patch) >> actually affect openat2(), but nothing is >> "leaked" after openat2() finished. > I say "stolen" because the original opener has no say in this. The initial idea was to keep this all within a single-process boundary. It wasn't coded that way though. :( > You're > taking their fsuid/fsgid and groups and overriding creds for the > duration of the lookup and open. Something the original opener never > consented to. But let's call it "borrowed" if you're hung up on > terminology here. Not a terminology: you were explicitly talking about uid/gid, blaming a v2 of my patch. But v2 was not any more harmful than others and uid/gid cannot be leaked even there. But I don't mind: if now you realize v2 is not a leak for uid/gid, then we are on the same track. > But ultimately it's the same complaint: the original opener has no way > of controlling this behavior. It wasn't clear if the opener should control that behaviour, or maybe instead that all should be limited within a single process? Andy Lutomirski's explanation made it clear that even if the openers are the same, the control is still needed. So now this argument is undeniable. > Once ignored in my first reply, and now > again conveniently cut off again. Let alone all the other objections. Sorry but your complains were about stealing uid/gid in v2, which were clearly wrong. But this is a matter of the past. > And fwiw, the same way you somehow feel like I haven't read your patch > it seems you to consistently underestimate the implications of this > change. Which is really strange because it's pretty obvious. It's > effectively temporary setuid/setgid with some light limitations. Temporary cred override is quite common within the current code AFAICS when I grep it for override_creds() call. > Leaking directory descriptors across security boundaries becomes a lot > more interesting with this patch applied. Something which has happened > multiple times already and heavily figures in container escapes. And the > RESOLVE_BENEATH/IN_ROOT restrictions only stave off symlink (and magic > link) attacks. If a directory fd is leaked a container can take the > fsuid/fsgid/groups from the original opener of that directory and write > to disk with whatever it resolves to in that namespace's idmapping. It's > basically a nice way to puzzle together arbitrary fsuid/fsgid and > idmapping pairs in whatever namespace the opener happens to be in. Yes, so the opt-in flag is undeniably needed. > And to the "unsuspecting userspace" point you dismissed earlier. > Providing a dirfd to a lesser privileged process isn't horrendously > dangerous right now. But with this change it sure is. For stuff like > libpathrs or systemd's fdstore this change has immediate security > implications. So am I getting your point correctly: - process A uses the opt-in flag (eg O_CAPTURE_CREDS) and passes the fd to "unsuspecting userspace" B. - process B is not going to use O_INHERIT_CREDS, but it now still can't drop his privs fully. Is this what you mean? ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 0/2] implement OA2_INHERIT_CRED flag for openat2() @ 2024-04-23 22:46 Stas Sergeev 2024-04-23 22:46 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev 0 siblings, 1 reply; 18+ messages in thread From: Stas Sergeev @ 2024-04-23 22:46 UTC (permalink / raw) To: linux-kernel Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche This patch-set implements the OA2_INHERIT_CRED flag for openat2() syscall. It is needed to perform an open operation with the creds that were in effect when the dir_fd was opened. This allows the process to pre-open some dirs and switch eUID (and other UIDs/GIDs) to the less-privileged user, while still retaining the possibility to open/create files within the pre-opened directory set. The more detailed description (including security considerations) is available in the log messages of individual patches. Changes in v3: - partially revert v2 changes to avoid overriding capabilities. Only the bare minimum is overridden: fsuid, fsgid and group_info. Document the fact the full cred override is unwanted, as it may represent an unneeded security risk. Changes in v2: - capture full struct cred instead of just fsuid/fsgid. Suggested by Stefan Metzmacher <metze@samba.org> CC: Stefan Metzmacher <metze@samba.org> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andy Lutomirski <luto@kernel.org> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Jeff Layton <jlayton@kernel.org> CC: Chuck Lever <chuck.lever@oracle.com> CC: Alexander Aring <alex.aring@gmail.com> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: linux-api@vger.kernel.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Christian Göttsche <cgzones@googlemail.com> -- 2.44.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-23 22:46 [PATCH v3 " Stas Sergeev @ 2024-04-23 22:46 ` Stas Sergeev 0 siblings, 0 replies; 18+ messages in thread From: Stas Sergeev @ 2024-04-23 22:46 UTC (permalink / raw) To: linux-kernel Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche This flag performs the open operation with the fs credentials (fsuid, fsgid, group_info) that were in effect when dir_fd was opened. This allows the process to pre-open some directories and then change eUID (and all other UIDs/GIDs) to a less-privileged user, retaining the ability to open/create files within these directories. Design goal: The idea is to provide a very light-weight sandboxing, where the process, without the use of any heavy-weight techniques like chroot within namespaces, can restrict the access to the set of pre-opened directories. This patch is just a first step to such sandboxing. If things go well, in the future the same extension can be added to more syscalls. These should include at least unlinkat(), renameat2() and the not-yet-upstreamed setxattrat(). Security considerations: - Only the bare minimal set of credentials is overridden: fsuid, fsgid and group_info. The rest, for example capabilities, are not overridden to avoid unneeded security risks. - To avoid sandboxing escape, this patch makes sure the restricted lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT. - To avoid leaking creds across exec, this patch requires O_CLOEXEC flag on a directory. Use cases: Virtual machines that deal with untrusted code, can use that instead of a more heavy-weighted approaches. Currently the approach is being tested on a dosemu2 VM. Signed-off-by: Stas Sergeev <stsp2@yandex.ru> CC: Stefan Metzmacher <metze@samba.org> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andy Lutomirski <luto@kernel.org> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Jeff Layton <jlayton@kernel.org> CC: Chuck Lever <chuck.lever@oracle.com> CC: Alexander Aring <alex.aring@gmail.com> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Christian Göttsche <cgzones@googlemail.com> --- fs/internal.h | 2 +- fs/namei.c | 61 ++++++++++++++++++++++++++++++++++-- fs/open.c | 2 +- include/linux/fcntl.h | 2 ++ include/uapi/linux/openat2.h | 3 ++ 5 files changed, 66 insertions(+), 4 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 7ca738904e34..692b53b19aad 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -169,7 +169,7 @@ static inline void sb_end_ro_state_change(struct super_block *sb) * open.c */ struct open_flags { - int open_flag; + u64 open_flag; umode_t mode; int acc_mode; int intent; diff --git a/fs/namei.c b/fs/namei.c index 2fde2c320ae9..f34ad2b296c7 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -586,6 +586,9 @@ struct nameidata { int dfd; vfsuid_t dir_vfsuid; umode_t dir_mode; + kuid_t dir_open_fsuid; + kgid_t dir_open_fsgid; + struct group_info *dir_open_groups; } __randomize_layout; #define ND_ROOT_PRESET 1 @@ -695,6 +698,8 @@ static void terminate_walk(struct nameidata *nd) nd->depth = 0; nd->path.mnt = NULL; nd->path.dentry = NULL; + if (nd->dir_open_groups) + put_group_info(nd->dir_open_groups); } /* path_put is needed afterwards regardless of success or failure */ @@ -2414,6 +2419,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_fsuid = current_cred()->fsuid; + nd->dir_open_fsgid = current_cred()->fsgid; + nd->dir_open_groups = get_current_groups(); } else { /* Caller must check execute permissions on the starting path component */ struct fd f = fdget_raw(nd->dfd); @@ -2437,6 +2445,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_fsuid = f.file->f_cred->fsuid; + nd->dir_open_fsgid = f.file->f_cred->fsgid; + nd->dir_open_groups = get_group_info( + f.file->f_cred->group_info); fdput(f); } @@ -3776,6 +3788,29 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file) return error; } +static const struct cred *openat2_override_creds(struct nameidata *nd) +{ + const struct cred *old_cred; + struct cred *override_cred; + + override_cred = prepare_creds(); + if (!override_cred) + return NULL; + + override_cred->fsuid = nd->dir_open_fsuid; + override_cred->fsgid = nd->dir_open_fsgid; + override_cred->group_info = nd->dir_open_groups; + + override_cred->non_rcu = 1; + + old_cred = override_creds(override_cred); + + /* override_cred() gets its own ref */ + put_cred(override_cred); + + return old_cred; +} + static struct file *path_openat(struct nameidata *nd, const struct open_flags *op, unsigned flags) { @@ -3794,8 +3829,28 @@ static struct file *path_openat(struct nameidata *nd, error = do_o_path(nd, flags, file); } else { const char *s = path_init(nd, flags); - file = alloc_empty_file(op->open_flag, current_cred()); - error = PTR_ERR_OR_ZERO(file); + const struct cred *old_cred = NULL; + + error = 0; + if (op->open_flag & OA2_INHERIT_CRED) { + /* Make sure to work only with restricted + * look-up modes. + */ + if (!(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) + error = -EPERM; + /* Only work with O_CLOEXEC dirs. */ + if (!get_close_on_exec(nd->dfd)) + error = -EPERM; + + if (!error) + old_cred = openat2_override_creds(nd); + } + if (!error) { + file = alloc_empty_file(op->open_flag, current_cred()); + error = PTR_ERR_OR_ZERO(file); + } else { + file = ERR_PTR(error); + } if (!error) { while (!(error = link_path_walk(s, nd)) && (s = open_last_lookups(nd, file, op)) != NULL) @@ -3803,6 +3858,8 @@ static struct file *path_openat(struct nameidata *nd, } if (!error) error = do_open(nd, file, op); + if (old_cred) + revert_creds(old_cred); terminate_walk(nd); if (IS_ERR(file)) return file; diff --git a/fs/open.c b/fs/open.c index ee8460c83c77..6be013182a35 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1225,7 +1225,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) * values before calling build_open_flags(), but openat2(2) checks all * of its arguments. */ - if (flags & ~VALID_OPEN_FLAGS) + if (flags & ~VALID_OPENAT2_FLAGS) return -EINVAL; if (how->resolve & ~VALID_RESOLVE_FLAGS) return -EINVAL; diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index a332e79b3207..b71f8b162102 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -12,6 +12,8 @@ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) +#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_INHERIT_CRED) + /* List of all valid flags for the how->resolve argument: */ #define VALID_RESOLVE_FLAGS \ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h index a5feb7604948..cdd676a10b62 100644 --- a/include/uapi/linux/openat2.h +++ b/include/uapi/linux/openat2.h @@ -40,4 +40,7 @@ struct open_how { return -EAGAIN if that's not possible. */ +/* openat2-specific flags go to upper 4 bytes. */ +#define OA2_INHERIT_CRED (1ULL << 32) + #endif /* _UAPI_LINUX_OPENAT2_H */ -- 2.44.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v2 0/2] implement OA2_INHERIT_CRED flag for openat2() @ 2024-04-23 11:01 Stas Sergeev 2024-04-23 11:01 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev 0 siblings, 1 reply; 18+ messages in thread From: Stas Sergeev @ 2024-04-23 11:01 UTC (permalink / raw) To: linux-kernel Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche This patch-set implements the OA2_INHERIT_CRED flag for openat2() syscall. It is needed to perform an open operation with the creds that were in effect when the dir_fd was opened. This allows the process to pre-open some dirs and switch eUID (and other UIDs/GIDs) to the less-privileged user, while still retaining the possibility to open/create files within the pre-opened directory set. Changes in v2: - capture full struct cred instead of just fsuid/fsgid. Suggested by Stefan Metzmacher <metze@samba.org> CC: Stefan Metzmacher <metze@samba.org> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andy Lutomirski <luto@kernel.org> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Jeff Layton <jlayton@kernel.org> CC: Chuck Lever <chuck.lever@oracle.com> CC: Alexander Aring <alex.aring@gmail.com> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: linux-api@vger.kernel.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Christian Göttsche <cgzones@googlemail.com> Stas Sergeev (2): fs: reorganize path_openat() openat2: add OA2_INHERIT_CRED flag fs/internal.h | 2 +- fs/namei.c | 52 +++++++++++++++++++++++++++++------- fs/open.c | 2 +- include/linux/fcntl.h | 2 ++ include/uapi/linux/openat2.h | 3 +++ 5 files changed, 50 insertions(+), 11 deletions(-) -- 2.44.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag 2024-04-23 11:01 [PATCH v2 0/2] implement OA2_INHERIT_CRED flag for openat2() Stas Sergeev @ 2024-04-23 11:01 ` Stas Sergeev 0 siblings, 0 replies; 18+ messages in thread From: Stas Sergeev @ 2024-04-23 11:01 UTC (permalink / raw) To: linux-kernel Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever, Alexander Aring, linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche This flag performs the open operation with the credentials that were in effect when dir_fd was opened. This allows the process to pre-open some directories and then change eUID (and all other UIDs/GIDs) to a less-privileged user, retaining the ability to open/create files within these directories. Design goal: The idea is to provide a very light-weight sandboxing, where the process, without the use of any heavy-weight techniques like chroot within namespaces, can restrict the access to the set of pre-opened directories. This patch is just a first step to such sandboxing. If things go well, in the future the same extension can be added to more syscalls. These should include at least unlinkat(), renameat2() and the not-yet-upstreamed setxattrat(). Security considerations: To avoid sandboxing escape, this patch makes sure the restricted lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT. To avoid leaking creds across exec, this patch requires O_CLOEXEC flag on a directory. Use cases: Virtual machines that deal with untrusted code, can use that instead of a more heavy-weighted approaches. Currently the approach is being tested on a dosemu2 VM. Signed-off-by: Stas Sergeev <stsp2@yandex.ru> CC: Stefan Metzmacher <metze@samba.org> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andy Lutomirski <luto@kernel.org> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Jeff Layton <jlayton@kernel.org> CC: Chuck Lever <chuck.lever@oracle.com> CC: Alexander Aring <alex.aring@gmail.com> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Christian Göttsche <cgzones@googlemail.com> --- fs/internal.h | 2 +- fs/namei.c | 30 ++++++++++++++++++++++++++++-- fs/open.c | 2 +- include/linux/fcntl.h | 2 ++ include/uapi/linux/openat2.h | 3 +++ 5 files changed, 35 insertions(+), 4 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 7ca738904e34..692b53b19aad 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -169,7 +169,7 @@ static inline void sb_end_ro_state_change(struct super_block *sb) * open.c */ struct open_flags { - int open_flag; + u64 open_flag; umode_t mode; int acc_mode; int intent; diff --git a/fs/namei.c b/fs/namei.c index 2fde2c320ae9..0e0f2e32ef02 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -586,6 +586,7 @@ struct nameidata { int dfd; vfsuid_t dir_vfsuid; umode_t dir_mode; + const struct cred *dir_open_cred; } __randomize_layout; #define ND_ROOT_PRESET 1 @@ -695,6 +696,7 @@ static void terminate_walk(struct nameidata *nd) nd->depth = 0; nd->path.mnt = NULL; nd->path.dentry = NULL; + put_cred(nd->dir_open_cred); } /* path_put is needed afterwards regardless of success or failure */ @@ -2414,6 +2416,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags) get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_cred = get_current_cred(); } else { /* Caller must check execute permissions on the starting path component */ struct fd f = fdget_raw(nd->dfd); @@ -2437,6 +2440,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_cred = get_cred(f.file->f_cred); fdput(f); } @@ -3794,8 +3798,28 @@ static struct file *path_openat(struct nameidata *nd, error = do_o_path(nd, flags, file); } else { const char *s = path_init(nd, flags); - file = alloc_empty_file(op->open_flag, current_cred()); - error = PTR_ERR_OR_ZERO(file); + const struct cred *old_cred = NULL; + + error = 0; + if (op->open_flag & OA2_INHERIT_CRED) { + /* Make sure to work only with restricted + * look-up modes. + */ + if (!(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) + error = -EPERM; + /* Only work with O_CLOEXEC dirs. */ + if (!get_close_on_exec(nd->dfd)) + error = -EPERM; + + if (!error) + old_cred = override_creds(nd->dir_open_cred); + } + if (!error) { + file = alloc_empty_file(op->open_flag, current_cred()); + error = PTR_ERR_OR_ZERO(file); + } else { + file = ERR_PTR(error); + } if (!error) { while (!(error = link_path_walk(s, nd)) && (s = open_last_lookups(nd, file, op)) != NULL) @@ -3803,6 +3827,8 @@ static struct file *path_openat(struct nameidata *nd, } if (!error) error = do_open(nd, file, op); + if (old_cred) + revert_creds(old_cred); terminate_walk(nd); if (IS_ERR(file)) return file; diff --git a/fs/open.c b/fs/open.c index ee8460c83c77..6be013182a35 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1225,7 +1225,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) * values before calling build_open_flags(), but openat2(2) checks all * of its arguments. */ - if (flags & ~VALID_OPEN_FLAGS) + if (flags & ~VALID_OPENAT2_FLAGS) return -EINVAL; if (how->resolve & ~VALID_RESOLVE_FLAGS) return -EINVAL; diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index a332e79b3207..b71f8b162102 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -12,6 +12,8 @@ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) +#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_INHERIT_CRED) + /* List of all valid flags for the how->resolve argument: */ #define VALID_RESOLVE_FLAGS \ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h index a5feb7604948..cdd676a10b62 100644 --- a/include/uapi/linux/openat2.h +++ b/include/uapi/linux/openat2.h @@ -40,4 +40,7 @@ struct open_how { return -EAGAIN if that's not possible. */ +/* openat2-specific flags go to upper 4 bytes. */ +#define OA2_INHERIT_CRED (1ULL << 32) + #endif /* _UAPI_LINUX_OPENAT2_H */ -- 2.44.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-04-26 13:36 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-04-24 10:52 [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Stas Sergeev 2024-04-24 10:52 ` [PATCH 1/2] fs: reorganize path_openat() Stas Sergeev 2024-04-25 8:13 ` kernel test robot 2024-04-24 10:52 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev 2024-04-25 2:31 ` Al Viro 2024-04-25 7:24 ` stsp 2024-04-25 9:23 ` stsp 2024-04-25 13:50 ` kernel test robot 2024-04-25 14:02 ` Christian Brauner 2024-04-26 13:36 ` stsp 2024-04-24 16:09 ` [PATCH v4 0/2] implement OA2_INHERIT_CRED flag for openat2() Christian Brauner 2024-04-24 17:50 ` stsp 2024-04-25 9:54 ` Christian Brauner 2024-04-25 10:12 ` stsp 2024-04-25 12:08 ` Christian Brauner 2024-04-25 12:39 ` stsp -- strict thread matches above, loose matches on Subject: below -- 2024-04-23 22:46 [PATCH v3 " Stas Sergeev 2024-04-23 22:46 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev 2024-04-23 11:01 [PATCH v2 0/2] implement OA2_INHERIT_CRED flag for openat2() Stas Sergeev 2024-04-23 11:01 ` [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag Stas Sergeev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).