* [PATCH RFC 0/1] mount: universally disallow mounting over symlinks @ 2019-12-30 5:20 Aleksa Sarai 2019-12-30 5:20 ` [PATCH RFC 1/1] " Aleksa Sarai 2019-12-30 5:44 ` [PATCH RFC 0/1] " Al Viro 0 siblings, 2 replies; 92+ messages in thread From: Aleksa Sarai @ 2019-12-30 5:20 UTC (permalink / raw) To: Al Viro, David Howells, Eric Biederman, Linus Torvalds Cc: Aleksa Sarai, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel An undocumented feature of the mount interface was that it was possible to mount over a symlink (even with the old mount API) by mounting over /proc/self/fd/$n -- where the corresponding file descrpitor was opened with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts (for a variety of reasons), but MS_BIND worked without issue. With the new mount API it was even easier. A reasonably detailed explanation of the issues is provided in the patch itself, but the full traces produced by both the oopses and deadlocks is included below (it makes little sense to include them in the commit since we are disabling this feature, not directly fixing the bugs themselves). I've posted this as an RFC on whether this feature should be allowed at all (and if anyone knows of legitimate uses for it), or if we should work on fixing these other kernel bugs that it exposes. Oops on NULL dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 8000000181b1f067 P4D 8000000181b1f067 PUD 24829c067 PMD 0 Oops: 0010 [#1] SMP PTI CPU: 6 PID: 20796 Comm: mount_to_symlin Tainted: G OE 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffbc7d87e1bcb0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffffa0c28cb633c0 RCX: 000000000000ae5a RDX: 0000000000000089 RSI: ffffa0c0eece8840 RDI: ffffa0c0eb8843b0 RBP: ffffa0c0eb8843b0 R08: ffffdc7d7fbbb770 R09: ffffa0c0ca333000 R10: 0000000000000000 R11: 808080807fffffff R12: ffffa0c0eece8840 R13: 0000000000000089 R14: ffffbc7d87e1bdb0 R15: 0000000000000080 FS: 00007fd921508540(0000) GS:ffffa0c3cf580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 000000018878a003 CR4: 00000000003606e0 Call Trace: __lookup_slow+0x94/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fd92143f4e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe98c89cc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd92143f4e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000167a330 RBP: 00007ffe98c89da0 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004c6 R11: 0000000000000202 R12: 00000000004010c0 R13: 00007ffe98c89e80 R14: 0000000000000000 R15: 0000000000000000 CR2: 0000000000000000 Oops on kernel address: BUG: unable to handle page fault for address: ffffbc7d87e1bcc0 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 107d4a067 P4D 107d4a067 PUD 107d4b067 PMD 46d753067 PTE 0 Oops: 0002 [#2] SMP PTI CPU: 4 PID: 20975 Comm: mount_to_symlin Tainted: G D OE 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:_raw_spin_lock_irqsave+0x28/0x50 Code: 00 00 0f 1f 44 00 00 41 54 53 48 89 fb 9c 58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 e8 3f 55 82 ff 31 c0 ba 01 00 00 00 <f0> 0f b1 13 75 07 4c 89 e0 5b 41 5c c3 89 c6 48 89 df e8 01 52 77 RSP: 0018:ffffbc7d90067bd8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffbc7d87e1bcc0 RCX: 0000000200000000 RDX: 0000000000000001 RSI: ffffbc7d90067c50 RDI: ffffbc7d87e1bcc0 RBP: ffffbc7d87e1bcc0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000000000000000 R11: 808080807fffffff R12: 0000000000000246 R13: ffffa0c28cb633c0 R14: ffffbc7d90067db0 R15: ffffa0c0eece8898 FS: 00007f4b80214540(0000) GS:ffffa0c3cf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffbc7d87e1bcc0 CR3: 000000026d4d0002 CR4: 00000000003606e0 Call Trace: add_wait_queue+0x15/0x40 d_alloc_parallel+0x36d/0x480 ? get_acl+0x1a/0x160 ? wake_up_q+0xa0/0xa0 __lookup_slow+0x6b/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f4b8014b4e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffee8041b28 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4b8014b4e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000019c8330 RBP: 00007ffee8041c00 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004c6 R11: 0000000000000206 R12: 00000000004010c0 R13: 00007ffee8041ce0 R14: 0000000000000000 R15: 0000000000000000 CR2: ffffbc7d87e1bcc0 Apparent deadlock in d_alloc_parallel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount_to_symlin:21285] CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G D OE 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898 RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003 R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73 R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898 FS: 00007fbb1fd30540(0000) GS:ffffa0c3cf400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbb1fbd25a0 CR3: 0000000181ace005 CR4: 00000000003606f0 Call Trace: _raw_spin_lock+0x1a/0x20 lockref_get_not_dead+0x4f/0x90 d_alloc_parallel+0x1a8/0x480 ? get_acl+0x1a/0x160 __lookup_slow+0x6b/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fbb1fc674e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330 RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0 R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000 RCU stall when trying to grab /proc/$pid/stack for the stuck process: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 0-....: (15000 ticks this GP) idle=2c6/1/0x4000000000000002 softirq=1172554/1172554 fqs=6849 (t=15001 jiffies g=1935177 q=25734) NMI backtrace for cpu 0 CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G D OEL 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 Call Trace: <IRQ> dump_stack+0x8f/0xd0 ? lapic_can_unplug_cpu.cold+0x3e/0x3e nmi_cpu_backtrace.cold+0x14/0x52 nmi_trigger_cpumask_backtrace+0xf6/0xf8 rcu_dump_cpu_stacks+0x8f/0xbd rcu_sched_clock_irq.cold+0x1b2/0x39f update_process_times+0x24/0x50 tick_sched_handle+0x22/0x60 tick_sched_timer+0x38/0x80 ? tick_sched_do_timer+0x60/0x60 __hrtimer_run_queues+0xf6/0x270 hrtimer_interrupt+0x10e/0x240 smp_apic_timer_interrupt+0x6c/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898 RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003 R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73 R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898 _raw_spin_lock+0x1a/0x20 lockref_get_not_dead+0x4f/0x90 d_alloc_parallel+0x1a8/0x480 ? get_acl+0x1a/0x160 __lookup_slow+0x6b/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fbb1fc674e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330 RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0 R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000 Deadlock on lock_mount after a successful umount(). The watchdog does trigger, but I could only find this stall when trying to suspend the system in my logs: Freezing of tasks failed after 20.010 seconds (2 tasks refusing to freeze, wq_busy=0): mount_to_symlin D 0 5850 5849 0x00000004 Call Trace: ? __schedule+0x2dd/0x770 schedule+0x4a/0xb0 rwsem_down_write_slowpath+0x256/0x500 lock_mount+0x22/0xf0 do_mount+0x4b7/0x9f0 ksys_mount+0x7e/0xc0 __x64_sys_mount+0x21/0x30 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f86e6355fda Code: Bad RIP value. RSP: 002b:00007ffc36f952d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f86e6355fda RDX: 0000000000402099 RSI: 00000000019a5310 RDI: 00007ffc36f96ee1 RBP: 00007ffc36f953b0 R08: 0000000000402099 R09: 000000000000000f R10: 0000000000001000 R11: 0000000000000206 R12: 00000000004010c0 R13: 00007ffc36f95490 R14: 0000000000000000 R15: 0000000000000000 Cc: stable@vger.kernel.org # pre-git Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Aleksa Sarai (1): mount: universally disallow mounting over symlinks fs/namespace.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) base-commit: fd6988496e79a6a4bdb514a4655d2920209eb85d -- 2.24.1 ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH RFC 1/1] mount: universally disallow mounting over symlinks 2019-12-30 5:20 [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai @ 2019-12-30 5:20 ` Aleksa Sarai 2019-12-30 7:34 ` Linus Torvalds 2019-12-30 5:44 ` [PATCH RFC 0/1] " Al Viro 1 sibling, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2019-12-30 5:20 UTC (permalink / raw) To: Al Viro, David Howells, Eric Biederman, Linus Torvalds Cc: Aleksa Sarai, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel An undocumented feature of the mount interface was that it was possible to mount over a symlink (even with the old mount API) by mounting over /proc/self/fd/$n -- where the corresponding file descrpitor was opened with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts (for a variety of reasons), but MS_BIND worked without issue. With the new mount API it was even easier. From userspace's perspective, this capability is only really useful as an attack vector. Until the introduction of openat2(RESOLVE_NO_XDEV), there was no trivial way to detect if a bind-mount was present. In the container runtime context (in a similar vein to CVE-2019-19921), this could result in a privileged process being unable to detect that a configuration resulted in magic-link usage operating on the wrong magic-links. Additionally, the API to use this feature was incredibly strange -- in order to umount, you would have go through /proc/self/fd/$n again (umounting the path would result in the *underlying* symlink being followed). Which brings us to the issues on the kernel side. When umounting a mount on top of a symlink, several oopses (both NULL and garbage kernel address dereferences) and deadlocks could be triggered incredibly trivially. Note that because this works in user namespaces, an unprivileged user could trigger these oopses incredibly trivially. While these bugs could be fixed separately, it seems much cleaner to disable a "feature" which clearly was not intentional (and is not used -- otherwise we would've seen bug reports about it breaking on umount). Note that because the linux-utils mount(1) helper will expand paths containing symlinks in user-space, only users which used the mount(2) syscall directly could possibly have seen this behaviour. Cc: stable@vger.kernel.org # pre-git Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/namespace.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index be601d3a8008..01a62bce105f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2172,8 +2172,12 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) return -EINVAL; + if (d_is_symlink(mp->m_dentry) || + d_is_symlink(mnt->mnt.mnt_root)) + return -EINVAL; + if (d_is_dir(mp->m_dentry) != - d_is_dir(mnt->mnt.mnt_root)) + d_is_dir(mnt->mnt.mnt_root)) return -ENOTDIR; return attach_recursive_mnt(mnt, p, mp, false); @@ -2251,6 +2255,9 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) if (IS_MNT_UNBINDABLE(old)) return mnt; + if (d_is_symlink(old_path->dentry)) + return mnt; + if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations) return mnt; @@ -2635,6 +2642,10 @@ static int do_move_mount(struct path *old_path, struct path *new_path) if (old_path->dentry != old_path->mnt->mnt_root) goto out; + if (d_is_symlink(new_path->dentry) || + d_is_symlink(old_path->dentry)) + goto out; + if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) goto out; @@ -2726,10 +2737,6 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) path->mnt->mnt_root == path->dentry) goto unlock; - err = -EINVAL; - if (d_is_symlink(newmnt->mnt.mnt_root)) - goto unlock; - newmnt->mnt.mnt_flags = mnt_flags; err = graft_tree(newmnt, parent, mp); -- 2.24.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks 2019-12-30 5:20 ` [PATCH RFC 1/1] " Aleksa Sarai @ 2019-12-30 7:34 ` Linus Torvalds 2019-12-30 8:28 ` Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: Linus Torvalds @ 2019-12-30 7:34 UTC (permalink / raw) To: Aleksa Sarai Cc: Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > + if (d_is_symlink(mp->m_dentry) || > + d_is_symlink(mnt->mnt.mnt_root)) > + return -EINVAL; So I don't hate this kind of check in general - overmounting a symlink sounds odd, but at the same time I get the feeling that the real issue is that something went wrong earlier. Yeah, the mount target kind of _is_ a path, but at the same time, we most definitely want to have the permission to really open the directory in question, don't we, and I don't see that we should accept a O_PATH file descriptor. I feel like the only valid use of "O_PATH" files is to then use them as the base for an openat() and friends (ie fchmodat/execveat() etc). But maybe I'm completely wrong, and people really do want O_PATH handling exactly for mounting too. It does sound a bit odd. By definition, mounting wants permissions to the mount-point, so what's the point of using O_PATH? So instead of saying "don't overmount symlinks", I would feel like it's the mount system call that should use a proper file descriptor that isn't FMODE_PATH. Is it really the symlink that is the issue? Because if it's the symlink that is the issue then I feel like O_NOFOLLOW should have triggered it, but your other email seems to say that you really need O_PATH | O_SYMLINK. So I'm not sayng that this patch is wrong, but it really smells a bit like it's papering over the more fundamental issue. For example, is the problem that when you do a proper fd = open("somepath", O_PATH); in one process, and then another thread does fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR); then we get confused and do bad things on that *second* open? Because now the second open doesn't have O_PATH, and doesn't ghet marked FMODE_PATH, but the underlying file descriptor is one of those limited "is really only useful for openat() and friends". I dunno. I haven't thought through the whole thing. But the oopses you quote seem like we're really doing something wrong, and it really does feel like your patch in no way _fixes_ the wrong thing we're doing, it's just hiding the symptoms. Linus ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks 2019-12-30 7:34 ` Linus Torvalds @ 2019-12-30 8:28 ` Aleksa Sarai 2020-01-08 4:39 ` Andy Lutomirski 0 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2019-12-30 8:28 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 5028 bytes --] On 2019-12-29, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > + if (d_is_symlink(mp->m_dentry) || > > + d_is_symlink(mnt->mnt.mnt_root)) > > + return -EINVAL; > > So I don't hate this kind of check in general - overmounting a symlink > sounds odd, but at the same time I get the feeling that the real issue > is that something went wrong earlier. > > Yeah, the mount target kind of _is_ a path, but at the same time, we > most definitely want to have the permission to really open the > directory in question, don't we, and I don't see that we should accept > a O_PATH file descriptor. The new mount API uses O_PATH under the hood (which is a good thing since some files you'd like to avoid actually opening -- FIFOs are the obvious example) so I'm not sure that's something we could really avoid. But if we block O_PATH for mounts this will achieve the same thing, because the only way to get a file descriptor that references a symlink is through (O_PATH | O_NOFOLLOW). > I feel like the only valid use of "O_PATH" files is to then use them > as the base for an openat() and friends (ie fchmodat/execveat() etc). See below, we use this for all sorts of dirty^Wclever tricks. > But maybe I'm completely wrong, and people really do want O_PATH > handling exactly for mounting too. It does sound a bit odd. By > definition, mounting wants permissions to the mount-point, so what's > the point of using O_PATH? When you go through O_PATH, you still get a proper 'struct path' which means that for operations such as mount (or open) you will operate on the *real* underlying file. This is part of what makes magic-links so useful (but also quite terrifying). > For example, is the problem that when you do a proper > > fd = open("somepath", O_PATH); > > in one process, and then another thread does > > fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR); > > then we get confused and do bad things on that *second* open? Because > now the second open doesn't have O_PATH, and doesn't ghet marked > FMODE_PATH, but the underlying file descriptor is one of those limited > "is really only useful for openat() and friends". Actually, this isn't true (for the same reason as above) -- when you do a re-open through /proc/$pid/fd/$n you get a real-as-a-heart-attack file descriptor. We make lots of use of this in container runtimes in order to do some dirty^Wfun tricks that help us harden the runtime against malicious container processes. You might recall that when I was posting the earlier revisions of openat2(), I also included a patch for O_EMPTYPATH (which basically did a re-open of /proc/self/fd/$dfd but without needing /proc). That had precisely the same semantics so that you could do the same operation without procfs. That patch was dropped before Al merged openat2(), but I am probably going to revive it for the reasons I outlined below. > I dunno. I haven't thought through the whole thing. But the oopses you > quote seem like we're really doing something wrong, and it really does > feel like your patch in no way _fixes_ the wrong thing we're doing, > it's just hiding the symptoms. That's fair enough. I'll be honest, the real reason why I don't want mounts over symlinks to be possible is for an entirely different reason. I'm working on a safe path resolution library to accompany openat2()[1] -- and one of the things I want to do is to harden all of our uses of procfs (such that if we are running in a context where procfs has been messed with -- such as having files bind-mounted -- we can detect it and abort). The issue with symlinks is that we need to be able to operate on magic-links (such as /proc/self/fd/$n and /proc/self/exe) -- and if it's possible bind-mount over those magic-links then we can't detect it at all. openat2(RESOLVE_NO_XDEV) would block it, but it also blocks going through magic-links which change your mount (which would almost always be true). You can't trust /proc/self/mountinfo by definition -- not just because of the TOCTOU race but also because you can't depend on /proc to harden against a "bad" /proc. All other options such as umount2(MNT_EXPIRE) won't help with magic-links because we cannot take an O_PATH to a magic-link and follow it -- O_PATHs of symlinks are completely stunted in this respect. If allowing bind-mounts over symlinks is allowed (which I don't have a problem with really), it just means we'll need a few more kernel pieces to get this hardening to work. But these features would be useful outside of the problems I'm dealing with (O_EMPTYPATH and some kind of pidfd-based interface to grab the equivalent of /proc/self/exe and a few other such magic-link targets). [1]: https://github.com/openSUSE/libpathrs -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks 2019-12-30 8:28 ` Aleksa Sarai @ 2020-01-08 4:39 ` Andy Lutomirski 0 siblings, 0 replies; 92+ messages in thread From: Andy Lutomirski @ 2020-01-08 4:39 UTC (permalink / raw) To: Aleksa Sarai Cc: Linus Torvalds, Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Mon, Dec 30, 2019 at 12:29 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2019-12-29, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > If allowing bind-mounts over symlinks is allowed (which I don't have a > problem with really), it just means we'll need a few more kernel pieces > to get this hardening to work. But these features would be useful > outside of the problems I'm dealing with (O_EMPTYPATH and some kind of > pidfd-based interface to grab the equivalent of /proc/self/exe and a few > other such magic-link targets). As one data point, I would use this ability in virtme: this would allow me to more reliably mount over /etc/resolve.conf even when it's a symlink. (Perhaps I should use overlayfs instead. Hmm.) ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2019-12-30 5:20 [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai 2019-12-30 5:20 ` [PATCH RFC 1/1] " Aleksa Sarai @ 2019-12-30 5:44 ` Al Viro 2019-12-30 5:49 ` Aleksa Sarai 1 sibling, 1 reply; 92+ messages in thread From: Al Viro @ 2019-12-30 5:44 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote: > A reasonably detailed explanation of the issues is provided in the patch > itself, but the full traces produced by both the oopses and deadlocks is > included below (it makes little sense to include them in the commit since we > are disabling this feature, not directly fixing the bugs themselves). > > I've posted this as an RFC on whether this feature should be allowed at > all (and if anyone knows of legitimate uses for it), or if we should > work on fixing these other kernel bugs that it exposes. Umm... Are all of those traces a) reproducible on mainline and b) reproducible as the first oopsen? As it is, quite a few might be secondary results of earlier memory corruption... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2019-12-30 5:44 ` [PATCH RFC 0/1] " Al Viro @ 2019-12-30 5:49 ` Aleksa Sarai 2019-12-30 7:29 ` Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2019-12-30 5:49 UTC (permalink / raw) To: Al Viro Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1431 bytes --] On 2019-12-30, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote: > > > A reasonably detailed explanation of the issues is provided in the patch > > itself, but the full traces produced by both the oopses and deadlocks is > > included below (it makes little sense to include them in the commit since we > > are disabling this feature, not directly fixing the bugs themselves). > > > > I've posted this as an RFC on whether this feature should be allowed at > > all (and if anyone knows of legitimate uses for it), or if we should > > work on fixing these other kernel bugs that it exposes. > > Umm... Are all of those traces > a) reproducible on mainline and This was on viro/for-next, I'll retry it on v5.5-rc4. > b) reproducible as the first oopsen? The NULL and garbage pointer derefs are reproducible as the first oops. Looking at my logs, it looks like the deadlocks were always triggered after the oops, but that might just have been a mistake on my part while testing things. > As it is, quite a few might be secondary results of earlier memory > corruption... Yeah, I thought that might be the case but decided to include them anyway (the /proc/self/stack RCU stall is definitely the result of other corruption and stalls). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2019-12-30 5:49 ` Aleksa Sarai @ 2019-12-30 7:29 ` Aleksa Sarai 2019-12-30 7:53 ` Linus Torvalds 2020-01-01 0:43 ` Al Viro 0 siblings, 2 replies; 92+ messages in thread From: Aleksa Sarai @ 2019-12-30 7:29 UTC (permalink / raw) To: Al Viro Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 4818 bytes --] On 2019-12-30, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2019-12-30, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote: > > > > > A reasonably detailed explanation of the issues is provided in the patch > > > itself, but the full traces produced by both the oopses and deadlocks is > > > included below (it makes little sense to include them in the commit since we > > > are disabling this feature, not directly fixing the bugs themselves). > > > > > > I've posted this as an RFC on whether this feature should be allowed at > > > all (and if anyone knows of legitimate uses for it), or if we should > > > work on fixing these other kernel bugs that it exposes. > > > > Umm... Are all of those traces > > a) reproducible on mainline and > > This was on viro/for-next, I'll retry it on v5.5-rc4. The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems harder to reproduce than on viro/for-next (I kept reproducing it there by accident), but I'll double-check if that really is the case. The simplest reproducer is (using the attached programs and .config): ln -s . link sudo ./umount_symlink link There's also a few other whacky behaviours where you get -ELOOP or -EACCES in cases where you shouldn't -- which results in MNT_DETACH failing and the mount being impossible to get rid of. A good example is sudo ./mount_to_symlink /proc/self/exe link sudo ./umount_symlink link # -EACCES Or ln -s . link1 ln -s . link2 sudo ./mount_to_symlink link1 link2 sudo ./umount_symlink link1 # -ELOOP sudo ./umount_symlink link2 # -ELOOP But I am trying to find a reproducer for the "umount of a mount triggering an Oops" issue. On another note -- I guess this is considered a feature which should "just work" and not a bug? BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 80000003c6fca067 P4D 80000003c6fca067 PUD 3c6f42067 PMD 0 Oops: 0010 [#1] SMP PTI CPU: 4 PID: 4486 Comm: umount_symlink Tainted: G E 5.5.0-rc4-cyphar #126 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0 Call Trace: __lookup_slow+0x94/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x360 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fbc2a8274e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd1da9b3f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc2a8274e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000001300310 RBP: 00007ffd1da9b4c0 R08: 0000000000000000 R09: 000000000000000f R10: 00007fbc2a92f800 R11: 0000000000000202 R12: 0000000000401090 R13: 00007ffd1da9b5a0 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: [snip] CR2: 0000000000000000 ---[ end trace ae473813e34e641d ]--- RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0 -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #1.2: .config --] [-- Type: application/x-config, Size: 237462 bytes --] [-- Attachment #1.3: mount_to_symlink.c --] [-- Type: text/x-c, Size: 1318 bytes --] #define _GNU_SOURCE #include <sys/types.h> #include <sys/mount.h> #include <sys/types.h> #include <sys/stat.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #define bail(msg) \ do { printf("mount_to_symlink: %s: %m\n", msg); exit(1); } while (0) int is_symlink(const char *path) { struct stat stat = {}; if (lstat(path, &stat) < 0) bail("lstat(<path>)"); return S_ISLNK(stat.st_mode); } int main(int argc, char **argv) { struct stat stat = {}; char *src, *dst, *src_fdpath, *dst_fdpath; int src_fd, dst_fd; if (argc != 3) bail("usage: mount_to_symlink <src> <dst>"); src_fdpath = src = argv[1]; dst_fdpath = dst = argv[2]; if (is_symlink(src)) { // open source fd src_fd = open(src, O_PATH | O_CLOEXEC | O_NOFOLLOW); if (src_fd < 0) bail("open(<src>, O_PATH|O_NOFOLLOW)"); // construct fd path asprintf(&src_fdpath, "/proc/self/fd/%d", src_fd); } if (is_symlink(dst)) { // open target fd dst_fd = open(dst, O_PATH | O_CLOEXEC | O_NOFOLLOW); if (dst_fd < 0) bail("open(<dst>, O_PATH|O_NOFOLLOW)"); // construct fd path asprintf(&dst_fdpath, "/proc/self/fd/%d", dst_fd); } // try to mount mount(src_fdpath, dst_fdpath, "", MS_BIND, ""); printf("mount(%s, %s, MS_BIND) = %m (%d)\n", src, dst, -errno); return 0; } [-- Attachment #1.4: umount_symlink.c --] [-- Type: text/x-c, Size: 795 bytes --] #define _GNU_SOURCE #include <sys/types.h> #include <sys/mount.h> #include <sys/types.h> #include <sys/stat.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #define bail(msg) \ do { printf("mount_to_symlink: %s: %m\n", msg); exit(1); } while (0) int main(int argc, char **argv) { struct stat stat = {}; char *mnt, *mnt_fdpath; int mnt_fd; if (argc != 2) bail("need <mount> argument"); mnt = argv[1]; // open mountpoint fd mnt_fd = open(mnt, O_PATH | O_CLOEXEC | O_NOFOLLOW); if (mnt_fd < 0) bail("open(<dst>, O_PATH|O_NOFOLLOW)"); // get fdpaths asprintf(&mnt_fdpath, "/proc/self/fd/%d", mnt_fd); // try to mount umount2(mnt_fdpath, MNT_DETACH); printf("umount2(%s, MNT_DETACH) = %m (%d)\n", mnt, -errno); return 0; } [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2019-12-30 7:29 ` Aleksa Sarai @ 2019-12-30 7:53 ` Linus Torvalds 2019-12-30 8:32 ` Aleksa Sarai 2020-01-01 0:43 ` Al Viro 1 sibling, 1 reply; 92+ messages in thread From: Linus Torvalds @ 2019-12-30 7:53 UTC (permalink / raw) To: Aleksa Sarai Cc: Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > BUG: kernel NULL pointer dereference, address: 0000000000000000 Would you mind building with debug info, and then running the oops through scripts/decode_stacktrace.sh which makes those addresses much more legible. > #PF: supervisor instruction fetch in kernel mode > #PF: error_code(0x0010) - not-present page Somebody jumped through a NULL pointer. > RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc > RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 > RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 > R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 > R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 > FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0 > Call Trace: > __lookup_slow+0x94/0x160 And "__lookup_slow()" has two indirect calls (they aren't obvious with retpoline, but look for something like call __x86_indirect_thunk_rax which is the modern sad way of doing "call *%rax"). One is for revalidatinging an old dentry, but the one I _suspect_ you trigger is this one: old = inode->i_op->lookup(inode, dentry, flags); but I thought we only could get here if we know it's a directory. How did we miss the "d_can_lookup()", which is what should check that yes, we can call that ->lookup() routine. This is why I have that suspicion that it's somehow that O_PATH fd opened in another process without O_PATH causes confusion... So what I think has happened is that because of the O_PATH thing, we've ended up with an inode that has never been truly opened (because O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we now have a file descriptor that _looks_ like it is valid, and we're treating that inode as if it can be used. But I'm handwaving. Linus ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2019-12-30 7:53 ` Linus Torvalds @ 2019-12-30 8:32 ` Aleksa Sarai 2020-01-02 8:58 ` David Laight 0 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2019-12-30 8:32 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 2682 bytes --] On 2019-12-29, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > Would you mind building with debug info, and then running the oops through > > scripts/decode_stacktrace.sh > > which makes those addresses much more legible. Will do. > > #PF: supervisor instruction fetch in kernel mode > > #PF: error_code(0x0010) - not-present page > > Somebody jumped through a NULL pointer. > > > RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc > > RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 > > RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 > > R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 > > R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 > > FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0 > > Call Trace: > > __lookup_slow+0x94/0x160 > > And "__lookup_slow()" has two indirect calls (they aren't obvious with > retpoline, but look for something like > > call __x86_indirect_thunk_rax > > which is the modern sad way of doing "call *%rax"). One is for > revalidatinging an old dentry, but the one I _suspect_ you trigger is > this one: > > old = inode->i_op->lookup(inode, dentry, flags); > > but I thought we only could get here if we know it's a directory. > > How did we miss the "d_can_lookup()", which is what should check that > yes, we can call that ->lookup() routine. I'll try applying a trivial patch to add d_can_lookup() to see if it fixes the immediate issue. > This is why I have that suspicion that it's somehow that O_PATH fd > opened in another process without O_PATH causes confusion... > > So what I think has happened is that because of the O_PATH thing, > we've ended up with an inode that has never been truly opened (because > O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we > now have a file descriptor that _looks_ like it is valid, and we're > treating that inode as if it can be used. I'm not sure I agree -- as I mentioned in my other mail, re-opening through /proc/self/fd/$n works *very* well and has for a long time (in fact, both LXC and runc depend on this working). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2019-12-30 8:32 ` Aleksa Sarai @ 2020-01-02 8:58 ` David Laight 2020-01-02 9:09 ` Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: David Laight @ 2020-01-02 8:58 UTC (permalink / raw) To: 'Aleksa Sarai', Linus Torvalds Cc: Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List From: Aleksa Sarai > Sent: 30 December 2019 08:32 ... > I'm not sure I agree -- as I mentioned in my other mail, re-opening > through /proc/self/fd/$n works *very* well and has for a long time (in > fact, both LXC and runc depend on this working). I thought it was marginally broken because it is followed as a symlink? On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the filesystem inode and can be used to link the file back into the filesystem if all the directory entries have been removed. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-02 8:58 ` David Laight @ 2020-01-02 9:09 ` Aleksa Sarai 0 siblings, 0 replies; 92+ messages in thread From: Aleksa Sarai @ 2020-01-02 9:09 UTC (permalink / raw) To: David Laight Cc: Linus Torvalds, Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 937 bytes --] On 2020-01-02, David Laight <David.Laight@ACULAB.COM> wrote: > From: Aleksa Sarai > > Sent: 30 December 2019 08:32 > ... > > I'm not sure I agree -- as I mentioned in my other mail, re-opening > > through /proc/self/fd/$n works *very* well and has for a long time (in > > fact, both LXC and runc depend on this working). > > I thought it was marginally broken because it is followed as a symlink? > On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the > filesystem inode and can be used to link the file back into the filesystem > if all the directory entries have been removed. That is also the case on Linux. It (strictly speaking) isn't a symlink in the normal sense of the word, it's a magic-link (nd_jump_link switches the nd->path to the actual 'struct file' in the case of /proc/self/fd/$n). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2019-12-30 7:29 ` Aleksa Sarai 2019-12-30 7:53 ` Linus Torvalds @ 2020-01-01 0:43 ` Al Viro 2020-01-01 0:54 ` Al Viro 1 sibling, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-01 0:43 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Mon, Dec 30, 2019 at 06:29:59PM +1100, Aleksa Sarai wrote: > On 2019-12-30, Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2019-12-30, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote: > > > > > > > A reasonably detailed explanation of the issues is provided in the patch > > > > itself, but the full traces produced by both the oopses and deadlocks is > > > > included below (it makes little sense to include them in the commit since we > > > > are disabling this feature, not directly fixing the bugs themselves). > > > > > > > > I've posted this as an RFC on whether this feature should be allowed at > > > > all (and if anyone knows of legitimate uses for it), or if we should > > > > work on fixing these other kernel bugs that it exposes. > > > > > > Umm... Are all of those traces > > > a) reproducible on mainline and > > > > This was on viro/for-next, I'll retry it on v5.5-rc4. > > The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems > harder to reproduce than on viro/for-next (I kept reproducing it there > by accident), but I'll double-check if that really is the case. > > The simplest reproducer is (using the attached programs and .config): > > ln -s . link > sudo ./umount_symlink link FWIW, the problem with that reproducer is that we *CAN'T* resolve that path. Look: you have /proc/self/fd/3 resolve to ./link. OK, you've asked to follow that. Got ./link, which is a symlink, so we need to follow it further. Relative to what, though? The meaning of symlink is dependent upon the directory you find it in. And we don't have any here. The bug is in mountpoint_last() - we have if (unlikely(nd->last_type != LAST_NORM)) { error = handle_dots(nd, nd->last_type); if (error) return error; path.dentry = dget(nd->path.dentry); } else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) { /* * No cached dentry. Mounted dentries are pinned in the * cache, so that means that this dentry is probably * a symlink or the path doesn't actually point * to a mounted dentry. */ path.dentry = lookup_slow(&nd->last, dir, nd->flags | LOOKUP_NO_REVAL); if (IS_ERR(path.dentry)) return PTR_ERR(path.dentry); } } if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) { dput(path.dentry); return -ENOENT; } path.mnt = nd->path.mnt; return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0); in there, and that ends up with step_into() called in case of LAST_DOT/LAST_DOTDOT (where it's harmless) *AND* in case of LAST_BIND. Where it very much isn't. I'm not sure if you have caught anything else, but we really, really should *NOT* consider the LAST_BIND as "maybe we should follow the result" material. So at least the following is needed; could you check if anything else remains with that applied? diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..d4fbbda8a7ff 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd) nd->flags &= ~LOOKUP_PARENT; if (unlikely(nd->last_type != LAST_NORM)) { - error = handle_dots(nd, nd->last_type); - if (error) - return error; - path.dentry = dget(nd->path.dentry); + return handle_dots(nd, nd->last_type); } else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) { ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-01 0:43 ` Al Viro @ 2020-01-01 0:54 ` Al Viro 2020-01-01 3:08 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-01 0:54 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Wed, Jan 01, 2020 at 12:43:24AM +0000, Al Viro wrote: > I'm not sure if you have caught anything else, but we really, really should *NOT* > consider the LAST_BIND as "maybe we should follow the result" material. So > at least the following is needed; could you check if anything else remains > with that applied? > > diff --git a/fs/namei.c b/fs/namei.c > index d6c91d1e88cb..d4fbbda8a7ff 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd) > nd->flags &= ~LOOKUP_PARENT; > > if (unlikely(nd->last_type != LAST_NORM)) { > - error = handle_dots(nd, nd->last_type); > - if (error) > - return error; > - path.dentry = dget(nd->path.dentry); > + return handle_dots(nd, nd->last_type); > } else { > path.dentry = d_lookup(dir, &nd->last); > if (!path.dentry) { Note, BTW, that lookup_last() (aka walk_component()) does just that - we only hit step_into() on LAST_NORM. The same goes for do_last(). mountpoint_last() not doing the same is _not_ intentional - it's definitely a bug. Consider your testcase; link points to . here. So the only thing you could expect from trying to follow it would be the directory 'link' lives in. And you don't have it when you reach the fscker via /proc/self/fd/3; what happens instead is nd->path set to ./link (by nd_jump_link()) *AND* step_into() called, pushing the same ./link onto stack. It violates all kinds of assumptions made by fs/namei.c - when pushing a symlink onto stack nd->path is expected to contain the base directory for resolving it. I'm fairly sure that this is the cause of at least some of the insanity you've caught; there always could be something else, of course, but this hole needs to be closed in any case. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-01 0:54 ` Al Viro @ 2020-01-01 3:08 ` Al Viro 2020-01-01 14:44 ` Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-01 3:08 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote: > Note, BTW, that lookup_last() (aka walk_component()) does just > that - we only hit step_into() on LAST_NORM. The same goes > for do_last(). mountpoint_last() not doing the same is _not_ > intentional - it's definitely a bug. > > Consider your testcase; link points to . here. So the only > thing you could expect from trying to follow it would be > the directory 'link' lives in. And you don't have it > when you reach the fscker via /proc/self/fd/3; what happens > instead is nd->path set to ./link (by nd_jump_link()) *AND* > step_into() called, pushing the same ./link onto stack. > It violates all kinds of assumptions made by fs/namei.c - > when pushing a symlink onto stack nd->path is expected to > contain the base directory for resolving it. > > I'm fairly sure that this is the cause of at least some > of the insanity you've caught; there always could be > something else, of course, but this hole needs to be > closed in any case. ... and with removal of now unused local variable, that's mountpoint_last(): fix the treatment of LAST_BIND step_into() should be attempted only in LAST_NORM case, when we have the parent directory (in nd->path). We get away with that for LAST_DOT and LOST_DOTDOT, since those can't be symlinks, making step_init() and equivalent of path_to_nameidata() - we do a bit of useless work, but that's it. For LAST_BIND (i.e. the case when we'd just followed a procfs-style symlink) we really can't go there - result might be a symlink and we really can't attempt following it. lookup_last() and do_last() do handle that properly; mountpoint_last() should do the same. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..13f9f973722b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty); static int mountpoint_last(struct nameidata *nd) { - int error = 0; struct dentry *dir = nd->path.dentry; struct path path; @@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd) nd->flags &= ~LOOKUP_PARENT; if (unlikely(nd->last_type != LAST_NORM)) { - error = handle_dots(nd, nd->last_type); - if (error) - return error; - path.dentry = dget(nd->path.dentry); + return handle_dots(nd, nd->last_type); } else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) { ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-01 3:08 ` Al Viro @ 2020-01-01 14:44 ` Aleksa Sarai 2020-01-01 23:40 ` Al Viro 2020-01-04 5:52 ` Andy Lutomirski 0 siblings, 2 replies; 92+ messages in thread From: Aleksa Sarai @ 2020-01-01 14:44 UTC (permalink / raw) To: Al Viro Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3198 bytes --] On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote: > > Note, BTW, that lookup_last() (aka walk_component()) does just > > that - we only hit step_into() on LAST_NORM. The same goes > > for do_last(). mountpoint_last() not doing the same is _not_ > > intentional - it's definitely a bug. > > > > Consider your testcase; link points to . here. So the only > > thing you could expect from trying to follow it would be > > the directory 'link' lives in. And you don't have it > > when you reach the fscker via /proc/self/fd/3; what happens > > instead is nd->path set to ./link (by nd_jump_link()) *AND* > > step_into() called, pushing the same ./link onto stack. > > It violates all kinds of assumptions made by fs/namei.c - > > when pushing a symlink onto stack nd->path is expected to > > contain the base directory for resolving it. > > > > I'm fairly sure that this is the cause of at least some > > of the insanity you've caught; there always could be > > something else, of course, but this hole needs to be > > closed in any case. > > ... and with removal of now unused local variable, that's > > mountpoint_last(): fix the treatment of LAST_BIND > > step_into() should be attempted only in LAST_NORM > case, when we have the parent directory (in nd->path). > We get away with that for LAST_DOT and LOST_DOTDOT, > since those can't be symlinks, making step_init() and > equivalent of path_to_nameidata() - we do a bit of > useless work, but that's it. For LAST_BIND (i.e. > the case when we'd just followed a procfs-style > symlink) we really can't go there - result might > be a symlink and we really can't attempt following > it. > > lookup_last() and do_last() do handle that properly; > mountpoint_last() should do the same. > > Cc: stable@vger.kernel.org > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it). Reported-by: Aleksa Sarai <cyphar@cyphar.com> Tested-by: Aleksa Sarai <cyphar@cyphar.com> As for the original topic of bind-mounting symlinks -- given this is a supported feature, would you be okay with me sending an updated O_EMPTYPATH series? > --- > diff --git a/fs/namei.c b/fs/namei.c > index d6c91d1e88cb..13f9f973722b 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty); > static int > mountpoint_last(struct nameidata *nd) > { > - int error = 0; > struct dentry *dir = nd->path.dentry; > struct path path; > > @@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd) > nd->flags &= ~LOOKUP_PARENT; > > if (unlikely(nd->last_type != LAST_NORM)) { > - error = handle_dots(nd, nd->last_type); > - if (error) > - return error; > - path.dentry = dget(nd->path.dentry); > + return handle_dots(nd, nd->last_type); > } else { > path.dentry = d_lookup(dir, &nd->last); > if (!path.dentry) { -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-01 14:44 ` Aleksa Sarai @ 2020-01-01 23:40 ` Al Viro 2020-01-02 3:59 ` Aleksa Sarai 2020-01-04 5:52 ` Andy Lutomirski 1 sibling, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-01 23:40 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote: > Thanks, this fixes the issue for me (and also fixes another reproducer I > found -- mounting a symlink on top of itself then trying to umount it). > > Reported-by: Aleksa Sarai <cyphar@cyphar.com> > Tested-by: Aleksa Sarai <cyphar@cyphar.com> Pushed into #fixes. > As for the original topic of bind-mounting symlinks -- given this is a > supported feature, would you be okay with me sending an updated > O_EMPTYPATH series? Post it on fsdevel; I'll need to reread it anyway to say anything useful... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-01 23:40 ` Al Viro @ 2020-01-02 3:59 ` Aleksa Sarai 2020-01-03 1:49 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2020-01-02 3:59 UTC (permalink / raw) To: Al Viro Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 855 bytes --] On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote: > > > Thanks, this fixes the issue for me (and also fixes another reproducer I > > found -- mounting a symlink on top of itself then trying to umount it). > > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com> > > Tested-by: Aleksa Sarai <cyphar@cyphar.com> > > Pushed into #fixes. Thanks. One other thing I noticed is that umount applies to the underlying symlink rather than the mountpoint on top. So, for example (using the same scripts I posted in the thread): # ln -s /tmp/foo link # ./mount_to_symlink /etc/passwd link # umount -l link # will attempt to unmount "/tmp/foo" Is that intentional? -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-02 3:59 ` Aleksa Sarai @ 2020-01-03 1:49 ` Al Viro 2020-01-04 4:46 ` Ian Kent ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Al Viro @ 2020-01-03 1:49 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel, Ian Kent On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote: > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote: > > > > > Thanks, this fixes the issue for me (and also fixes another reproducer I > > > found -- mounting a symlink on top of itself then trying to umount it). > > > > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com> > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com> > > > > Pushed into #fixes. > > Thanks. One other thing I noticed is that umount applies to the > underlying symlink rather than the mountpoint on top. So, for example > (using the same scripts I posted in the thread): > > # ln -s /tmp/foo link > # ./mount_to_symlink /etc/passwd link > # umount -l link # will attempt to unmount "/tmp/foo" > > Is that intentional? It's a mess, again in mountpoint_last(). FWIW, at some point I proposed to have nd_jump_link() to fail with -ELOOP if the target was a symlink; Linus asked for reasons deeper than my dislike of the semantics, I looked around and hadn't spotted anything. And there hadn't been at the time, but when four months later umount_lookup_last() went in I failed to look for that source of potential problems in it ;-/ I've looked at that area again now. Aside of usual cursing at do_last() horrors (yes, its control flow is a horror; yes, it needs serious massage; no, it's not a good idea to get sidetracked into that right now), there are several fun questions: * d_manage() and d_automount(). We almost certainly don't want those for autofs on the final component of pathname in umount, including the trailing symlinks. But do we want those on usual access via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not) in autofs; do we want ->d_manage()/->d_automount() called when resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that correct from autofs point of view? I suspect that refusing to do ->d_automount() is correct, but I don't understand ->d_manage() purpose well enough to tell. * I really hope that the weird "trailing / forces automount even in cases when we normally wouldn't trigger it" (stat /mnt/foo vs. stat /mnt/foo/) is not meant to extend to umount. I'd like Ian's confirmation, though. * do we want ->d_manage() on following .. into overmounted directory? Again, autofs question... The minimal fix to mountpoint_last() would be to have follow_mount() done in LAST_NORM case. However, I'd like to understand (and hopefully regularize) the rules for follow_mount()/follow_managed(). Additional scary question is nfsd iterplay with automount. For nfs4 exports it's potentially interesting... Ian, could you comment on the autofs questions above? I'd rather avoid doing changes in that area without your input - it's subtle and breakage in automount-related behaviour can be mysterious as hell. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-03 1:49 ` Al Viro @ 2020-01-04 4:46 ` Ian Kent 2020-01-08 3:13 ` Al Viro 2020-01-10 23:19 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro 2 siblings, 0 replies; 92+ messages in thread From: Ian Kent @ 2020-01-04 4:46 UTC (permalink / raw) To: Al Viro, Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel It may be a bit off-topic here but, in autofs symlinks can be used in place of mounts. That mechanism can be used (mostly nowadays) with amd map format maps. If I'm using symlinks instead of mounts (where I can) I definitely don't want these to be over mounted by a mount. I haven't seen problems like that happening but if it did happen that would be a bug in automount or user mis-use of some sort. On Fri, 2020-01-03 at 01:49 +0000, Al Viro wrote: > On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote: > > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote: > > > > > > > Thanks, this fixes the issue for me (and also fixes another > > > > reproducer I > > > > found -- mounting a symlink on top of itself then trying to > > > > umount it). > > > > > > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com> > > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com> > > > > > > Pushed into #fixes. > > > > Thanks. One other thing I noticed is that umount applies to the > > underlying symlink rather than the mountpoint on top. So, for > > example > > (using the same scripts I posted in the thread): > > > > # ln -s /tmp/foo link > > # ./mount_to_symlink /etc/passwd link > > # umount -l link # will attempt to unmount "/tmp/foo" > > > > Is that intentional? > > It's a mess, again in mountpoint_last(). FWIW, at some point I > proposed > to have nd_jump_link() to fail with -ELOOP if the target was a > symlink; > Linus asked for reasons deeper than my dislike of the semantics, I > looked > around and hadn't spotted anything. And there hadn't been at the > time, > but when four months later umount_lookup_last() went in I failed to > look > for that source of potential problems in it ;-/ > > I've looked at that area again now. Aside of usual cursing at > do_last() > horrors (yes, its control flow is a horror; yes, it needs serious > massage; > no, it's not a good idea to get sidetracked into that right now), > there > are several fun questions: > * d_manage() and d_automount(). We almost certainly don't > want those for autofs on the final component of pathname in umount, > including the trailing symlinks. But do we want those on usual > access > via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not) > in autofs; do we want ->d_manage()/->d_automount() called when > resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that > correct from autofs point of view? I suspect that refusing to > do ->d_automount() is correct, but I don't understand ->d_manage() > purpose well enough to tell. Yes, we don't want those on the final component of the path in umount. The following of a symlink will give use a new path of some sort so the rules would change to the usual ones for the new path. The semantics of following a symlink, be the source a proc entry or not (I think) should always be the same. If the follow takes us to an autofs file system (be it a trigger mount or an indirect mount in an autofs file system) the behaviour should be that of the autofs file system when we arrive there, from an auto-mount POV. The original intent of ->d_manage() was to prevent walks into an under construction mount and that might not be as simple as mounting a source on a mount point. For example take the case of an automount indirect mount map entry like this: test /some/path/one server:/source/path1 \ /some/path/two server2:/source/path2 \ /some/other/path server:/source/path3 \ /some/other/path/three server:/source/path4 This entry has no mount at the root of the tree (so called root-less multi-mount) but walks need to block when it's under construction as the topology isn't known until the directory tree and any associated mounts (usually trigger mounts) have been completed. In this case it's needed to go to ref-walk mode and block until it's done. The other (perhaps not so obvious) use of ->d_manage() is to detect expire to mount races. When an automount is expiring at the same time a process (that would cause an automount) is traversing the path. The base (I'll not say root, since the root of the expire might not be the root of the tree) needs to block the walk until the expire is done. These multi-mounts are meant to provide a "mount as you go" mechanism so that only portions of the tree of mounts are mounted or expired at any one time. For example, the offsets in the above entry are /some/path/one, /some/path/two, /some/other/path and /some/other/path/three. On access to <autofs mount>/test automount is meant to mount trigger mounts for offsets /some/path/one, /some/path/two and /some/other/path and mount an offset trigger for /some/other/path/three into the mount for /some/other/path when it's accessed and that might not happen during the initial mount of the tree. The reverse being done on umount in sub-trees of mounts when a nesting point like /some/other/path is encountered. But that's something of an aside because in all cases below the root there will be an actual mount preventing walks into the tree under nesting point mounts being constructed or expired. Anyway, returning to the topic at hand, the answer to whether we want ->d_manage()/->d_automount() after a symlink has been followed is yes, I think, because at that point we could be within a file system that has automounts of some sort. But perhaps I'm missing something about the description of the case above ... > * I really hope that the weird "trailing / forces automount > even in cases when we normally wouldn't trigger it" (stat /mnt/foo > vs. stat /mnt/foo/) is not meant to extend to umount. I'd like > Ian's confirmation, though. I can't see any way that the trailing "/" can realte to umount. It has always been meant to be used to trigger a mount on something that would otherwise not be mounted and that's the only case I'm aware of. > * do we want ->d_manage() on following .. into overmounted > directory? Again, autofs question... I think that amounts to asking "can the target of the ../ be in the process of being constructed or expired at this time" and that's probably yes. A root-less multi-mount would be one case where this could happen (although it's not strictly an over-mounted directory). > > The minimal fix to mountpoint_last() would be to have > follow_mount() done in LAST_NORM case. However, I'd like to > understand > (and hopefully regularize) the rules for > follow_mount()/follow_managed(). > Additional scary question is nfsd iterplay with automount. For nfs4 > exports it's potentially interesting... I'm not sure about nfs (and other cross mounting file systems). The automounting in file systems other than autofs always have a real mount as the target (AFAIK) so there's an implied blocking that occurs on crossing the mount point. That's always made the nfs automounting case simpler to my thinking anyway. The real problem with nfs automount trees is when the topology of the exports tree changes while parts of it are in use. People that have any idea of how nfs cross mounting (and mount dependencies in general) work shouldn't do that but they do it and then wonder why things go wrong ... > > Ian, could you comment on the autofs questions above? > I'd rather avoid doing changes in that area without your input - > it's subtle and breakage in automount-related behaviour can be > mysterious as hell. Thanks for the heads up. As always I can run tests on changes you want to do. Fortunately that's generally worked out ok for us in the past. Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-03 1:49 ` Al Viro 2020-01-04 4:46 ` Ian Kent @ 2020-01-08 3:13 ` Al Viro 2020-01-08 3:54 ` Linus Torvalds 2020-01-10 23:19 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro 2 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-08 3:13 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel, Ian Kent On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote: > It's a mess, again in mountpoint_last(). FWIW, at some point I proposed > to have nd_jump_link() to fail with -ELOOP if the target was a symlink; > Linus asked for reasons deeper than my dislike of the semantics, I looked > around and hadn't spotted anything. And there hadn't been at the time, > but when four months later umount_lookup_last() went in I failed to look > for that source of potential problems in it ;-/ > > I've looked at that area again now. Aside of usual cursing at do_last() > horrors (yes, its control flow is a horror; yes, it needs serious massage; > no, it's not a good idea to get sidetracked into that right now), there > are several fun questions: > * d_manage() and d_automount(). We almost certainly don't > want those for autofs on the final component of pathname in umount, > including the trailing symlinks. But do we want those on usual access > via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not) > in autofs; do we want ->d_manage()/->d_automount() called when > resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that > correct from autofs point of view? I suspect that refusing to > do ->d_automount() is correct, but I don't understand ->d_manage() > purpose well enough to tell. > * I really hope that the weird "trailing / forces automount > even in cases when we normally wouldn't trigger it" (stat /mnt/foo > vs. stat /mnt/foo/) is not meant to extend to umount. I'd like > Ian's confirmation, though. > * do we want ->d_manage() on following .. into overmounted > directory? Again, autofs question... FWIW, I suspect that we want to do something along the following lines: 1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been O_NOFOLLOW in the mix. Reason: if there is a trailing symlink, we want to fail with EEXIST anyway. Benefit: this fragment in do_last() error = follow_managed(&path, nd); if (unlikely(error < 0)) return error; /* * create/update audit record if it already exists. */ audit_inode(nd->name, path.dentry, 0); if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { path_to_nameidata(&path, nd); return -EEXIST; } seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) return error; can become error = follow_managed(&path, nd); if (unlikely(error < 0)) return error; seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) return error; if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { audit_inode(nd->name, nd->path.dentry, 0); return -EEXIST; } Equivalent transformation, since the the only goto finish_lookup; is under if (!(open_flag & O_CREAT)). What it buys us is more regular structure of follow_managed() callers. 2) make follow_managed() take &inode and &seq. Look: follow_managed() never returns 0 (we have if (ret == -EISDIR || !ret) ret = 1; on the way to the only return in it) and the callers are err = follow_managed(path, nd); if (likely(err > 0)) *inode = d_backing_inode(path->dentry); return err; in lookup_fast(), err = follow_managed(&path, nd); if (unlikely(err < 0)) return err; seq = 0; /* we are already out of RCU mode */ inode = d_backing_inode(path.dentry); in walk_component(), err = follow_managed(&path, nd); if (unlikely(err < 0)) return err; inode = d_backing_inode(path.dentry); seq = 0; in handle_lookup_down() and (after the previous change) error = follow_managed(&path, nd); if (unlikely(error < 0)) return error; seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); in do_last(). That's begging to fold those followups into follow_managed() itself, doesn't it? And having *seqp = 0; equivalent added in lookup_fast() is not going to hurt the performance - in all callers it's an address of local variable, right next to the one whose address is passed as inodep. Which we'd just dirtied, and the cacheline is not going to have been shared anyway. Note that after that the arguments for follow_managed() become identical to those for __follow_mount_rcu(). Which makes a lot of sense, since the latter is RCU-mode counterpart of the former. 3) have the followup to failing __follow_mount_rcu() taken into it. After (2) we have this in lookup_fast(): *seqp = seq; status = d_revalidate(dentry, nd->flags); if (likely(status > 0)) { /* * Note: do negative dentry check after revalidation in * case that drops it. */ if (unlikely(negative)) return -ENOENT; path->mnt = mnt; path->dentry = dentry; if (likely(__follow_mount_rcu(nd, path, inode, seqp))) return 1; } if (unlazy_child(nd, dentry, seq)) return -ECHILD; if (unlikely(status == -ECHILD)) /* we'd been told to redo it in non-rcu mode */ status = d_revalidate(dentry, nd->flags); } else { ... } if (unlikely(status <= 0)) { if (!status) d_invalidate(dentry); dput(dentry); return status; } path->mnt = mnt; path->dentry = dentry; return follow_managed(path, nd, inode, seqp); Suppose __follow_mount_rcu() returns false; what follows is if (unlazy_child(nd, dentry, seq)) return -ECHILD; seq here is equal to *seqp here, dentry - the value of path->dentry at the time of __follow_mount_rcu() call. if (unlikely(status == -ECHILD)) .... not taken - we know that status must have been positive if (unlikely(status <= 0)) { ... } ditto path->mnt = mnt; path->dentry = dentry; return follow_managed(path, nd, inode, seqp); we return *path to original and call follow_managed(). IOW, we could bloody well do all of that in the __follow_mount_rcu() itself, having it return 1 when the original would've returned true and doing that "revert *path, call unlazy_child() and fall back to follow_mount_rcu() in case of success" in cases when the original would've returned false. The caller turns into /* * Note: do negative dentry check after revalidation in * case that drops it. */ if (unlikely(negative)) return -ENOENT; path->mnt = mnt; path->dentry = dentry; return __follow_mount_rcu(nd, path, inode, seqp); 4) fold __follow_mount_rcu() into follow_managed(), using the latter both in RCU and non-RCU cases. 5) take the calls of follow_managed() out of lookup_fast() into its callers. That would be err = lookup_fast(nd, &path, &inode, &seq); if (unlikely(err <= 0)) { if (err < 0) return err; path.dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); if (IS_ERR(path.dentry)) return PTR_ERR(path.dentry); path.mnt = nd->path.mnt; err = follow_managed(&path, nd, &inode, &seq); if (unlikely(err < 0)) return err; } turning into err = lookup_fast(nd, &path, &inode, &seq); if (unlikely(err <= 0)) { if (err < 0) return err; path.dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); if (IS_ERR(path.dentry)) return PTR_ERR(path.dentry); path.mnt = nd->path.mnt; } err = follow_managed(&path, nd, &inode, &seq); if (unlikely(err < 0)) return err; in walk_component() and error = lookup_fast(nd, &path, &inode, &seq); if (likely(error > 0)) goto finish_lookup; ... error = follow_managed(&path, nd, &inode, &seq); if (unlikely(error < 0)) return error; finish_lookup: turning into error = lookup_fast(nd, &path, &inode, &seq); if (likely(error > 0)) goto finish_lookup; ... finish_lookup: error = follow_managed(&path, nd, &inode, &seq); if (unlikely(error < 0)) return error; in do_last(). 6) after that we have 3 callers of step_into(); the ones in walk_component() and in do_last() would be immediately preceded by the calls of follow_managed(). The last one is in mountpoint_last(). That's if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) { dput(path.dentry); return -ENOENT; } path.mnt = nd->path.mnt; return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0); And that's where we are missing the mountpoint traversal in symlink case - sure, the caller does follow_mount(), but it doesn't catch the case when we have a symlink overmounted - we run into step_into() before that. Note that smp_load_acquire + d_flags_negative is what we would've done in follow_managed(), as well as getting d_backing_inode(). So here we also have an open-coded bastardized variant of follow_managed(). The difference is, we don't want to trigger ->d_automount() and ->d_manage() in that one. And at that point the only call of follow_managed() *NOT* followed by step_into() is in handle_lookup_down(). What it is followed by is path_to_nameidata(&path, nd); nd->inode = inode; nd->seq = seq; And that's a piece of step_into(): if (likely(!d_is_symlink(path->dentry)) || !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) { /* not a symlink or should not follow */ path_to_nameidata(path, nd); nd->inode = inode; nd->seq = seq; return 0; } is the normal path through that sucker. What's more, we are guaranteed that this will _not_ be a symlink (it's the starting point of pathwalk, and path_init() would've told us to sod off were it not a directory). So if we manage to convert the damn thing in mountpoint_last() into follow_managed(), we could fold follow_managed() into step_into(). Which suggests the way to do that - not that step_into() takes an argument containing ORed WALK_... constants. So we can simply add WALK_NOAUTOMOUNT and put a check for it into if (flags & DCACHE_MANAGE_TRANSIT) { and if (flags & DCACHE_NEED_AUTOMOUNT) { bodies, so that they would be ignored if that's passed to follow_mount()/step_into() hybrid. At that point we have one primitive for moving into child, handling both the mountpoint traversals and keeping track of symlinks. Moreover, there's a fairly strong argument for using it in case of .. as well. As it is, if the parent is overmounted, we cross into whatever is mounted on top of it. And we ignore ->d_manage/->d_automount on the damn thing. Which is not an issue for anything other than autofs (nobody else has ->d_manage() and nfs/afs/cifs automount points don't have children) and for autofs we *want* those called; that's not something likely to be encountered, but it's an impossible setup (autofs direct mount set on an ancestor of somebody's current directory) and autofs does count upon not walking into something being set up by the daemon. I'll put together such series and see how well does it work; it would fix the idiocies in user_path_mountpoint_at() and make the pathwalk machinery easier to follow - the boilerplate around mountpoint crossing and symlink handling is demonstrably easy to get wrong. If that works and doesn't cause observable slowdown, I'll put it into -next, either stepping around the changes done by openat2() series, or rebasing it on top of that. Another interesting question is whether we want O_PATH open to trigger automounts. The thing is, we do *NOT* trigger them (or traverse mountpoints) at the starting point of lookups. I believe it's a mistake (and mine, at that), but I doubt that there's anything that can be done about it at that point. It's a user-visible behaviour and I can easily imagine a custom /init that ends up relying upon it ;-/ mkdir /root, mount the final root there, chdir /root, mount --move . /, remove everything on initramfs using absolute pathnames and chroot to "." to finish... Traversing mounts at the beginning of pathwalk would break the hell out of that, potentially with root filesystem contents wiped out... ;-/ I wish we could change that, but I'm afraid that's cast in stone by now (and had been for 20 years or so). As it is, we have an unpleasant side effect - O_PATH open does *NOT* trigger automounts. So if you do that to e.g. referral point and try to do ...at() syscalls with that as the origin, you'll get an unpleasant surprise - automount won't trigger at all. I think the easiest way to handle that is to have O_PATH turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's trivial to do, but that changes user-visible behaviour. OTOH, with the current behaviour nobody can rely upon automount not getting triggered by somebody else just as they are entering their open(dir, O_PATH), so I think that's not a problem. Linus, do you have any objections to such O_PATH semantics change? PS: I think I see how to untangle the control flow horrors in do_last() with this massage done, but I'm not going there until this is sorted out - by previous experience touching the damn thing can easily turn into several weeks of digging through the nfs/gfs2/etc. guts trying to verify something, with a couple of detours into fixing something in there found in process... ;-/ ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-08 3:13 ` Al Viro @ 2020-01-08 3:54 ` Linus Torvalds 2020-01-08 21:34 ` Al Viro 2020-01-10 21:07 ` Aleksa Sarai 0 siblings, 2 replies; 92+ messages in thread From: Linus Torvalds @ 2020-01-08 3:54 UTC (permalink / raw) To: Al Viro Cc: Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Tue, Jan 7, 2020 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > FWIW, I suspect that we want to do something along the following lines: > > 1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been > O_NOFOLLOW in the mix. My reaction to that is "Whee, that's a big change". But: > Benefit: this fragment in do_last() you're right. That's the semantics we have right now (and I think it's the correct safe semantics when I think about it). But when I first looked at your email I without thinking more about it actually thought we followed the symlink, and then did the O_CREAT | O_EXCL on the target (and potentially succeeded). So I agree - making O_CREAT | O_EXCL imply O_NOFOLLOW seems to be the right thing to do, and not only should simplify our code, it's much more descriptive of what the real semantics are. Even if my first reaction was that it would act differently. Slash-and-burn approach to your explanatory subsequent steps: > 2) make follow_managed() take &inode and &seq. > 3) have the followup to failing __follow_mount_rcu() taken into it. > 4) fold __follow_mount_rcu() into follow_managed(), using the latter both in > RCU and non-RCU cases. > 5) take the calls of follow_managed() out of lookup_fast() into its callers. > 6) after that we have 3 callers of step_into(); [..] > So if we manage to convert the damn thing in mountpoint_last() into > follow_managed(), we could fold follow_managed() into step_into(). I think that all makes sense. I didn't go to look at the source, but from the email contents your steps look reasonable to me. > Another interesting question is whether we want O_PATH open > to trigger automounts. It does sound like they shouldn't, but as you say: > The thing is, we do *NOT* trigger them > (or traverse mountpoints) at the starting point of lookups. > I believe it's a mistake (and mine, at that), but I doubt that > there's anything that can be done about it at that point. > It's a user-visible behaviour [..] Hmm. I wonder how set in stone that is. We may have two decades of history of not doing it at start point of lookups, but we do *not* have two decades of history of O_PATH. So what I think we agree would be sane behavior would be for O_PATH opens to not trigger automounts (unless there's a slash at the end, whatever), but _do_ add the mount-point traversal to the beginning of lookups. But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case. That way we maintain original behavior: if somebody overmounts your cwd, you still see the pre-mount directory on lookups, because your cwd is "under" the mount. But if you open a file with O_PATH, and somebody does a mount _afterwards_, the openat() will see that later mount and/or do the automount. Don't you think that would be the more sane/obvious semantics of how O_PATH should work? > I think the easiest way to handle that is to have O_PATH > turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's > trivial to do, but that changes user-visible behaviour. OTOH, > with the current behaviour nobody can rely upon automount not > getting triggered by somebody else just as they are entering > their open(dir, O_PATH), so I think that's not a problem. > > Linus, do you have any objections to such O_PATH semantics > change? See above: I think I'd prefer the O_PATH behavior the other way around. That seems to be more of a consistent behavior of what "O_PATH" means - it means "don't really open, we'll do it only when you use it as a directory". But I don't have any _strong_ opinions. If you have a good reason to tell me that I'm being stupid, go ahead and do so and override my stupidity. Linus ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-08 3:54 ` Linus Torvalds @ 2020-01-08 21:34 ` Al Viro 2020-01-10 0:08 ` Linus Torvalds 2020-01-10 21:07 ` Aleksa Sarai 1 sibling, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-08 21:34 UTC (permalink / raw) To: Linus Torvalds Cc: Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Tue, Jan 07, 2020 at 07:54:02PM -0800, Linus Torvalds wrote: > > Another interesting question is whether we want O_PATH open > > to trigger automounts. > > It does sound like they shouldn't, but as you say: > > > The thing is, we do *NOT* trigger them > > (or traverse mountpoints) at the starting point of lookups. > > I believe it's a mistake (and mine, at that), but I doubt that > > there's anything that can be done about it at that point. > > It's a user-visible behaviour [..] > > Hmm. I wonder how set in stone that is. We may have two decades of > history of not doing it at start point of lookups, but we do *not* > have two decades of history of O_PATH. > > So what I think we agree would be sane behavior would be for O_PATH > opens to not trigger automounts (unless there's a slash at the end, > whatever), but _do_ add the mount-point traversal to the beginning of > lookups. > > But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case. > > That way we maintain original behavior: if somebody overmounts your > cwd, you still see the pre-mount directory on lookups, because your > cwd is "under" the mount. > > But if you open a file with O_PATH, and somebody does a mount > _afterwards_, the openat() will see that later mount and/or do the > automount. > > Don't you think that would be the more sane/obvious semantics of how > O_PATH should work? Maybe, but... note that we do not (and AFAICS never had) follow mounts on /proc/self/cwd, /proc/self/fd/42, etc. And there are very good reasons for that. First of all, if your stdin is from /tmp/foo, you'd better get that file when you open /dev/stdin, even if somebody has done mount --bind /tmp/bar /tmp/foo; another issue is with the use of stat("/proc/self/fd/42", &buf) - it should be an equivalent of fstat(42, &buf), even if somebody has overmounted that. BTW, for similar reason after link(".", "foo"); fd = open("foo", O_PATH); // return 42 we really should (and do) have resolution of /proc/self/fd/42 stop at foo, not . Reason: consistency of stat() behaviour... The point is, we'd never followed mounts on /proc/self/cwd et.al. I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) is that way. Actually, scratch that - 2.0 behaves the same way (mountpoint crossing is done in iget() there; is that Minix influence or straight from the Lions' book?) Hmm... Looking through the history, we have (for reference) v7: mount traversal in iget() (forward) and namei() (back); due to the way it's done, forward traversal happens * at starting point * after any component (. and .. included) * on results of forward traversal (due to a loop in iget()). Back traversal (to covered on .. from root directory) is also to unlimited depth. 0.01: no mount handling 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry() (not by Lions' Book, then - v6 didn't do back traversals at all). Forward traversal * after any component (. and .. included) No traversal on starting point, no traversal on result of traversal. OTOH, mount(2) refuses to mount on top of root, so the lack of the last one is not an issue. 0.12: symlinks added; no mount traversal on starting point of those either. We start at the process' root for absolute ones, even if it happens to be overmounted, and we start from parent for relative ones. The latter matters only if we were in the beginning of the pathwalk, since anything else would've traversed mounts back when we'd picked said parent. Mount traversal takes precedence over symlink traversal, but that's not an issue since mount follows links on mountpoint. It does not, at that point, reject fs image with symlink for root, but that actually more or less works. 0.97.3: same, with addition of procfs symlinks. No mount crossing on their targets (for normal symlinks we don't do mount crossing in the beginning and any component inside triggers mount crossing as usual; for procfs ones there's no components inside) Situation remains essentially unchanged until 2.1.42. Next few kernels are in flux, to put it politely - initial merge had been insane and it took until 2.1.44 or so for the things to get more or less working. At 2.1.44: forward traversal in fs/namei.c:lookup(), back traversal in fs/namei.c:reserved_lookup(). Otherwise the same behaviour as pre-dcache (wrt mount traversals, that is). 2.1.51pre1: forward traversal moved into real_lookup() and __d_lookup(). Forward traversal happens *ONLY* after normal components - not after . or .. 2.1.61: forward traversal moved into follow_mount(), behaviour reverted to pre-dcache one. Previous is from reading through the historical trees; my involvement started circa 2.1.120-something. 2.3.50pre3: call of follow_mount() moved a bit, reverting to 2.1.51pre1 behaviour (nor traversal on . or ..) *again*. Not sure whose idea had that been - might've been mine, but unlike the other patch that went into fs/namei.c in the same release, I hadn't been able to find anything related to that one. If your memories (or mail archives) are better... 2.3.99pre4-5: massive surgery in there. Preparations to allowing mount on top of mount; forward traversal adjusted accordingly, back traversal still isn't. 2.3.99pre7-1: more surgery, back traversals are also to unlimited depth now and mount on top of mount has been allowed. 2.3.99pre9-4: mount --bind taught to mount non-directories on top of non-directories. At that point it does *NOT* follow trailing symlinks, so mounting of symlinks and mounting on top of symlinks becomes possible. Mount traversal still takes precedence over symlink traversal, symlink traversal of mount traversal result still generally works, even though it's not something I considered at the time. v2.4.5.2: mount --bind started to follow symlinks. So that source of mounting of and on the symlinks was no more. 2.5.0.5: forward mount traversal is done after .. (in handle_dotdot()). That brings back the pre-dcache behaviour for those suckers. Still no forward traversal after ., though. At about the same time I'd been getting rid of the early-boot incestous relationships with fs/namespace.c (initramfs work) and that was probably the last time we could realistically switch to following mounts at starting point; I considered trying to do that, but decided not to. Pity, that... 2.6.5-rc2: normal mount now checks for corrupt fs with symlink for root. Since it has always been following symlinks for mountpoint, the remaining source of mounting of and on symlinks was gone; that lasted until after O_PATH introduction. 2.6.39-rc1: mount traps support - instead of abusing ->follow_link() for automounting, we have an explicit pair of methods that can be called at the same places where we traverse mounts. None too consistent - we don't do that on .. results. That was Dave Howells and Ian Kent. 2.6.39-rc1: O_PATH introduced and, later in the same series, allowed for symlinks. That has changed things - now procfs symlink targets could be symlinks themselves. Originally an attempt to follow those would blow up with -ELOOP (there's simply no good way to follow such beast; it's either "stop even if we are asked to follow" or "give an error"). 3.6.0-rc1: nd_jump_link() introduction (hch) had unnoticed side effects - we'd switched from "fail traversal with -ELOOP" to "stop there". Mostly it doesn't change behaviour, but it has opened a way to mount symlinks and mount on top of symlinks. Which generally worked. circa 3.8--3.9: side effects had been noticed; my first reaction had been "let's make nd_jump_link() return an error, then", but I hadn't been able to find good reasons when challenged to do so. Did an audit, found no obvious problems, went "oh, well - whether it works by accident or by design, it doesn't break anything". 3.12.0-rc1: lookups for umount(2) are different - we don't want revalidate on the last component. Which had been handled by introduction of path_umountat()/umount_lookup_last(), parallel to path_lookupat(). Which has gotten quite a few things wrong - it *did* try to follow symlinks obtained by following procfs ones (and blew up big way) and it didn't follow mounts on overmounted trailing symlinks. Nobody noticed for 6 years, until folks actually tried to play with mount-on-symlink... Patches were by Jeff Layton, neither he nor I have spotted the problem back then. And I should have, since it had been only a few months since the audit for exactly that kind of problems... AFAICS, there'd been no serious semantical changes since then. What we have right now: * no mount traversal on the starting point * mount traversal after any component other than "." * symlink traversal consists of possibly jumping to given point plus following a given (possibly empty) series of components. It can be both - e.g. symlink to "/foo/bar" is 'jump to root, then traverse "foo", then traverse "bar"'. Procfs "magic" symlinks are not really magical - they behave as symlinks to "/" as far as the pathwalk semantics is concerned. The only differences is that jump might be not to process' root. * mount traversal takes precedence over symlink traversal. * jump (if any) in symlink traversal is treated the same as the starting point - it's not followed by mount traversal. It's also not followed by symlink traversal, even if we are jumping into a symlink. Of course, in any position other than the end of pathname that's an instant error. That's also not different from the starting point treatment - if ...at(2) is given a symlink for starting point, it leaves it as-is if AT_EMPTY_PATH is given and fails with -ENOTDIR otherwise. * umount(2) handles the final component differently - for one thing, it does not do revalidate, for another - its mount traversal (if any) does not include automount-related parts. And there we *do* want mount traversal at the final point, for obvious reasons. > > I think the easiest way to handle that is to have O_PATH > > turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's > > trivial to do, but that changes user-visible behaviour. OTOH, > > with the current behaviour nobody can rely upon automount not > > getting triggered by somebody else just as they are entering > > their open(dir, O_PATH), so I think that's not a problem. > > > > Linus, do you have any objections to such O_PATH semantics > > change? > > See above: I think I'd prefer the O_PATH behavior the other way > around. That seems to be more of a consistent behavior of what > "O_PATH" means - it means "don't really open, we'll do it only when > you use it as a directory". How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) vs. faccessat(42, "foo", MAY_READ)? The latter would trigger automount, the former would not... Or would you extend that to "traverse mounts upon following procfs links, if the file in question had been opened with O_PATH"? We could do that (give nd_jump_link() an extra argument telling if we want mount traversal), but I'm not sure if the resulting semantics is sane... Note, BTW, that O_PATH users really can't rely upon automounts _not_ being triggered - all it takes is a lookup on bogus path with such prefix by anybody who can reach that place... We are not opening anything, really, but we are not able to ignore automounts triggered by somebody else. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-08 21:34 ` Al Viro @ 2020-01-10 0:08 ` Linus Torvalds 2020-01-10 4:15 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Linus Torvalds @ 2020-01-10 0:08 UTC (permalink / raw) To: Al Viro Cc: Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > The point is, we'd never followed mounts on /proc/self/cwd et.al. > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) > is that way. Hmm. If that's the case, maybe they should be marked implicitly as O_PATH when opened? > Actually, scratch that - 2.0 behaves the same way > (mountpoint crossing is done in iget() there; is that Minix influence > or straight from the Lions' book?) I don't think I ever had access to Lions' - I've _seen_ a printout of it later, and obviously maybe others did, More likely it's from Maurice Bach: the Design of the Unix Operating System. I'm pretty sure that's where a lot of the FS layer stuff came from. Certainly the bad old buffer head interfaces, and quite likely the iget() stuff too. > 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry() Whee, you _really_ went back in time. So I did too. And looking at that code in iget(), I doubt it came from anywhere. Christ. It's just looping over a fixed-size array, both when finding the inode, and finding the superblock. Cute, but unbelievably stupid. It was a more innocent time. In other words, I think you can chalk it up to just me, because blaming anybody else for that garbage would be very very unfair indeed ;) > How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) > vs. faccessat(42, "foo", MAY_READ)? I think that in a perfect world, the O_PATH'ness of '42' would be the deciding factor. Wouldn't those be the best and most consistent semantics? And then 'cwd'/'root' always have the O_PATH behavior. > The latter would trigger automount, > the former would not... Or would you extend that to "traverse mounts > upon following procfs links, if the file in question had been opened with > O_PATH"? Exactly. But you know what? I do not believe this is all that important, and I doubt it will matter to anybody. So what matters most is what makes the most sense to the VFS layer, and what makes the most sense to _you_. Because my reaction from this thread is that not only have you thought about this issue and followed the history a whole lot more than I would ever have done, it's also that I trust you to DTRT. I think it would be good to have some self-consistency, but at the same time clearly we already don't really, and our behavior here has subtly changed over the years (and not so subtly - if you go back sufficiently far, /proc behavior wrt file descriptors has had both "dup()" behavior and "make a new file descriptor with the same inode" behavior, afaik). Linus ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-10 0:08 ` Linus Torvalds @ 2020-01-10 4:15 ` Al Viro 2020-01-10 5:03 ` Linus Torvalds 2020-01-10 6:20 ` Ian Kent 0 siblings, 2 replies; 92+ messages in thread From: Al Viro @ 2020-01-10 4:15 UTC (permalink / raw) To: Linus Torvalds Cc: Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote: > On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > The point is, we'd never followed mounts on /proc/self/cwd et.al. > > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) > > is that way. > > Hmm. If that's the case, maybe they should be marked implicitly as > O_PATH when opened? I thought you wanted O_PATH as starting point to have mounts traversed? Confused... > > Actually, scratch that - 2.0 behaves the same way > > (mountpoint crossing is done in iget() there; is that Minix influence > > or straight from the Lions' book?) > > I don't think I ever had access to Lions' - I've _seen_ a printout of > it later, and obviously maybe others did, > > More likely it's from Maurice Bach: the Design of the Unix Operating > System. I'm pretty sure that's where a lot of the FS layer stuff came > from. Certainly the bad old buffer head interfaces, and quite likely > the iget() stuff too. > > > 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry() > > Whee, you _really_ went back in time. > > So I did too. > > And looking at that code in iget(), I doubt it came from anywhere. > Christ. It's just looping over a fixed-size array, both when finding > the inode, and finding the superblock. > > Cute, but unbelievably stupid. It was a more innocent time. > > In other words, I think you can chalk it up to just me, because > blaming anybody else for that garbage would be very very unfair indeed > ;) See https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c Exactly the same algorithm, complete with linear searches over those fixed-sized array. <grabs Bach> Right, he simply transcribes v7 iget(). So I suspect that you are right - your variant of iget was pretty much one-to-one implementation of Bach's description of v7 iget. Your namei wasn't - Bach has 'if the entry points to root and you are in the root and name is "..", find mount table entry (by device number), drop your directory inode, grab the inode of mountpount and restart the search for ".." in there', which gives back traversals to arbitrary depth. And v7 namei() (as Bach mentions) uses iget() for starting point as well as for each component. You kept pointers instead, which is where the other difference has come from (no mount traversal at the starting point)... Actually, I've misread your code in 0.10 - it does unlimited forward traversals; it's back traversals that go only one level. The forward ones got limited to one level in 0.95, but then mount-over-root had been banned all along. I'd read the pre-dcache variant of iget(), seen it go pretty much all the way back to beginning and hadn't sorted out the 0.12 -> 0.95 transition... > > How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) > > vs. faccessat(42, "foo", MAY_READ)? > > I think that in a perfect world, the O_PATH'ness of '42' would be the > deciding factor. Wouldn't those be the best and most consistent > semantics? > > And then 'cwd'/'root' always have the O_PATH behavior. See above - unless I'm misparsing you, you wanted mount traversals in the starting point if it's ...at() with O_PATH fd. With O_PATH open() not doing them. For cwd and root the situation is opposite - we do NOT traverse mounts for those. And that's really too late to change. > > The latter would trigger automount, > > the former would not... Or would you extend that to "traverse mounts > > upon following procfs links, if the file in question had been opened with > > O_PATH"? > > Exactly. > > But you know what? I do not believe this is all that important, and I > doubt it will matter to anybody. FWIW, digging through the automount-related parts of that stuff has caught several fun issues. One (and I'm rather embarrassed by it) should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount). To quote the commit message: The problem is that lock_mount() drops the caller's reference to the mountpoint's vfsmount in the case where it finds something already mounted on the mountpoint as it transits to the mounted filesystem and replaces path->mnt with the new mountpoint vfsmount. During a pathwalk, however, we don't take a reference on the vfsmount if it is the same as the one in the nameidata struct, but do_add_mount() doesn't know this. At which point I should've gone "what the fuck?" - lock_mount() does, indeed, drop path->mnt in this situation and replaces it with the whatever's come to cover it. For mount(2) that's the right thing to do - we _want_ to mount on top of whatever we have at the mountpoint. For automounts we very much don't want that - it's either "mount right on top of the automount trigger" or discard whatever we'd been about to mount and walk into whatever's got mounted there (presumably the same thing triggered by another process). We kinda-sorta get that effect, but in a very convoluted way: do_add_mount() will refuse to mount something on top of itself - /* Refuse the same filesystem on the same mount point */ err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) goto unlock; which will end up with -EBUSY returned (and recognized by follow_automount()). First of all, that's unreliable. If somebody not only has triggered that automount, but managed to _mount_ something else on top (for example, has triggered it by lookup of mountpoint-to-be in mount(2)), we'll end up not triggering that check. In which case we'll get something like nfs referral point under nfs automounted there under tmpfs from explicit overmount under same nfs mount we'd automounted there - identical to what's been buried under tmpfs. It's hard to hit, but not impossibly so. What's more, the whole solution is a kludge - the root of problem is that lock_mount() is the wrong thing to do in case of finish_automount(). We don't want to go into whatever's overmounting us there, both for the reasons above *and* because it's a PITA for the caller. So the right solution is * lift lock_mount() call from do_add_mount() into its callers (all 2 of them); while we are at it, lift unlock_mount() as well (makes for simpler failure exits in do_add_mount()). * replace the call of lock_mount() in finish_automount() with variant that doesn't do "unlock, walk deeper and retry locking", returning ERR_PTR(-EBUSY) in such case. * get rid of the kludge introduced in that commit. Better yet, don't bother with traversing into the covering mount in case of success - let the caller of follow_automount() do that. Which eliminates the need to pass need_mntput to the sucker and suggests an even better solution - have this analogue of lock_mount() return NULL instead of ERR_PTR(-EBUSY) and treat it in finish_automount() as "OK, discard what we wanted to mount and return 0". That gets rid of the entire err = finish_automount(mnt, path); switch (err) { case -EBUSY: /* Someone else made a mount here whilst we were busy */ return 0; case 0: path_put(path); path->mnt = mnt; path->dentry = dget(mnt->mnt_root); return 0; default: return err; } chunk in follow_automount() - it would just be return finish_automount(mnt, path); Another thing (in the same area) is not a bug per se, but... after the call of ->d_automount() we have this: if (IS_ERR(mnt)) { /* * The filesystem is allowed to return -EISDIR here to indicate * it doesn't want to automount. For instance, autofs would do * this so that its userspace daemon can mount on this dentry. * * However, we can only permit this if it's a terminal point in * the path being looked up; if it wasn't then the remainder of * the path is inaccessible and we should say so. */ if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT)) return -EREMOTE; return PTR_ERR(mnt); } Except that not a single instance of ->d_automount() has ever returned -EISDIR. Certainly not autofs one, despite the what the comment says. That chunk has come from dhowells, back when the whole mount trap series had been merged. After talking that thing over (fun: trying to figure out what had been intended nearly 9 years ago, when people involved are in UK, US east coast and AU west coast respectively. The only way it could suck more would've been if I were on the west coast - then all timezone deltas would be 8-hour ones)... looks like it's a rudiment of plans that got superseded during the series development, nobody quite remembers exact details. Conclusion: it's not even dead, it's stillborn; bury it. Unfortunately, there are other interesting questions related to autofs-specific bits (->d_manage()) and the timezone-related fun is, of course, still there. I hope to sort that out today or tomorrow, at least enough to do a reasonable set of backportable fixes to put in front of follow_managed()/step_into() queue. Oh, well... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-10 4:15 ` Al Viro @ 2020-01-10 5:03 ` Linus Torvalds 2020-01-10 6:20 ` Ian Kent 1 sibling, 0 replies; 92+ messages in thread From: Linus Torvalds @ 2020-01-10 5:03 UTC (permalink / raw) To: Al Viro Cc: Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Thu, Jan 9, 2020 at 8:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Hmm. If that's the case, maybe they should be marked implicitly as > > O_PATH when opened? > > I thought you wanted O_PATH as starting point to have mounts traversed? > Confused... No, I'm confused. I meant "non-O_PATH", just got the rules reversed in my mind. So cwd/root would always act as it non-O_PATH, and only using an actual fd would look at the O_PATH flag, and if it was set would walk the mountpoints. > <grabs Bach> Right, he simply transcribes v7 iget(). > > So I suspect that you are right - your variant of iget was pretty much > one-to-one implementation of Bach's description of v7 iget. Ok, that makes sense. My copy of Bach literally had the system call list "marked off" when I implemented them back when. I may still have that paperbook copy somewhere. I don't _think_ I'd have thrown it out, it has sentimental value. > > I think that in a perfect world, the O_PATH'ness of '42' would be the > > deciding factor. Wouldn't those be the best and most consistent > > semantics? > > > > And then 'cwd'/'root' always have the O_PATH behavior. > > See above - unless I'm misparsing you, you wanted mount traversals in the > starting point if it's ...at() with O_PATH fd. .. and see above, it was just my confusion about the sense of O_PATH. > For cwd and root the situation is opposite - we do NOT traverse mounts > for those. And that's really too late to change. Oh, absolutely. [ snip some more about your automount digging. Looks about right, but I'm not going to make a peep after getting O_PATH reversed ;) ] Linus ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-10 4:15 ` Al Viro 2020-01-10 5:03 ` Linus Torvalds @ 2020-01-10 6:20 ` Ian Kent 2020-01-12 21:33 ` Al Viro 1 sibling, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-10 6:20 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Fri, 2020-01-10 at 04:15 +0000, Al Viro wrote: > On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote: > > On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk> > > wrote: > > > The point is, we'd never followed mounts on /proc/self/cwd et.al. > > > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from > > > me) > > > is that way. > > > > Hmm. If that's the case, maybe they should be marked implicitly as > > O_PATH when opened? > > I thought you wanted O_PATH as starting point to have mounts > traversed? > Confused... > > > > Actually, scratch that - 2.0 behaves the same way > > > (mountpoint crossing is done in iget() there; is that Minix > > > influence > > > or straight from the Lions' book?) > > > > I don't think I ever had access to Lions' - I've _seen_ a printout > > of > > it later, and obviously maybe others did, > > > > More likely it's from Maurice Bach: the Design of the Unix > > Operating > > System. I'm pretty sure that's where a lot of the FS layer stuff > > came > > from. Certainly the bad old buffer head interfaces, and quite > > likely > > the iget() stuff too. > > > > > 0.10: forward traversal in iget(), back traversal in > > > fs/namei.c:find_entry() > > > > Whee, you _really_ went back in time. > > > > So I did too. > > > > And looking at that code in iget(), I doubt it came from anywhere. > > Christ. It's just looping over a fixed-size array, both when > > finding > > the inode, and finding the superblock. > > > > Cute, but unbelievably stupid. It was a more innocent time. > > > > In other words, I think you can chalk it up to just me, because > > blaming anybody else for that garbage would be very very unfair > > indeed > > ;) > > See > https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c > Exactly the same algorithm, complete with linear searches over those > fixed-sized array. > > <grabs Bach> Right, he simply transcribes v7 iget(). > > So I suspect that you are right - your variant of iget was pretty > much > one-to-one implementation of Bach's description of v7 iget. > > Your namei wasn't - Bach has 'if the entry points to root and you are > in the root and name is "..", find mount table entry (by device > number), > drop your directory inode, grab the inode of mountpount and restart > the search for ".." in there', which gives back traversals to > arbitrary > depth. And v7 namei() (as Bach mentions) uses iget() for starting > point > as well as for each component. You kept pointers instead, which is > where > the other difference has come from (no mount traversal at the > starting > point)... > > Actually, I've misread your code in 0.10 - it does unlimited forward > traversals; it's back traversals that go only one level. The forward > ones got limited to one level in 0.95, but then mount-over-root had > been banned all along. I'd read the pre-dcache variant of iget(), > seen it go pretty much all the way back to beginning and hadn't > sorted out the 0.12 -> 0.95 transition... > > > > How would your proposal deal with access("/proc/self/fd/42/foo", > > > MAY_READ) > > > vs. faccessat(42, "foo", MAY_READ)? > > > > I think that in a perfect world, the O_PATH'ness of '42' would be > > the > > deciding factor. Wouldn't those be the best and most consistent > > semantics? > > > > And then 'cwd'/'root' always have the O_PATH behavior. > > See above - unless I'm misparsing you, you wanted mount traversals in > the > starting point if it's ...at() with O_PATH fd. With O_PATH open() > not > doing them. > > For cwd and root the situation is opposite - we do NOT traverse > mounts > for those. And that's really too late to change. > > > > The latter would trigger automount, > > > the former would not... Or would you extend that to "traverse > > > mounts > > > upon following procfs links, if the file in question had been > > > opened with > > > O_PATH"? > > > > Exactly. > > > > But you know what? I do not believe this is all that important, and > > I > > doubt it will matter to anybody. > > FWIW, digging through the automount-related parts of that stuff has > caught several fun issues. One (and I'm rather embarrassed by it) > should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount > overput on simultaneous automount). To quote the commit message: > The problem is that lock_mount() drops the caller's reference to > the > mountpoint's vfsmount in the case where it finds something > already mounted on > the mountpoint as it transits to the mounted filesystem and > replaces path->mnt > with the new mountpoint vfsmount. > > During a pathwalk, however, we don't take a reference on the > vfsmount if it is > the same as the one in the nameidata struct, but do_add_mount() > doesn't know > this. > At which point I should've gone "what the fuck?" - lock_mount() does, > indeed, > drop path->mnt in this situation and replaces it with the whatever's > come to > cover it. For mount(2) that's the right thing to do - we _want_ to > mount > on top of whatever we have at the mountpoint. For automounts we very > much > don't want that - it's either "mount right on top of the automount > trigger" > or discard whatever we'd been about to mount and walk into whatever's > got > mounted there (presumably the same thing triggered by another > process). > We kinda-sorta get that effect, but in a very convoluted way: > do_add_mount() > will refuse to mount something on top of itself - > /* Refuse the same filesystem on the same mount point */ > err = -EBUSY; > if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && > path->mnt->mnt_root == path->dentry) > goto unlock; > which will end up with -EBUSY returned (and recognized by > follow_automount()). > > First of all, that's unreliable. If somebody not only has triggered > that > automount, but managed to _mount_ something else on top (for example, > has triggered it by lookup of mountpoint-to-be in mount(2)), we'll > end > up not triggering that check. In which case we'll get something like > nfs referral point under nfs automounted there under tmpfs from > explicit > overmount under same nfs mount we'd automounted there - identical to > what's > been buried under tmpfs. It's hard to hit, but not impossibly so. > > What's more, the whole solution is a kludge - the root of problem is > that lock_mount() is the wrong thing to do in case of > finish_automount(). > We don't want to go into whatever's overmounting us there, both for > the reasons above *and* because it's a PITA for the caller. So the > right solution is > * lift lock_mount() call from do_add_mount() into its callers > (all 2 of them); while we are at it, lift unlock_mount() as well > (makes for simpler failure exits in do_add_mount()). > * replace the call of lock_mount() in finish_automount() > with variant that doesn't do "unlock, walk deeper and retry locking", > returning ERR_PTR(-EBUSY) in such case. > * get rid of the kludge introduced in that commit. Better > yet, don't bother with traversing into the covering mount in case > of success - let the caller of follow_automount() do that. Which > eliminates the need to pass need_mntput to the sucker and suggests > an even better solution - have this analogue of lock_mount() > return NULL instead of ERR_PTR(-EBUSY) and treat it in > finish_automount() > as "OK, discard what we wanted to mount and return 0". That gets > rid of the entire > err = finish_automount(mnt, path); > switch (err) { > case -EBUSY: > /* Someone else made a mount here whilst we were busy > */ > return 0; > case 0: > path_put(path); > path->mnt = mnt; > path->dentry = dget(mnt->mnt_root); > return 0; > default: > return err; > } > chunk in follow_automount() - it would just be > return finish_automount(mnt, path); > > Another thing (in the same area) is not a bug per se, but... > after the call of ->d_automount() we have this: > if (IS_ERR(mnt)) { > /* > * The filesystem is allowed to return -EISDIR here > to indicate > * it doesn't want to automount. For instance, > autofs would do > * this so that its userspace daemon can mount on > this dentry. > * > * However, we can only permit this if it's a > terminal point in > * the path being looked up; if it wasn't then the > remainder of > * the path is inaccessible and we should say so. > */ > if (PTR_ERR(mnt) == -EISDIR && (nd->flags & > LOOKUP_PARENT)) > return -EREMOTE; > return PTR_ERR(mnt); > } > Except that not a single instance of ->d_automount() has ever > returned > -EISDIR. Certainly not autofs one, despite the what the comment > says. > That chunk has come from dhowells, back when the whole mount trap > series > had been merged. After talking that thing over (fun: trying to > figure > out what had been intended nearly 9 years ago, when people involved > are > in UK, US east coast and AU west coast respectively. The only way it > could suck more would've been if I were on the west coast - then all > timezone deltas would be 8-hour ones)... looks like it's a rudiment > of plans that got superseded during the series development, nobody > quite remembers exact details. Conclusion: it's not even dead, it's > stillborn; bury it. Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time we get there it's not relevant any more, so that check looks redundant. I'm not aware of any other fs automount implementation that needs that EISDIR pass-thru function. I didn't notice it at the time of the merge, sorry about that. While we're at it that: if (!path->dentry->d_op || !path->dentry->d_op->d_automount) return -EREMOTE; at the top of follow_automount() isn't going to be be relevant for autofs because ->d_automount() really must always be defined for it. But, at the time of the merge, I didn't object to it because there were (are) other file systems that use the VFS automount function which may accidentally not define the method. > > Unfortunately, there are other interesting questions related to > autofs-specific bits (->d_manage()) and the timezone-related fun > is, of course, still there. I hope to sort that out today or > tomorrow, at least enough to do a reasonable set of backportable > fixes to put in front of follow_managed()/step_into() queue. > Oh, well... Yeah, I know it slows you down but I kink-off like having a chance to look at what's going and think about your questions before trying to answer them, rather than replying prematurely, as I usually do ... It's been a bit of a busy day so far but I'm getting to look into the questions you've asked. Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-10 6:20 ` Ian Kent @ 2020-01-12 21:33 ` Al Viro 2020-01-13 2:59 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-12 21:33 UTC (permalink / raw) To: Ian Kent Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote: > Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time > we get there it's not relevant any more, so that check looks > redundant. I'm not aware of any other fs automount implementation > that needs that EISDIR pass-thru function. > > I didn't notice it at the time of the merge, sorry about that. > > While we're at it that: > if (!path->dentry->d_op || !path->dentry->d_op->d_automount) > return -EREMOTE; > > at the top of follow_automount() isn't going to be be relevant > for autofs because ->d_automount() really must always be defined > for it. > > But, at the time of the merge, I didn't object to it because > there were (are) other file systems that use the VFS automount > function which may accidentally not define the method. OK... > > Unfortunately, there are other interesting questions related to > > autofs-specific bits (->d_manage()) and the timezone-related fun > > is, of course, still there. I hope to sort that out today or > > tomorrow, at least enough to do a reasonable set of backportable > > fixes to put in front of follow_managed()/step_into() queue. > > Oh, well... > > Yeah, I know it slows you down but I kink-off like having a chance Nice typo, that ;-) > to look at what's going and think about your questions before trying > to answer them, rather than replying prematurely, as I usually do ... > > It's been a bit of a busy day so far but I'm getting to look into > the questions you've asked. Here's a bit more of those (I might've missed some of your replies on IRC; my apologies if that's the case): 1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try ->d_automount() here". If its effect can be delayed until the decision to call ->d_automount(), the things seem to get simpler. Is it ever returned in situation when the sucker _is_ overmounted? 2) can autofs_d_automount() ever be called for a daemon? Looks like it shouldn't be... 3) is _anything_ besides root directory ever created in direct autofs superblocks by anyone? If not, why does autofs_lookup() even bother to do anything there? IOW, why not have it return ERR_PTR(-ENOENT) immediately for direct ones? Or am I missing something and it is, in fact, possible to have the daemon create something in those? 4) Symlinks look like they should qualify for parent being non-empty; at least autofs_d_manage() seems to think so (simple_empty() use). So shouldn't we remove the trap from its parent on symlink/restore on unlink if parent gets empty? For version 4 or earlier, that is. Or is it simply that daemon only creates symlinks in root directory? Anyway, intermediate state of the series is in #work.namei right now, and some _very_ interesting possibilities open up. It definitely needs more massage around __follow_mount_rcu() (as it is, the fastpath in there is still too twisted). Said that * call graph is less convoluted * follow_managed() calls are folded into step_into(). Interface: int step_into(nd, flags, dentry, inode, seq), with inode/seq used only if we are in RCU mode. * ".." still doesn't use that; it probably ought to. * lookup_fast() doesn't take path - nd, &inode, &seq and returns dentry * lookup_open() and fs/namei.c:atomic_open() get similar treatment - don't take path, return dentry. * calls of follow_managed()/step_into() combination returning 1 are always followed by get_link(), and very shortly, at that. So much that we can realistically merge pick_link() (in the end of step_into()) with get_link(). That merge is NOT done in this branch yet. The last one promises to get rid of a rather unpleasant group of calling conventions. Right now we have several functions (step_into()/ walk_component()/lookup_last()/do_last()) with the following calling conventions: -E... => error 0 => non-symlink or symlink not followed; nd->path points to it 1 => picked a symlink to follow; its mount/dentry/seq has been pushed on nd->stack[]; its inode is stashed into nd->link_inode for subsequent get_link() to pick. nd->path is left unchanged. That way all of those become ERR_PTR(-E...) => error NULL => non-symlink, symlink not followed or a pure jump (bare "/" or procfs ones); nd->path points to where we end up string => symlink being followed; the sucker's pushed to stack, initial jump (if any) has been handled and the string returned is what we need to traverse. IMO it's less arbitrary that way. More importantly, the separation between step_into() committing to symlink traversal and (inevitably following) get_link() is gone - it's one operation after that change. No nd->link_inode either - it's only needed to carry the information from pick_link() to the next get_link(). Loops turn into while (!(err = link_path_walk(nd, s)) && (s = lookup_last(nd)) != NULL) ; and while (!(err = link_path_walk(nd, s)) && (s = do_last(nd, file, op)) != NULL) ; trailing_symlink() goes away (folded into pick_link()/get_link() combo, conditional upon nd->depth at the entry). And in link_path_walk() we'll have if (unlikely(!*name)) { /* pathname body, done */ if (!nd->depth) return 0; name = nd->stack[nd->depth - 1].name; /* trailing symlink, done */ if (!name) return 0; /* last component of nested symlink */ s = walk_component(nd, WALK_FOLLOW); } else { /* not the last component */ s = walk_component(nd, WALK_FOLLOW | WALK_MORE); } if (s) { if (IS_ERR(s)) return PTR_ERR(s); /* a symlink to follow */ nd->stack[nd->depth - 1].name = name; name = s; continue; } Anyway, before I try that one I'm going to fold path_openat2() into that series - that step is definitely going to require some massage there; it's too close to get_link() changes done in Aleksa's series. If we do that, we get a single primitive for "here's the result of lookup; traverse mounts and either move into the result or, if it's a symlink that needs to be traversed, start the symlink traversal - jump into the base position for it (if needed) and return the pathname that needs to be handled". As it is, mainline has that logics spread over about a dozen locations... Diffstat at the moment: fs/autofs/dev-ioctl.c | 6 +- fs/internal.h | 1 - fs/namei.c | 460 ++++++++++++++------------------------------------ fs/namespace.c | 97 +++++++---- fs/nfs/nfstrace.h | 2 - fs/open.c | 4 +- include/linux/namei.h | 3 +- 7 files changed, 197 insertions(+), 376 deletions(-) In the current form the sucker appears to work (so far - about 30% into the usual xfstests run) without visible slowdowns... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-12 21:33 ` Al Viro @ 2020-01-13 2:59 ` Ian Kent 2020-01-14 0:25 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-13 2:59 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Sun, 2020-01-12 at 21:33 +0000, Al Viro wrote: > On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote: > > > Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time > > we get there it's not relevant any more, so that check looks > > redundant. I'm not aware of any other fs automount implementation > > that needs that EISDIR pass-thru function. > > > > I didn't notice it at the time of the merge, sorry about that. > > > > While we're at it that: > > if (!path->dentry->d_op || !path->dentry->d_op->d_automount) > > return -EREMOTE; > > > > at the top of follow_automount() isn't going to be be relevant > > for autofs because ->d_automount() really must always be defined > > for it. > > > > But, at the time of the merge, I didn't object to it because > > there were (are) other file systems that use the VFS automount > > function which may accidentally not define the method. > > OK... > > > > Unfortunately, there are other interesting questions related to > > > autofs-specific bits (->d_manage()) and the timezone-related fun > > > is, of course, still there. I hope to sort that out today or > > > tomorrow, at least enough to do a reasonable set of backportable > > > fixes to put in front of follow_managed()/step_into() queue. > > > Oh, well... > > > > Yeah, I know it slows you down but I kink-off like having a chance > > Nice typo, that ;-) > > > to look at what's going and think about your questions before > > trying > > to answer them, rather than replying prematurely, as I usually do > > ... > > > > It's been a bit of a busy day so far but I'm getting to look into > > the questions you've asked. > > Here's a bit more of those (I might've missed some of your replies on > IRC; my apologies if that's the case): > > 1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try > ->d_automount() here". If its effect can be delayed until the > decision > to call ->d_automount(), the things seem to get simpler. Is it ever > returned in situation when the sucker _is_ overmounted? In theory it shouldn't need to be returned when there is an actual mount there. If there is a real mount at this point that should be enough to prevent walks into that mount until it's mount is complete. The whole idea of -EISDIR is to prevent processes from walking into a directory tree that "doesn't have a real mount at its base" (the so called multi-mount map construct). > > 2) can autofs_d_automount() ever be called for a daemon? Looks like > it > shouldn't be... Can't do that, it will lead to deadlock very quickly. > > 3) is _anything_ besides root directory ever created in direct autofs > superblocks by anyone? If not, why does autofs_lookup() even bother > to > do anything there? IOW, why not have it return ERR_PTR(-ENOENT) > immediately > for direct ones? Or am I missing something and it is, in fact, > possible > to have the daemon create something in those? Short answer is no, longer answer is directories "shouldn't" ever be created inside direct mount points. The thing is that the multi-mount map construct can be used with direct mounts too, but they must always have a real mount at the base because they are direct mounts. So processes should not be able to walk into them while they are being mounted (constructed). But I'm pretty sure it's rare (maybe not done at all) that this map construct is used with direct mounts. > > 4) Symlinks look like they should qualify for parent being non-empty; > at least autofs_d_manage() seems to think so (simple_empty() use). > So shouldn't we remove the trap from its parent on symlink/restore on > unlink if parent gets empty? For version 4 or earlier, that is. Or > is > it simply that daemon only creates symlinks in root directory? Yes, they have to be empty. If a symlink is to be used (based on autofs config or map option) and the "browse" option is used for the indirect mount (browse only makes sense for indirect autofs managed mounts) then the mount point directory has to be removed and a symlink created so it must be empty to for this to make sense. If it's a "nobrowse" autofs mount then nothing should already exist, it just gets created. The catch is that a map entry for which a symlink is to be used instead of a mount can't be a multi-mount. I'm pretty sure I don't have sufficient error checking for that in the daemon but I also haven't had reports of problems with it either. For a very long time the use of symlinks was not common but when the amd format map parser was added it made sense to use symlinks in some cases for those. That was partly to reduce the number of mounts needed and because I deliberately don't support amd map entries that provide the multi-mount construct. The way amd did this looked ugly to me, very much a hack to add a Sun format mount feature. As far as keeping the trap flags up to date, I don't. It seemed so much simpler to just leave the flags in place but, at that time, symlinks were not used (although it was possible to do so), now that's changed fiddling with the flags might now make sense. As I said on IRC: "DCACHE_NEED_AUTOMOUNT is set on symlink dentries because, when ->lookup() is called the dentry may trigger a callback to the daemon that will either create a directory (since, in this case, one does not already exist) and attempt to mount on it or create a symlink if the autofs config/map requires it. I didn't think there would be potential simplification by setting and clearing the DCACHE_NEED_AUTOMOUNT flag based on it being a directory (mountpoint) or a symlink so the flag is always left set. Although, as you point out, symlinks won't actually trigger mounts so the flag being left set when the dentry is a symlink is due to lazyness, since there's nothing to gain. If you can see potential simplification in the VFS code by managing this flag better then that would be worth while." > > > Anyway, intermediate state of the series is in #work.namei right now, > and some _very_ interesting possibilities open up. It definitely > needs more massage around __follow_mount_rcu() (as it is, the > fastpath in there is still too twisted). Said that > * call graph is less convoluted > * follow_managed() calls are folded into > step_into(). Interface: > int step_into(nd, flags, dentry, inode, seq), with inode/seq used > only > if we are in RCU mode. > * ".." still doesn't use that; it probably ought to. > * lookup_fast() doesn't take path - nd, &inode, &seq and > returns dentry > * lookup_open() and fs/namei.c:atomic_open() get similar > treatment > - don't take path, return dentry. > * calls of follow_managed()/step_into() combination returning 1 > are always followed by get_link(), and very shortly, at that. So > much > that we can realistically merge pick_link() (in the end of > step_into()) with get_link(). That merge is NOT done in this branch > yet. > > The last one promises to get rid of a rather unpleasant group of > calling > conventions. Right now we have several functions (step_into()/ > walk_component()/lookup_last()/do_last()) with the following calling > conventions: > -E... => error > 0 => non-symlink or symlink not followed; nd->path points to it > 1 => picked a symlink to follow; its mount/dentry/seq has been > pushed on nd->stack[]; its inode is stashed into nd->link_inode for > subsequent get_link() to pick. nd->path is left unchanged. > > That way all of those become > ERR_PTR(-E...) => error > NULL => non-symlink, symlink not followed or a > pure > jump (bare "/" or procfs ones); nd->path points to where we end up > string => symlink being followed; the sucker's > pushed > to stack, initial jump (if any) has been handled and the string > returned > is what we need to traverse. > > IMO it's less arbitrary that way. More importantly, the separation > between > step_into() committing to symlink traversal and (inevitably > following) > get_link() is gone - it's one operation after that change. No nd- > >link_inode > either - it's only needed to carry the information from pick_link() > to the > next get_link(). > > Loops turn into > while (!(err = link_path_walk(nd, s)) && > (s = lookup_last(nd)) != NULL) > ; > and > while (!(err = link_path_walk(nd, s)) && > (s = do_last(nd, file, op)) != NULL) > ; > > trailing_symlink() goes away (folded into pick_link()/get_link() > combo, > conditional upon nd->depth at the entry). And in link_path_walk() > we'll > have > if (unlikely(!*name)) { > /* pathname body, done */ > if (!nd->depth) > return 0; > name = nd->stack[nd->depth - 1].name; > /* trailing symlink, done */ > if (!name) > return 0; > /* last component of nested symlink */ > s = walk_component(nd, WALK_FOLLOW); > } else { > /* not the last component */ > s = walk_component(nd, WALK_FOLLOW | > WALK_MORE); > } > if (s) { > if (IS_ERR(s)) > return PTR_ERR(s); > /* a symlink to follow */ > nd->stack[nd->depth - 1].name = name; > name = s; > continue; > } > > Anyway, before I try that one I'm going to fold path_openat2() into > that series - that step is definitely going to require some massage > there; it's too close to get_link() changes done in Aleksa's series. > > If we do that, we get a single primitive for "here's the result of > lookup; traverse mounts and either move into the result or, if > it's a symlink that needs to be traversed, start the symlink > traversal - jump into the base position for it (if needed) and > return the pathname that needs to be handled". As it is, mainline > has that logics spread over about a dozen locations... > > Diffstat at the moment: > fs/autofs/dev-ioctl.c | 6 +- > fs/internal.h | 1 - > fs/namei.c | 460 ++++++++++++++------------------------ > ------------ > fs/namespace.c | 97 +++++++---- > fs/nfs/nfstrace.h | 2 - > fs/open.c | 4 +- > include/linux/namei.h | 3 +- > 7 files changed, 197 insertions(+), 376 deletions(-) > > In the current form the sucker appears to work (so far - about 30% > into the usual xfstests run) without visible slowdowns... Ok, I'll have a look at that branch, ;) Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-13 2:59 ` Ian Kent @ 2020-01-14 0:25 ` Ian Kent 2020-01-14 4:39 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-14 0:25 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Mon, 2020-01-13 at 10:59 +0800, Ian Kent wrote: > > > 3) is _anything_ besides root directory ever created in direct > > autofs > > superblocks by anyone? If not, why does autofs_lookup() even > > bother > > to > > do anything there? IOW, why not have it return ERR_PTR(-ENOENT) > > immediately > > for direct ones? Or am I missing something and it is, in fact, > > possible > > to have the daemon create something in those? > > Short answer is no, longer answer is directories "shouldn't" ever > be created inside direct mount points. > > The thing is that the multi-mount map construct can be used with > direct mounts too, but they must always have a real mount at the > base because they are direct mounts. So processes should not be > able to walk into them while they are being mounted (constructed). > > But I'm pretty sure it's rare (maybe not done at all) that this > map construct is used with direct mounts. This isn't right. There's actually nothing stopping a user from using a direct map entry that's a multi-mount without an actual mount at its root. So there could be directories created under these, it's just not usually done. I'm pretty sure I don't check and disallow this. Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 0:25 ` Ian Kent @ 2020-01-14 4:39 ` Al Viro 2020-01-14 5:01 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-14 4:39 UTC (permalink / raw) To: Ian Kent Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote: > This isn't right. > > There's actually nothing stopping a user from using a direct map > entry that's a multi-mount without an actual mount at its root. > So there could be directories created under these, it's just not > usually done. > > I'm pretty sure I don't check and disallow this. IDGI... How the hell will that work in v5? Who will set _any_ traps outside the one in root in that scenario? autofs_lookup() won't (there it's conditional upon indirect mount). Neither will autofs_dir_mkdir() (conditional upon version being less than 5). Who will, then? Confused... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 4:39 ` Al Viro @ 2020-01-14 5:01 ` Ian Kent 2020-01-14 5:59 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-14 5:01 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote: > On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote: > > > This isn't right. > > > > There's actually nothing stopping a user from using a direct map > > entry that's a multi-mount without an actual mount at its root. > > So there could be directories created under these, it's just not > > usually done. > > > > I'm pretty sure I don't check and disallow this. > > IDGI... How the hell will that work in v5? Who will set _any_ > traps outside the one in root in that scenario? autofs_lookup() > won't (there it's conditional upon indirect mount). Neither > will autofs_dir_mkdir() (conditional upon version being less > than 5). Who will, then? > > Confused... It's easy to miss. For autofs type direct and offset mounts the flags are set at fill super time. They have to be set then because they are direct mounts and offset mounts behave the same as direct mounts so they need to be set then too. So, like direct mounts, offset mounts are each distinct autofs (trigger) mounts. I could check for this construct and refuse it if that's really needed. I'm pretty sure this map construct isn't much used by people using direct mounts. Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 5:01 ` Ian Kent @ 2020-01-14 5:59 ` Ian Kent 0 siblings, 0 replies; 92+ messages in thread From: Ian Kent @ 2020-01-14 5:59 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Tue, 2020-01-14 at 13:01 +0800, Ian Kent wrote: > On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote: > > On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote: > > > > > This isn't right. > > > > > > There's actually nothing stopping a user from using a direct map > > > entry that's a multi-mount without an actual mount at its root. > > > So there could be directories created under these, it's just not > > > usually done. > > > > > > I'm pretty sure I don't check and disallow this. > > > > IDGI... How the hell will that work in v5? Who will set _any_ > > traps outside the one in root in that scenario? autofs_lookup() > > won't (there it's conditional upon indirect mount). Neither > > will autofs_dir_mkdir() (conditional upon version being less > > than 5). Who will, then? > > > > Confused... > > It's easy to miss. > > For autofs type direct and offset mounts the flags are set at fill > super time. > > They have to be set then because they are direct mounts and offset > mounts behave the same as direct mounts so they need to be set then > too. So, like direct mounts, offset mounts are each distinct autofs > (trigger) mounts. > > I could check for this construct and refuse it if that's really > needed. I'm pretty sure this map construct isn't much used by > people using direct mounts. Ok, once again I'm not exactly accurate is some of what I said. It turns out that the autofs connectathon tests, one of the tests that I use, does test direct mounts with offsets both with and without a real mount at the base of the mount. Based on that, I have to say this map construct is meant to be supported with Sun format maps of autofs (even though I think it's probably not used much). So not allowing it is probably the wrong thing to do. OTOH initial testing with the #work.namei branch shows these are functioning as required. Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-08 3:54 ` Linus Torvalds 2020-01-08 21:34 ` Al Viro @ 2020-01-10 21:07 ` Aleksa Sarai 2020-01-14 4:57 ` Al Viro 1 sibling, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2020-01-10 21:07 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent [-- Attachment #1: Type: text/plain, Size: 2739 bytes --] On 2020-01-07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 7, 2020 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Another interesting question is whether we want O_PATH open > > to trigger automounts. > > It does sound like they shouldn't, but as you say: > > > The thing is, we do *NOT* trigger them > > (or traverse mountpoints) at the starting point of lookups. > > I believe it's a mistake (and mine, at that), but I doubt that > > there's anything that can be done about it at that point. > > It's a user-visible behaviour [..] > > Hmm. I wonder how set in stone that is. We may have two decades of > history of not doing it at start point of lookups, but we do *not* > have two decades of history of O_PATH. > > So what I think we agree would be sane behavior would be for O_PATH > opens to not trigger automounts (unless there's a slash at the end, > whatever), but _do_ add the mount-point traversal to the beginning of > lookups. > > But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case. > > That way we maintain original behavior: if somebody overmounts your > cwd, you still see the pre-mount directory on lookups, because your > cwd is "under" the mount. > > But if you open a file with O_PATH, and somebody does a mount > _afterwards_, the openat() will see that later mount and/or do the > automount. > > Don't you think that would be the more sane/obvious semantics of how > O_PATH should work? If I'm understanding this proposal correctly, this would be a problem for the libpathrs use-case -- if this is done then there's no way to avoid a TOCTOU with someone mounting and the userspace program checking whether something is a mountpoint (unless you have Linux >5.6 and RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE: 1. Open the candidate directory. 2. umount2(MNT_EXPIRE) the fd. * -EINVAL means it wasn't a mountpoint when we got the fd, and the fd is a stable handle to the underlying directory. * -EAGAIN or -EBUSY means that it was a mountpoint or became a mountpoint after the fd was opened (we don't care about that, but fail-safe is better here). 3. Use the fd from (1) for all operations. Don't get me wrong, I want to fix this issue *properly* by adding some new kernel features that allow us to avoid worrying about mounts-over-magiclinks -- but on old kernels (which libpathrs cares about) I would be worried about changes like this being backported resulting in it being not possible to implement the hardening I mentioned up-thread. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-10 21:07 ` Aleksa Sarai @ 2020-01-14 4:57 ` Al Viro 2020-01-14 5:12 ` Al Viro ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Al Viro @ 2020-01-14 4:57 UTC (permalink / raw) To: Aleksa Sarai Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote: > If I'm understanding this proposal correctly, this would be a problem > for the libpathrs use-case -- if this is done then there's no way to > avoid a TOCTOU with someone mounting and the userspace program checking > whether something is a mountpoint (unless you have Linux >5.6 and > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE: > > 1. Open the candidate directory. > 2. umount2(MNT_EXPIRE) the fd. > * -EINVAL means it wasn't a mountpoint when we got the fd, and the > fd is a stable handle to the underlying directory. > * -EAGAIN or -EBUSY means that it was a mountpoint or became a > mountpoint after the fd was opened (we don't care about that, but > fail-safe is better here). > 3. Use the fd from (1) for all operations. ... except that foo/../bar *WILL* cross into the covering mount, on any kernel that supports ...at(2) at all, so I would be very cautious about any kind "hardening" claims in that case. I'm not sure about Linus' proposal - it looks rather convoluted and we get a hard to describe twist of semantics in an area (procfs symlinks vs. mount traversal) on top of everything else in there... Anyway, a couple of questions: 1) do you see any problems on your testcases with the current #fixes? That's commit 7a955b7363b8 as branch tip. 2) do you have any updates you would like to fold into stuff in #work.openat2? Right now I have a local variant of #work.namei (with fairly cosmetical change compared to vfs.git one) that merges clean with #work.openat2; I would like to do any updates/fold-ins/etc. of #work.openat2 *before* doing a merge and continuing to work on top of the merge results... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 4:57 ` Al Viro @ 2020-01-14 5:12 ` Al Viro 2020-01-14 20:01 ` Aleksa Sarai 2020-01-15 13:57 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai 2 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-14 5:12 UTC (permalink / raw) To: Aleksa Sarai Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Tue, Jan 14, 2020 at 04:57:33AM +0000, Al Viro wrote: > On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote: > > > If I'm understanding this proposal correctly, this would be a problem > > for the libpathrs use-case -- if this is done then there's no way to > > avoid a TOCTOU with someone mounting and the userspace program checking > > whether something is a mountpoint (unless you have Linux >5.6 and > > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE: > > > > 1. Open the candidate directory. > > 2. umount2(MNT_EXPIRE) the fd. > > * -EINVAL means it wasn't a mountpoint when we got the fd, and the > > fd is a stable handle to the underlying directory. > > * -EAGAIN or -EBUSY means that it was a mountpoint or became a > > mountpoint after the fd was opened (we don't care about that, but > > fail-safe is better here). > > 3. Use the fd from (1) for all operations. > > ... except that foo/../bar *WILL* cross into the covering mount, on any > kernel that supports ...at(2) at all, so I would be very cautious about > any kind "hardening" claims in that case. > > I'm not sure about Linus' proposal - it looks rather convoluted and we > get a hard to describe twist of semantics in an area (procfs symlinks > vs. mount traversal) on top of everything else in there... PS: one thing that might be interesting is exposing LOOKUP_DOWN via AT_... flag - it would allow to request mount traversals at the starting point explicitly. Pretty much all code needed for that is already there; all it would take is checking the flag in path_openat() and path_parentat() and having handle_lookup_down() called there, same as in path_lookupat(). A tricky question is whether such flag should affect absolute symlinks - i.e. chdir /foo ln -s /bar barf overmount / do lookup with that flag for /bar/splat do lookup with that flag for barf/splat Do we want the same results in both calls? The first one would traverse mounts on / and walk into /bar/splat in overmounting; the second - see no mounts whatsoever on current directory (/foo in old root), see the symlink to "/bar", jump to process' root and proceed from there, first for "bar", then "splat" in it... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 4:57 ` Al Viro 2020-01-14 5:12 ` Al Viro @ 2020-01-14 20:01 ` Aleksa Sarai 2020-01-15 14:25 ` Al Viro 2020-01-15 13:57 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai 2 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2020-01-14 20:01 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent [-- Attachment #1: Type: text/plain, Size: 2839 bytes --] On 2020-01-14, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote: > > > If I'm understanding this proposal correctly, this would be a problem > > for the libpathrs use-case -- if this is done then there's no way to > > avoid a TOCTOU with someone mounting and the userspace program checking > > whether something is a mountpoint (unless you have Linux >5.6 and > > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE: > > > > 1. Open the candidate directory. > > 2. umount2(MNT_EXPIRE) the fd. > > * -EINVAL means it wasn't a mountpoint when we got the fd, and the > > fd is a stable handle to the underlying directory. > > * -EAGAIN or -EBUSY means that it was a mountpoint or became a > > mountpoint after the fd was opened (we don't care about that, but > > fail-safe is better here). > > 3. Use the fd from (1) for all operations. > > ... except that foo/../bar *WILL* cross into the covering mount, on any > kernel that supports ...at(2) at all, so I would be very cautious about > any kind "hardening" claims in that case. In the use-case I have, we would have full control over what the path being opened is (and thus you wouldn't open "foo/../bar"). But I agree that generally the MNT_EXPIRE solution is really non-ideal anyway. Not to mention that we're still screwed when it comes to using magic-links (because if someone bind-mounts a magic-link over a magic-link there's absolutely no race-free way to be sure that we're traversing the right magic-link -- for that we'll need to have a different solution). > I'm not sure about Linus' proposal - it looks rather convoluted and we > get a hard to describe twist of semantics in an area (procfs symlinks > vs. mount traversal) on top of everything else in there... Yeah, I agree. > 1) do you see any problems on your testcases with the current #fixes? > That's commit 7a955b7363b8 as branch tip. I will take a quick look later today, but I'm currently at a conference. > 2) do you have any updates you would like to fold into stuff in > #work.openat2? Right now I have a local variant of #work.namei (with > fairly cosmetical change compared to vfs.git one) that merges clean > with #work.openat2; I would like to do any updates/fold-ins/etc. > of #work.openat2 *before* doing a merge and continuing to work on > top of the merge results... Yes, there were two patches I sent a while ago[1]. I can re-send them if you like. The second patch switches open_how->mode to a u64, but I'm still on the fence about whether that makes sense to do... [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/ -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 20:01 ` Aleksa Sarai @ 2020-01-15 14:25 ` Al Viro 2020-01-15 14:29 ` Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-15 14:25 UTC (permalink / raw) To: Aleksa Sarai Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote: > Yes, there were two patches I sent a while ago[1]. I can re-send them if > you like. The second patch switches open_how->mode to a u64, but I'm > still on the fence about whether that makes sense to do... IMO plain __u64 is better than games with __aligned_u64 - all sizes are fixed, so... > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/ Do you want that series folded into "open: introduce openat2(2) syscall" and "selftests: add openat2(2) selftests" or would you rather have them appended at the end of the series. Personally I'd go for "fold them in" if it had been about my code, but it's really up to you. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-15 14:25 ` Al Viro @ 2020-01-15 14:29 ` Aleksa Sarai 2020-01-15 14:34 ` Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2020-01-15 14:29 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent [-- Attachment #1: Type: text/plain, Size: 1028 bytes --] On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote: > > > Yes, there were two patches I sent a while ago[1]. I can re-send them if > > you like. The second patch switches open_how->mode to a u64, but I'm > > still on the fence about whether that makes sense to do... > > IMO plain __u64 is better than games with __aligned_u64 - all sizes are > fixed, so... > > > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/ > > Do you want that series folded into "open: introduce openat2(2) syscall" > and "selftests: add openat2(2) selftests" or would you rather have them > appended at the end of the series. Personally I'd go for "fold them in" > if it had been about my code, but it's really up to you. "fold them in" would probably be better to avoid making the mainline history confusing afterwards. Thanks. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-15 14:29 ` Aleksa Sarai @ 2020-01-15 14:34 ` Aleksa Sarai 2020-01-15 14:48 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2020-01-15 14:34 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent [-- Attachment #1: Type: text/plain, Size: 1205 bytes --] On 2020-01-16, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote: > > > > > Yes, there were two patches I sent a while ago[1]. I can re-send them if > > > you like. The second patch switches open_how->mode to a u64, but I'm > > > still on the fence about whether that makes sense to do... > > > > IMO plain __u64 is better than games with __aligned_u64 - all sizes are > > fixed, so... > > > > > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/ > > > > Do you want that series folded into "open: introduce openat2(2) syscall" > > and "selftests: add openat2(2) selftests" or would you rather have them > > appended at the end of the series. Personally I'd go for "fold them in" > > if it had been about my code, but it's really up to you. > > "fold them in" would probably be better to avoid making the mainline > history confusing afterwards. Thanks. Also (if you prefer) I can send a v3 which uses u64s rather than aligned_u64s. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-15 14:34 ` Aleksa Sarai @ 2020-01-15 14:48 ` Al Viro 2020-01-18 12:07 ` [PATCH v3 0/2] openat2: minor uapi cleanups Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-15 14:48 UTC (permalink / raw) To: Aleksa Sarai Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent On Thu, Jan 16, 2020 at 01:34:59AM +1100, Aleksa Sarai wrote: > On 2020-01-16, Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote: > > > > > > > Yes, there were two patches I sent a while ago[1]. I can re-send them if > > > > you like. The second patch switches open_how->mode to a u64, but I'm > > > > still on the fence about whether that makes sense to do... > > > > > > IMO plain __u64 is better than games with __aligned_u64 - all sizes are > > > fixed, so... > > > > > > > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/ > > > > > > Do you want that series folded into "open: introduce openat2(2) syscall" > > > and "selftests: add openat2(2) selftests" or would you rather have them > > > appended at the end of the series. Personally I'd go for "fold them in" > > > if it had been about my code, but it's really up to you. > > > > "fold them in" would probably be better to avoid making the mainline > > history confusing afterwards. Thanks. > > Also (if you prefer) I can send a v3 which uses u64s rather than > aligned_u64s. <mode "lazy bastard"> Could you fold and resend the results of folding (i.e. replacements for two commits in question)? </mode> The hard part is, of course, in updating commit messages ;-) ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH v3 0/2] openat2: minor uapi cleanups 2020-01-15 14:48 ` Al Viro @ 2020-01-18 12:07 ` Aleksa Sarai 2020-01-18 12:07 ` [PATCH v3 1/2] open: introduce openat2(2) syscall Aleksa Sarai ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Aleksa Sarai @ 2020-01-18 12:07 UTC (permalink / raw) To: Alexander Viro, Jeff Layton, J. Bruce Fields, Shuah Khan Cc: Aleksa Sarai, Florian Weimer, David Laight, Christian Brauner, quae, dev, containers, libc-alpha, linux-api, linux-fsdevel, linux-kernel, linux-kselftest Patch changelog: v3: * Merge changes into the original patches to make Al's life easier. [Al Viro] v2: * Add include <linux/types.h> to openat2.h. [Florian Weimer] * Move OPEN_HOW_SIZE_* constants out of UAPI. [Florian Weimer] * Switch from __aligned_u64 to __u64 since it isn't necessary. [David Laight] v1: <https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/> While openat2(2) is still not yet in Linus's tree, we can take this opportunity to iron out some small warts that weren't noticed earlier: * A fix was suggested by Florian Weimer, to separate the openat2 definitions so glibc can use the header directly. I've put the maintainership under VFS but let me know if you'd prefer it belong ot the fcntl folks. * Having heterogenous field sizes in an extensible struct results in "padding hole" problems when adding new fields (in addition the correct error to use for non-zero padding isn't entirely clear ). The simplest solution is to just copy clone(3)'s model -- always use u64s. It will waste a little more space in the struct, but it removes a possible future headache. This patch is intended to replace the corresponding patches in Al's #work.openat2 tree (and *will not* apply on Linus' tree). @Al: I will send some additional patches later, but they will require proper design review since they're ABI-related features (namely, adding a way to check what features a syscall supports as I outlined in my talk here[1]). [1]: https://youtu.be/ggD-eb3yPVs Aleksa Sarai (2): open: introduce openat2(2) syscall selftests: add openat2(2) selftests CREDITS | 4 +- MAINTAINERS | 1 + arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/open.c | 147 +++-- include/linux/fcntl.h | 16 +- include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/fcntl.h | 2 +- include/uapi/linux/openat2.h | 39 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/openat2/Makefile | 8 + tools/testing/selftests/openat2/helpers.c | 109 ++++ tools/testing/selftests/openat2/helpers.h | 106 ++++ .../testing/selftests/openat2/openat2_test.c | 312 +++++++++++ .../selftests/openat2/rename_attack_test.c | 160 ++++++ .../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++ 34 files changed, 1418 insertions(+), 39 deletions(-) create mode 100644 include/uapi/linux/openat2.h create mode 100644 tools/testing/selftests/openat2/.gitignore create mode 100644 tools/testing/selftests/openat2/Makefile create mode 100644 tools/testing/selftests/openat2/helpers.c create mode 100644 tools/testing/selftests/openat2/helpers.h create mode 100644 tools/testing/selftests/openat2/openat2_test.c create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c create mode 100644 tools/testing/selftests/openat2/resolve_test.c -- 2.24.1 ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH v3 1/2] open: introduce openat2(2) syscall 2020-01-18 12:07 ` [PATCH v3 0/2] openat2: minor uapi cleanups Aleksa Sarai @ 2020-01-18 12:07 ` Aleksa Sarai 2020-01-18 12:08 ` [PATCH v3 2/2] selftests: add openat2(2) selftests Aleksa Sarai 2020-01-18 15:28 ` [PATCH v3 0/2] openat2: minor uapi cleanups Al Viro 2 siblings, 0 replies; 92+ messages in thread From: Aleksa Sarai @ 2020-01-18 12:07 UTC (permalink / raw) To: Alexander Viro, Jeff Layton, J. Bruce Fields, Shuah Khan Cc: Aleksa Sarai, Christian Brauner, Florian Weimer, David Laight, quae, dev, containers, libc-alpha, linux-api, linux-fsdevel, linux-kernel, linux-kselftest /* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- CREDITS | 4 +- MAINTAINERS | 1 + arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/open.c | 147 +++++++++++++++----- include/linux/fcntl.h | 16 ++- include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/fcntl.h | 2 +- include/uapi/linux/openat2.h | 39 ++++++ 26 files changed, 198 insertions(+), 39 deletions(-) create mode 100644 include/uapi/linux/openat2.h diff --git a/CREDITS b/CREDITS index 9602b0fa1c95..a97d3280a627 100644 --- a/CREDITS +++ b/CREDITS @@ -3302,7 +3302,9 @@ S: France N: Aleksa Sarai E: cyphar@cyphar.com W: https://www.cyphar.com/ -D: `pids` cgroup subsystem +D: /sys/fs/cgroup/pids +D: openat2(2) +S: Sydney, Australia N: Dipankar Sarma E: dipankar@in.ibm.com diff --git a/MAINTAINERS b/MAINTAINERS index bd5847e802de..737ada377ac3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6397,6 +6397,7 @@ F: fs/* F: include/linux/fs.h F: include/linux/fs_types.h F: include/uapi/linux/fs.h +F: include/uapi/linux/openat2.h FINTEK F75375S HARDWARE MONITOR AND FAN CONTROLLER DRIVER M: Riku Voipio <riku.voipio@iki.fi> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 8e13b0b2928d..4d7f2ffa957c 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -475,3 +475,4 @@ 543 common fspick sys_fspick 544 common pidfd_open sys_pidfd_open # 545 reserved for clone3 +547 common openat2 sys_openat2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 6da7dc4d79cc..4ba54bc7e19a 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -449,3 +449,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 +437 common openat2 sys_openat2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 2629a68b8724..8aa00ccb0b96 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 436 +#define __NR_compat_syscalls 438 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 94ab29cf4f00..57f6f592d460 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick) __SYSCALL(__NR_pidfd_open, sys_pidfd_open) #define __NR_clone3 435 __SYSCALL(__NR_clone3, sys_clone3) +#define __NR_openat2 437 +__SYSCALL(__NR_openat2, sys_openat2) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 36d5faf4c86c..8d36f2e2dc89 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -356,3 +356,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index a88a285a0e5f..2559925f1924 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -435,3 +435,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 09b0cd7dab0a..c04385e60833 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -441,3 +441,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 +437 common openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index e7c5ab38e403..68c9ec06851f 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -374,3 +374,4 @@ 433 n32 fspick sys_fspick 434 n32 pidfd_open sys_pidfd_open 435 n32 clone3 __sys_clone3 +437 n32 openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 13cd66581f3b..42a72d010050 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -350,3 +350,4 @@ 433 n64 fspick sys_fspick 434 n64 pidfd_open sys_pidfd_open 435 n64 clone3 __sys_clone3 +437 n64 openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 353539ea4140..f114c4aed0ed 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -423,3 +423,4 @@ 433 o32 fspick sys_fspick 434 o32 pidfd_open sys_pidfd_open 435 o32 clone3 __sys_clone3 +437 o32 openat2 sys_openat2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 285ff516150c..b550ae9a7fea 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -433,3 +433,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3_wrapper +437 common openat2 sys_openat2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 43f736ed47f2..a8b5ecb5b602 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -517,3 +517,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 nospu clone3 ppc_clone3 +437 common openat2 sys_openat2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 3054e9c035a3..16b571c06161 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -438,3 +438,4 @@ 433 common fspick sys_fspick sys_fspick 434 common pidfd_open sys_pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 sys_clone3 +437 common openat2 sys_openat2 sys_openat2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index b5ed26c4c005..a7185cc18626 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -438,3 +438,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 8c8cc7537fb2..b11c19552022 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -481,3 +481,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 15908eb9b17e..d22a8b5c3fab 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -440,3 +440,4 @@ 433 i386 fspick sys_fspick __ia32_sys_fspick 434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open 435 i386 clone3 sys_clone3 __ia32_sys_clone3 +437 i386 openat2 sys_openat2 __ia32_sys_openat2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c29976eca4a8..9035647ef236 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -357,6 +357,7 @@ 433 common fspick __x64_sys_fspick 434 common pidfd_open __x64_sys_pidfd_open 435 common clone3 __x64_sys_clone3/ptregs +437 common openat2 __x64_sys_openat2 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 25f4de729a6d..f0a68013c038 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -406,3 +406,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 +437 common openat2 sys_openat2 diff --git a/fs/open.c b/fs/open.c index b62f5c0923a8..8cdb2b675867 100644 --- a/fs/open.c +++ b/fs/open.c @@ -955,48 +955,84 @@ struct file *open_with_fake_path(const struct path *path, int flags, } EXPORT_SYMBOL(open_with_fake_path); -static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op) +#define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE)) +#define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC) + +static inline struct open_how build_open_how(int flags, umode_t mode) +{ + struct open_how how = { + .flags = flags & VALID_OPEN_FLAGS, + .mode = mode & S_IALLUGO, + }; + + /* O_PATH beats everything else. */ + if (how.flags & O_PATH) + how.flags &= O_PATH_FLAGS; + /* Modes should only be set for create-like flags. */ + if (!WILL_CREATE(how.flags)) + how.mode = 0; + return how; +} + +static inline int build_open_flags(const struct open_how *how, + struct open_flags *op) { + int flags = how->flags; int lookup_flags = 0; int acc_mode = ACC_MODE(flags); + /* Must never be set by userspace */ + flags &= ~(FMODE_NONOTIFY | O_CLOEXEC); + /* - * Clear out all open flags we don't know about so that we don't report - * them in fcntl(F_GETFD) or similar interfaces. + * Older syscalls implicitly clear all of the invalid flags or argument + * values before calling build_open_flags(), but openat2(2) checks all + * of its arguments. */ - flags &= VALID_OPEN_FLAGS; + if (flags & ~VALID_OPEN_FLAGS) + return -EINVAL; + if (how->resolve & ~VALID_RESOLVE_FLAGS) + return -EINVAL; - if (flags & (O_CREAT | __O_TMPFILE)) - op->mode = (mode & S_IALLUGO) | S_IFREG; - else + /* Deal with the mode. */ + if (WILL_CREATE(flags)) { + if (how->mode & ~S_IALLUGO) + return -EINVAL; + op->mode = how->mode | S_IFREG; + } else { + if (how->mode != 0) + return -EINVAL; op->mode = 0; - - /* Must never be set by userspace */ - flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC; + } /* - * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only - * check for O_DSYNC if the need any syncing at all we enforce it's - * always set instead of having to deal with possibly weird behaviour - * for malicious applications setting only __O_SYNC. + * In order to ensure programs get explicit errors when trying to use + * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it + * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we + * have to require userspace to explicitly set it. */ - if (flags & __O_SYNC) - flags |= O_DSYNC; - if (flags & __O_TMPFILE) { if ((flags & O_TMPFILE_MASK) != O_TMPFILE) return -EINVAL; if (!(acc_mode & MAY_WRITE)) return -EINVAL; - } else if (flags & O_PATH) { - /* - * If we have O_PATH in the open flag. Then we - * cannot have anything other than the below set of flags - */ - flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH; + } + if (flags & O_PATH) { + /* O_PATH only permits certain other flags to be set. */ + if (flags & ~O_PATH_FLAGS) + return -EINVAL; acc_mode = 0; } + /* + * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only + * check for O_DSYNC if the need any syncing at all we enforce it's + * always set instead of having to deal with possibly weird behaviour + * for malicious applications setting only __O_SYNC. + */ + if (flags & __O_SYNC) + flags |= O_DSYNC; + op->open_flag = flags; /* O_TRUNC implies we need access checks for write permissions */ @@ -1022,6 +1058,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_DIRECTORY; if (!(flags & O_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW; + + if (how->resolve & RESOLVE_NO_XDEV) + lookup_flags |= LOOKUP_NO_XDEV; + if (how->resolve & RESOLVE_NO_MAGICLINKS) + lookup_flags |= LOOKUP_NO_MAGICLINKS; + if (how->resolve & RESOLVE_NO_SYMLINKS) + lookup_flags |= LOOKUP_NO_SYMLINKS; + if (how->resolve & RESOLVE_BENEATH) + lookup_flags |= LOOKUP_BENEATH; + if (how->resolve & RESOLVE_IN_ROOT) + lookup_flags |= LOOKUP_IN_ROOT; + op->lookup_flags = lookup_flags; return 0; } @@ -1040,8 +1088,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o struct file *file_open_name(struct filename *name, int flags, umode_t mode) { struct open_flags op; - int err = build_open_flags(flags, mode, &op); - return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op); + struct open_how how = build_open_how(flags, mode); + int err = build_open_flags(&how, &op); + if (err) + return ERR_PTR(err); + return do_filp_open(AT_FDCWD, name, &op); } /** @@ -1072,17 +1123,19 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt, const char *filename, int flags, umode_t mode) { struct open_flags op; - int err = build_open_flags(flags, mode, &op); + struct open_how how = build_open_how(flags, mode); + int err = build_open_flags(&how, &op); if (err) return ERR_PTR(err); return do_file_open_root(dentry, mnt, filename, &op); } EXPORT_SYMBOL(file_open_root); -long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) +static long do_sys_openat2(int dfd, const char __user *filename, + struct open_how *how) { struct open_flags op; - int fd = build_open_flags(flags, mode, &op); + int fd = build_open_flags(how, &op); struct filename *tmp; if (fd) @@ -1092,7 +1145,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) if (IS_ERR(tmp)) return PTR_ERR(tmp); - fd = get_unused_fd_flags(flags); + fd = get_unused_fd_flags(how->flags); if (fd >= 0) { struct file *f = do_filp_open(dfd, tmp, &op); if (IS_ERR(f)) { @@ -1107,12 +1160,16 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) return fd; } -SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) +long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) { - if (force_o_largefile()) - flags |= O_LARGEFILE; + struct open_how how = build_open_how(flags, mode); + return do_sys_openat2(dfd, filename, &how); +} - return do_sys_open(AT_FDCWD, filename, flags, mode); + +SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) +{ + return ksys_open(filename, flags, mode); } SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, @@ -1120,10 +1177,32 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, { if (force_o_largefile()) flags |= O_LARGEFILE; - return do_sys_open(dfd, filename, flags, mode); } +SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename, + struct open_how __user *, how, size_t, usize) +{ + int err; + struct open_how tmp; + + BUILD_BUG_ON(sizeof(struct open_how) < OPEN_HOW_SIZE_VER0); + BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_LATEST); + + if (unlikely(usize < OPEN_HOW_SIZE_VER0)) + return -EINVAL; + + err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize); + if (err) + return err; + + /* O_LARGEFILE is only allowed for non-O_PATH. */ + if (!(tmp.flags & O_PATH) && force_o_largefile()) + tmp.flags |= O_LARGEFILE; + + return do_sys_openat2(dfd, filename, &tmp); +} + #ifdef CONFIG_COMPAT /* * Exactly like sys_open(), except that it doesn't set the diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index d019df946cb2..7bcdcf4f6ab2 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -2,15 +2,29 @@ #ifndef _LINUX_FCNTL_H #define _LINUX_FCNTL_H +#include <linux/stat.h> #include <uapi/linux/fcntl.h> -/* list of all valid flags for the open/openat flags argument: */ +/* List of all valid flags for the open/openat flags argument: */ #define VALID_OPEN_FLAGS \ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) +/* List of all valid flags for the how->upgrade_mask argument: */ +#define VALID_UPGRADE_FLAGS \ + (UPGRADE_NOWRITE | UPGRADE_NOREAD) + +/* List of all valid flags for the how->resolve argument: */ +#define VALID_RESOLVE_FLAGS \ + (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ + RESOLVE_BENEATH | RESOLVE_IN_ROOT) + +/* List of all open_how "versions". */ +#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */ +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0 + #ifndef force_o_largefile #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T)) #endif diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d0391cc2dae9..cd9f27cbc567 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct rseq; union bpf_attr; struct io_uring_params; struct clone_args; +struct open_how; #include <linux/types.h> #include <linux/aio_abi.h> @@ -439,6 +440,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user, asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group); asmlinkage long sys_openat(int dfd, const char __user *filename, int flags, umode_t mode); +asmlinkage long sys_openat2(int dfd, const char __user *filename, + struct open_how *how, size_t size); asmlinkage long sys_close(unsigned int fd); asmlinkage long sys_vhangup(void); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 1fc8faa6e973..d4122c091472 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open) __SYSCALL(__NR_clone3, sys_clone3) #endif +#define __NR_openat2 437 +__SYSCALL(__NR_openat2, sys_openat2) + #undef __NR_syscalls -#define __NR_syscalls 436 +#define __NR_syscalls 438 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 1f97b33c840e..ca88b7bce553 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -3,6 +3,7 @@ #define _UAPI_LINUX_FCNTL_H #include <asm/fcntl.h> +#include <linux/openat2.h> #define F_SETLEASE (F_LINUX_SPECIFIC_BASE + 0) #define F_GETLEASE (F_LINUX_SPECIFIC_BASE + 1) @@ -100,5 +101,4 @@ #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ - #endif /* _UAPI_LINUX_FCNTL_H */ diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h new file mode 100644 index 000000000000..58b1eb711360 --- /dev/null +++ b/include/uapi/linux/openat2.h @@ -0,0 +1,39 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_OPENAT2_H +#define _UAPI_LINUX_OPENAT2_H + +#include <linux/types.h> + +/* + * Arguments for how openat2(2) should open the target path. If only @flags and + * @mode are non-zero, then openat2(2) operates very similarly to openat(2). + * + * However, unlike openat(2), unknown or invalid bits in @flags result in + * -EINVAL rather than being silently ignored. @mode must be zero unless one of + * {O_CREAT, O_TMPFILE} are set. + * + * @flags: O_* flags. + * @mode: O_CREAT/O_TMPFILE file mode. + * @resolve: RESOLVE_* flags. + */ +struct open_how { + __u64 flags; + __u64 mode; + __u64 resolve; +}; + +/* how->resolve flags for openat2(2). */ +#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings + (includes bind-mounts). */ +#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style + "magic-links". */ +#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks + (implies OEXT_NO_MAGICLINKS) */ +#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like + "..", symlinks, and absolute + paths which escape the dirfd. */ +#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." + be scoped inside the dirfd + (similar to chroot(2)). */ + +#endif /* _UAPI_LINUX_OPENAT2_H */ -- 2.24.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH v3 2/2] selftests: add openat2(2) selftests 2020-01-18 12:07 ` [PATCH v3 0/2] openat2: minor uapi cleanups Aleksa Sarai 2020-01-18 12:07 ` [PATCH v3 1/2] open: introduce openat2(2) syscall Aleksa Sarai @ 2020-01-18 12:08 ` Aleksa Sarai 2020-01-18 15:28 ` [PATCH v3 0/2] openat2: minor uapi cleanups Al Viro 2 siblings, 0 replies; 92+ messages in thread From: Aleksa Sarai @ 2020-01-18 12:08 UTC (permalink / raw) To: Alexander Viro, Jeff Layton, J. Bruce Fields, Shuah Khan Cc: Aleksa Sarai, Florian Weimer, David Laight, Christian Brauner, quae, dev, containers, libc-alpha, linux-api, linux-fsdevel, linux-kernel, linux-kselftest Test all of the various openat2(2) flags. A small stress-test of a symlink-rename attack is included to show that the protections against ".."-based attacks are sufficient. The main things these self-tests are enforcing are: * The struct+usize ABI for openat2(2) and copy_struct_from_user() to ensure that upgrades will be handled gracefully (in addition, ensuring that misaligned structures are also handled correctly). * The -EINVAL checks for openat2(2) are all correctly handled to avoid userspace passing unknown or conflicting flag sets (most importantly, ensuring that invalid flag combinations are checked). * All of the RESOLVE_* semantics (including errno values) are correctly handled with various combinations of paths and flags. * RESOLVE_IN_ROOT correctly protects against the symlink rename(2) attack that has been responsible for several CVEs (and likely will be responsible for several more). Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/openat2/Makefile | 8 + tools/testing/selftests/openat2/helpers.c | 109 ++++ tools/testing/selftests/openat2/helpers.h | 106 ++++ .../testing/selftests/openat2/openat2_test.c | 312 +++++++++++ .../selftests/openat2/rename_attack_test.c | 160 ++++++ .../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++ 8 files changed, 1220 insertions(+) create mode 100644 tools/testing/selftests/openat2/.gitignore create mode 100644 tools/testing/selftests/openat2/Makefile create mode 100644 tools/testing/selftests/openat2/helpers.c create mode 100644 tools/testing/selftests/openat2/helpers.h create mode 100644 tools/testing/selftests/openat2/openat2_test.c create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c create mode 100644 tools/testing/selftests/openat2/resolve_test.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index b001c602414b..4f502448dc7e 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -40,6 +40,7 @@ TARGETS += powerpc TARGETS += proc TARGETS += pstore TARGETS += ptrace +TARGETS += openat2 TARGETS += rseq TARGETS += rtc TARGETS += seccomp diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore new file mode 100644 index 000000000000..bd68f6c3fd07 --- /dev/null +++ b/tools/testing/selftests/openat2/.gitignore @@ -0,0 +1 @@ +/*_test diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile new file mode 100644 index 000000000000..4b93b1417b86 --- /dev/null +++ b/tools/testing/selftests/openat2/Makefile @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0-or-later + +CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined +TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test + +include ../lib.mk + +$(TEST_GEN_PROGS): helpers.c diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c new file mode 100644 index 000000000000..e9a6557ab16f --- /dev/null +++ b/tools/testing/selftests/openat2/helpers.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai <cyphar@cyphar.com> + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <stdbool.h> +#include <string.h> +#include <syscall.h> +#include <limits.h> + +#include "helpers.h" + +bool needs_openat2(const struct open_how *how) +{ + return how->resolve != 0; +} + +int raw_openat2(int dfd, const char *path, void *how, size_t size) +{ + int ret = syscall(__NR_openat2, dfd, path, how, size); + return ret >= 0 ? ret : -errno; +} + +int sys_openat2(int dfd, const char *path, struct open_how *how) +{ + return raw_openat2(dfd, path, how, sizeof(*how)); +} + +int sys_openat(int dfd, const char *path, struct open_how *how) +{ + int ret = openat(dfd, path, how->flags, how->mode); + return ret >= 0 ? ret : -errno; +} + +int sys_renameat2(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, unsigned int flags) +{ + int ret = syscall(__NR_renameat2, olddirfd, oldpath, + newdirfd, newpath, flags); + return ret >= 0 ? ret : -errno; +} + +int touchat(int dfd, const char *path) +{ + int fd = openat(dfd, path, O_CREAT); + if (fd >= 0) + close(fd); + return fd; +} + +char *fdreadlink(int fd) +{ + char *target, *tmp; + + E_asprintf(&tmp, "/proc/self/fd/%d", fd); + + target = malloc(PATH_MAX); + if (!target) + ksft_exit_fail_msg("fdreadlink: malloc failed\n"); + memset(target, 0, PATH_MAX); + + E_readlink(tmp, target, PATH_MAX); + free(tmp); + return target; +} + +bool fdequal(int fd, int dfd, const char *path) +{ + char *fdpath, *dfdpath, *other; + bool cmp; + + fdpath = fdreadlink(fd); + dfdpath = fdreadlink(dfd); + + if (!path) + E_asprintf(&other, "%s", dfdpath); + else if (*path == '/') + E_asprintf(&other, "%s", path); + else + E_asprintf(&other, "%s/%s", dfdpath, path); + + cmp = !strcmp(fdpath, other); + + free(fdpath); + free(dfdpath); + free(other); + return cmp; +} + +bool openat2_supported = false; + +void __attribute__((constructor)) init(void) +{ + struct open_how how = {}; + int fd; + + BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0); + + /* Check openat2(2) support. */ + fd = sys_openat2(AT_FDCWD, ".", &how); + openat2_supported = (fd >= 0); + + if (fd >= 0) + close(fd); +} diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h new file mode 100644 index 000000000000..a6ea27344db2 --- /dev/null +++ b/tools/testing/selftests/openat2/helpers.h @@ -0,0 +1,106 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai <cyphar@cyphar.com> + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#ifndef __RESOLVEAT_H__ +#define __RESOLVEAT_H__ + +#define _GNU_SOURCE +#include <stdint.h> +#include <errno.h> +#include <linux/types.h> +#include "../kselftest.h" + +#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X))) +#define BUILD_BUG_ON(e) ((void)(sizeof(struct { int:(-!!(e)); }))) + +#ifndef SYS_openat2 +#ifndef __NR_openat2 +#define __NR_openat2 437 +#endif /* __NR_openat2 */ +#define SYS_openat2 __NR_openat2 +#endif /* SYS_openat2 */ + +/* + * Arguments for how openat2(2) should open the target path. If @resolve is + * zero, then openat2(2) operates very similarly to openat(2). + * + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather + * than being silently ignored. @mode must be zero unless one of {O_CREAT, + * O_TMPFILE} are set. + * + * @flags: O_* flags. + * @mode: O_CREAT/O_TMPFILE file mode. + * @resolve: RESOLVE_* flags. + */ +struct open_how { + __u64 flags; + __u64 mode; + __u64 resolve; +}; + +#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */ +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0 + +bool needs_openat2(const struct open_how *how); + +#ifndef RESOLVE_IN_ROOT +/* how->resolve flags for openat2(2). */ +#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings + (includes bind-mounts). */ +#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style + "magic-links". */ +#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks + (implies OEXT_NO_MAGICLINKS) */ +#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like + "..", symlinks, and absolute + paths which escape the dirfd. */ +#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." + be scoped inside the dirfd + (similar to chroot(2)). */ +#endif /* RESOLVE_IN_ROOT */ + +#define E_func(func, ...) \ + do { \ + if (func(__VA_ARGS__) < 0) \ + ksft_exit_fail_msg("%s:%d %s failed\n", \ + __FILE__, __LINE__, #func);\ + } while (0) + +#define E_asprintf(...) E_func(asprintf, __VA_ARGS__) +#define E_chmod(...) E_func(chmod, __VA_ARGS__) +#define E_dup2(...) E_func(dup2, __VA_ARGS__) +#define E_fchdir(...) E_func(fchdir, __VA_ARGS__) +#define E_fstatat(...) E_func(fstatat, __VA_ARGS__) +#define E_kill(...) E_func(kill, __VA_ARGS__) +#define E_mkdirat(...) E_func(mkdirat, __VA_ARGS__) +#define E_mount(...) E_func(mount, __VA_ARGS__) +#define E_prctl(...) E_func(prctl, __VA_ARGS__) +#define E_readlink(...) E_func(readlink, __VA_ARGS__) +#define E_setresuid(...) E_func(setresuid, __VA_ARGS__) +#define E_symlinkat(...) E_func(symlinkat, __VA_ARGS__) +#define E_touchat(...) E_func(touchat, __VA_ARGS__) +#define E_unshare(...) E_func(unshare, __VA_ARGS__) + +#define E_assert(expr, msg, ...) \ + do { \ + if (!(expr)) \ + ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \ + __FILE__, __LINE__, #expr, ##__VA_ARGS__); \ + } while (0) + +int raw_openat2(int dfd, const char *path, void *how, size_t size); +int sys_openat2(int dfd, const char *path, struct open_how *how); +int sys_openat(int dfd, const char *path, struct open_how *how); +int sys_renameat2(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, unsigned int flags); + +int touchat(int dfd, const char *path); +char *fdreadlink(int fd); +bool fdequal(int fd, int dfd, const char *path); + +extern bool openat2_supported; + +#endif /* __RESOLVEAT_H__ */ diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c new file mode 100644 index 000000000000..b386367c606b --- /dev/null +++ b/tools/testing/selftests/openat2/openat2_test.c @@ -0,0 +1,312 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai <cyphar@cyphar.com> + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* + * O_LARGEFILE is set to 0 by glibc. + * XXX: This is wrong on {mips, parisc, powerpc, sparc}. + */ +#undef O_LARGEFILE +#define O_LARGEFILE 0x8000 + +struct open_how_ext { + struct open_how inner; + uint32_t extra1; + char pad1[128]; + uint32_t extra2; + char pad2[128]; + uint32_t extra3; +}; + +struct struct_test { + const char *name; + struct open_how_ext arg; + size_t size; + int err; +}; + +#define NUM_OPENAT2_STRUCT_TESTS 7 +#define NUM_OPENAT2_STRUCT_VARIATIONS 13 + +void test_openat2_struct(void) +{ + int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 }; + + struct struct_test tests[] = { + /* Normal struct. */ + { .name = "normal struct", + .arg.inner.flags = O_RDONLY, + .size = sizeof(struct open_how) }, + /* Bigger struct, with zeroed out end. */ + { .name = "bigger struct (zeroed out)", + .arg.inner.flags = O_RDONLY, + .size = sizeof(struct open_how_ext) }, + + /* TODO: Once expanded, check zero-padding. */ + + /* Smaller than version-0 struct. */ + { .name = "zero-sized 'struct'", + .arg.inner.flags = O_RDONLY, .size = 0, .err = -EINVAL }, + { .name = "smaller-than-v0 struct", + .arg.inner.flags = O_RDONLY, + .size = OPEN_HOW_SIZE_VER0 - 1, .err = -EINVAL }, + + /* Bigger struct, with non-zero trailing bytes. */ + { .name = "bigger struct (non-zero data in first 'future field')", + .arg.inner.flags = O_RDONLY, .arg.extra1 = 0xdeadbeef, + .size = sizeof(struct open_how_ext), .err = -E2BIG }, + { .name = "bigger struct (non-zero data in middle of 'future fields')", + .arg.inner.flags = O_RDONLY, .arg.extra2 = 0xfeedcafe, + .size = sizeof(struct open_how_ext), .err = -E2BIG }, + { .name = "bigger struct (non-zero data at end of 'future fields')", + .arg.inner.flags = O_RDONLY, .arg.extra3 = 0xabad1dea, + .size = sizeof(struct open_how_ext), .err = -E2BIG }, + }; + + BUILD_BUG_ON(ARRAY_LEN(misalignments) != NUM_OPENAT2_STRUCT_VARIATIONS); + BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_STRUCT_TESTS); + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + struct struct_test *test = &tests[i]; + struct open_how_ext how_ext = test->arg; + + for (int j = 0; j < ARRAY_LEN(misalignments); j++) { + int fd, misalign = misalignments[j]; + char *fdpath = NULL; + bool failed; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + void *copy = NULL, *how_copy = &how_ext; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + if (misalign) { + /* + * Explicitly misalign the structure copying it with the given + * (mis)alignment offset. The other data is set to be non-zero to + * make sure that non-zero bytes outside the struct aren't checked + * + * This is effectively to check that is_zeroed_user() works. + */ + copy = malloc(misalign + sizeof(how_ext)); + how_copy = copy + misalign; + memset(copy, 0xff, misalign); + memcpy(how_copy, &how_ext, sizeof(how_ext)); + } + + fd = raw_openat2(AT_FDCWD, ".", how_copy, test->size); + if (test->err >= 0) + failed = (fd < 0); + else + failed = (fd != test->err); + if (fd >= 0) { + fdpath = fdreadlink(fd); + close(fd); + } + + if (failed) { + resultfn = ksft_test_result_fail; + + ksft_print_msg("openat2 unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s']\n", fd, fdpath); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + +skip: + if (test->err >= 0) + resultfn("openat2 with %s argument [misalign=%d] succeeds\n", + test->name, misalign); + else + resultfn("openat2 with %s argument [misalign=%d] fails with %d (%s)\n", + test->name, misalign, test->err, + strerror(-test->err)); + + free(copy); + free(fdpath); + fflush(stdout); + } + } +} + +struct flag_test { + const char *name; + struct open_how how; + int err; +}; + +#define NUM_OPENAT2_FLAG_TESTS 23 + +void test_openat2_flags(void) +{ + struct flag_test tests[] = { + /* O_TMPFILE is incompatible with O_PATH and O_CREAT. */ + { .name = "incompatible flags (O_TMPFILE | O_PATH)", + .how.flags = O_TMPFILE | O_PATH | O_RDWR, .err = -EINVAL }, + { .name = "incompatible flags (O_TMPFILE | O_CREAT)", + .how.flags = O_TMPFILE | O_CREAT | O_RDWR, .err = -EINVAL }, + + /* O_PATH only permits certain other flags to be set ... */ + { .name = "compatible flags (O_PATH | O_CLOEXEC)", + .how.flags = O_PATH | O_CLOEXEC }, + { .name = "compatible flags (O_PATH | O_DIRECTORY)", + .how.flags = O_PATH | O_DIRECTORY }, + { .name = "compatible flags (O_PATH | O_NOFOLLOW)", + .how.flags = O_PATH | O_NOFOLLOW }, + /* ... and others are absolutely not permitted. */ + { .name = "incompatible flags (O_PATH | O_RDWR)", + .how.flags = O_PATH | O_RDWR, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_CREAT)", + .how.flags = O_PATH | O_CREAT, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_EXCL)", + .how.flags = O_PATH | O_EXCL, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_NOCTTY)", + .how.flags = O_PATH | O_NOCTTY, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_DIRECT)", + .how.flags = O_PATH | O_DIRECT, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_LARGEFILE)", + .how.flags = O_PATH | O_LARGEFILE, .err = -EINVAL }, + + /* ->mode must only be set with O_{CREAT,TMPFILE}. */ + { .name = "non-zero how.mode and O_RDONLY", + .how.flags = O_RDONLY, .how.mode = 0600, .err = -EINVAL }, + { .name = "non-zero how.mode and O_PATH", + .how.flags = O_PATH, .how.mode = 0600, .err = -EINVAL }, + { .name = "valid how.mode and O_CREAT", + .how.flags = O_CREAT, .how.mode = 0600 }, + { .name = "valid how.mode and O_TMPFILE", + .how.flags = O_TMPFILE | O_RDWR, .how.mode = 0600 }, + /* ->mode must only contain 0777 bits. */ + { .name = "invalid how.mode and O_CREAT", + .how.flags = O_CREAT, + .how.mode = 0xFFFF, .err = -EINVAL }, + { .name = "invalid (very large) how.mode and O_CREAT", + .how.flags = O_CREAT, + .how.mode = 0xC000000000000000ULL, .err = -EINVAL }, + { .name = "invalid how.mode and O_TMPFILE", + .how.flags = O_TMPFILE | O_RDWR, + .how.mode = 0x1337, .err = -EINVAL }, + { .name = "invalid (very large) how.mode and O_TMPFILE", + .how.flags = O_TMPFILE | O_RDWR, + .how.mode = 0x0000A00000000000ULL, .err = -EINVAL }, + + /* ->resolve must only contain RESOLVE_* flags. */ + { .name = "invalid how.resolve and O_RDONLY", + .how.flags = O_RDONLY, + .how.resolve = 0x1337, .err = -EINVAL }, + { .name = "invalid how.resolve and O_CREAT", + .how.flags = O_CREAT, + .how.resolve = 0x1337, .err = -EINVAL }, + { .name = "invalid how.resolve and O_TMPFILE", + .how.flags = O_TMPFILE | O_RDWR, + .how.resolve = 0x1337, .err = -EINVAL }, + { .name = "invalid how.resolve and O_PATH", + .how.flags = O_PATH, + .how.resolve = 0x1337, .err = -EINVAL }, + }; + + BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS); + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + int fd, fdflags = -1; + char *path, *fdpath = NULL; + bool failed = false; + struct flag_test *test = &tests[i]; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + path = (test->how.flags & O_CREAT) ? "/tmp/ksft.openat2_tmpfile" : "."; + unlink(path); + + fd = sys_openat2(AT_FDCWD, path, &test->how); + if (test->err >= 0) + failed = (fd < 0); + else + failed = (fd != test->err); + if (fd >= 0) { + int otherflags; + + fdpath = fdreadlink(fd); + fdflags = fcntl(fd, F_GETFL); + otherflags = fcntl(fd, F_GETFD); + close(fd); + + E_assert(fdflags >= 0, "fcntl F_GETFL of new fd"); + E_assert(otherflags >= 0, "fcntl F_GETFD of new fd"); + + /* O_CLOEXEC isn't shown in F_GETFL. */ + if (otherflags & FD_CLOEXEC) + fdflags |= O_CLOEXEC; + /* O_CREAT is hidden from F_GETFL. */ + if (test->how.flags & O_CREAT) + fdflags |= O_CREAT; + if (!(test->how.flags & O_LARGEFILE)) + fdflags &= ~O_LARGEFILE; + failed |= (fdflags != test->how.flags); + } + + if (failed) { + resultfn = ksft_test_result_fail; + + ksft_print_msg("openat2 unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s'] with %X (!= %X)\n", + fd, fdpath, fdflags, + test->how.flags); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + +skip: + if (test->err >= 0) + resultfn("openat2 with %s succeeds\n", test->name); + else + resultfn("openat2 with %s fails with %d (%s)\n", + test->name, test->err, strerror(-test->err)); + + free(fdpath); + fflush(stdout); + } +} + +#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \ + NUM_OPENAT2_FLAG_TESTS) + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_set_plan(NUM_TESTS); + + test_openat2_struct(); + test_openat2_flags(); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c new file mode 100644 index 000000000000..0a770728b436 --- /dev/null +++ b/tools/testing/selftests/openat2/rename_attack_test.c @@ -0,0 +1,160 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai <cyphar@cyphar.com> + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <sys/mman.h> +#include <sys/prctl.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> +#include <syscall.h> +#include <limits.h> +#include <unistd.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* Construct a test directory with the following structure: + * + * root/ + * |-- a/ + * | `-- c/ + * `-- b/ + */ +int setup_testdir(void) +{ + int dfd; + char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX"; + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n"); + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + + E_mkdirat(dfd, "a", 0755); + E_mkdirat(dfd, "b", 0755); + E_mkdirat(dfd, "a/c", 0755); + + return dfd; +} + +/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */ +pid_t spawn_attack(int dirfd, char *a, char *b) +{ + pid_t child = fork(); + if (child != 0) + return child; + + /* If the parent (the test process) dies, kill ourselves too. */ + E_prctl(PR_SET_PDEATHSIG, SIGKILL); + + /* Swap @a and @b. */ + for (;;) + renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE); + exit(1); +} + +#define NUM_RENAME_TESTS 2 +#define ROUNDS 400000 + +const char *flagname(int resolve) +{ + switch (resolve) { + case RESOLVE_IN_ROOT: + return "RESOLVE_IN_ROOT"; + case RESOLVE_BENEATH: + return "RESOLVE_BENEATH"; + } + return "(unknown)"; +} + +void test_rename_attack(int resolve) +{ + int dfd, afd; + pid_t child; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + int escapes = 0, other_errs = 0, exdevs = 0, eagains = 0, successes = 0; + + struct open_how how = { + .flags = O_PATH, + .resolve = resolve, + }; + + if (!openat2_supported) { + how.resolve = 0; + ksft_print_msg("openat2(2) unsupported -- using openat(2) instead\n"); + } + + dfd = setup_testdir(); + afd = openat(dfd, "a", O_PATH); + if (afd < 0) + ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n"); + + child = spawn_attack(dfd, "a/c", "b"); + + for (int i = 0; i < ROUNDS; i++) { + int fd; + char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../.."; + + if (openat2_supported) + fd = sys_openat2(afd, victim_path, &how); + else + fd = sys_openat(afd, victim_path, &how); + + if (fd < 0) { + if (fd == -EAGAIN) + eagains++; + else if (fd == -EXDEV) + exdevs++; + else if (fd == -ENOENT) + escapes++; /* escaped outside and got ENOENT... */ + else + other_errs++; /* unexpected error */ + } else { + if (fdequal(fd, afd, NULL)) + successes++; + else + escapes++; /* we got an unexpected fd */ + } + close(fd); + } + + if (escapes > 0) + resultfn = ksft_test_result_fail; + ksft_print_msg("non-escapes: EAGAIN=%d EXDEV=%d E<other>=%d success=%d\n", + eagains, exdevs, other_errs, successes); + resultfn("rename attack with %s (%d runs, got %d escapes)\n", + flagname(resolve), ROUNDS, escapes); + + /* Should be killed anyway, but might as well make sure. */ + E_kill(child, SIGKILL); +} + +#define NUM_TESTS NUM_RENAME_TESTS + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_set_plan(NUM_TESTS); + + test_rename_attack(RESOLVE_BENEATH); + test_rename_attack(RESOLVE_IN_ROOT); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c new file mode 100644 index 000000000000..7a94b1da8e7b --- /dev/null +++ b/tools/testing/selftests/openat2/resolve_test.c @@ -0,0 +1,523 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai <cyphar@cyphar.com> + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* + * Construct a test directory with the following structure: + * + * root/ + * |-- procexe -> /proc/self/exe + * |-- procroot -> /proc/self/root + * |-- root/ + * |-- mnt/ [mountpoint] + * | |-- self -> ../mnt/ + * | `-- absself -> /mnt/ + * |-- etc/ + * | `-- passwd + * |-- creatlink -> /newfile3 + * |-- reletc -> etc/ + * |-- relsym -> etc/passwd + * |-- absetc -> /etc/ + * |-- abssym -> /etc/passwd + * |-- abscheeky -> /cheeky + * `-- cheeky/ + * |-- absself -> / + * |-- self -> ../../root/ + * |-- garbageself -> /../../root/ + * |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd + * |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd + * |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd + * `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd + */ +int setup_testdir(void) +{ + int dfd, tmpfd; + char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX"; + + /* Unshare and make /tmp a new directory. */ + E_unshare(CLONE_NEWNS); + E_mount("", "/tmp", "", MS_PRIVATE, ""); + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n"); + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + + /* A sub-directory which is actually used for tests. */ + E_mkdirat(dfd, "root", 0755); + tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY); + if (tmpfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + close(dfd); + dfd = tmpfd; + + E_symlinkat("/proc/self/exe", dfd, "procexe"); + E_symlinkat("/proc/self/root", dfd, "procroot"); + E_mkdirat(dfd, "root", 0755); + + /* There is no mountat(2), so use chdir. */ + E_mkdirat(dfd, "mnt", 0755); + E_fchdir(dfd); + E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, ""); + E_symlinkat("../mnt/", dfd, "mnt/self"); + E_symlinkat("/mnt/", dfd, "mnt/absself"); + + E_mkdirat(dfd, "etc", 0755); + E_touchat(dfd, "etc/passwd"); + + E_symlinkat("/newfile3", dfd, "creatlink"); + E_symlinkat("etc/", dfd, "reletc"); + E_symlinkat("etc/passwd", dfd, "relsym"); + E_symlinkat("/etc/", dfd, "absetc"); + E_symlinkat("/etc/passwd", dfd, "abssym"); + E_symlinkat("/cheeky", dfd, "abscheeky"); + + E_mkdirat(dfd, "cheeky", 0755); + + E_symlinkat("/", dfd, "cheeky/absself"); + E_symlinkat("../../root/", dfd, "cheeky/self"); + E_symlinkat("/../../root/", dfd, "cheeky/garbageself"); + + E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd"); + E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd"); + + E_symlinkat("../../../../../../../../../../../../../../etc/passwd", + dfd, "cheeky/dotdotlink"); + E_symlinkat("/../../../../../../../../../../../../../../etc/passwd", + dfd, "cheeky/garbagelink"); + + return dfd; +} + +struct basic_test { + const char *name; + const char *dir; + const char *path; + struct open_how how; + bool pass; + union { + int err; + const char *path; + } out; +}; + +#define NUM_OPENAT2_OPATH_TESTS 88 + +void test_openat2_opath_tests(void) +{ + int rootfd, hardcoded_fd; + char *procselfexe, *hardcoded_fdpath; + + E_asprintf(&procselfexe, "/proc/%d/exe", getpid()); + rootfd = setup_testdir(); + + hardcoded_fd = open("/dev/null", O_RDONLY); + E_assert(hardcoded_fd >= 0, "open fd to hardcode"); + E_asprintf(&hardcoded_fdpath, "self/fd/%d", hardcoded_fd); + + struct basic_test tests[] = { + /** RESOLVE_BENEATH **/ + /* Attempts to cross dirfd should be blocked. */ + { .name = "[beneath] jump to /", + .path = "/", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] absolute link to $root", + .path = "cheeky/absself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained absolute links to $root", + .path = "abscheeky/absself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] jump outside $root", + .path = "..", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] temporary jump outside $root", + .path = "../root/", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] symlink temporary jump outside $root", + .path = "cheeky/self", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained symlink temporary jump outside $root", + .path = "abscheeky/self", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] garbage links to $root", + .path = "cheeky/garbageself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained garbage links to $root", + .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + /* Only relative paths that stay inside dirfd should work. */ + { .name = "[beneath] ordinary path to 'root'", + .path = "root", .how.resolve = RESOLVE_BENEATH, + .out.path = "root", .pass = true }, + { .name = "[beneath] ordinary path to 'etc'", + .path = "etc", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc", .pass = true }, + { .name = "[beneath] ordinary path to 'etc/passwd'", + .path = "etc/passwd", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .name = "[beneath] relative symlink inside $root", + .path = "relsym", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .name = "[beneath] chained-'..' relative symlink inside $root", + .path = "cheeky/passwd", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .name = "[beneath] absolute symlink component outside $root", + .path = "abscheeky/passwd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] absolute symlink target outside $root", + .path = "abssym", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] absolute path outside $root", + .path = "/etc/passwd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] cheeky absolute path outside $root", + .path = "cheeky/abspasswd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained cheeky absolute path outside $root", + .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + /* Tricky paths should fail. */ + { .name = "[beneath] tricky '..'-chained symlink outside $root", + .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] tricky absolute + '..'-chained symlink outside $root", + .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] tricky garbage link outside $root", + .path = "cheeky/garbagelink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] tricky absolute + garbage link outside $root", + .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + + /** RESOLVE_IN_ROOT **/ + /* All attempts to cross the dirfd will be scoped-to-root. */ + { .name = "[in_root] jump to /", + .path = "/", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] absolute symlink to /root", + .path = "cheeky/absself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] chained absolute symlinks to /root", + .path = "abscheeky/absself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] '..' at root", + .path = "..", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] '../root' at root", + .path = "../root/", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] relative symlink containing '..' above root", + .path = "cheeky/self", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] garbage link to /root", + .path = "cheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] chainged garbage links to /root", + .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] relative path to 'root'", + .path = "root", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] relative path to 'etc'", + .path = "etc", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc", .pass = true }, + { .name = "[in_root] relative path to 'etc/passwd'", + .path = "etc/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] relative symlink to 'etc/passwd'", + .path = "relsym", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] chained-'..' relative symlink to 'etc/passwd'", + .path = "cheeky/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] chained-'..' absolute + relative symlink to 'etc/passwd'", + .path = "abscheeky/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] absolute symlink to 'etc/passwd'", + .path = "abssym", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] absolute path 'etc/passwd'", + .path = "/etc/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] cheeky absolute path 'etc/passwd'", + .path = "cheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] chained cheeky absolute path 'etc/passwd'", + .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky '..'-chained symlink outside $root", + .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute + '..'-chained symlink outside $root", + .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute path + absolute + '..'-chained symlink outside $root", + .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky garbage link outside $root", + .path = "cheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute + garbage link outside $root", + .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute path + absolute + garbage link outside $root", + .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + /* O_CREAT should handle trailing symlinks correctly. */ + { .name = "[in_root] O_CREAT of relative path inside $root", + .path = "newfile1", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile1", .pass = true }, + { .name = "[in_root] O_CREAT of absolute path", + .path = "/newfile2", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile2", .pass = true }, + { .name = "[in_root] O_CREAT of tricky symlink outside root", + .path = "/creatlink", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile3", .pass = true }, + + /** RESOLVE_NO_XDEV **/ + /* Crossing *down* into a mountpoint is disallowed. */ + { .name = "[no_xdev] cross into $mnt", + .path = "mnt", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] cross into $mnt/", + .path = "mnt/", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] cross into $mnt/.", + .path = "mnt/.", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Crossing *up* out of a mountpoint is disallowed. */ + { .name = "[no_xdev] goto mountpoint root", + .dir = "mnt", .path = ".", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "mnt", .pass = true }, + { .name = "[no_xdev] cross up through '..'", + .dir = "mnt", .path = "..", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] temporary cross up through '..'", + .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] temporary relative symlink cross up", + .dir = "mnt", .path = "self", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] temporary absolute symlink cross up", + .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Jumping to "/" is ok, but later components cannot cross. */ + { .name = "[no_xdev] jump to / directly", + .dir = "mnt", .path = "/", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/", .pass = true }, + { .name = "[no_xdev] jump to / (from /) directly", + .dir = "/", .path = "/", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/", .pass = true }, + { .name = "[no_xdev] jump to / then proc", + .path = "/proc/1", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] jump to / then tmp", + .path = "/tmp", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Magic-links are blocked since they can switch vfsmounts. */ + { .name = "[no_xdev] cross through magic-link to self/root", + .dir = "/proc", .path = "self/root", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] cross through magic-link to self/cwd", + .dir = "/proc", .path = "self/cwd", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Except magic-link jumps inside the same vfsmount. */ + { .name = "[no_xdev] jump through magic-link to same procfs", + .dir = "/proc", .path = hardcoded_fdpath, .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/proc", .pass = true, }, + + /** RESOLVE_NO_MAGICLINKS **/ + /* Regular symlinks should work. */ + { .name = "[no_magiclinks] ordinary relative symlink", + .path = "relsym", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.path = "etc/passwd", .pass = true }, + /* Magic-links should not work. */ + { .name = "[no_magiclinks] symlink to magic-link", + .path = "procexe", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] normal path to magic-link", + .path = "/proc/self/exe", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] normal path to magic-link with O_NOFOLLOW", + .path = "/proc/self/exe", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.path = procselfexe, .pass = true }, + { .name = "[no_magiclinks] symlink to magic-link path component", + .path = "procroot/etc", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] magic-link path component", + .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] magic-link path component with O_NOFOLLOW", + .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + + /** RESOLVE_NO_SYMLINKS **/ + /* Normal paths should work. */ + { .name = "[no_symlinks] ordinary path to '.'", + .path = ".", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = NULL, .pass = true }, + { .name = "[no_symlinks] ordinary path to 'root'", + .path = "root", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "root", .pass = true }, + { .name = "[no_symlinks] ordinary path to 'etc'", + .path = "etc", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "etc", .pass = true }, + { .name = "[no_symlinks] ordinary path to 'etc/passwd'", + .path = "etc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "etc/passwd", .pass = true }, + /* Regular symlinks are blocked. */ + { .name = "[no_symlinks] relative symlink target", + .path = "relsym", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] relative symlink component", + .path = "reletc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] absolute symlink target", + .path = "abssym", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] absolute symlink component", + .path = "absetc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] cheeky garbage link", + .path = "cheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] cheeky absolute + garbage link", + .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] cheeky absolute + absolute symlink", + .path = "abscheeky/absself", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + /* Trailing symlinks with NO_FOLLOW. */ + { .name = "[no_symlinks] relative symlink with O_NOFOLLOW", + .path = "relsym", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "relsym", .pass = true }, + { .name = "[no_symlinks] absolute symlink with O_NOFOLLOW", + .path = "abssym", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "abssym", .pass = true }, + { .name = "[no_symlinks] trailing symlink with O_NOFOLLOW", + .path = "cheeky/garbagelink", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "cheeky/garbagelink", .pass = true }, + { .name = "[no_symlinks] multiple symlink components with O_NOFOLLOW", + .path = "abscheeky/absself", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] multiple symlink (and garbage link) components with O_NOFOLLOW", + .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + }; + + BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_OPATH_TESTS); + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + int dfd, fd; + char *fdpath = NULL; + bool failed; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + struct basic_test *test = &tests[i]; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + /* Auto-set O_PATH. */ + if (!(test->how.flags & O_CREAT)) + test->how.flags |= O_PATH; + + if (test->dir) + dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY); + else + dfd = dup(rootfd); + E_assert(dfd, "failed to openat root '%s': %m", test->dir); + + E_dup2(dfd, hardcoded_fd); + + fd = sys_openat2(dfd, test->path, &test->how); + if (test->pass) + failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path)); + else + failed = (fd != test->out.err); + if (fd >= 0) { + fdpath = fdreadlink(fd); + close(fd); + } + close(dfd); + + if (failed) { + resultfn = ksft_test_result_fail; + + ksft_print_msg("openat2 unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s']\n", fd, fdpath); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + +skip: + if (test->pass) + resultfn("%s gives path '%s'\n", test->name, + test->out.path ?: "."); + else + resultfn("%s fails with %d (%s)\n", test->name, + test->out.err, strerror(-test->out.err)); + + fflush(stdout); + free(fdpath); + } + + free(procselfexe); + close(rootfd); + + free(hardcoded_fdpath); + close(hardcoded_fd); +} + +#define NUM_TESTS NUM_OPENAT2_OPATH_TESTS + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_set_plan(NUM_TESTS); + + /* NOTE: We should be checking for CAP_SYS_ADMIN here... */ + if (geteuid() != 0) + ksft_exit_skip("all tests require euid == 0\n"); + + test_openat2_opath_tests(); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} -- 2.24.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH v3 0/2] openat2: minor uapi cleanups 2020-01-18 12:07 ` [PATCH v3 0/2] openat2: minor uapi cleanups Aleksa Sarai 2020-01-18 12:07 ` [PATCH v3 1/2] open: introduce openat2(2) syscall Aleksa Sarai 2020-01-18 12:08 ` [PATCH v3 2/2] selftests: add openat2(2) selftests Aleksa Sarai @ 2020-01-18 15:28 ` Al Viro 2020-01-18 18:09 ` Al Viro 2 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-18 15:28 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Layton, J. Bruce Fields, Shuah Khan, Florian Weimer, David Laight, Christian Brauner, quae, dev, containers, libc-alpha, linux-api, linux-fsdevel, linux-kernel, linux-kselftest On Sat, Jan 18, 2020 at 11:07:58PM +1100, Aleksa Sarai wrote: > Patch changelog: > v3: > * Merge changes into the original patches to make Al's life easier. > [Al Viro] > v2: > * Add include <linux/types.h> to openat2.h. [Florian Weimer] > * Move OPEN_HOW_SIZE_* constants out of UAPI. [Florian Weimer] > * Switch from __aligned_u64 to __u64 since it isn't necessary. > [David Laight] > v1: <https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/> > > While openat2(2) is still not yet in Linus's tree, we can take this > opportunity to iron out some small warts that weren't noticed earlier: > > * A fix was suggested by Florian Weimer, to separate the openat2 > definitions so glibc can use the header directly. I've put the > maintainership under VFS but let me know if you'd prefer it belong > ot the fcntl folks. > > * Having heterogenous field sizes in an extensible struct results in > "padding hole" problems when adding new fields (in addition the > correct error to use for non-zero padding isn't entirely clear ). > The simplest solution is to just copy clone(3)'s model -- always use > u64s. It will waste a little more space in the struct, but it > removes a possible future headache. > > This patch is intended to replace the corresponding patches in Al's > #work.openat2 tree (and *will not* apply on Linus' tree). > > @Al: I will send some additional patches later, but they will require > proper design review since they're ABI-related features (namely, > adding a way to check what features a syscall supports as I > outlined in my talk here[1]). #work.openat2 updated, #for-next rebuilt and force-pushed. There's a massive update of #work.namei as well, also pushed out; not in #for-next yet, will post the patch series for review later today. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v3 0/2] openat2: minor uapi cleanups 2020-01-18 15:28 ` [PATCH v3 0/2] openat2: minor uapi cleanups Al Viro @ 2020-01-18 18:09 ` Al Viro 2020-01-18 23:03 ` Aleksa Sarai 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-18 18:09 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Layton, J. Bruce Fields, Shuah Khan, Florian Weimer, David Laight, Christian Brauner, quae, dev, containers, libc-alpha, linux-api, linux-fsdevel, linux-kernel, linux-kselftest On Sat, Jan 18, 2020 at 03:28:33PM +0000, Al Viro wrote: > #work.openat2 updated, #for-next rebuilt and force-pushed. There's > a massive update of #work.namei as well, also pushed out; not in > #for-next yet, will post the patch series for review later today. BTW, looking through that code again, how could this static bool legitimize_root(struct nameidata *nd) { /* * For scoped-lookups (where nd->root has been zeroed), we need to * restart the whole lookup from scratch -- because set_root() is wrong * for these lookups (nd->dfd is the root, not the filesystem root). */ if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)) return false; possibly trigger? The only things that ever clean ->root.mnt are 1) failing legitimize_path(nd, &nd->root, nd->root_seq) in legitimize_root() itself. If *ANY* legitimize_path() has failed, we are through - RCU pathwalk is given up. In particular, if you look at the call chains leading to legitimize_root(), you'll see that it's called by unlazy_walk() or unlazy_child() and failure has either of those buggger off immediately. The same goes for their callers; fail any of those and we are done; the very next thing that will be done with that nameidata is going to be terminate_walk(). We don't look at its fields, etc. - just return to the top level ASAP and call terminate_walk() on it. Which is where we run into if (nd->flags & LOOKUP_ROOT_GRABBED) { path_put(&nd->root); nd->flags &= ~LOOKUP_ROOT_GRABBED; } paired with setting LOOKUP_ROOT_GRABBED just before the attempt to legitimize in legitimize_root(). The next thing *after* terminate_walk() is either path_init() or the end of life for that struct nameidata instance. This is really, really fundamental for understanding the whole thing - a failure of unlazy_walk/unlazy_child means that we are through with that attempt. 2) complete_walk() doing if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED))) nd->root.mnt = NULL; Can't happen with LOOKUP_IS_SCOPED in flags, obviously. 3) path_init(). Where it's followed either by leaving through if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) { .... } (and LOOKUP_IS_SCOPED includes LOOKUP_IN_ROOT) or with a failure exit (no calls of *anything* but terminate_walk() after that or with if (flags & LOOKUP_IS_SCOPED) { nd->root = nd->path; ... and that makes damn sure nd->root.mnt is not NULL. And neither of the LOOKUP_IS_SCOPED bits ever gets changed in nd->flags - they remain as path_init() has set them. The same, BTW, goes for the check you've added in the beginning of set_root() - set_root() is called only with NULL nd->root.mnt (trivial to prove) and that is incompatible with LOOKUP_IS_SCOPED. I'm kinda-sorta OK with having WARN_ON() there for a while, but IMO the check in the beginning of legitimize_root() should go away - this kind of defensive programming only makes harder to reason about the behaviour of the entire thing. And fs/namei.c is too convoluted as it is... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v3 0/2] openat2: minor uapi cleanups 2020-01-18 18:09 ` Al Viro @ 2020-01-18 23:03 ` Aleksa Sarai 2020-01-19 1:12 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2020-01-18 23:03 UTC (permalink / raw) To: Al Viro Cc: Jeff Layton, J. Bruce Fields, Shuah Khan, Florian Weimer, David Laight, Christian Brauner, quae, dev, containers, libc-alpha, linux-api, linux-fsdevel, linux-kernel, linux-kselftest [-- Attachment #1: Type: text/plain, Size: 2853 bytes --] On 2020-01-18, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sat, Jan 18, 2020 at 03:28:33PM +0000, Al Viro wrote: > > > #work.openat2 updated, #for-next rebuilt and force-pushed. There's > > a massive update of #work.namei as well, also pushed out; not in > > #for-next yet, will post the patch series for review later today. > > BTW, looking through that code again, how could this > static bool legitimize_root(struct nameidata *nd) > { > /* > * For scoped-lookups (where nd->root has been zeroed), we need to > * restart the whole lookup from scratch -- because set_root() is wrong > * for these lookups (nd->dfd is the root, not the filesystem root). > */ > if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)) > return false; > > possibly trigger? The only things that ever clean ->root.mnt are You're quite right -- the codepath I was worried about was pick_link() failing (which *does* clear nd->path.mnt, and I must've misread it at the time as nd->root.mnt). We can drop this check, though now complete_walk()'s main defence against a NULL nd->root.mnt is that path_is_under() will fail and trigger -EXDEV (or set_root() will fail at some point in the future). However, as you pointed out, a NULL nd->root.mnt won't happen with things as they stand today -- I might be a little too paranoid. :P > This is really, really fundamental for understanding the whole > thing - a failure of unlazy_walk/unlazy_child means that we are through > with that attempt. Yup -- see above, the worry was about pick_link() not about how the RCU-walk and REF-walk dances operate. > The same, BTW, goes for the check you've added in the beginning of > set_root() - set_root() is called only with NULL nd->root.mnt (trivial to > prove) and that is incompatible with LOOKUP_IS_SCOPED. I'm kinda-sorta > OK with having WARN_ON() there for a while, but IMO the check in the > beginning of legitimize_root() should go away - You're quite right about dropping the legitimize_root() check, but I'd like to keep the WARN_ON() in set_root(). The main reason being that it makes us very damn sure that a future change won't accidentally break the nd->root contract which all of the LOOKUP_IS_SCOPED changes rely on. Then again, this might be my paranoia popping up again. > this kind of defensive programming only makes harder to reason about > the behaviour of the entire thing. And fs/namei.c is too convoluted > as it is... If you feel that dropping some of these more defensive checks is better for the codebase as a whole, then I defer to your judgement. I completely agree that namei is a pretty complicated chunk of code. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v3 0/2] openat2: minor uapi cleanups 2020-01-18 23:03 ` Aleksa Sarai @ 2020-01-19 1:12 ` Al Viro 0 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 1:12 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Layton, J. Bruce Fields, Shuah Khan, Florian Weimer, David Laight, Christian Brauner, quae, dev, containers, libc-alpha, linux-api, linux-fsdevel, linux-kernel, linux-kselftest On Sun, Jan 19, 2020 at 10:03:13AM +1100, Aleksa Sarai wrote: > > possibly trigger? The only things that ever clean ->root.mnt are > > You're quite right -- the codepath I was worried about was pick_link() > failing (which *does* clear nd->path.mnt, and I must've misread it at > the time as nd->root.mnt). pick_link() (allocation failure of external stack in RCU case, followed by failure to legitimize the link) is, unfortunately, subtle and nasty. We *must* path_put() the link; if we'd managed to legitimize the mount and failed on dentry, the mount needs to be dropped. No way around it. And while everything else there can be left for soon-to-be-reached terminate_walk(), this cannot. We have no good way to pass what we need to drop to the place where that eventual terminate_walk() drops rcu_read_lock(). So we end up having to do what terminate_walk() would've done and do it right there, so we could do that path_put(link) before we bugger off. I'm not happy about that, but I don't see cleaner solutions, more's the pity. However, it doesn't mess with ->root - nor should it, since we don't have LOOKUP_ROOT_GRABBED (not in RCU mode), so it can and should be left alone. > We can drop this check, though now complete_walk()'s main defence > against a NULL nd->root.mnt is that path_is_under() will fail and > trigger -EXDEV (or set_root() will fail at some point in the future). > However, as you pointed out, a NULL nd->root.mnt won't happen with > things as they stand today -- I might be a little too paranoid. :P The only reason why complete_walk() zeroes nd->root in some cases is microoptimization - we *know* we won't be using it later, so we don't care whether it's stale or not and can spare unlazy_walk() a bit of work. All there is to that one. I don't see any reason for adding code that would clear nd->root in later work; if such thing does get added (again, I don't see what purpose could that possibly serve), we'll need to watch out for a lot of things. Starting with LOOKUP_ROOT case... It's not something likely to slip in unnoticed. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 4:57 ` Al Viro 2020-01-14 5:12 ` Al Viro 2020-01-14 20:01 ` Aleksa Sarai @ 2020-01-15 13:57 ` Aleksa Sarai 2020-01-19 3:14 ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Al Viro 2 siblings, 1 reply; 92+ messages in thread From: Aleksa Sarai @ 2020-01-15 13:57 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent [-- Attachment #1: Type: text/plain, Size: 461 bytes --] On 2020-01-14, Al Viro <viro@zeniv.linux.org.uk> wrote: > 1) do you see any problems on your testcases with the current #fixes? > That's commit 7a955b7363b8 as branch tip. I just finished testing the few cases I reported earlier and they both appear to be fixed with the current #work.namei branch. And I don't have any troubles booting whatsoever. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* [RFC][PATCHSET][CFT] pathwalk cleanups and fixes 2020-01-15 13:57 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai @ 2020-01-19 3:14 ` Al Viro 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro 2020-01-19 14:33 ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Ian Kent 0 siblings, 2 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:14 UTC (permalink / raw) To: Aleksa Sarai Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent OK, vfs.git #work.namei seems to survive xfstests. I think it cleans the things quite a bit, but it obviously needs more review and testing. Review and testing would be _very_ welcome; it does a lot of massage, so there had been a plenty of opportunities to fuck up and fail to spot that. The same goes for profiling - it doesn't seem to slow the things down, but that needs to be verified. It does include #work.openat2. Topology: 17 commits, followed by clean merge with #work.openat2, followed by 9 followups. The part is #work.openat2 is as posted by Aleksa; I can repost it, but I don't see much point. Description of the rest follows; patches themselves will be in followups. part 1: follow_automount() cleanups and fixes. Quite a bit of that function had been about working around the wrong calling conventions of finish_automount(). The problem is that finish_automount() misuses the primitive intended for mount(2) and friends, where we want to mount on top of the pile, even if something has managed to add to that while we'd been trying to lock the namespace. For automount that's not the right thing to do - there we want to discard whatever it was going to attach and just cross into what got mounted there in the meanwhile (most likely - the results of the same automount triggered by somebody else). Current mainline kinda-sorta manages to do that, but it's unreliable and very convoluted. Much simpler approach is to stop using lock_mount() in finish_automount() and have it bail out if something turns out to have been mounted on top where we wanted to attach. That allows to get rid of a lot of PITA in the caller. Another simplification comes from not trying to cross into the results of automount - simply ride through the next iteration of the loop and let it move into overmount. Another thing in the same series is divorcing follow_automount() from nameidata; that'll play later when we get to unifying follow_down() with the guts of follow_managed(). 4 commits, the second one fixes a hard-to-hit race. The first is a prereq for it. 1/17 do_add_mount(): lift lock_mount/unlock_mount into callers 2/17 fix automount/automount race properly 3/17 follow_automount(): get rid of dead^Wstillborn code 4/17 follow_automount() doesn't need the entire nameidata part 2: unifying mount traversals in pathwalk. Handling of mount traversal (follow_managed()) is currently called in a bunch of places. Each of them is shortly followed by a call of step_into() or an open-coded equivalent thereof. However, the locations of those step_into() calls are far from preceding follow_managed(); moreover, that preceding call might happen on different paths that converge to given step_into() call. It's harder to analyse that it should be (especially when it comes to liveness analysis) and it forces rather ugly calling conventions on lookup_fast()/atomic_open()/lookup_open(). The series below massages the code to the point when the calls of follow_managed() (and __follow_mount_rcu()) move into the beginning of step_into(). 5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW gets EEXIST handling in do_last() past the step_into() call there. 6/17 handle_mounts(): start building a sane wrapper for follow_managed() rather than mangling follow_managed() itself (and creating conflicts with openat2 series), add a wrapper that will absorb the required interface changes. 7/17 atomic_open(): saner calling conventions (return dentry on success) struct path passed to it is pure out parameter; only dentry part ever varies, though - mnt is always nd->path.mnt. Just return the dentry on success, and ERR_PTR(-E...) on failure. 8/17 lookup_open(): saner calling conventions (return dentry on success) propagate the same change one level up the call chain. 9/17 do_last(): collapse the call of path_to_nameidata() struct path filled in lookup_open() call is eventually given to handle_mounts(); the only use it has before that is path_to_nameidata() call in "->atomic_open() has actually opened it" case, and there path_to_nameidata() is an overkill - we are guaranteed to replace only nd->path.dentry. So have the struct path filled only immediately prior to handle_mounts(). 10/17 handle_mounts(): pass dentry in, turn path into a pure out argument now all callers of handle_mount() are directly preceded by filling struct path it gets. path->mnt is nd->path.mnt in all cases, so we can pass just the dentry instead and fill path in handle_mount() itself. Some boilerplate gone, path is pure out argument of handle_mount() now. 11/17 lookup_fast(): consolidate the RCU success case massage to gather what will become an RCU case equivalent of handle_mounts(); basically, that's what we do if revalidate succeeds in RCU case of lookup_fast(), including unlazy and fallback to handle_mounts() if __follow_mount_rcu() says "it's too tricky". 12/17 teach handle_mounts() to handle RCU mode ... and take that into handle_mount() itself. The other caller of __follow_mount_rcu() is fine with the same fallback (it just didn't bother since it's in the very beginning of pathwalk), switched to handle_mount() as well. 13/17 lookup_fast(): take mount traversal into callers Now we are getting somewhere - both RCU and non-RCU success cases of lookup_fast() are ended with the same return handle_mounts(...); move that to the callers - there it will merge with the identical calls that had been on the paths where we had to do slow lookups. lookup_fast() returns dentry now. 14/17 new step_into() flag: WALK_NOFOLLOW use step_into() instead of open-coding it in handle_lookup_down(). Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for that (and eventually, I hope, for .. handling). Now *all* calls of handle_mounts() and step_into() are right next to each other. 15/17 fold handle_mounts() into step_into() ... and we can move the call of handle_mounts() into step_into(), getting a slightly saner calling conventions out of that. 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() another payoff from 14/17 - we can teach path_lookupat() to do what path_mountpointat() used to. And kill the latter, along with its wrappers. 17/17 expand the only remaining call of path_lookup_conditional() minor cleanup - RIP path_lookup_conditional(). Only one caller left. At that point we run out of things that can be done without textual conflicts with openat2 series. Changes so far: * mount traversal is taken into step_into(). * lookup_fast(), atomic_open() and lookup_open() calling conventions are slightly changed. All of them return dentry now, instead of returning an int and filling struct path on success. For lookup_fast() the old "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache miss, dentry - for hit". * step_into() can be called in RCU mode as well. Takes nameidata, WALK_... flags, dentry and, in RCU case, corresponding inode and seq value. Handles mount traversals, decides whether it's a symlink to be followed. Error => returns -E...; symlink to follow => returns 1, puts symlink on stack; non-symlink or symlink not to follow => returns 0, moves nd->path to new location. * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags. Next comes the merge with Aleksa's openat2 patchset; everything up to that point had been non-conflicting with it. That patchset has been posted earlier; it's in #work.openat2. The next series comes on top of the merge. part 3: untangling the symlink handling. Right now when we decide to follow a symlink it happens this way: * step_into() decides that it has been given a symlink that needs to be followed. * it calls pick_link(), which pushes the symlink on stack and returns 1 on success / -E... on error. Symlink's mount/dentry/seq is stored on stack and the inode is stashed in nd->link_inode. * step_into() passes that 1 to its callers, which proceed to pass it up the call chain for several layers. In all cases we get to get_link() call shortly afterwards. * get_link() is called, picks the inode stashed in nd->link_inode by the pick_link(), does some checks, touches the atime, etc. * get_link() either picks the link body out of inode or calls ->get_link(). If it's an absolute symlink, we move to the root and return the relative portion of the body; if it's a relative one - just return the body. If it's a procfs-style one, the call of nd_jump_link() has been made and we'd moved to whatever location is desired. And return NULL, same as we do for symlink to "/". * the caller proceeds to deal with the string returned to it. The sequence is the same in all cases (nested symlink, trailing symlink on lookup, trailing symlink on open), but its pieces are not close to each other and the bit between the call of pick_link() and (inevitable) call of get_link() afterwards is not easy to follow. Moreover, a bunch of functions (walk_component/lookup_last/do_last) ends up with the same conventions for return values as step_into(). And those conventions (see above) are not pretty - 0/1/-E... is asking for mistakes, especially when returned 1 is used only to direct control flow on a rather twisted way to matching get_link() call. And that path can be seriously twisted. E.g. when we are trying to open /dev/stdin, we get the following sequence: * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. Into the stack it goes and we return 1. * do_last() sees 1 and returns it. * trailing_symlink() is called (in the top-level loop) and it calls get_link(). OK, we get "/proc/self/fd/0" for body, move to root again and return "proc/self/fd/0". * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes and we return 1. * again, do_last() passes 1 to caller * trailing_symlink() is called and calls get_link(). * this time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. But the fun doesn't stop yet. * trailing_symlink() returns "" to the caller * link_path_walk() is called on that and does nothing whatsoever. * do_last() is called and sees LAST_BIND left by the get_link(). It calls handle_dots() * handle_dots() drops the symlink from stack and returns * do_last() *FINALLY* proceeds to the point after its call of step_into() (finish_open:) and gets around to opening the damn thing. Making sense of the control flow through all of that is not fun, to put it mildly; debugging anything in that area can be a massive PITA, and this example has touched only one of 3 cases. Arguably, the worst one, but... Anyway, it turns out that this code can be massaged to considerably saner shape - both in terms of control flow and wrt calling conventions. 1/9 merging pick_link() with get_link(), part 1 prep work: move the "hardening" crap from trailing_symlink() into get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags). We'll be moving the calls of get_link() around quite a bit through that series, and the next step will be to eliminate trailing_symlink(). 2/9 merging pick_link() with get_link(), part 2 fold trailing_symlink() into lookup_last() and do_last(). Now these are returning strings; it's not the final calling conventions, but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) => old -E..., we'd failed. string => old 1, and the string is the symlink body to follow. Just as for trailing_symlink(), "/" and procfs ones (where get_link() returns NULL) yield "", so the ugly song and dance with no-op trip through link_path_walk()/handle_dots() still remains. 3/9 merging pick_link() with get_link(), part 3 elimination of that round-trip. In *all* cases having get_link() return NULL on such symlinks means that we'll proceed to drop the symlink from stack and get back to the point near that get_link() call - basically, where we would be if it hadn't been a symlink at all. The path by which we are getting there depends upon the call site; the end result is the same in all cases - such symlinks (procfs ones and symlink to "/") are fully processed by the time get_link() returns, so we could as well drop them from the stack right in get_link(). Makes life simpler in terms of control flow analysis... And now the calling conventions for do_last() and lookup_last() have reached the final shape - ERR_PTR(-E...) for error, NULL for "we are done", string for "traverse this". 4/9 merging pick_link() with get_link(), part 4 now all calls of walk_component() are followed by the same boilerplate - "if it has returned 1, call get_link() and if that has returned NULL treat that as if walk_component() has returned 0". Eliminate by folding that into walk_component() itself. Now walk_component() return value conventions have joined those of do_last()/lookup_last(). 5/9 merging pick_link() with get_link(), part 5 same as for the previous, only this time the boilerplate migrates one level down, into step_into(). Only one caller of get_link() left, step_into() has joined the same return value conventions. 6/9 merging pick_link() with get_link(), part 6 move that thing into pick_link(). Now all traces of "return 1 if we are following a symlink" are gone. 7/9 finally fold get_link() into pick_link() ta-da - expand get_link() into the only caller. As a side benefit, we get rid of stashing the inode in nd->link_inode - it was done only to carry that piece of information from pick_link() to eventual get_link(). That's not the main benefit, though - the control flow became considerably easier to reason about. For what it's worth, the example above (/dev/stdin) becomes * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. On the stack it goes and we get its body. Which is "/proc/self/fd/0", so we move to root and return "proc/self/fd/0". * do_last() sees non-NULL and returns it - whether it's an error or a pathname to traverse, we hadn't reached something we'll be opening. * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes. This time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. So we drop symlink from stack and return that NULL to caller. * that NULL is returned by step_into(), same as if we had just moved to a non-symlink. * do_last() proceeds to open the damn thing. part 4. some mount traversal cleanups. 8/9 massage __follow_mount_rcu() a bit make it more similar to non-RCU counterpart 9/9 new helper: traverse_mounts() the guts of follow_managed() are very similar to follow_down(). The calling conventions are different (follow_managed() works with nameidata, follow_down() - with standalone struct path), but the core loop is pretty much the same in both. Turned that loop into a common helper (traverse_mounts()) and since follow_managed() becomes a very thin wrapper around it, expand follow_managed() at its only call site (in handle_mounts()), That's where the series stands right now. FWIW, at 5.5-rc1 fs/namei.c had been 4867 lines, at the tip of #work.openat2 - 4998, at the tip of #work.namei (containing #work.openat2) - 4730... And IMO the thing has become considerably easier to follow. What's more, it might be possible to untangle the control flow in do_last() now. Probably a separate series, though - do_last() is one hell of a tarpit, so I'm not stepping into it for the rest of this cycle... ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers 2020-01-19 3:14 ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 02/17] fix automount/automount race properly Al Viro ` (25 more replies) 2020-01-19 14:33 ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Ian Kent 1 sibling, 26 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> preparation to finish_automount() fix (next commit) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 47 ++++++++++++++++++++++++----------------------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2fd0c8bcb8c1..5f0a80f17651 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2697,45 +2697,32 @@ static int do_move_mount_old(struct path *path, const char *old_name) /* * add a mount into a namespace's mount tree */ -static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) +static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, + struct path *path, int mnt_flags) { - struct mountpoint *mp; - struct mount *parent; - int err; + struct mount *parent = real_mount(path->mnt); mnt_flags &= ~MNT_INTERNAL_FLAGS; - mp = lock_mount(path); - if (IS_ERR(mp)) - return PTR_ERR(mp); - - parent = real_mount(path->mnt); - err = -EINVAL; if (unlikely(!check_mnt(parent))) { /* that's acceptable only for automounts done in private ns */ if (!(mnt_flags & MNT_SHRINKABLE)) - goto unlock; + return -EINVAL; /* ... and for those we'd better have mountpoint still alive */ if (!parent->mnt_ns) - goto unlock; + return -EINVAL; } /* Refuse the same filesystem on the same mount point */ - err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) - goto unlock; + return -EBUSY; - err = -EINVAL; if (d_is_symlink(newmnt->mnt.mnt_root)) - goto unlock; + return -EINVAL; newmnt->mnt.mnt_flags = mnt_flags; - err = graft_tree(newmnt, parent, mp); - -unlock: - unlock_mount(mp); - return err; + return graft_tree(newmnt, parent, mp); } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags); @@ -2748,6 +2735,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { struct vfsmount *mnt; + struct mountpoint *mp; struct super_block *sb = fc->root->d_sb; int error; @@ -2768,7 +2756,13 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, mnt_warn_timestamp_expiry(mountpoint, mnt); - error = do_add_mount(real_mount(mnt), mountpoint, mnt_flags); + mp = lock_mount(mountpoint); + if (IS_ERR(mp)) { + mntput(mnt); + return PTR_ERR(mp); + } + error = do_add_mount(real_mount(mnt), mp, mountpoint, mnt_flags); + unlock_mount(mp); if (error < 0) mntput(mnt); return error; @@ -2830,6 +2824,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, int finish_automount(struct vfsmount *m, struct path *path) { struct mount *mnt = real_mount(m); + struct mountpoint *mp; int err; /* The new mount record should have at least 2 refs to prevent it being * expired before we get a chance to add it @@ -2842,7 +2837,13 @@ int finish_automount(struct vfsmount *m, struct path *path) goto fail; } - err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE); + mp = lock_mount(path); + if (IS_ERR(mp)) { + err = PTR_ERR(mp); + goto fail; + } + err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); + unlock_mount(mp); if (!err) return 0; fail: -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 02/17] fix automount/automount race properly 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-30 14:34 ` Christian Brauner 2020-01-19 3:17 ` [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code Al Viro ` (24 subsequent siblings) 25 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Protection against automount/automount races (two threads hitting the same referral point at the same time) is based upon do_add_mount() prevention of identical overmounts - trying to overmount the root of mounted tree with the same tree fails with -EBUSY. It's unreliable (the other thread might've mounted something on top of the automount it has triggered) *and* causes no end of headache for follow_automount() and its caller, since finish_automount() behaves like do_new_mount() - if the mountpoint to be is overmounted, it mounts on top what's overmounting it. It's not only wrong (we want to go into what's overmounting the automount point and quietly discard what we planned to mount there), it introduces the possibility of original parent mount getting dropped. That's what 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount) deals with, but it can't do anything about the reliability of conflict detection - if something had been overmounted the other thread's automount (e.g. that other thread having stepped into automount in mount(2)), we don't get that -EBUSY and the result is referral point under automounted NFS under explicit overmount under another copy of automounted NFS What we need is finish_automount() *NOT* digging into overmounts - if it finds one, it should just quietly discard the thing it was asked to mount. And don't bother with actually crossing into the results of finish_automount() - the same loop that calls follow_automount() will do that just fine on the next iteration. IOW, instead of calling lock_mount() have finish_automount() do it manually, _without_ the "move into overmount and retry" part. And leave crossing into the results to the caller of follow_automount(), which simplifies it a lot. Moral: if you end up with a lot of glue working around the calling conventions of something, perhaps these calling conventions are simply wrong... Fixes: 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 29 ++++------------------------- fs/namespace.c | 41 ++++++++++++++++++++++++++++++++++------- 2 files changed, 38 insertions(+), 32 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index d2720dc71d0e..bd036dfdb0d9 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1133,11 +1133,9 @@ EXPORT_SYMBOL(follow_up); * - return -EISDIR to tell follow_managed() to stop and return the path we * were called with. */ -static int follow_automount(struct path *path, struct nameidata *nd, - bool *need_mntput) +static int follow_automount(struct path *path, struct nameidata *nd) { struct vfsmount *mnt; - int err; if (!path->dentry->d_op || !path->dentry->d_op->d_automount) return -EREMOTE; @@ -1178,29 +1176,10 @@ static int follow_automount(struct path *path, struct nameidata *nd, return PTR_ERR(mnt); } - if (!mnt) /* mount collision */ - return 0; - - if (!*need_mntput) { - /* lock_mount() may release path->mnt on error */ - mntget(path->mnt); - *need_mntput = true; - } - err = finish_automount(mnt, path); - - switch (err) { - case -EBUSY: - /* Someone else made a mount here whilst we were busy */ + if (!mnt) return 0; - case 0: - path_put(path); - path->mnt = mnt; - path->dentry = dget(mnt->mnt_root); - return 0; - default: - return err; - } + return finish_automount(mnt, path); } /* @@ -1258,7 +1237,7 @@ static int follow_managed(struct path *path, struct nameidata *nd) /* Handle an automount point */ if (flags & DCACHE_NEED_AUTOMOUNT) { - ret = follow_automount(path, nd, &need_mntput); + ret = follow_automount(path, nd); if (ret < 0) break; continue; diff --git a/fs/namespace.c b/fs/namespace.c index 5f0a80f17651..f1817eb5f87d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2823,6 +2823,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, int finish_automount(struct vfsmount *m, struct path *path) { + struct dentry *dentry = path->dentry; struct mount *mnt = real_mount(m); struct mountpoint *mp; int err; @@ -2832,21 +2833,47 @@ int finish_automount(struct vfsmount *m, struct path *path) BUG_ON(mnt_get_count(mnt) < 2); if (m->mnt_sb == path->mnt->mnt_sb && - m->mnt_root == path->dentry) { + m->mnt_root == dentry) { err = -ELOOP; - goto fail; + goto discard; } - mp = lock_mount(path); + /* + * we don't want to use lock_mount() - in this case finding something + * that overmounts our mountpoint to be means "quitely drop what we've + * got", not "try to mount it on top". + */ + inode_lock(dentry->d_inode); + if (unlikely(cant_mount(dentry))) { + err = -ENOENT; + goto discard1; + } + namespace_lock(); + rcu_read_lock(); + if (unlikely(__lookup_mnt(path->mnt, dentry))) { + rcu_read_unlock(); + err = 0; + goto discard2; + } + rcu_read_unlock(); + mp = get_mountpoint(dentry); if (IS_ERR(mp)) { err = PTR_ERR(mp); - goto fail; + goto discard2; } + err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); unlock_mount(mp); - if (!err) - return 0; -fail: + if (unlikely(err)) + goto discard; + mntput(m); + return 0; + +discard2: + namespace_unlock(); +discard1: + inode_unlock(dentry->d_inode); +discard: /* remove m from any expiration list it may be on */ if (!list_empty(&mnt->mnt_expire)) { namespace_lock(); -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH 02/17] fix automount/automount race properly 2020-01-19 3:17 ` [PATCH 02/17] fix automount/automount race properly Al Viro @ 2020-01-30 14:34 ` Christian Brauner 0 siblings, 0 replies; 92+ messages in thread From: Christian Brauner @ 2020-01-30 14:34 UTC (permalink / raw) To: Al Viro Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman On Sun, Jan 19, 2020 at 03:17:14AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > Protection against automount/automount races (two threads hitting the same > referral point at the same time) is based upon do_add_mount() prevention of > identical overmounts - trying to overmount the root of mounted tree with > the same tree fails with -EBUSY. It's unreliable (the other thread might've > mounted something on top of the automount it has triggered) *and* causes > no end of headache for follow_automount() and its caller, since > finish_automount() behaves like do_new_mount() - if the mountpoint to be is > overmounted, it mounts on top what's overmounting it. It's not only wrong > (we want to go into what's overmounting the automount point and quietly > discard what we planned to mount there), it introduces the possibility of > original parent mount getting dropped. That's what 8aef18845266 (VFS: Fix > vfsmount overput on simultaneous automount) deals with, but it can't do > anything about the reliability of conflict detection - if something had > been overmounted the other thread's automount (e.g. that other thread > having stepped into automount in mount(2)), we don't get that -EBUSY and > the result is > referral point under automounted NFS under explicit overmount > under another copy of automounted NFS > > What we need is finish_automount() *NOT* digging into overmounts - if it > finds one, it should just quietly discard the thing it was asked to mount. > And don't bother with actually crossing into the results of finish_automount() - > the same loop that calls follow_automount() will do that just fine on the > next iteration. > > IOW, instead of calling lock_mount() have finish_automount() do it manually, > _without_ the "move into overmount and retry" part. And leave crossing into > the results to the caller of follow_automount(), which simplifies it a lot. > > Moral: if you end up with a lot of glue working around the calling conventions > of something, perhaps these calling conventions are simply wrong... > > Fixes: 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount) > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> I mean, just reading this is awefully complicated but the code seems fine. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> > --- > fs/namei.c | 29 ++++------------------------- > fs/namespace.c | 41 ++++++++++++++++++++++++++++++++++------- > 2 files changed, 38 insertions(+), 32 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index d2720dc71d0e..bd036dfdb0d9 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1133,11 +1133,9 @@ EXPORT_SYMBOL(follow_up); > * - return -EISDIR to tell follow_managed() to stop and return the path we > * were called with. > */ > -static int follow_automount(struct path *path, struct nameidata *nd, > - bool *need_mntput) > +static int follow_automount(struct path *path, struct nameidata *nd) > { > struct vfsmount *mnt; > - int err; > > if (!path->dentry->d_op || !path->dentry->d_op->d_automount) > return -EREMOTE; > @@ -1178,29 +1176,10 @@ static int follow_automount(struct path *path, struct nameidata *nd, > return PTR_ERR(mnt); > } > > - if (!mnt) /* mount collision */ > - return 0; > - > - if (!*need_mntput) { > - /* lock_mount() may release path->mnt on error */ > - mntget(path->mnt); > - *need_mntput = true; > - } > - err = finish_automount(mnt, path); > - > - switch (err) { > - case -EBUSY: > - /* Someone else made a mount here whilst we were busy */ > + if (!mnt) > return 0; > - case 0: > - path_put(path); > - path->mnt = mnt; > - path->dentry = dget(mnt->mnt_root); > - return 0; > - default: > - return err; > - } > > + return finish_automount(mnt, path); > } > > /* > @@ -1258,7 +1237,7 @@ static int follow_managed(struct path *path, struct nameidata *nd) > > /* Handle an automount point */ > if (flags & DCACHE_NEED_AUTOMOUNT) { > - ret = follow_automount(path, nd, &need_mntput); > + ret = follow_automount(path, nd); > if (ret < 0) > break; > continue; > diff --git a/fs/namespace.c b/fs/namespace.c > index 5f0a80f17651..f1817eb5f87d 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2823,6 +2823,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, > > int finish_automount(struct vfsmount *m, struct path *path) > { > + struct dentry *dentry = path->dentry; > struct mount *mnt = real_mount(m); > struct mountpoint *mp; > int err; > @@ -2832,21 +2833,47 @@ int finish_automount(struct vfsmount *m, struct path *path) > BUG_ON(mnt_get_count(mnt) < 2); > > if (m->mnt_sb == path->mnt->mnt_sb && > - m->mnt_root == path->dentry) { > + m->mnt_root == dentry) { > err = -ELOOP; > - goto fail; > + goto discard; > } > > - mp = lock_mount(path); > + /* > + * we don't want to use lock_mount() - in this case finding something > + * that overmounts our mountpoint to be means "quitely drop what we've > + * got", not "try to mount it on top". > + */ > + inode_lock(dentry->d_inode); > + if (unlikely(cant_mount(dentry))) { > + err = -ENOENT; > + goto discard1; > + } > + namespace_lock(); > + rcu_read_lock(); > + if (unlikely(__lookup_mnt(path->mnt, dentry))) { That means someone has already performed that mount in the meantime, I take it. > + rcu_read_unlock(); > + err = 0; > + goto discard2; > + } > + rcu_read_unlock(); > + mp = get_mountpoint(dentry); > if (IS_ERR(mp)) { > err = PTR_ERR(mp); > - goto fail; > + goto discard2; > } > + > err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); > unlock_mount(mp); > - if (!err) > - return 0; > -fail: > + if (unlikely(err)) > + goto discard; > + mntput(m); Probably being dense here but better safe than sorry: this mntput() corresponds to the get_mountpoint() above, right? ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro 2020-01-19 3:17 ` [PATCH 02/17] fix automount/automount race properly Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-30 14:38 ` Christian Brauner 2020-01-19 3:17 ` [PATCH 04/17] follow_automount() doesn't need the entire nameidata Al Viro ` (23 subsequent siblings) 25 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> 1) no instances of ->d_automount() have ever made use of the "return ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's a rudiment of plans that got superseded before the thing went into the tree. Despite the comment in follow_automount(), autofs has never done that. 2) if there's no ->d_automount() in dentry_operations, filesystems should not set DCACHE_NEED_AUTOMOUNT in the first place. None have ever done so... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 28 +++------------------------- fs/namespace.c | 9 ++++++++- 2 files changed, 11 insertions(+), 26 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index bd036dfdb0d9..d30a74a18da9 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1135,10 +1135,7 @@ EXPORT_SYMBOL(follow_up); */ static int follow_automount(struct path *path, struct nameidata *nd) { - struct vfsmount *mnt; - - if (!path->dentry->d_op || !path->dentry->d_op->d_automount) - return -EREMOTE; + struct dentry *dentry = path->dentry; /* We don't want to mount if someone's just doing a stat - * unless they're stat'ing a directory and appended a '/' to @@ -1153,33 +1150,14 @@ static int follow_automount(struct path *path, struct nameidata *nd) */ if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY | LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) && - path->dentry->d_inode) + dentry->d_inode) return -EISDIR; nd->total_link_count++; if (nd->total_link_count >= 40) return -ELOOP; - mnt = path->dentry->d_op->d_automount(path); - if (IS_ERR(mnt)) { - /* - * The filesystem is allowed to return -EISDIR here to indicate - * it doesn't want to automount. For instance, autofs would do - * this so that its userspace daemon can mount on this dentry. - * - * However, we can only permit this if it's a terminal point in - * the path being looked up; if it wasn't then the remainder of - * the path is inaccessible and we should say so. - */ - if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT)) - return -EREMOTE; - return PTR_ERR(mnt); - } - - if (!mnt) - return 0; - - return finish_automount(mnt, path); + return finish_automount(dentry->d_op->d_automount(path), path); } /* diff --git a/fs/namespace.c b/fs/namespace.c index f1817eb5f87d..b37dc59bfa05 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2824,9 +2824,16 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, int finish_automount(struct vfsmount *m, struct path *path) { struct dentry *dentry = path->dentry; - struct mount *mnt = real_mount(m); struct mountpoint *mp; + struct mount *mnt; int err; + + if (!m) + return 0; + if (IS_ERR(m)) + return PTR_ERR(m); + + mnt = real_mount(m); /* The new mount record should have at least 2 refs to prevent it being * expired before we get a chance to add it */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code 2020-01-19 3:17 ` [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code Al Viro @ 2020-01-30 14:38 ` Christian Brauner 0 siblings, 0 replies; 92+ messages in thread From: Christian Brauner @ 2020-01-30 14:38 UTC (permalink / raw) To: Al Viro Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman On Sun, Jan 19, 2020 at 03:17:15AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > 1) no instances of ->d_automount() have ever made use of the "return > ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's > a rudiment of plans that got superseded before the thing went into > the tree. Despite the comment in follow_automount(), autofs has > never done that. > > 2) if there's no ->d_automount() in dentry_operations, filesystems > should not set DCACHE_NEED_AUTOMOUNT in the first place. None have > ever done so... > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> I can't speak to 1) but code seems correct: Acked-by: Christian Brauner <christian.brauner@ubuntu.com> ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH 04/17] follow_automount() doesn't need the entire nameidata 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro 2020-01-19 3:17 ` [PATCH 02/17] fix automount/automount race properly Al Viro 2020-01-19 3:17 ` [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-30 14:45 ` Christian Brauner 2020-01-19 3:17 ` [PATCH 05/17] make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW Al Viro ` (22 subsequent siblings) 25 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> only the address of ->total_link_count and the flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index d30a74a18da9..3b6f60c02f8a 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1133,7 +1133,7 @@ EXPORT_SYMBOL(follow_up); * - return -EISDIR to tell follow_managed() to stop and return the path we * were called with. */ -static int follow_automount(struct path *path, struct nameidata *nd) +static int follow_automount(struct path *path, int *count, unsigned lookup_flags) { struct dentry *dentry = path->dentry; @@ -1148,13 +1148,12 @@ static int follow_automount(struct path *path, struct nameidata *nd) * as being automount points. These will need the attentions * of the daemon to instantiate them before they can be used. */ - if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY | + if (!(lookup_flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY | LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) && dentry->d_inode) return -EISDIR; - nd->total_link_count++; - if (nd->total_link_count >= 40) + if (count && *count++ >= 40) return -ELOOP; return finish_automount(dentry->d_op->d_automount(path), path); @@ -1215,7 +1214,8 @@ static int follow_managed(struct path *path, struct nameidata *nd) /* Handle an automount point */ if (flags & DCACHE_NEED_AUTOMOUNT) { - ret = follow_automount(path, nd); + ret = follow_automount(path, &nd->total_link_count, + nd->flags); if (ret < 0) break; continue; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH 04/17] follow_automount() doesn't need the entire nameidata 2020-01-19 3:17 ` [PATCH 04/17] follow_automount() doesn't need the entire nameidata Al Viro @ 2020-01-30 14:45 ` Christian Brauner 2020-01-30 15:38 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Christian Brauner @ 2020-01-30 14:45 UTC (permalink / raw) To: Al Viro Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman On Sun, Jan 19, 2020 at 03:17:16AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > only the address of ->total_link_count and the flags > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/namei.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index d30a74a18da9..3b6f60c02f8a 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1133,7 +1133,7 @@ EXPORT_SYMBOL(follow_up); > * - return -EISDIR to tell follow_managed() to stop and return the path we > * were called with. > */ > -static int follow_automount(struct path *path, struct nameidata *nd) > +static int follow_automount(struct path *path, int *count, unsigned lookup_flags) > { > struct dentry *dentry = path->dentry; > > @@ -1148,13 +1148,12 @@ static int follow_automount(struct path *path, struct nameidata *nd) > * as being automount points. These will need the attentions > * of the daemon to instantiate them before they can be used. > */ > - if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY | > + if (!(lookup_flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY | > LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) && > dentry->d_inode) > return -EISDIR; > > - nd->total_link_count++; > - if (nd->total_link_count >= 40) > + if (count && *count++ >= 40) He, side-effects galore. :) Isn't this incrementing the address but you want to increment the counter? Seems like this should be if (count && (*count)++ >= 40) and even then it seems to me not incrementing at all when we have hit the limit seems more natural? Christian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 04/17] follow_automount() doesn't need the entire nameidata 2020-01-30 14:45 ` Christian Brauner @ 2020-01-30 15:38 ` Al Viro 2020-01-30 15:55 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-30 15:38 UTC (permalink / raw) To: Christian Brauner Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman On Thu, Jan 30, 2020 at 03:45:20PM +0100, Christian Brauner wrote: > > - nd->total_link_count++; > > - if (nd->total_link_count >= 40) > > + if (count && *count++ >= 40) > > He, side-effects galore. :) > Isn't this incrementing the address but you want to increment the > counter? > Seems like this should be > > if (count && (*count)++ >= 40) Nice catch; incidentally, it means that usual testsuites (xfstests, LTP) are missing the automount loop detection. Hmm... Something like export $FOO over nfs4 to localhost mkdir $FOO/sub touch $FOO/a mount $SCRATCH_DEV $FOO/sub touch $FOO/sub/a cd $BAR mkdir nfs mount -t nfs localhost:$FOO nfs for i in `seq 40`; do ln -s l`expr $i - 1` l$i; done for i in `seq 40`; do ln -s m`expr $i - 1` m$i; done ln -s nfs/sub/a l0 ln -s nfs/a m0 for i in `seq 40`; do umount nfs/sub 2>/dev/null cat l$i m$i done BTW, the check of pre-increment value is more correct - it's accidental, but it does give consistency with the normal symlink following. We do allow up to 40 symlinks over the pathname resolution, not up to 39. The thing above should produce cat: l39: Too many levels of symbolic links cat: l40: Too many levels of symbolic links cat: m40: Too many levels of symbolic links Here l<n> and m<n> go through n + 1 symlink, ending at nfs/sub/a and nfs/a resp.; the former does trigger an automount, the latter does not. On mainline it actually starts to complain about l38, l39, l40 and m40, due to that off-by-one in follow_automount(). ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 04/17] follow_automount() doesn't need the entire nameidata 2020-01-30 15:38 ` Al Viro @ 2020-01-30 15:55 ` Al Viro 0 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-30 15:55 UTC (permalink / raw) To: Christian Brauner Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman On Thu, Jan 30, 2020 at 03:38:25PM +0000, Al Viro wrote: > On Thu, Jan 30, 2020 at 03:45:20PM +0100, Christian Brauner wrote: > > > - nd->total_link_count++; > > > - if (nd->total_link_count >= 40) > > > + if (count && *count++ >= 40) > > > > He, side-effects galore. :) > > Isn't this incrementing the address but you want to increment the > > counter? > > Seems like this should be > > > > if (count && (*count)++ >= 40) > > Nice catch; incidentally, it means that usual testsuites (xfstests, > LTP) are missing the automount loop detection. Hmm... Fix folded and pushed (the series in #next.namei now, on top of #work.openat2 + #fixes) ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH 05/17] make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (2 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 04/17] follow_automount() doesn't need the entire nameidata Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 06/17] handle_mounts(): start building a sane wrapper for follow_managed() Al Viro ` (21 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink". As it is, we might or might not have LOOKUP_FOLLOW in op->intent in that case - that depends upon having O_NOFOLLOW in open flags. It doesn't matter, since we won't be checking it in that case - do_last() bails out earlier. However, making sure it's not set (i.e. acting as if we had an explicit O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the check for O_CREAT | O_EXCL in do_last() with the call of step_into() immediately following it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 15 +++++---------- fs/open.c | 4 +++- 2 files changed, 8 insertions(+), 11 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 3b6f60c02f8a..c19b458f66da 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3262,22 +3262,17 @@ static int do_last(struct nameidata *nd, if (unlikely(error < 0)) return error; - /* - * create/update audit record if it already exists. - */ - audit_inode(nd->name, path.dentry, 0); - - if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { - path_to_nameidata(&path, nd); - return -EEXIST; - } - seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) return error; + + if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { + audit_inode(nd->name, nd->path.dentry, 0); + return -EEXIST; + } finish_open: /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ error = complete_walk(nd); diff --git a/fs/open.c b/fs/open.c index b62f5c0923a8..ba7009a5dd1a 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1014,8 +1014,10 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o if (flags & O_CREAT) { op->intent |= LOOKUP_CREATE; - if (flags & O_EXCL) + if (flags & O_EXCL) { op->intent |= LOOKUP_EXCL; + flags |= O_NOFOLLOW; + } } if (flags & O_DIRECTORY) -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 06/17] handle_mounts(): start building a sane wrapper for follow_managed() 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (3 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 05/17] make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 07/17] atomic_open(): saner calling conventions (return dentry on success) Al Viro ` (20 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> All callers of follow_managed() follow it on success with the same steps - d_backing_inode(path->dentry) is calculated and stored into some struct inode * variable and, in all but one case, an unsigned variable (nd->seq to be) is zeroed. The single exception is lookup_fast() and there zeroing is correct thing to do - not doing it is a pointless microoptimization. Add a wrapper for follow_managed() that would do that combination. It's mostly a vehicle for code massage - it will be changing quite a bit, and the current calling conventions are by no means final. Right now it takes path, nameidata and (as out params) inode and seq, similar to __follow_mount_rcu(). Which will soon get folded into it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index c19b458f66da..4c867d0970d5 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1304,6 +1304,18 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT); } +static inline int handle_mounts(struct path *path, struct nameidata *nd, + struct inode **inode, unsigned int *seqp) +{ + int ret = follow_managed(path, nd); + + if (likely(ret >= 0)) { + *inode = d_backing_inode(path->dentry); + *seqp = 0; /* out of RCU mode, so the value doesn't matter */ + } + return ret; +} + static int follow_dotdot_rcu(struct nameidata *nd) { struct inode *inode = nd->inode; @@ -1514,7 +1526,6 @@ static int lookup_fast(struct nameidata *nd, struct vfsmount *mnt = nd->path.mnt; struct dentry *dentry, *parent = nd->path.dentry; int status = 1; - int err; /* * Rename seqlock is not required here because in the off chance @@ -1584,10 +1595,7 @@ static int lookup_fast(struct nameidata *nd, path->mnt = mnt; path->dentry = dentry; - err = follow_managed(path, nd); - if (likely(err > 0)) - *inode = d_backing_inode(path->dentry); - return err; + return handle_mounts(path, nd, inode, seqp); } /* Fast lookup failed, do it the slow way */ @@ -1761,12 +1769,9 @@ static int walk_component(struct nameidata *nd, int flags) return PTR_ERR(path.dentry); path.mnt = nd->path.mnt; - err = follow_managed(&path, nd); + err = handle_mounts(&path, nd, &inode, &seq); if (unlikely(err < 0)) return err; - - seq = 0; /* we are already out of RCU mode */ - inode = d_backing_inode(path.dentry); } return step_into(nd, &path, flags, inode, seq); @@ -2233,11 +2238,9 @@ static int handle_lookup_down(struct nameidata *nd) return -ECHILD; } else { dget(path.dentry); - err = follow_managed(&path, nd); + err = handle_mounts(&path, nd, &inode, &seq); if (unlikely(err < 0)) return err; - inode = d_backing_inode(path.dentry); - seq = 0; } path_to_nameidata(&path, nd); nd->inode = inode; @@ -3258,12 +3261,9 @@ static int do_last(struct nameidata *nd, got_write = false; } - error = follow_managed(&path, nd); + error = handle_mounts(&path, nd, &inode, &seq); if (unlikely(error < 0)) return error; - - seq = 0; /* out of RCU mode, so the value doesn't matter */ - inode = d_backing_inode(path.dentry); finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 07/17] atomic_open(): saner calling conventions (return dentry on success) 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (4 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 06/17] handle_mounts(): start building a sane wrapper for follow_managed() Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 08/17] lookup_open(): " Al Viro ` (19 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Currently it either returns -E... or puts (nd->path.mnt,dentry) into *path and returns 0. Make it return ERR_PTR(-E...) or dentry; adjust the caller. Fewer arguments and it's easier to keep track of *path contents that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 37 ++++++++++++++++++++----------------- 1 file changed, 20 insertions(+), 17 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 4c867d0970d5..9d8837432a7b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2955,10 +2955,10 @@ static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t m * * Returns an error code otherwise. */ -static int atomic_open(struct nameidata *nd, struct dentry *dentry, - struct path *path, struct file *file, - const struct open_flags *op, - int open_flag, umode_t mode) +static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry, + struct file *file, + const struct open_flags *op, + int open_flag, umode_t mode) { struct dentry *const DENTRY_NOT_SET = (void *) -1UL; struct inode *dir = nd->path.dentry->d_inode; @@ -2999,17 +2999,15 @@ static int atomic_open(struct nameidata *nd, struct dentry *dentry, } if (file->f_mode & FMODE_CREATED) fsnotify_create(dir, dentry); - if (unlikely(d_is_negative(dentry))) { + if (unlikely(d_is_negative(dentry))) error = -ENOENT; - } else { - path->dentry = dentry; - path->mnt = nd->path.mnt; - return 0; - } } } - dput(dentry); - return error; + if (error) { + dput(dentry); + dentry = ERR_PTR(error); + } + return dentry; } /* @@ -3104,11 +3102,16 @@ static int lookup_open(struct nameidata *nd, struct path *path, } if (dir_inode->i_op->atomic_open) { - error = atomic_open(nd, dentry, path, file, op, open_flag, - mode); - if (unlikely(error == -ENOENT) && create_error) - error = create_error; - return error; + dentry = atomic_open(nd, dentry, file, op, open_flag, mode); + if (IS_ERR(dentry)) { + error = PTR_ERR(dentry); + if (unlikely(error == -ENOENT) && create_error) + error = create_error; + return error; + } + path->mnt = nd->path.mnt; + path->dentry = dentry; + return 0; } no_open: -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 08/17] lookup_open(): saner calling conventions (return dentry on success) 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (5 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 07/17] atomic_open(): saner calling conventions (return dentry on success) Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 09/17] do_last(): collapse the call of path_to_nameidata() Al Viro ` (18 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> same story as for atomic_open() in the previous commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 39 ++++++++++++++++++--------------------- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 9d8837432a7b..30503f114142 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3025,10 +3025,9 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry, * * An error code is returned on failure. */ -static int lookup_open(struct nameidata *nd, struct path *path, - struct file *file, - const struct open_flags *op, - bool got_write) +static struct dentry *lookup_open(struct nameidata *nd, struct file *file, + const struct open_flags *op, + bool got_write) { struct dentry *dir = nd->path.dentry; struct inode *dir_inode = dir->d_inode; @@ -3039,7 +3038,7 @@ static int lookup_open(struct nameidata *nd, struct path *path, DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); if (unlikely(IS_DEADDIR(dir_inode))) - return -ENOENT; + return ERR_PTR(-ENOENT); file->f_mode &= ~FMODE_CREATED; dentry = d_lookup(dir, &nd->last); @@ -3047,7 +3046,7 @@ static int lookup_open(struct nameidata *nd, struct path *path, if (!dentry) { dentry = d_alloc_parallel(dir, &nd->last, &wq); if (IS_ERR(dentry)) - return PTR_ERR(dentry); + return dentry; } if (d_in_lookup(dentry)) break; @@ -3063,7 +3062,7 @@ static int lookup_open(struct nameidata *nd, struct path *path, } if (dentry->d_inode) { /* Cached positive dentry: will open in f_op->open */ - goto out_no_open; + return dentry; } /* @@ -3104,14 +3103,10 @@ static int lookup_open(struct nameidata *nd, struct path *path, if (dir_inode->i_op->atomic_open) { dentry = atomic_open(nd, dentry, file, op, open_flag, mode); if (IS_ERR(dentry)) { - error = PTR_ERR(dentry); - if (unlikely(error == -ENOENT) && create_error) - error = create_error; - return error; + if (dentry == ERR_PTR(-ENOENT) && create_error) + dentry = ERR_PTR(create_error); } - path->mnt = nd->path.mnt; - path->dentry = dentry; - return 0; + return dentry; } no_open: @@ -3147,14 +3142,11 @@ static int lookup_open(struct nameidata *nd, struct path *path, error = create_error; goto out_dput; } -out_no_open: - path->dentry = dentry; - path->mnt = nd->path.mnt; - return 0; + return dentry; out_dput: dput(dentry); - return error; + return ERR_PTR(error); } /* @@ -3171,6 +3163,7 @@ static int do_last(struct nameidata *nd, unsigned seq; struct inode *inode; struct path path; + struct dentry *dentry; int error; nd->flags &= ~LOOKUP_PARENT; @@ -3227,14 +3220,18 @@ static int do_last(struct nameidata *nd, inode_lock(dir->d_inode); else inode_lock_shared(dir->d_inode); - error = lookup_open(nd, &path, file, op, got_write); + dentry = lookup_open(nd, file, op, got_write); if (open_flag & O_CREAT) inode_unlock(dir->d_inode); else inode_unlock_shared(dir->d_inode); - if (error) + if (IS_ERR(dentry)) { + error = PTR_ERR(dentry); goto out; + } + path.mnt = nd->path.mnt; + path.dentry = dentry; if (file->f_mode & FMODE_OPENED) { if ((file->f_mode & FMODE_CREATED) || -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 09/17] do_last(): collapse the call of path_to_nameidata() 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (6 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 08/17] lookup_open(): " Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 10/17] handle_mounts(): pass dentry in, turn path into a pure out argument Al Viro ` (17 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> ... and shift filling struct path to just before the call of handle_mounts(). All callers of handle_mounts() are immediately preceded by path->mnt = nd->path.mnt now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 30503f114142..f66553ef436a 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3230,8 +3230,6 @@ static int do_last(struct nameidata *nd, error = PTR_ERR(dentry); goto out; } - path.mnt = nd->path.mnt; - path.dentry = dentry; if (file->f_mode & FMODE_OPENED) { if ((file->f_mode & FMODE_CREATED) || @@ -3247,7 +3245,8 @@ static int do_last(struct nameidata *nd, open_flag &= ~O_TRUNC; will_truncate = false; acc_mode = 0; - path_to_nameidata(&path, nd); + dput(nd->path.dentry); + nd->path.dentry = dentry; goto finish_open_created; } @@ -3261,6 +3260,8 @@ static int do_last(struct nameidata *nd, got_write = false; } + path.mnt = nd->path.mnt; + path.dentry = dentry; error = handle_mounts(&path, nd, &inode, &seq); if (unlikely(error < 0)) return error; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 10/17] handle_mounts(): pass dentry in, turn path into a pure out argument 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (7 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 09/17] do_last(): collapse the call of path_to_nameidata() Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 11/17] lookup_fast(): consolidate the RCU success case Al Viro ` (16 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> All callers are equivalent to path->dentry = dentry; path->mnt = nd->path.mnt; err = handle_mounts(path, ...) Pass dentry as an explicit argument, fill *path in handle_mounts() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index f66553ef436a..f95c072bad03 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1304,11 +1304,15 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT); } -static inline int handle_mounts(struct path *path, struct nameidata *nd, - struct inode **inode, unsigned int *seqp) +static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, + struct path *path, struct inode **inode, + unsigned int *seqp) { - int ret = follow_managed(path, nd); + int ret; + path->mnt = nd->path.mnt; + path->dentry = dentry; + ret = follow_managed(path, nd); if (likely(ret >= 0)) { *inode = d_backing_inode(path->dentry); *seqp = 0; /* out of RCU mode, so the value doesn't matter */ @@ -1592,10 +1596,7 @@ static int lookup_fast(struct nameidata *nd, dput(dentry); return status; } - - path->mnt = mnt; - path->dentry = dentry; - return handle_mounts(path, nd, inode, seqp); + return handle_mounts(nd, dentry, path, inode, seqp); } /* Fast lookup failed, do it the slow way */ @@ -1745,6 +1746,7 @@ static inline int step_into(struct nameidata *nd, struct path *path, static int walk_component(struct nameidata *nd, int flags) { struct path path; + struct dentry *dentry; struct inode *inode; unsigned seq; int err; @@ -1763,13 +1765,11 @@ static int walk_component(struct nameidata *nd, int flags) if (unlikely(err <= 0)) { if (err < 0) return err; - path.dentry = lookup_slow(&nd->last, nd->path.dentry, - nd->flags); - if (IS_ERR(path.dentry)) - return PTR_ERR(path.dentry); + dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); - path.mnt = nd->path.mnt; - err = handle_mounts(&path, nd, &inode, &seq); + err = handle_mounts(nd, dentry, &path, &inode, &seq); if (unlikely(err < 0)) return err; } @@ -2223,7 +2223,7 @@ static inline int lookup_last(struct nameidata *nd) static int handle_lookup_down(struct nameidata *nd) { - struct path path = nd->path; + struct path path; struct inode *inode = nd->inode; unsigned seq = nd->seq; int err; @@ -2234,11 +2234,12 @@ static int handle_lookup_down(struct nameidata *nd) * at the very beginning of walk, so we lose nothing * if we simply redo everything in non-RCU mode */ + path = nd->path; if (unlikely(!__follow_mount_rcu(nd, &path, &inode, &seq))) return -ECHILD; } else { - dget(path.dentry); - err = handle_mounts(&path, nd, &inode, &seq); + dget(nd->path.dentry); + err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq); if (unlikely(err < 0)) return err; } @@ -3260,9 +3261,7 @@ static int do_last(struct nameidata *nd, got_write = false; } - path.mnt = nd->path.mnt; - path.dentry = dentry; - error = handle_mounts(&path, nd, &inode, &seq); + error = handle_mounts(nd, dentry, &path, &inode, &seq); if (unlikely(error < 0)) return error; finish_lookup: -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 11/17] lookup_fast(): consolidate the RCU success case 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (8 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 10/17] handle_mounts(): pass dentry in, turn path into a pure out argument Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 12/17] teach handle_mounts() to handle RCU mode Al Viro ` (15 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> 1) in case of __follow_mount_rcu() failure, lookup_fast() proceeds to call unlazy_child() and, should it succeed, handle_mounts(). Note that we have status > 0 (or we wouldn't be calling __follow_mount_rcu() at all), so all stuff conditional upon non-positive status won't be even touched. Consolidate just that sequence after the call of __follow_mount_rcu(). 2) calling d_is_negative() and keeping its result is pointless - we either don't get past checking ->d_seq (and don't use the results of d_is_negative() at all), or we are guaranteed that ->d_inode and type bits of ->d_flags had been consistent at the time of d_is_negative() call. IOW, we could only get to the use of its result if it's equal to !inode. The same ->d_seq check guarantees that after that point this CPU won't observe ->d_flags values older than ->d_inode update. So 'negative' variable is completely pointless these days. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index f95c072bad03..2e416bd8ee26 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1538,7 +1538,6 @@ static int lookup_fast(struct nameidata *nd, */ if (nd->flags & LOOKUP_RCU) { unsigned seq; - bool negative; dentry = __d_lookup_rcu(parent, &nd->last, &seq); if (unlikely(!dentry)) { if (unlazy_walk(nd)) @@ -1551,7 +1550,6 @@ static int lookup_fast(struct nameidata *nd, * the dentry name information from lookup. */ *inode = d_backing_inode(dentry); - negative = d_is_negative(dentry); if (unlikely(read_seqcount_retry(&dentry->d_seq, seq))) return -ECHILD; @@ -1572,12 +1570,15 @@ static int lookup_fast(struct nameidata *nd, * Note: do negative dentry check after revalidation in * case that drops it. */ - if (unlikely(negative)) + if (unlikely(!inode)) return -ENOENT; path->mnt = mnt; path->dentry = dentry; if (likely(__follow_mount_rcu(nd, path, inode, seqp))) return 1; + if (unlazy_child(nd, dentry, seq)) + return -ECHILD; + return handle_mounts(nd, dentry, path, inode, seqp); } if (unlazy_child(nd, dentry, seq)) return -ECHILD; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 12/17] teach handle_mounts() to handle RCU mode 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (9 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 11/17] lookup_fast(): consolidate the RCU success case Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 13/17] lookup_fast(): take mount traversal into callers Al Viro ` (14 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> ... and make the callers of __follow_mount_rcu() use handle_mounts(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 46 +++++++++++++++++----------------------------- 1 file changed, 17 insertions(+), 29 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 2e416bd8ee26..a3bed1307a4b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1312,6 +1312,18 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, path->mnt = nd->path.mnt; path->dentry = dentry; + if (nd->flags & LOOKUP_RCU) { + unsigned int seq = *seqp; + if (unlikely(!*inode)) + return -ENOENT; + if (likely(__follow_mount_rcu(nd, path, inode, seqp))) + return 1; + if (unlazy_child(nd, dentry, seq)) + return -ECHILD; + // *path might've been clobbered by __follow_mount_rcu() + path->mnt = nd->path.mnt; + path->dentry = dentry; + } ret = follow_managed(path, nd); if (likely(ret >= 0)) { *inode = d_backing_inode(path->dentry); @@ -1527,7 +1539,6 @@ static int lookup_fast(struct nameidata *nd, struct path *path, struct inode **inode, unsigned *seqp) { - struct vfsmount *mnt = nd->path.mnt; struct dentry *dentry, *parent = nd->path.dentry; int status = 1; @@ -1565,21 +1576,8 @@ static int lookup_fast(struct nameidata *nd, *seqp = seq; status = d_revalidate(dentry, nd->flags); - if (likely(status > 0)) { - /* - * Note: do negative dentry check after revalidation in - * case that drops it. - */ - if (unlikely(!inode)) - return -ENOENT; - path->mnt = mnt; - path->dentry = dentry; - if (likely(__follow_mount_rcu(nd, path, inode, seqp))) - return 1; - if (unlazy_child(nd, dentry, seq)) - return -ECHILD; + if (likely(status > 0)) return handle_mounts(nd, dentry, path, inode, seqp); - } if (unlazy_child(nd, dentry, seq)) return -ECHILD; if (unlikely(status == -ECHILD)) @@ -2229,21 +2227,11 @@ static int handle_lookup_down(struct nameidata *nd) unsigned seq = nd->seq; int err; - if (nd->flags & LOOKUP_RCU) { - /* - * don't bother with unlazy_walk on failure - we are - * at the very beginning of walk, so we lose nothing - * if we simply redo everything in non-RCU mode - */ - path = nd->path; - if (unlikely(!__follow_mount_rcu(nd, &path, &inode, &seq))) - return -ECHILD; - } else { + if (!(nd->flags & LOOKUP_RCU)) dget(nd->path.dentry); - err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq); - if (unlikely(err < 0)) - return err; - } + err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq); + if (unlikely(err < 0)) + return err; path_to_nameidata(&path, nd); nd->inode = inode; nd->seq = seq; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 13/17] lookup_fast(): take mount traversal into callers 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (10 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 12/17] teach handle_mounts() to handle RCU mode Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 14/17] new step_into() flag: WALK_NOFOLLOW Al Viro ` (13 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Current calling conventions: -E... on error, 0 on cache miss, result of handle_mounts(nd, dentry, path, inode, seqp) on success. Turn that into returning ERR_PTR(-E...), NULL and dentry resp.; deal with handle_mounts() in the callers. The thing is, they already do that in cache miss handling case, so we just need to supply dentry to them and unify the mount traversal in those cases. Fewer arguments that way, and we get closer to merging handle_mounts() and step_into(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 50 ++++++++++++++++++++++++-------------------------- 1 file changed, 24 insertions(+), 26 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index a3bed1307a4b..d529c1e138ff 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1535,9 +1535,9 @@ static struct dentry *__lookup_hash(const struct qstr *name, return dentry; } -static int lookup_fast(struct nameidata *nd, - struct path *path, struct inode **inode, - unsigned *seqp) +static struct dentry *lookup_fast(struct nameidata *nd, + struct inode **inode, + unsigned *seqp) { struct dentry *dentry, *parent = nd->path.dentry; int status = 1; @@ -1552,8 +1552,8 @@ static int lookup_fast(struct nameidata *nd, dentry = __d_lookup_rcu(parent, &nd->last, &seq); if (unlikely(!dentry)) { if (unlazy_walk(nd)) - return -ECHILD; - return 0; + return ERR_PTR(-ECHILD); + return NULL; } /* @@ -1562,7 +1562,7 @@ static int lookup_fast(struct nameidata *nd, */ *inode = d_backing_inode(dentry); if (unlikely(read_seqcount_retry(&dentry->d_seq, seq))) - return -ECHILD; + return ERR_PTR(-ECHILD); /* * This sequence count validates that the parent had no @@ -1572,30 +1572,30 @@ static int lookup_fast(struct nameidata *nd, * enough, we can use __read_seqcount_retry here. */ if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq))) - return -ECHILD; + return ERR_PTR(-ECHILD); *seqp = seq; status = d_revalidate(dentry, nd->flags); if (likely(status > 0)) - return handle_mounts(nd, dentry, path, inode, seqp); + return dentry; if (unlazy_child(nd, dentry, seq)) - return -ECHILD; + return ERR_PTR(-ECHILD); if (unlikely(status == -ECHILD)) /* we'd been told to redo it in non-rcu mode */ status = d_revalidate(dentry, nd->flags); } else { dentry = __d_lookup(parent, &nd->last); if (unlikely(!dentry)) - return 0; + return NULL; status = d_revalidate(dentry, nd->flags); } if (unlikely(status <= 0)) { if (!status) d_invalidate(dentry); dput(dentry); - return status; + return ERR_PTR(status); } - return handle_mounts(nd, dentry, path, inode, seqp); + return dentry; } /* Fast lookup failed, do it the slow way */ @@ -1760,19 +1760,18 @@ static int walk_component(struct nameidata *nd, int flags) put_link(nd); return err; } - err = lookup_fast(nd, &path, &inode, &seq); - if (unlikely(err <= 0)) { - if (err < 0) - return err; + dentry = lookup_fast(nd, &inode, &seq); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + if (unlikely(!dentry)) { dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); if (IS_ERR(dentry)) return PTR_ERR(dentry); - - err = handle_mounts(nd, dentry, &path, &inode, &seq); - if (unlikely(err < 0)) - return err; } + err = handle_mounts(nd, dentry, &path, &inode, &seq); + if (unlikely(err < 0)) + return err; return step_into(nd, &path, flags, inode, seq); } @@ -3170,13 +3169,12 @@ static int do_last(struct nameidata *nd, if (nd->last.name[nd->last.len]) nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; /* we _can_ be in RCU mode here */ - error = lookup_fast(nd, &path, &inode, &seq); - if (likely(error > 0)) + dentry = lookup_fast(nd, &inode, &seq); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + if (likely(dentry)) goto finish_lookup; - if (error < 0) - return error; - BUG_ON(nd->inode != dir->d_inode); BUG_ON(nd->flags & LOOKUP_RCU); } else { @@ -3250,10 +3248,10 @@ static int do_last(struct nameidata *nd, got_write = false; } +finish_lookup: error = handle_mounts(nd, dentry, &path, &inode, &seq); if (unlikely(error < 0)) return error; -finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) return error; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 14/17] new step_into() flag: WALK_NOFOLLOW 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (11 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 13/17] lookup_fast(): take mount traversal into callers Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 15/17] fold handle_mounts() into step_into() Al Viro ` (12 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Tells step_into() not to follow symlinks, regardless of LOOKUP_FOLLOW. Allows to switch handle_lookup_down() to of step_into(), getting all follow_managed() and step_into() calls paired. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index d529c1e138ff..44634643475d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1713,7 +1713,7 @@ static int pick_link(struct nameidata *nd, struct path *link, return 1; } -enum {WALK_FOLLOW = 1, WALK_MORE = 2}; +enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4}; /* * Do we need to follow links? We _really_ want to be able @@ -1727,7 +1727,8 @@ static inline int step_into(struct nameidata *nd, struct path *path, if (!(flags & WALK_MORE) && nd->depth) put_link(nd); if (likely(!d_is_symlink(path->dentry)) || - !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) { + !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) || + flags & WALK_NOFOLLOW) { /* not a symlink or should not follow */ path_to_nameidata(path, nd); nd->inode = inode; @@ -2231,10 +2232,7 @@ static int handle_lookup_down(struct nameidata *nd) err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq); if (unlikely(err < 0)) return err; - path_to_nameidata(&path, nd); - nd->inode = inode; - nd->seq = seq; - return 0; + return step_into(nd, &path, WALK_NOFOLLOW, inode, seq); } /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 15/17] fold handle_mounts() into step_into() 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (12 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 14/17] new step_into() flag: WALK_NOFOLLOW Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 16/17] LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() Al Viro ` (11 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> The following is true: * calls of handle_mounts() and step_into() are always paired in sequences like err = handle_mounts(nd, dentry, &path, &inode, &seq); if (unlikely(err < 0)) return err; err = step_into(nd, &path, flags, inode, seq); * in all such sequences path is uninitialized before and unused after this pair of calls * in all such sequences inode and seq are unused afterwards. So the call of handle_mounts() can be shifted inside step_into(), turning 'path' into a local variable in the combined function. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 41 +++++++++++++++-------------------------- 1 file changed, 15 insertions(+), 26 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 44634643475d..6c28b969f4d1 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1721,31 +1721,35 @@ enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4}; * so we keep a cache of "no, this doesn't need follow_link" * for the common case. */ -static inline int step_into(struct nameidata *nd, struct path *path, - int flags, struct inode *inode, unsigned seq) +static int step_into(struct nameidata *nd, int flags, + struct dentry *dentry, struct inode *inode, unsigned seq) { + struct path path; + int err = handle_mounts(nd, dentry, &path, &inode, &seq); + + if (err < 0) + return err; if (!(flags & WALK_MORE) && nd->depth) put_link(nd); - if (likely(!d_is_symlink(path->dentry)) || + if (likely(!d_is_symlink(path.dentry)) || !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) || flags & WALK_NOFOLLOW) { /* not a symlink or should not follow */ - path_to_nameidata(path, nd); + path_to_nameidata(&path, nd); nd->inode = inode; nd->seq = seq; return 0; } /* make sure that d_is_symlink above matches inode */ if (nd->flags & LOOKUP_RCU) { - if (read_seqcount_retry(&path->dentry->d_seq, seq)) + if (read_seqcount_retry(&path.dentry->d_seq, seq)) return -ECHILD; } - return pick_link(nd, path, inode, seq); + return pick_link(nd, &path, inode, seq); } static int walk_component(struct nameidata *nd, int flags) { - struct path path; struct dentry *dentry; struct inode *inode; unsigned seq; @@ -1769,11 +1773,7 @@ static int walk_component(struct nameidata *nd, int flags) if (IS_ERR(dentry)) return PTR_ERR(dentry); } - - err = handle_mounts(nd, dentry, &path, &inode, &seq); - if (unlikely(err < 0)) - return err; - return step_into(nd, &path, flags, inode, seq); + return step_into(nd, flags, dentry, inode, seq); } /* @@ -2222,17 +2222,10 @@ static inline int lookup_last(struct nameidata *nd) static int handle_lookup_down(struct nameidata *nd) { - struct path path; - struct inode *inode = nd->inode; - unsigned seq = nd->seq; - int err; - if (!(nd->flags & LOOKUP_RCU)) dget(nd->path.dentry); - err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq); - if (unlikely(err < 0)) - return err; - return step_into(nd, &path, WALK_NOFOLLOW, inode, seq); + return step_into(nd, WALK_NOFOLLOW, + nd->path.dentry, nd->inode, nd->seq); } /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */ @@ -3149,7 +3142,6 @@ static int do_last(struct nameidata *nd, int acc_mode = op->acc_mode; unsigned seq; struct inode *inode; - struct path path; struct dentry *dentry; int error; @@ -3247,10 +3239,7 @@ static int do_last(struct nameidata *nd, } finish_lookup: - error = handle_mounts(nd, dentry, &path, &inode, &seq); - if (unlikely(error < 0)) - return error; - error = step_into(nd, &path, 0, inode, seq); + error = step_into(nd, 0, dentry, inode, seq); if (unlikely(error)) return error; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 16/17] LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (13 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 15/17] fold handle_mounts() into step_into() Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 17/17] expand the only remaining call of path_lookup_conditional() Al Viro ` (10 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> New LOOKUP flag, telling path_lookupat() to act as path_mountpointat(). IOW, traverse mounts at the final point and skip revalidation of the location where it ends up. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/autofs/dev-ioctl.c | 6 +-- fs/internal.h | 1 - fs/namei.c | 89 +++---------------------------------------- fs/namespace.c | 4 +- include/linux/namei.h | 2 +- 5 files changed, 12 insertions(+), 90 deletions(-) diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c index a3cdb0036c5d..f3a0f412b43b 100644 --- a/fs/autofs/dev-ioctl.c +++ b/fs/autofs/dev-ioctl.c @@ -186,7 +186,7 @@ static int find_autofs_mount(const char *pathname, struct path path; int err; - err = kern_path_mountpoint(AT_FDCWD, pathname, &path, 0); + err = kern_path(pathname, LOOKUP_MOUNTPOINT, &path); if (err) return err; err = -ENOENT; @@ -519,8 +519,8 @@ static int autofs_dev_ioctl_ismountpoint(struct file *fp, if (!fp || param->ioctlfd == -1) { if (autofs_type_any(type)) - err = kern_path_mountpoint(AT_FDCWD, - name, &path, LOOKUP_FOLLOW); + err = kern_path(name, LOOKUP_FOLLOW | LOOKUP_MOUNTPOINT, + &path); else err = find_autofs_mount(name, &path, test_by_type, &type); diff --git a/fs/internal.h b/fs/internal.h index 4a7da1df573d..07695e0f56fe 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -61,7 +61,6 @@ extern int finish_clean_context(struct fs_context *fc); */ extern int filename_lookup(int dfd, struct filename *name, unsigned flags, struct path *path, struct path *root); -extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *); extern int vfs_path_lookup(struct dentry *, struct vfsmount *, const char *, unsigned int, struct path *); long do_mknodat(int dfd, const char __user *filename, umode_t mode, diff --git a/fs/namei.c b/fs/namei.c index 6c28b969f4d1..6852a0dcb25d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2250,6 +2250,10 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path if (!err && nd->flags & LOOKUP_DIRECTORY) if (!d_can_lookup(nd->path.dentry)) err = -ENOTDIR; + if (!err && unlikely(nd->flags & LOOKUP_MOUNTPOINT)) { + err = handle_lookup_down(nd); + nd->flags &= ~LOOKUP_JUMPED; // no d_weak_revalidate(), please... + } if (!err) { *path = nd->path; nd->path.mnt = NULL; @@ -2278,7 +2282,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, retval = path_lookupat(&nd, flags | LOOKUP_REVAL, path); if (likely(!retval)) - audit_inode(name, path->dentry, 0); + audit_inode(name, path->dentry, + flags & LOOKUP_MOUNTPOINT ? AUDIT_INODE_NOEVAL : 0); restore_nameidata(); putname(name); return retval; @@ -2556,88 +2561,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags, } EXPORT_SYMBOL(user_path_at_empty); -/** - * path_mountpoint - look up a path to be umounted - * @nd: lookup context - * @flags: lookup flags - * @path: pointer to container for result - * - * Look up the given name, but don't attempt to revalidate the last component. - * Returns 0 and "path" will be valid on success; Returns error otherwise. - */ -static int -path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) -{ - const char *s = path_init(nd, flags); - int err; - - while (!(err = link_path_walk(s, nd)) && - (err = lookup_last(nd)) > 0) { - s = trailing_symlink(nd); - } - if (!err && (nd->flags & LOOKUP_RCU)) - err = unlazy_walk(nd); - if (!err) - err = handle_lookup_down(nd); - if (!err) { - *path = nd->path; - nd->path.mnt = NULL; - nd->path.dentry = NULL; - } - terminate_walk(nd); - return err; -} - -static int -filename_mountpoint(int dfd, struct filename *name, struct path *path, - unsigned int flags) -{ - struct nameidata nd; - int error; - if (IS_ERR(name)) - return PTR_ERR(name); - set_nameidata(&nd, dfd, name); - error = path_mountpoint(&nd, flags | LOOKUP_RCU, path); - if (unlikely(error == -ECHILD)) - error = path_mountpoint(&nd, flags, path); - if (unlikely(error == -ESTALE)) - error = path_mountpoint(&nd, flags | LOOKUP_REVAL, path); - if (likely(!error)) - audit_inode(name, path->dentry, AUDIT_INODE_NOEVAL); - restore_nameidata(); - putname(name); - return error; -} - -/** - * user_path_mountpoint_at - lookup a path from userland in order to umount it - * @dfd: directory file descriptor - * @name: pathname from userland - * @flags: lookup flags - * @path: pointer to container to hold result - * - * A umount is a special case for path walking. We're not actually interested - * in the inode in this situation, and ESTALE errors can be a problem. We - * simply want track down the dentry and vfsmount attached at the mountpoint - * and avoid revalidating the last component. - * - * Returns 0 and populates "path" on success. - */ -int -user_path_mountpoint_at(int dfd, const char __user *name, unsigned int flags, - struct path *path) -{ - return filename_mountpoint(dfd, getname(name), path, flags); -} - -int -kern_path_mountpoint(int dfd, const char *name, struct path *path, - unsigned int flags) -{ - return filename_mountpoint(dfd, getname_kernel(name), path, flags); -} -EXPORT_SYMBOL(kern_path_mountpoint); - int __check_sticky(struct inode *dir, struct inode *inode) { kuid_t fsuid = current_fsuid(); diff --git a/fs/namespace.c b/fs/namespace.c index b37dc59bfa05..b31a75782a59 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1669,7 +1669,7 @@ int ksys_umount(char __user *name, int flags) struct path path; struct mount *mnt; int retval; - int lookup_flags = 0; + int lookup_flags = LOOKUP_MOUNTPOINT; if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW)) return -EINVAL; @@ -1680,7 +1680,7 @@ int ksys_umount(char __user *name, int flags) if (!(flags & UMOUNT_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW; - retval = user_path_mountpoint_at(AT_FDCWD, name, lookup_flags, &path); + retval = user_path_at(AT_FDCWD, name, lookup_flags, &path); if (retval) goto out; mnt = real_mount(path.mnt); diff --git a/include/linux/namei.h b/include/linux/namei.h index 07bfb0874033..df3549de1cd1 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -22,6 +22,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_AUTOMOUNT 0x0004 /* force terminal automount */ #define LOOKUP_EMPTY 0x4000 /* accept empty path [user_... only] */ #define LOOKUP_DOWN 0x8000 /* follow mounts in the starting point */ +#define LOOKUP_MOUNTPOINT 0x0080 /* follow mounts in the end */ #define LOOKUP_REVAL 0x0020 /* tell ->d_revalidate() to trust no cache */ #define LOOKUP_RCU 0x0040 /* RCU pathwalk mode; semi-internal */ @@ -54,7 +55,6 @@ extern struct dentry *kern_path_create(int, const char *, struct path *, unsigne extern struct dentry *user_path_create(int, const char __user *, struct path *, unsigned int); extern void done_path_create(struct path *, struct dentry *); extern struct dentry *kern_path_locked(const char *, struct path *); -extern int kern_path_mountpoint(int, const char *, struct path *, unsigned int); extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int); extern struct dentry *lookup_one_len(const char *, struct dentry *, int); -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 17/17] expand the only remaining call of path_lookup_conditional() 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (14 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 16/17] LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 1/9] merging pick_link() with get_link(), part 1 Al Viro ` (9 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 6852a0dcb25d..e840472ab9bf 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -816,13 +816,6 @@ static void set_root(struct nameidata *nd) } } -static void path_put_conditional(struct path *path, struct nameidata *nd) -{ - dput(path->dentry); - if (path->mnt != nd->path.mnt) - mntput(path->mnt); -} - static inline void path_to_nameidata(const struct path *path, struct nameidata *nd) { @@ -1233,8 +1226,11 @@ static int follow_managed(struct path *path, struct nameidata *nd) ret = 1; if (ret > 0 && unlikely(d_flags_negative(flags))) ret = -ENOENT; - if (unlikely(ret < 0)) - path_put_conditional(path, nd); + if (unlikely(ret < 0)) { + dput(path->dentry); + if (path->mnt != nd->path.mnt) + mntput(path->mnt); + } return ret; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 1/9] merging pick_link() with get_link(), part 1 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (15 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 17/17] expand the only remaining call of path_lookup_conditional() Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 2/9] merging pick_link() with get_link(), part 2 Al Viro ` (8 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Move restoring LOOKUP_PARENT and zeroing nd->stack.name[0] past the call of get_link() (nothing _currently_ uses them in there). That allows to moved the call of may_follow_link() into get_link() as well, since now the presence of LOOKUP_PARENT distinguishes the callers from each other (link_path_walk() has it, trailing_symlink() doesn't). Preparations for folding trailing_symlink() into callers (lookup_last() and do_last()) and changing the calling conventions of those. Next stage after that will have get_link() call migrate into walk_component(), then - into step_into(). It's tricky enough to warrant doing that in stages, unfortunately... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index f9fa8579cf6a..45cedbe267ab 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1114,6 +1114,12 @@ const char *get_link(struct nameidata *nd) int error; const char *res; + if (!(nd->flags & LOOKUP_PARENT)) { + error = may_follow_link(nd); + if (unlikely(error)) + return ERR_PTR(error); + } + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) return ERR_PTR(-ELOOP); @@ -2328,13 +2334,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) static const char *trailing_symlink(struct nameidata *nd) { - const char *s; - int error = may_follow_link(nd); - if (unlikely(error)) - return ERR_PTR(error); + const char *s = get_link(nd); nd->flags |= LOOKUP_PARENT; nd->stack[0].name = NULL; - s = get_link(nd); return s ? s : ""; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 2/9] merging pick_link() with get_link(), part 2 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (16 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 1/9] merging pick_link() with get_link(), part 1 Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 3/9] merging pick_link() with get_link(), part 3 Al Viro ` (7 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Fold trailing_symlink() into lookup_last() and do_last(), change the calling conventions of those two. Rules change: success, we are done => NULL instead of 0 error => ERR_PTR(-E...) instead of -E... got a symlink to follow => return the path to be followed instead of 1 The loops calling those (in path_lookupat() and path_openat()) adjusted. A subtle change of control flow here: originally a pure-jump trailing symlink ("/" or procfs one) would've passed through the upper level loop once more, with "" for path to traverse. That would've brought us back to the lookup_last/do_last entry and we would've hit LAST_BIND case (LAST_BIND left from get_link() called by trailing_symlink()) and pretty much skip to the point right after where we'd left the sucker back when we picked that trailing symlink. Now we don't bother with that extra pass through the upper level loop - if get_link() says "I've just done a pure jump, nothing else to do", we just treat that as non-symlink case. Boilerplate added on that step will go away shortly - it'll migrate into walk_component() and then to step_into(), collapsing into the change of calling conventions for those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 68 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 40 insertions(+), 28 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 45cedbe267ab..d93e155caded 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2332,21 +2332,26 @@ static const char *path_init(struct nameidata *nd, unsigned flags) return s; } -static const char *trailing_symlink(struct nameidata *nd) -{ - const char *s = get_link(nd); - nd->flags |= LOOKUP_PARENT; - nd->stack[0].name = NULL; - return s ? s : ""; -} - -static inline int lookup_last(struct nameidata *nd) +static inline const char *lookup_last(struct nameidata *nd) { + int err; if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len]) nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; nd->flags &= ~LOOKUP_PARENT; - return walk_component(nd, 0); + err = walk_component(nd, 0); + if (unlikely(err)) { + const char *s; + if (err < 0) + return PTR_ERR(err); + s = get_link(nd); + if (s) { + nd->flags |= LOOKUP_PARENT; + nd->stack[0].name = NULL; + return s; + } + } + return NULL; } static int handle_lookup_down(struct nameidata *nd) @@ -2369,10 +2374,9 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path s = ERR_PTR(err); } - while (!(err = link_path_walk(s, nd)) - && ((err = lookup_last(nd)) > 0)) { - s = trailing_symlink(nd); - } + while (!(err = link_path_walk(s, nd)) && + (s = lookup_last(nd)) != NULL) + ; if (!err) err = complete_walk(nd); @@ -3184,7 +3188,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, /* * Handle the last step of open() */ -static int do_last(struct nameidata *nd, +static const char *do_last(struct nameidata *nd, struct file *file, const struct open_flags *op) { struct dentry *dir = nd->path.dentry; @@ -3203,7 +3207,7 @@ static int do_last(struct nameidata *nd, if (nd->last_type != LAST_NORM) { error = handle_dots(nd, nd->last_type); if (unlikely(error)) - return error; + return ERR_PTR(error); goto finish_open; } @@ -3213,7 +3217,7 @@ static int do_last(struct nameidata *nd, /* we _can_ be in RCU mode here */ dentry = lookup_fast(nd, &inode, &seq); if (IS_ERR(dentry)) - return PTR_ERR(dentry); + return ERR_CAST(dentry); if (likely(dentry)) goto finish_lookup; @@ -3228,12 +3232,12 @@ static int do_last(struct nameidata *nd, */ error = complete_walk(nd); if (error) - return error; + return ERR_PTR(error); audit_inode(nd->name, dir, AUDIT_INODE_PARENT); /* trailing slashes? */ if (unlikely(nd->last.name[nd->last.len])) - return -EISDIR; + return ERR_PTR(-EISDIR); } if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { @@ -3292,18 +3296,28 @@ static int do_last(struct nameidata *nd, finish_lookup: error = step_into(nd, 0, dentry, inode, seq); - if (unlikely(error)) - return error; + if (unlikely(error)) { + const char *s; + if (error < 0) + return ERR_PTR(error); + s = get_link(nd); + if (s) { + nd->flags |= LOOKUP_PARENT; + nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL); + nd->stack[0].name = NULL; + return s; + } + } if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { audit_inode(nd->name, nd->path.dentry, 0); - return -EEXIST; + return ERR_PTR(-EEXIST); } finish_open: /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ error = complete_walk(nd); if (error) - return error; + return ERR_PTR(error); audit_inode(nd->name, nd->path.dentry, 0); if (open_flag & O_CREAT) { error = -EISDIR; @@ -3345,7 +3359,7 @@ static int do_last(struct nameidata *nd, } if (got_write) mnt_drop_write(nd->path.mnt); - return error; + return ERR_PTR(error); } struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode, int open_flag) @@ -3448,10 +3462,8 @@ static struct file *path_openat(struct nameidata *nd, } else { const char *s = path_init(nd, flags); while (!(error = link_path_walk(s, nd)) && - (error = do_last(nd, file, op)) > 0) { - nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL); - s = trailing_symlink(nd); - } + (s = do_last(nd, file, op)) != NULL) + ; terminate_walk(nd); } if (likely(!error)) { -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 3/9] merging pick_link() with get_link(), part 3 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (17 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 2/9] merging pick_link() with get_link(), part 2 Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 4/9] merging pick_link() with get_link(), part 4 Al Viro ` (6 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> After a pure jump ("/" or procfs-style symlink) we don't need to hold the link anymore. link_path_walk() dropped it if such case had been detected, lookup_last/do_last() (i.e. old trailing_symlink()) left it on the stack - it ended up calling terminate_walk() shortly anyway, which would've purged the entire stack. Do it in get_link() itself instead. Simpler logics that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index d93e155caded..fe03e4d1144b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1153,7 +1153,9 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); } - if (IS_ERR_OR_NULL(res)) + if (!res) + goto all_done; + if (IS_ERR(res)) return res; } if (*res == '/') { @@ -1163,9 +1165,11 @@ const char *get_link(struct nameidata *nd) while (unlikely(*++res == '/')) ; } - if (!*res) - res = NULL; - return res; + if (*res) + return res; +all_done: // pure jump + put_link(nd); + return NULL; } /* @@ -2210,11 +2214,7 @@ static int link_path_walk(const char *name, struct nameidata *nd) if (IS_ERR(s)) return PTR_ERR(s); - err = 0; - if (unlikely(!s)) { - /* jumped */ - put_link(nd); - } else { + if (likely(s)) { nd->stack[nd->depth - 1].name = name; name = s; continue; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 4/9] merging pick_link() with get_link(), part 4 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (18 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 3/9] merging pick_link() with get_link(), part 3 Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 5/9] merging pick_link() with get_link(), part 5 Al Viro ` (5 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> Move the call of get_link() into walk_component(). Change the calling conventions for walk_component() to returning the link body to follow (if any). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 60 ++++++++++++++++++++++++------------------------------ 1 file changed, 27 insertions(+), 33 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index fe03e4d1144b..2c7778d95d32 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1867,7 +1867,7 @@ static int step_into(struct nameidata *nd, int flags, return pick_link(nd, &path, inode, seq); } -static int walk_component(struct nameidata *nd, int flags) +static const char *walk_component(struct nameidata *nd, int flags) { struct dentry *dentry; struct inode *inode; @@ -1882,17 +1882,23 @@ static int walk_component(struct nameidata *nd, int flags) err = handle_dots(nd, nd->last_type); if (!(flags & WALK_MORE) && nd->depth) put_link(nd); - return err; + return ERR_PTR(err); } dentry = lookup_fast(nd, &inode, &seq); if (IS_ERR(dentry)) - return PTR_ERR(dentry); + return ERR_CAST(dentry); if (unlikely(!dentry)) { dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); if (IS_ERR(dentry)) - return PTR_ERR(dentry); + return ERR_CAST(dentry); } - return step_into(nd, flags, dentry, inode, seq); + err = step_into(nd, flags, dentry, inode, seq); + if (!err) + return NULL; + else if (err > 0) + return get_link(nd); + else + return ERR_PTR(err); } /* @@ -2144,6 +2150,7 @@ static int link_path_walk(const char *name, struct nameidata *nd) /* At this point we know we have a real path component. */ for(;;) { + const char *link; u64 hash_len; int type; @@ -2201,24 +2208,18 @@ static int link_path_walk(const char *name, struct nameidata *nd) if (!name) return 0; /* last component of nested symlink */ - err = walk_component(nd, WALK_FOLLOW); + link = walk_component(nd, WALK_FOLLOW); } else { /* not the last component */ - err = walk_component(nd, WALK_FOLLOW | WALK_MORE); + link = walk_component(nd, WALK_FOLLOW | WALK_MORE); } - if (err < 0) - return err; - - if (err) { - const char *s = get_link(nd); - - if (IS_ERR(s)) - return PTR_ERR(s); - if (likely(s)) { - nd->stack[nd->depth - 1].name = name; - name = s; - continue; - } + if (unlikely(link)) { + if (IS_ERR(link)) + return PTR_ERR(link); + /* a symlink to follow */ + nd->stack[nd->depth - 1].name = name; + name = link; + continue; } if (unlikely(!d_can_lookup(nd->path.dentry))) { if (nd->flags & LOOKUP_RCU) { @@ -2334,24 +2335,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags) static inline const char *lookup_last(struct nameidata *nd) { - int err; + const char *link; if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len]) nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; nd->flags &= ~LOOKUP_PARENT; - err = walk_component(nd, 0); - if (unlikely(err)) { - const char *s; - if (err < 0) - return PTR_ERR(err); - s = get_link(nd); - if (s) { - nd->flags |= LOOKUP_PARENT; - nd->stack[0].name = NULL; - return s; - } + link = walk_component(nd, 0); + if (link) { + nd->flags |= LOOKUP_PARENT; + nd->stack[0].name = NULL; } - return NULL; + return link; } static int handle_lookup_down(struct nameidata *nd) -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 5/9] merging pick_link() with get_link(), part 5 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (19 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 4/9] merging pick_link() with get_link(), part 4 Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 6/9] merging pick_link() with get_link(), part 6 Al Viro ` (4 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> move get_link() call into step_into(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 45 +++++++++++++++++++-------------------------- 1 file changed, 19 insertions(+), 26 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 2c7778d95d32..ad6de8b4167e 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1840,14 +1840,14 @@ enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4}; * so we keep a cache of "no, this doesn't need follow_link" * for the common case. */ -static int step_into(struct nameidata *nd, int flags, +static const char *step_into(struct nameidata *nd, int flags, struct dentry *dentry, struct inode *inode, unsigned seq) { struct path path; int err = handle_mounts(nd, dentry, &path, &inode, &seq); if (err < 0) - return err; + return ERR_PTR(err); if (!(flags & WALK_MORE) && nd->depth) put_link(nd); if (likely(!d_is_symlink(path.dentry)) || @@ -1857,14 +1857,18 @@ static int step_into(struct nameidata *nd, int flags, path_to_nameidata(&path, nd); nd->inode = inode; nd->seq = seq; - return 0; + return NULL; } /* make sure that d_is_symlink above matches inode */ if (nd->flags & LOOKUP_RCU) { if (read_seqcount_retry(&path.dentry->d_seq, seq)) - return -ECHILD; + return ERR_PTR(-ECHILD); } - return pick_link(nd, &path, inode, seq); + err = pick_link(nd, &path, inode, seq); + if (err > 0) + return get_link(nd); + else + return ERR_PTR(err); } static const char *walk_component(struct nameidata *nd, int flags) @@ -1892,13 +1896,7 @@ static const char *walk_component(struct nameidata *nd, int flags) if (IS_ERR(dentry)) return ERR_CAST(dentry); } - err = step_into(nd, flags, dentry, inode, seq); - if (!err) - return NULL; - else if (err > 0) - return get_link(nd); - else - return ERR_PTR(err); + return step_into(nd, flags, dentry, inode, seq); } /* @@ -2352,8 +2350,8 @@ static int handle_lookup_down(struct nameidata *nd) { if (!(nd->flags & LOOKUP_RCU)) dget(nd->path.dentry); - return step_into(nd, WALK_NOFOLLOW, - nd->path.dentry, nd->inode, nd->seq); + return PTR_ERR(step_into(nd, WALK_NOFOLLOW, + nd->path.dentry, nd->inode, nd->seq)); } /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */ @@ -3193,6 +3191,7 @@ static const char *do_last(struct nameidata *nd, unsigned seq; struct inode *inode; struct dentry *dentry; + const char *link; int error; nd->flags &= ~LOOKUP_PARENT; @@ -3289,18 +3288,12 @@ static const char *do_last(struct nameidata *nd, } finish_lookup: - error = step_into(nd, 0, dentry, inode, seq); - if (unlikely(error)) { - const char *s; - if (error < 0) - return ERR_PTR(error); - s = get_link(nd); - if (s) { - nd->flags |= LOOKUP_PARENT; - nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL); - nd->stack[0].name = NULL; - return s; - } + link = step_into(nd, 0, dentry, inode, seq); + if (unlikely(link)) { + nd->flags |= LOOKUP_PARENT; + nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL); + nd->stack[0].name = NULL; + return link; } if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 6/9] merging pick_link() with get_link(), part 6 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (20 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 5/9] merging pick_link() with get_link(), part 5 Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 7/9] finally fold get_link() into pick_link() Al Viro ` (3 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> move the only remaining call of get_link() into pick_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index ad6de8b4167e..adb573e0f424 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1792,14 +1792,14 @@ static inline int handle_dots(struct nameidata *nd, int type) return 0; } -static int pick_link(struct nameidata *nd, struct path *link, +static const char *pick_link(struct nameidata *nd, struct path *link, struct inode *inode, unsigned seq) { int error; struct saved *last; if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) { path_to_nameidata(link, nd); - return -ELOOP; + return ERR_PTR(-ELOOP); } if (!(nd->flags & LOOKUP_RCU)) { if (link->mnt == nd->path.mnt) @@ -1820,7 +1820,7 @@ static int pick_link(struct nameidata *nd, struct path *link, } if (error) { path_put(link); - return error; + return ERR_PTR(error); } } @@ -1829,7 +1829,7 @@ static int pick_link(struct nameidata *nd, struct path *link, clear_delayed_call(&last->done); nd->link_inode = inode; last->seq = seq; - return 1; + return get_link(nd); } enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4}; @@ -1864,11 +1864,7 @@ static const char *step_into(struct nameidata *nd, int flags, if (read_seqcount_retry(&path.dentry->d_seq, seq)) return ERR_PTR(-ECHILD); } - err = pick_link(nd, &path, inode, seq); - if (err > 0) - return get_link(nd); - else - return ERR_PTR(err); + return pick_link(nd, &path, inode, seq); } static const char *walk_component(struct nameidata *nd, int flags) -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 7/9] finally fold get_link() into pick_link() 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (21 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 6/9] merging pick_link() with get_link(), part 6 Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 8/9] massage __follow_mount_rcu() a bit Al Viro ` (2 subsequent siblings) 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> kill nd->link_inode, while we are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 135 ++++++++++++++++++++++++----------------------------- 1 file changed, 61 insertions(+), 74 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index adb573e0f424..40263f89a54f 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -503,7 +503,6 @@ struct nameidata { } *stack, internal[EMBEDDED_LEVELS]; struct filename *name; struct nameidata *saved; - struct inode *link_inode; unsigned root_seq; int dfd; } __randomize_layout; @@ -962,9 +961,8 @@ int sysctl_protected_regular __read_mostly; * * Returns 0 if following the symlink is allowed, -ve on error. */ -static inline int may_follow_link(struct nameidata *nd) +static inline int may_follow_link(struct nameidata *nd, const struct inode *inode) { - const struct inode *inode; const struct inode *parent; kuid_t puid; @@ -972,7 +970,6 @@ static inline int may_follow_link(struct nameidata *nd) return 0; /* Allowed if owner and follower match. */ - inode = nd->link_inode; if (uid_eq(current_cred()->fsuid, inode->i_uid)) return 0; @@ -1105,73 +1102,6 @@ static int may_create_in_sticky(struct dentry * const dir, return 0; } -static __always_inline -const char *get_link(struct nameidata *nd) -{ - struct saved *last = nd->stack + nd->depth - 1; - struct dentry *dentry = last->link.dentry; - struct inode *inode = nd->link_inode; - int error; - const char *res; - - if (!(nd->flags & LOOKUP_PARENT)) { - error = may_follow_link(nd); - if (unlikely(error)) - return ERR_PTR(error); - } - - if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) - return ERR_PTR(-ELOOP); - - if (!(nd->flags & LOOKUP_RCU)) { - touch_atime(&last->link); - cond_resched(); - } else if (atime_needs_update(&last->link, inode)) { - if (unlikely(unlazy_walk(nd))) - return ERR_PTR(-ECHILD); - touch_atime(&last->link); - } - - error = security_inode_follow_link(dentry, inode, - nd->flags & LOOKUP_RCU); - if (unlikely(error)) - return ERR_PTR(error); - - nd->last_type = LAST_BIND; - res = READ_ONCE(inode->i_link); - if (!res) { - const char * (*get)(struct dentry *, struct inode *, - struct delayed_call *); - get = inode->i_op->get_link; - if (nd->flags & LOOKUP_RCU) { - res = get(NULL, inode, &last->done); - if (res == ERR_PTR(-ECHILD)) { - if (unlikely(unlazy_walk(nd))) - return ERR_PTR(-ECHILD); - res = get(dentry, inode, &last->done); - } - } else { - res = get(dentry, inode, &last->done); - } - if (!res) - goto all_done; - if (IS_ERR(res)) - return res; - } - if (*res == '/') { - error = nd_jump_root(nd); - if (unlikely(error)) - return ERR_PTR(error); - while (unlikely(*++res == '/')) - ; - } - if (*res) - return res; -all_done: // pure jump - put_link(nd); - return NULL; -} - /* * follow_up - Find the mountpoint of path's vfsmount * @@ -1795,8 +1725,10 @@ static inline int handle_dots(struct nameidata *nd, int type) static const char *pick_link(struct nameidata *nd, struct path *link, struct inode *inode, unsigned seq) { - int error; struct saved *last; + const char *res; + int error; + if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) { path_to_nameidata(link, nd); return ERR_PTR(-ELOOP); @@ -1827,9 +1759,64 @@ static const char *pick_link(struct nameidata *nd, struct path *link, last = nd->stack + nd->depth++; last->link = *link; clear_delayed_call(&last->done); - nd->link_inode = inode; last->seq = seq; - return get_link(nd); + + if (!(nd->flags & LOOKUP_PARENT)) { + error = may_follow_link(nd, inode); + if (unlikely(error)) + return ERR_PTR(error); + } + + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + return ERR_PTR(-ELOOP); + + if (!(nd->flags & LOOKUP_RCU)) { + touch_atime(&last->link); + cond_resched(); + } else if (atime_needs_update(&last->link, inode)) { + if (unlikely(unlazy_walk(nd))) + return ERR_PTR(-ECHILD); + touch_atime(&last->link); + } + + error = security_inode_follow_link(link->dentry, inode, + nd->flags & LOOKUP_RCU); + if (unlikely(error)) + return ERR_PTR(error); + + nd->last_type = LAST_BIND; + res = READ_ONCE(inode->i_link); + if (!res) { + const char * (*get)(struct dentry *, struct inode *, + struct delayed_call *); + get = inode->i_op->get_link; + if (nd->flags & LOOKUP_RCU) { + res = get(NULL, inode, &last->done); + if (res == ERR_PTR(-ECHILD)) { + if (unlikely(unlazy_walk(nd))) + return ERR_PTR(-ECHILD); + res = get(link->dentry, inode, &last->done); + } + } else { + res = get(link->dentry, inode, &last->done); + } + if (!res) + goto all_done; + if (IS_ERR(res)) + return res; + } + if (*res == '/') { + error = nd_jump_root(nd); + if (unlikely(error)) + return ERR_PTR(error); + while (unlikely(*++res == '/')) + ; + } + if (*res) + return res; +all_done: // pure jump + put_link(nd); + return NULL; } enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4}; -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 8/9] massage __follow_mount_rcu() a bit 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (22 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 7/9] finally fold get_link() into pick_link() Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-19 3:17 ` [PATCH 9/9] new helper: traverse_mounts() Al Viro 2020-01-30 14:13 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Christian Brauner 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> make the loop more similar to that in follow_managed(), with explicit tracking of flags, etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 70 +++++++++++++++++++++++++++--------------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 40263f89a54f..310c5ccddf42 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1268,12 +1268,6 @@ int follow_down_one(struct path *path) } EXPORT_SYMBOL(follow_down_one); -static inline int managed_dentry_rcu(const struct path *path) -{ - return (path->dentry->d_flags & DCACHE_MANAGE_TRANSIT) ? - path->dentry->d_op->d_manage(path, true) : 0; -} - /* * Try to skip to top of mountpoint pile in rcuwalk mode. Fail if * we meet a managed dentry that would need blocking. @@ -1281,43 +1275,49 @@ static inline int managed_dentry_rcu(const struct path *path) static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, struct inode **inode, unsigned *seqp) { + struct dentry *dentry = path->dentry; + unsigned int flags = dentry->d_flags; + + if (likely(!(flags & DCACHE_MANAGED_DENTRY))) + return true; + + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) + return false; + for (;;) { - struct mount *mounted; /* * Don't forget we might have a non-mountpoint managed dentry * that wants to block transit. */ - switch (managed_dentry_rcu(path)) { - case -ECHILD: - default: - return false; - case -EISDIR: - return true; - case 0: - break; + if (unlikely(flags & DCACHE_MANAGE_TRANSIT)) { + int res = dentry->d_op->d_manage(path, true); + if (res) + return res == -EISDIR; + flags = dentry->d_flags; } - if (!d_mountpoint(path->dentry)) - return !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT); - - mounted = __lookup_mnt(path->mnt, path->dentry); - if (!mounted) - break; - if (unlikely(nd->flags & LOOKUP_NO_XDEV)) - return false; - path->mnt = &mounted->mnt; - path->dentry = mounted->mnt.mnt_root; - nd->flags |= LOOKUP_JUMPED; - *seqp = read_seqcount_begin(&path->dentry->d_seq); - /* - * Update the inode too. We don't need to re-check the - * dentry sequence number here after this d_inode read, - * because a mount-point is always pinned. - */ - *inode = path->dentry->d_inode; + if (flags & DCACHE_MOUNTED) { + struct mount *mounted = __lookup_mnt(path->mnt, dentry); + if (mounted) { + path->mnt = &mounted->mnt; + dentry = path->dentry = mounted->mnt.mnt_root; + nd->flags |= LOOKUP_JUMPED; + *seqp = read_seqcount_begin(&dentry->d_seq); + *inode = dentry->d_inode; + /* + * We don't need to re-check ->d_seq after this + * ->d_inode read - there will be an RCU delay + * between mount hash removal and ->mnt_root + * becoming unpinned. + */ + flags = dentry->d_flags; + continue; + } + if (read_seqretry(&mount_lock, nd->m_seq)) + return false; + } + return !(flags & DCACHE_NEED_AUTOMOUNT); } - return !read_seqretry(&mount_lock, nd->m_seq) && - !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT); } static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 9/9] new helper: traverse_mounts() 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (23 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 8/9] massage __follow_mount_rcu() a bit Al Viro @ 2020-01-19 3:17 ` Al Viro 2020-01-30 14:13 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Christian Brauner 25 siblings, 0 replies; 92+ messages in thread From: Al Viro @ 2020-01-19 3:17 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman, Christian Brauner, Al Viro From: Al Viro <viro@zeniv.linux.org.uk> common guts of follow_down() and follow_managed() taken to a new helper - traverse_mounts(). The remnants of follow_managed() are folded into its sole remaining caller (handle_mounts()). Calling conventions of handle_mounts() slightly sanitized - instead of the weird "1 for success, -E... for failure" that used to be imposed by the calling conventions of walk_component() et.al. we can use the normal "0 for success, -E... for failure". Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namei.c | 177 ++++++++++++++++++++++------------------------------- 1 file changed, 72 insertions(+), 105 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 310c5ccddf42..d3172e2c7f7f 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1167,91 +1167,79 @@ static int follow_automount(struct path *path, int *count, unsigned lookup_flags } /* - * Handle a dentry that is managed in some way. - * - Flagged for transit management (autofs) - * - Flagged as mountpoint - * - Flagged as automount point - * - * This may only be called in refwalk mode. - * On success path->dentry is known positive. - * - * Serialization is taken care of in namespace.c + * mount traversal - out-of-line part. One note on ->d_flags accesses - + * dentries are pinned but not locked here, so negative dentry can go + * positive right under us. Use of smp_load_acquire() provides a barrier + * sufficient for ->d_inode and ->d_flags consistency. */ -static int follow_managed(struct path *path, struct nameidata *nd) +static int __traverse_mounts(struct path *path, unsigned flags, bool *jumped, + int *count, unsigned lookup_flags) { - struct vfsmount *mnt = path->mnt; /* held by caller, must be left alone */ - unsigned flags; + struct vfsmount *mnt = path->mnt; bool need_mntput = false; int ret = 0; - /* Given that we're not holding a lock here, we retain the value in a - * local variable for each dentry as we look at it so that we don't see - * the components of that value change under us */ - while (flags = smp_load_acquire(&path->dentry->d_flags), - unlikely(flags & DCACHE_MANAGED_DENTRY)) { + while (flags & DCACHE_MANAGED_DENTRY) { /* Allow the filesystem to manage the transit without i_mutex * being held. */ if (flags & DCACHE_MANAGE_TRANSIT) { - BUG_ON(!path->dentry->d_op); - BUG_ON(!path->dentry->d_op->d_manage); ret = path->dentry->d_op->d_manage(path, false); flags = smp_load_acquire(&path->dentry->d_flags); if (ret < 0) break; } - /* Transit to a mounted filesystem. */ - if (flags & DCACHE_MOUNTED) { + if (flags & DCACHE_MOUNTED) { // something's mounted on it.. struct vfsmount *mounted = lookup_mnt(path); - if (mounted) { + if (mounted) { // ... in our namespace dput(path->dentry); if (need_mntput) mntput(path->mnt); path->mnt = mounted; path->dentry = dget(mounted->mnt_root); + // here we know it's positive + flags = path->dentry->d_flags; need_mntput = true; continue; } - - /* Something is mounted on this dentry in another - * namespace and/or whatever was mounted there in this - * namespace got unmounted before lookup_mnt() could - * get it */ } - /* Handle an automount point */ - if (flags & DCACHE_NEED_AUTOMOUNT) { - ret = follow_automount(path, &nd->total_link_count, - nd->flags); - if (ret < 0) - break; - continue; - } + if (!(flags & DCACHE_NEED_AUTOMOUNT)) + break; - /* We didn't change the current path point */ - break; + // uncovered automount point + ret = follow_automount(path, count, lookup_flags); + flags = smp_load_acquire(&path->dentry->d_flags); + if (ret < 0) + break; } - if (need_mntput) { - if (path->mnt == mnt) - mntput(path->mnt); - if (unlikely(nd->flags & LOOKUP_NO_XDEV)) - ret = -EXDEV; - else - nd->flags |= LOOKUP_JUMPED; - } - if (ret == -EISDIR || !ret) - ret = 1; - if (ret > 0 && unlikely(d_flags_negative(flags))) + if (ret == -EISDIR) + ret = 0; + // possible if you race with several mount --move + if (need_mntput && path->mnt == mnt) + mntput(path->mnt); + if (!ret && unlikely(d_flags_negative(flags))) ret = -ENOENT; - if (unlikely(ret < 0)) { - dput(path->dentry); - if (path->mnt != nd->path.mnt) - mntput(path->mnt); - } + *jumped = need_mntput; return ret; } +static inline int traverse_mounts(struct path *path, bool *jumped, + int *count, unsigned lookup_flags) +{ + unsigned flags = smp_load_acquire(&path->dentry->d_flags); + + /* fastpath */ + if (likely(!(flags & DCACHE_MANAGED_DENTRY))) { + *jumped = false; + if (unlikely(d_flags_negative(flags))) + return -ENOENT; + return 0; + } + return __traverse_mounts(path, flags, jumped, count, lookup_flags); +} + int follow_down_one(struct path *path) { struct vfsmount *mounted; @@ -1268,6 +1256,23 @@ int follow_down_one(struct path *path) } EXPORT_SYMBOL(follow_down_one); +/* + * Follow down to the covering mount currently visible to userspace. At each + * point, the filesystem owning that dentry may be queried as to whether the + * caller is permitted to proceed or not. + */ +int follow_down(struct path *path) +{ + struct vfsmount *mnt = path->mnt; + bool jumped; + int ret = traverse_mounts(path, &jumped, NULL, 0); + + if (path->mnt != mnt) + mntput(mnt); + return ret; +} +EXPORT_SYMBOL(follow_down); + /* * Try to skip to top of mountpoint pile in rcuwalk mode. Fail if * we meet a managed dentry that would need blocking. @@ -1324,6 +1329,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, struct path *path, struct inode **inode, unsigned int *seqp) { + bool jumped; int ret; path->mnt = nd->path.mnt; @@ -1333,15 +1339,25 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, if (unlikely(!*inode)) return -ENOENT; if (likely(__follow_mount_rcu(nd, path, inode, seqp))) - return 1; + return 0; if (unlazy_child(nd, dentry, seq)) return -ECHILD; // *path might've been clobbered by __follow_mount_rcu() path->mnt = nd->path.mnt; path->dentry = dentry; } - ret = follow_managed(path, nd); - if (likely(ret >= 0)) { + ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags); + if (jumped) { + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) + ret = -EXDEV; + else + nd->flags |= LOOKUP_JUMPED; + } + if (unlikely(ret)) { + dput(path->dentry); + if (path->mnt != nd->path.mnt) + mntput(path->mnt); + } else { *inode = d_backing_inode(path->dentry); *seqp = 0; /* out of RCU mode, so the value doesn't matter */ } @@ -1409,55 +1425,6 @@ static int follow_dotdot_rcu(struct nameidata *nd) return 0; } -/* - * Follow down to the covering mount currently visible to userspace. At each - * point, the filesystem owning that dentry may be queried as to whether the - * caller is permitted to proceed or not. - */ -int follow_down(struct path *path) -{ - unsigned managed; - int ret; - - while (managed = READ_ONCE(path->dentry->d_flags), - unlikely(managed & DCACHE_MANAGED_DENTRY)) { - /* Allow the filesystem to manage the transit without i_mutex - * being held. - * - * We indicate to the filesystem if someone is trying to mount - * something here. This gives autofs the chance to deny anyone - * other than its daemon the right to mount on its - * superstructure. - * - * The filesystem may sleep at this point. - */ - if (managed & DCACHE_MANAGE_TRANSIT) { - BUG_ON(!path->dentry->d_op); - BUG_ON(!path->dentry->d_op->d_manage); - ret = path->dentry->d_op->d_manage(path, false); - if (ret < 0) - return ret == -EISDIR ? 0 : ret; - } - - /* Transit to a mounted filesystem. */ - if (managed & DCACHE_MOUNTED) { - struct vfsmount *mounted = lookup_mnt(path); - if (!mounted) - break; - dput(path->dentry); - mntput(path->mnt); - path->mnt = mounted; - path->dentry = dget(mounted->mnt_root); - continue; - } - - /* Don't handle automount points here */ - break; - } - return 0; -} -EXPORT_SYMBOL(follow_down); - /* * Skip to top of mountpoint pile in refwalk mode for follow_dotdot() */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro ` (24 preceding siblings ...) 2020-01-19 3:17 ` [PATCH 9/9] new helper: traverse_mounts() Al Viro @ 2020-01-30 14:13 ` Christian Brauner 25 siblings, 0 replies; 92+ messages in thread From: Christian Brauner @ 2020-01-30 14:13 UTC (permalink / raw) To: Al Viro Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Aleksa Sarai, David Howells, Eric Biederman On Sun, Jan 19, 2020 at 03:17:13AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > preparation to finish_automount() fix (next commit) > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Just a naming nit below. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> > --- > fs/namespace.c | 47 ++++++++++++++++++++++++----------------------- > 1 file changed, 24 insertions(+), 23 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 2fd0c8bcb8c1..5f0a80f17651 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2697,45 +2697,32 @@ static int do_move_mount_old(struct path *path, const char *old_name) > /* > * add a mount into a namespace's mount tree > */ > -static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) > +static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, > + struct path *path, int mnt_flags) Maybe this should now be named do_add_mount_locked() so callers know that they need to do locking themselves? But that's bikeshedding... Christian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [RFC][PATCHSET][CFT] pathwalk cleanups and fixes 2020-01-19 3:14 ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Al Viro 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro @ 2020-01-19 14:33 ` Ian Kent 1 sibling, 0 replies; 92+ messages in thread From: Ian Kent @ 2020-01-19 14:33 UTC (permalink / raw) To: Al Viro, Aleksa Sarai Cc: Linus Torvalds, David Howells, Eric Biederman, stable, Christian Brauner, Serge Hallyn, dev, Linux Containers, Linux API, linux-fsdevel, Linux Kernel Mailing List On Sun, 2020-01-19 at 03:14 +0000, Al Viro wrote: > OK, vfs.git #work.namei seems to survive xfstests. I think > it cleans the things quite a bit, but it obviously needs more > review and testing. > > Review and testing would be _very_ welcome; it does a lot > of massage, so there had been a plenty of opportunities to fuck up > and fail to spot that. The same goes for profiling - it doesn't > seem to slow the things down, but that needs to be verified. I have run my usual tests (the second run of my submount-test is still going) and they have run through fine. I spend what time I can looking through the series tomorrow but will probably need to complete that when I return from my trip to Albany (Western Australia) some time on Friday. > > It does include #work.openat2. Topology: 17 commits, followed > by clean merge with #work.openat2, followed by 9 followups. The > part is #work.openat2 is as posted by Aleksa; I can repost it, but > I don't see much point. Description of the rest follows; patches > themselves will be in followups. > > part 1: follow_automount() cleanups and fixes. > > Quite a bit of that function had been about working around the > wrong calling conventions of finish_automount(). The problem is that > finish_automount() misuses the primitive intended for mount(2) and > friends, where we want to mount on top of the pile, even if something > has managed to add to that while we'd been trying to lock the > namespace. > For automount that's not the right thing to do - there we want to > discard > whatever it was going to attach and just cross into what got mounted > there in the meanwhile (most likely - the results of the same > automount > triggered by somebody else). Current mainline kinda-sorta manages to > do > that, but it's unreliable and very convoluted. Much simpler approach > is to stop using lock_mount() in finish_automount() and have it bail > out if something turns out to have been mounted on top where we > wanted > to attach. That allows to get rid of a lot of PITA in the caller. > Another simplification comes from not trying to cross into the > results > of automount - simply ride through the next iteration of the loop and > let it move into overmount. > > Another thing in the same series is divorcing > follow_automount() > from nameidata; that'll play later when we get to unifying > follow_down() > with the guts of follow_managed(). > > 4 commits, the second one fixes a hard-to-hit race. The first > is a prereq for it. > > 1/17 do_add_mount(): lift lock_mount/unlock_mount into callers > 2/17 fix automount/automount race properly > 3/17 follow_automount(): get rid of dead^Wstillborn code > 4/17 follow_automount() doesn't need the entire nameidata > > part 2: unifying mount traversals in pathwalk. > > Handling of mount traversal (follow_managed()) is currently > called > in a bunch of places. Each of them is shortly followed by a call of > step_into() or an open-coded equivalent thereof. However, the > locations > of those step_into() calls are far from preceding follow_managed(); > moreover, that preceding call might happen on different paths that > converge to given step_into() call. It's harder to analyse that it > should > be (especially when it comes to liveness analysis) and it forces > rather > ugly calling conventions on > lookup_fast()/atomic_open()/lookup_open(). > The series below massages the code to the point when the calls of > follow_managed() (and __follow_mount_rcu()) move into the beginning > of > step_into(). > > 5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying > O_NOFOLLOW > gets EEXIST handling in do_last() past the step_into() call > there. > 6/17 handle_mounts(): start building a sane wrapper for > follow_managed() > rather than mangling follow_managed() itself (and creating > conflicts > with openat2 series), add a wrapper that will absorb the > required > interface changes. > 7/17 atomic_open(): saner calling conventions (return dentry on > success) > struct path passed to it is pure out parameter; only dentry > part > ever varies, though - mnt is always nd->path.mnt. Just return > the dentry on success, and ERR_PTR(-E...) on failure. > 8/17 lookup_open(): saner calling conventions (return dentry on > success) > propagate the same change one level up the call chain. > 9/17 do_last(): collapse the call of path_to_nameidata() > struct path filled in lookup_open() call is eventually given to > handle_mounts(); the only use it has before that is > path_to_nameidata() > call in "->atomic_open() has actually opened it" case, and > there > path_to_nameidata() is an overkill - we are guaranteed to > replace > only nd->path.dentry. So have the struct path filled only > immediately > prior to handle_mounts(). > 10/17 handle_mounts(): pass dentry in, turn path into a pure out > argument > now all callers of handle_mount() are directly preceded by > filling > struct path it gets. path->mnt is nd->path.mnt in all cases, > so we can > pass just the dentry instead and fill path in handle_mount() > itself. > Some boilerplate gone, path is pure out argument of > handle_mount() > now. > 11/17 lookup_fast(): consolidate the RCU success case > massage to gather what will become an RCU case equivalent of > handle_mounts(); basically, that's what we do if revalidate > succeeds > in RCU case of lookup_fast(), including unlazy and fallback to > handle_mounts() if __follow_mount_rcu() says "it's too tricky". > 12/17 teach handle_mounts() to handle RCU mode > ... and take that into handle_mount() itself. The other caller > of > __follow_mount_rcu() is fine with the same fallback (it just > didn't > bother since it's in the very beginning of pathwalk), switched > to > handle_mount() as well. > 13/17 lookup_fast(): take mount traversal into callers > Now we are getting somewhere - both RCU and non-RCU success > cases of > lookup_fast() are ended with the same return > handle_mounts(...); > move that to the callers - there it will merge with the > identical calls > that had been on the paths where we had to do slow lookups. > lookup_fast() returns dentry now. > 14/17 new step_into() flag: WALK_NOFOLLOW > use step_into() instead of open-coding it in > handle_lookup_down(). > Add a flag for "don't follow symlinks regardless of > LOOKUP_FOLLOW" for > that (and eventually, I hope, for .. handling). > Now *all* calls of handle_mounts() and step_into() are right > next to > each other. > 15/17 fold handle_mounts() into step_into() > ... and we can move the call of handle_mounts() into > step_into(), > getting a slightly saner calling conventions out of that. > 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into > path_lookupat() > another payoff from 14/17 - we can teach path_lookupat() to do > what path_mountpointat() used to. And kill the latter, along > with > its wrappers. > 17/17 expand the only remaining call of path_lookup_conditional() > minor cleanup - RIP path_lookup_conditional(). Only one caller > left. > > At that point we run out of things that can be done without textual > conflicts > with openat2 series. Changes so far: > * mount traversal is taken into step_into(). > * lookup_fast(), atomic_open() and lookup_open() calling > conventions > are slightly changed. All of them return dentry now, instead of > returning > an int and filling struct path on success. For lookup_fast() the old > "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for > cache > miss, dentry - for hit". > * step_into() can be called in RCU mode as well. Takes > nameidata, > WALK_... flags, dentry and, in RCU case, corresponding inode and seq > value. > Handles mount traversals, decides whether it's a symlink to be > followed. > Error => returns -E...; symlink to follow => returns 1, puts symlink > on stack; > non-symlink or symlink not to follow => returns 0, moves nd->path to > new location. > * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and > friends > became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in > flags. > > Next comes the merge with Aleksa's openat2 patchset; everything up to > that point > had been non-conflicting with it. That patchset has been posted > earlier; > it's in #work.openat2. The next series comes on top of the merge. > > part 3: untangling the symlink handling. > > Right now when we decide to follow a symlink it happens this > way: > * step_into() decides that it has been given a symlink that > needs to > be followed. > * it calls pick_link(), which pushes the symlink on stack and > returns 1 on success / -E... on error. Symlink's mount/dentry/seq is > stored on stack and the inode is stashed in nd->link_inode. > * step_into() passes that 1 to its callers, which proceed to > pass it > up the call chain for several layers. In all cases we get to > get_link() > call shortly afterwards. > * get_link() is called, picks the inode stashed in nd- > >link_inode > by the pick_link(), does some checks, touches the atime, etc. > * get_link() either picks the link body out of inode or calls > ->get_link(). If it's an absolute symlink, we move to the root and > return > the relative portion of the body; if it's a relative one - just > return the > body. If it's a procfs-style one, the call of nd_jump_link() has > been > made and we'd moved to whatever location is desired. And return > NULL, > same as we do for symlink to "/". > * the caller proceeds to deal with the string returned to it. > > The sequence is the same in all cases (nested symlink, trailing > symlink on lookup, trailing symlink on open), but its pieces are not > close > to each other and the bit between the call of pick_link() and > (inevitable) > call of get_link() afterwards is not easy to follow. Moreover, a > bunch > of functions (walk_component/lookup_last/do_last) ends up with the > same > conventions for return values as step_into(). And those conventions > (see above) are not pretty - 0/1/-E... is asking for mistakes, > especially > when returned 1 is used only to direct control flow on a rather > twisted > way to matching get_link() call. And that path can be seriously > twisted. > E.g. when we are trying to open /dev/stdin, we get the following > sequence: > * path_init() has put us into root and returned "/dev/stdin" > * link_path_walk() has eventually reached /dev and left > <LAST_NORM, "stdin"> in nd->last_type/nd->last > * we call do_last(), which sees that we have LAST_NORM and > calls > lookup_fast(). Let's assume that everything is in dcache; we get the > dentry of /dev/stdin and proceed to finish_lookup:, where we call > step_into() > * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick > the > damn thing. Into the stack it goes and we return 1. > * do_last() sees 1 and returns it. > * trailing_symlink() is called (in the top-level loop) and it > calls get_link(). OK, we get "/proc/self/fd/0" for body, move to > root again and return "proc/self/fd/0". > * link_path_walk() is given that string, eventually leading us > into > /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. > * do_last() is called, and similar to the previous case we > eventually reach the call of step_into() with dentry of > /proc/self/fd/0. > * _now_ we can discard /dev/stdin from the stack (we'd been > using its body until now). It's dropped (from step_into()) and we > get > to look at what we'd been given. A symlink to follow, so on the > stack > it goes and we return 1. > * again, do_last() passes 1 to caller > * trailing_symlink() is called and calls get_link(). > * this time it's a procfs symlink and its ->get_link() method > moves us to the mount/dentry of our stdin. And returns NULL. But > the > fun doesn't stop yet. > * trailing_symlink() returns "" to the caller > * link_path_walk() is called on that and does nothing > whatsoever. > * do_last() is called and sees LAST_BIND left by the > get_link(). > It calls handle_dots() > * handle_dots() drops the symlink from stack and returns > * do_last() *FINALLY* proceeds to the point after its call of > step_into() (finish_open:) and gets around to opening the damn thing. > > Making sense of the control flow through all of that is not > fun, > to put it mildly; debugging anything in that area can be a massive > PITA, > and this example has touched only one of 3 cases. Arguably, the > worst > one, but... Anyway, it turns out that this code can be massaged to > considerably saner shape - both in terms of control flow and wrt > calling > conventions. > > 1/9 merging pick_link() with get_link(), part 1 > prep work: move the "hardening" crap from trailing_symlink() > into > get_link() (conditional on the absense of LOOKUP_PARENT in nd- > >flags). > We'll be moving the calls of get_link() around quite a bit through > that > series, and the next step will be to eliminate trailing_symlink(). > 2/9 merging pick_link() with get_link(), part 2 > fold trailing_symlink() into lookup_last() and do_last(). > Now these are returning strings; it's not the final calling > conventions, > but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) > => > old -E..., we'd failed. string => old 1, and the string is the > symlink > body to follow. Just as for trailing_symlink(), "/" and procfs ones > (where get_link() returns NULL) yield "", so the ugly song and dance > with no-op trip through link_path_walk()/handle_dots() still remains. > 3/9 merging pick_link() with get_link(), part 3 > elimination of that round-trip. In *all* cases having > get_link() return NULL on such symlinks means that we'll proceed to > drop the symlink from stack and get back to the point near that > get_link() call - basically, where we would be if it hadn't been > a symlink at all. The path by which we are getting there depends > upon the call site; the end result is the same in all cases - such > symlinks (procfs ones and symlink to "/") are fully processed by > the time get_link() returns, so we could as well drop them from the > stack right in get_link(). Makes life simpler in terms of control > flow analysis... > And now the calling conventions for do_last() and lookup_last() > have reached the final shape - ERR_PTR(-E...) for error, NULL for > "we are done", string for "traverse this". > 4/9 merging pick_link() with get_link(), part 4 > now all calls of walk_component() are followed by the same > boilerplate - "if it has returned 1, call get_link() and if that > has returned NULL treat that as if walk_component() has returned 0". > Eliminate by folding that into walk_component() itself. Now > walk_component() return value conventions have joined those of > do_last()/lookup_last(). > 5/9 merging pick_link() with get_link(), part 5 > same as for the previous, only this time the boilerplate > migrates one level down, into step_into(). Only one caller of > get_link() left, step_into() has joined the same return value > conventions. > 6/9 merging pick_link() with get_link(), part 6 > move that thing into pick_link(). Now all traces of > "return 1 if we are following a symlink" are gone. > 7/9 finally fold get_link() into pick_link() > ta-da - expand get_link() into the only caller. As a side > benefit, we get rid of stashing the inode in nd->link_inode - it > was done only to carry that piece of information from pick_link() > to eventual get_link(). That's not the main benefit, though - the > control flow became considerably easier to reason about. > > For what it's worth, the example above (/dev/stdin) becomes > * path_init() has put us into root and returned "/dev/stdin" > * link_path_walk() has eventually reached /dev and left > <LAST_NORM, "stdin"> in nd->last_type/nd->last > * we call do_last(), which sees that we have LAST_NORM and > calls > lookup_fast(). Let's assume that everything is in dcache; we get the > dentry of /dev/stdin and proceed to finish_lookup:, where we call > step_into() > * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick > the > damn thing. On the stack it goes and we get its body. Which is > "/proc/self/fd/0", so we move to root and return "proc/self/fd/0". > * do_last() sees non-NULL and returns it - whether it's an > error > or a pathname to traverse, we hadn't reached something we'll be > opening. > * link_path_walk() is given that string, eventually leading us > into > /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. > * do_last() is called, and similar to the previous case we > eventually reach the call of step_into() with dentry of > /proc/self/fd/0. > * _now_ we can discard /dev/stdin from the stack (we'd been > using its body until now). It's dropped (from step_into()) and we > get > to look at what we'd been given. A symlink to follow, so on the > stack > it goes. This time it's a procfs symlink and its ->get_link() > method > moves us to the mount/dentry of our stdin. And returns NULL. So we > drop symlink from stack and return that NULL to caller. > * that NULL is returned by step_into(), same as if we had just > moved to a non-symlink. > * do_last() proceeds to open the damn thing. > > part 4. some mount traversal cleanups. > > 8/9 massage __follow_mount_rcu() a bit > make it more similar to non-RCU counterpart > 9/9 new helper: traverse_mounts() > the guts of follow_managed() are very similar to > follow_down(). The calling conventions are different > (follow_managed() > works with nameidata, follow_down() - with standalone struct path), > but the core loop is pretty much the same in both. Turned that loop > into a common helper (traverse_mounts()) and since follow_managed() > becomes a very thin wrapper around it, expand follow_managed() at its > only call site (in handle_mounts()), > > That's where the series stands right now. FWIW, at 5.5-rc1 > fs/namei.c > had been 4867 lines, at the tip of #work.openat2 - 4998, at the > tip of #work.namei (containing #work.openat2) - 4730... And IMO > the thing has become considerably easier to follow. > > What's more, it might be possible to untangle the control flow in > do_last() now. Probably a separate series, though - do_last() is > one hell of a tarpit, so I'm not stepping into it for the rest > of this cycle... > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-03 1:49 ` Al Viro 2020-01-04 4:46 ` Ian Kent 2020-01-08 3:13 ` Al Viro @ 2020-01-10 23:19 ` Al Viro 2020-01-13 1:48 ` Ian Kent 2 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-10 23:19 UTC (permalink / raw) To: Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel, Ian Kent On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote: > On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote: > > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote: > > > > > > > Thanks, this fixes the issue for me (and also fixes another reproducer I > > > > found -- mounting a symlink on top of itself then trying to umount it). > > > > > > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com> > > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com> > > > > > > Pushed into #fixes. > > > > Thanks. One other thing I noticed is that umount applies to the > > underlying symlink rather than the mountpoint on top. So, for example > > (using the same scripts I posted in the thread): > > > > # ln -s /tmp/foo link > > # ./mount_to_symlink /etc/passwd link > > # umount -l link # will attempt to unmount "/tmp/foo" > > > > Is that intentional? > > It's a mess, again in mountpoint_last(). FWIW, at some point I proposed > to have nd_jump_link() to fail with -ELOOP if the target was a symlink; > Linus asked for reasons deeper than my dislike of the semantics, I looked > around and hadn't spotted anything. And there hadn't been at the time, > but when four months later umount_lookup_last() went in I failed to look > for that source of potential problems in it ;-/ FWIW, since Ian appears to agree that we want ->d_manage() on the mount crossing at the end of umount(2) lookup, here's a much simpler solution - kill mountpoint_last() and switch to using lookup_last(). As a side benefit, LOOKUP_NO_REVAL also goes away. It's possible to trim the things even more (path_mountpoint() is very similar to path_lookupat() at that point, and it's not hard to make the differences conditional on something like LOOKUP_UMOUNT); I would rather do that part in the cleanups series - the one below is easier to backport. Aleksa, Ian - could you see if the patch below works for you? commit e56b43b971a7c08762fceab330a52b7245041dbc Author: Al Viro <viro@zeniv.linux.org.uk> Date: Fri Jan 10 17:17:19 2020 -0500 reimplement path_mountpoint() with less magic ... and get rid of a bunch of bugs in it. Background: the reason for path_mountpoint() is that umount() really doesn't want attempts to revalidate the root of what it's trying to umount. The thing we want to avoid actually happen from complete_walk(); solution was to do something parallel to normal path_lookupat() and it both went overboard and got the boilerplate subtly (and not so subtly) wrong. A better solution is to do pretty much what the normal path_lookupat() does, but instead of complete_walk() do unlazy_walk(). All it takes to avoid that ->d_weak_revalidate() call... mountpoint_last() goes away, along with everything it got wrong, and so does the magic around LOOKUP_NO_REVAL. Another source of bugs is that when we traverse mounts at the final location (and we need to do that - umount . expects to get whatever's overmounting ., if any, out of the lookup) we really ought to take care of ->d_manage() - as it is, manual umount of autofs automount in progress can lead to unpleasant surprises for the daemon. Easily solved by using handle_lookup_down() instead of follow_mount(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..1793661c3342 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const struct qstr *name, if (IS_ERR(dentry)) return dentry; if (unlikely(!d_in_lookup(dentry))) { - if (!(flags & LOOKUP_NO_REVAL)) { - int error = d_revalidate(dentry, flags); - if (unlikely(error <= 0)) { - if (!error) { - d_invalidate(dentry); - dput(dentry); - goto again; - } + int error = d_revalidate(dentry, flags); + if (unlikely(error <= 0)) { + if (!error) { + d_invalidate(dentry); dput(dentry); - dentry = ERR_PTR(error); + goto again; } + dput(dentry); + dentry = ERR_PTR(error); } } else { old = inode->i_op->lookup(inode, dentry, flags); @@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags, EXPORT_SYMBOL(user_path_at_empty); /** - * mountpoint_last - look up last component for umount - * @nd: pathwalk nameidata - currently pointing at parent directory of "last" - * - * This is a special lookup_last function just for umount. In this case, we - * need to resolve the path without doing any revalidation. - * - * The nameidata should be the result of doing a LOOKUP_PARENT pathwalk. Since - * mountpoints are always pinned in the dcache, their ancestors are too. Thus, - * in almost all cases, this lookup will be served out of the dcache. The only - * cases where it won't are if nd->last refers to a symlink or the path is - * bogus and it doesn't exist. - * - * Returns: - * -error: if there was an error during lookup. This includes -ENOENT if the - * lookup found a negative dentry. - * - * 0: if we successfully resolved nd->last and found it to not to be a - * symlink that needs to be followed. - * - * 1: if we successfully resolved nd->last and found it to be a symlink - * that needs to be followed. - */ -static int -mountpoint_last(struct nameidata *nd) -{ - int error = 0; - struct dentry *dir = nd->path.dentry; - struct path path; - - /* If we're in rcuwalk, drop out of it to handle last component */ - if (nd->flags & LOOKUP_RCU) { - if (unlazy_walk(nd)) - return -ECHILD; - } - - nd->flags &= ~LOOKUP_PARENT; - - if (unlikely(nd->last_type != LAST_NORM)) { - error = handle_dots(nd, nd->last_type); - if (error) - return error; - path.dentry = dget(nd->path.dentry); - } else { - path.dentry = d_lookup(dir, &nd->last); - if (!path.dentry) { - /* - * No cached dentry. Mounted dentries are pinned in the - * cache, so that means that this dentry is probably - * a symlink or the path doesn't actually point - * to a mounted dentry. - */ - path.dentry = lookup_slow(&nd->last, dir, - nd->flags | LOOKUP_NO_REVAL); - if (IS_ERR(path.dentry)) - return PTR_ERR(path.dentry); - } - } - if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) { - dput(path.dentry); - return -ENOENT; - } - path.mnt = nd->path.mnt; - return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0); -} - -/** * path_mountpoint - look up a path to be umounted * @nd: lookup context * @flags: lookup flags @@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) int err; while (!(err = link_path_walk(s, nd)) && - (err = mountpoint_last(nd)) > 0) { + (err = lookup_last(nd)) > 0) { s = trailing_symlink(nd); } + if (!err) + err = unlazy_walk(nd); + if (!err) + err = handle_lookup_down(nd); if (!err) { *path = nd->path; nd->path.mnt = NULL; nd->path.dentry = NULL; - follow_mount(path); } terminate_walk(nd); return err; diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h index f64a33d2a1d1..2a82dcce5fc1 100644 --- a/fs/nfs/nfstrace.h +++ b/fs/nfs/nfstrace.h @@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT); TRACE_DEFINE_ENUM(LOOKUP_PARENT); TRACE_DEFINE_ENUM(LOOKUP_REVAL); TRACE_DEFINE_ENUM(LOOKUP_RCU); -TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL); TRACE_DEFINE_ENUM(LOOKUP_OPEN); TRACE_DEFINE_ENUM(LOOKUP_CREATE); TRACE_DEFINE_ENUM(LOOKUP_EXCL); @@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN); { LOOKUP_PARENT, "PARENT" }, \ { LOOKUP_REVAL, "REVAL" }, \ { LOOKUP_RCU, "RCU" }, \ - { LOOKUP_NO_REVAL, "NO_REVAL" }, \ { LOOKUP_OPEN, "OPEN" }, \ { LOOKUP_CREATE, "CREATE" }, \ { LOOKUP_EXCL, "EXCL" }, \ diff --git a/include/linux/namei.h b/include/linux/namei.h index 7fe7b87a3ded..07bfb0874033 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; /* internal use only */ #define LOOKUP_PARENT 0x0010 -#define LOOKUP_NO_REVAL 0x0080 #define LOOKUP_JUMPED 0x1000 #define LOOKUP_ROOT 0x2000 #define LOOKUP_ROOT_GRABBED 0x0008 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-10 23:19 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro @ 2020-01-13 1:48 ` Ian Kent 2020-01-13 3:54 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-13 1:48 UTC (permalink / raw) To: Al Viro, Aleksa Sarai Cc: David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Fri, 2020-01-10 at 23:19 +0000, Al Viro wrote: > On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote: > > On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote: > > > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote: > > > > > > > > > Thanks, this fixes the issue for me (and also fixes another > > > > > reproducer I > > > > > found -- mounting a symlink on top of itself then trying to > > > > > umount it). > > > > > > > > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com> > > > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com> > > > > > > > > Pushed into #fixes. > > > > > > Thanks. One other thing I noticed is that umount applies to the > > > underlying symlink rather than the mountpoint on top. So, for > > > example > > > (using the same scripts I posted in the thread): > > > > > > # ln -s /tmp/foo link > > > # ./mount_to_symlink /etc/passwd link > > > # umount -l link # will attempt to unmount "/tmp/foo" > > > > > > Is that intentional? > > > > It's a mess, again in mountpoint_last(). FWIW, at some point I > > proposed > > to have nd_jump_link() to fail with -ELOOP if the target was a > > symlink; > > Linus asked for reasons deeper than my dislike of the semantics, I > > looked > > around and hadn't spotted anything. And there hadn't been at the > > time, > > but when four months later umount_lookup_last() went in I failed to > > look > > for that source of potential problems in it ;-/ > > FWIW, since Ian appears to agree that we want ->d_manage() on the > mount > crossing at the end of umount(2) lookup, here's a much simpler > solution - > kill mountpoint_last() and switch to using lookup_last(). As a side > benefit, LOOKUP_NO_REVAL also goes away. It's possible to trim the > things even more (path_mountpoint() is very similar to > path_lookupat() > at that point, and it's not hard to make the differences conditional > on > something like LOOKUP_UMOUNT); I would rather do that part in the > cleanups series - the one below is easier to backport. > > Aleksa, Ian - could you see if the patch below works for you? I did try this patch and I was trying to work out why it didn't work. But thought I'd let you know what I saw. Applying it to current Linus tree systemd stops at switch root. Not sure what causes that, I couldn't see any reason for it. I see you have a development branch in your repo. I'll have a look at that rather than continue with this. > > commit e56b43b971a7c08762fceab330a52b7245041dbc > Author: Al Viro <viro@zeniv.linux.org.uk> > Date: Fri Jan 10 17:17:19 2020 -0500 > > reimplement path_mountpoint() with less magic > > ... and get rid of a bunch of bugs in it. Background: > the reason for path_mountpoint() is that umount() really doesn't > want attempts to revalidate the root of what it's trying to > umount. > The thing we want to avoid actually happen from complete_walk(); > solution was to do something parallel to normal path_lookupat() > and it both went overboard and got the boilerplate subtly > (and not so subtly) wrong. > > A better solution is to do pretty much what the normal > path_lookupat() > does, but instead of complete_walk() do unlazy_walk(). All it > takes > to avoid that ->d_weak_revalidate() call... mountpoint_last() > goes > away, along with everything it got wrong, and so does the magic > around > LOOKUP_NO_REVAL. > > Another source of bugs is that when we traverse mounts at the > final > location (and we need to do that - umount . expects to get > whatever's > overmounting ., if any, out of the lookup) we really ought to > take > care of ->d_manage() - as it is, manual umount of autofs > automount > in progress can lead to unpleasant surprises for the > daemon. Easily > solved by using handle_lookup_down() instead of follow_mount(). > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > diff --git a/fs/namei.c b/fs/namei.c > index d6c91d1e88cb..1793661c3342 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const > struct qstr *name, > if (IS_ERR(dentry)) > return dentry; > if (unlikely(!d_in_lookup(dentry))) { > - if (!(flags & LOOKUP_NO_REVAL)) { > - int error = d_revalidate(dentry, flags); > - if (unlikely(error <= 0)) { > - if (!error) { > - d_invalidate(dentry); > - dput(dentry); > - goto again; > - } > + int error = d_revalidate(dentry, flags); > + if (unlikely(error <= 0)) { > + if (!error) { > + d_invalidate(dentry); > dput(dentry); > - dentry = ERR_PTR(error); > + goto again; > } > + dput(dentry); > + dentry = ERR_PTR(error); > } > } else { > old = inode->i_op->lookup(inode, dentry, flags); > @@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char > __user *name, unsigned flags, > EXPORT_SYMBOL(user_path_at_empty); > > /** > - * mountpoint_last - look up last component for umount > - * @nd: pathwalk nameidata - currently pointing at parent > directory of "last" > - * > - * This is a special lookup_last function just for umount. In this > case, we > - * need to resolve the path without doing any revalidation. > - * > - * The nameidata should be the result of doing a LOOKUP_PARENT > pathwalk. Since > - * mountpoints are always pinned in the dcache, their ancestors are > too. Thus, > - * in almost all cases, this lookup will be served out of the > dcache. The only > - * cases where it won't are if nd->last refers to a symlink or the > path is > - * bogus and it doesn't exist. > - * > - * Returns: > - * -error: if there was an error during lookup. This includes > -ENOENT if the > - * lookup found a negative dentry. > - * > - * 0: if we successfully resolved nd->last and found it to not > to be a > - * symlink that needs to be followed. > - * > - * 1: if we successfully resolved nd->last and found it to be a > symlink > - * that needs to be followed. > - */ > -static int > -mountpoint_last(struct nameidata *nd) > -{ > - int error = 0; > - struct dentry *dir = nd->path.dentry; > - struct path path; > - > - /* If we're in rcuwalk, drop out of it to handle last component > */ > - if (nd->flags & LOOKUP_RCU) { > - if (unlazy_walk(nd)) > - return -ECHILD; > - } > - > - nd->flags &= ~LOOKUP_PARENT; > - > - if (unlikely(nd->last_type != LAST_NORM)) { > - error = handle_dots(nd, nd->last_type); > - if (error) > - return error; > - path.dentry = dget(nd->path.dentry); > - } else { > - path.dentry = d_lookup(dir, &nd->last); > - if (!path.dentry) { > - /* > - * No cached dentry. Mounted dentries are > pinned in the > - * cache, so that means that this dentry is > probably > - * a symlink or the path doesn't actually point > - * to a mounted dentry. > - */ > - path.dentry = lookup_slow(&nd->last, dir, > - nd->flags | > LOOKUP_NO_REVAL); > - if (IS_ERR(path.dentry)) > - return PTR_ERR(path.dentry); > - } > - } > - if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) > { > - dput(path.dentry); > - return -ENOENT; > - } > - path.mnt = nd->path.mnt; > - return step_into(nd, &path, 0, d_backing_inode(path.dentry), > 0); > -} > - > -/** > * path_mountpoint - look up a path to be umounted > * @nd: lookup context > * @flags: lookup flags > @@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd, > unsigned flags, struct path *path) > int err; > > while (!(err = link_path_walk(s, nd)) && > - (err = mountpoint_last(nd)) > 0) { > + (err = lookup_last(nd)) > 0) { > s = trailing_symlink(nd); > } > + if (!err) > + err = unlazy_walk(nd); > + if (!err) > + err = handle_lookup_down(nd); > if (!err) { > *path = nd->path; > nd->path.mnt = NULL; > nd->path.dentry = NULL; > - follow_mount(path); > } > terminate_walk(nd); > return err; > diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h > index f64a33d2a1d1..2a82dcce5fc1 100644 > --- a/fs/nfs/nfstrace.h > +++ b/fs/nfs/nfstrace.h > @@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT); > TRACE_DEFINE_ENUM(LOOKUP_PARENT); > TRACE_DEFINE_ENUM(LOOKUP_REVAL); > TRACE_DEFINE_ENUM(LOOKUP_RCU); > -TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL); > TRACE_DEFINE_ENUM(LOOKUP_OPEN); > TRACE_DEFINE_ENUM(LOOKUP_CREATE); > TRACE_DEFINE_ENUM(LOOKUP_EXCL); > @@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN); > { LOOKUP_PARENT, "PARENT" }, \ > { LOOKUP_REVAL, "REVAL" }, \ > { LOOKUP_RCU, "RCU" }, \ > - { LOOKUP_NO_REVAL, "NO_REVAL" }, \ > { LOOKUP_OPEN, "OPEN" }, \ > { LOOKUP_CREATE, "CREATE" }, \ > { LOOKUP_EXCL, "EXCL" }, \ > diff --git a/include/linux/namei.h b/include/linux/namei.h > index 7fe7b87a3ded..07bfb0874033 100644 > --- a/include/linux/namei.h > +++ b/include/linux/namei.h > @@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, > LAST_BIND}; > > /* internal use only */ > #define LOOKUP_PARENT 0x0010 > -#define LOOKUP_NO_REVAL 0x0080 > #define LOOKUP_JUMPED 0x1000 > #define LOOKUP_ROOT 0x2000 > #define LOOKUP_ROOT_GRABBED 0x0008 ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-13 1:48 ` Ian Kent @ 2020-01-13 3:54 ` Al Viro 2020-01-13 6:00 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-13 3:54 UTC (permalink / raw) To: Ian Kent Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote: > I did try this patch and I was trying to work out why it didn't > work. But thought I'd let you know what I saw. > > Applying it to current Linus tree systemd stops at switch root. > > Not sure what causes that, I couldn't see any reason for it. Wait a minute... So you are seeing problems early in the boot, before any autofs ioctls might come into play? Sigh... Guess I'll have to dig that Fedora KVM image out and try to see what it's about... ;-/ Here comes a couple of hours of build... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-13 3:54 ` Al Viro @ 2020-01-13 6:00 ` Ian Kent 2020-01-13 6:03 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-13 6:00 UTC (permalink / raw) To: Al Viro Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote: > On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote: > > > I did try this patch and I was trying to work out why it didn't > > work. But thought I'd let you know what I saw. > > > > Applying it to current Linus tree systemd stops at switch root. > > > > Not sure what causes that, I couldn't see any reason for it. > > Wait a minute... So you are seeing problems early in the boot, > before any autofs ioctls might come into play? I did, then I checked it booted without the patch, then tried building from scratch with the patch twice and same thing happened each time. Looked like this, such as it is: [ OK ] Reached target Switch Root. [ OK ] Started Plymouth switch root service. Starting Switch Root... I don't have any evidence but thought it might be this: https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c Mind you, that's not the actual systemd repo. either I probably need to look a lot deeper (and at the actual systemd repo) to work out what's actually being called. > > Sigh... Guess I'll have to dig that Fedora KVM image out and > try to see what it's about... ;-/ Here comes a couple of hours > of build... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-13 6:00 ` Ian Kent @ 2020-01-13 6:03 ` Ian Kent 2020-01-13 13:30 ` Al Viro 0 siblings, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-13 6:03 UTC (permalink / raw) To: Al Viro Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Mon, 2020-01-13 at 14:00 +0800, Ian Kent wrote: > On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote: > > On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote: > > > > > I did try this patch and I was trying to work out why it didn't > > > work. But thought I'd let you know what I saw. > > > > > > Applying it to current Linus tree systemd stops at switch root. > > > > > > Not sure what causes that, I couldn't see any reason for it. > > > > Wait a minute... So you are seeing problems early in the boot, > > before any autofs ioctls might come into play? > > I did, then I checked it booted without the patch, then tried > building from scratch with the patch twice and same thing > happened each time. > > Looked like this, such as it is: > [ OK ] Reached target Switch Root. > [ OK ] Started Plymouth switch root service. > Starting Switch Root... > > I don't have any evidence but thought it might be this: > https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c Oh wait, for systemd I was actually looking at: https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c > > Mind you, that's not the actual systemd repo. either I probably > need to look a lot deeper (and at the actual systemd repo) to > work out what's actually being called. > > > Sigh... Guess I'll have to dig that Fedora KVM image out and > > try to see what it's about... ;-/ Here comes a couple of hours > > of build... ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-13 6:03 ` Ian Kent @ 2020-01-13 13:30 ` Al Viro 2020-01-14 7:25 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Al Viro @ 2020-01-13 13:30 UTC (permalink / raw) To: Ian Kent Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote: > Oh wait, for systemd I was actually looking at: > https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c > > > > > Mind you, that's not the actual systemd repo. either I probably > > need to look a lot deeper (and at the actual systemd repo) to > > work out what's actually being called. > > > > > Sigh... Guess I'll have to dig that Fedora KVM image out and > > > try to see what it's about... ;-/ Here comes a couple of hours > > > of build... D'oh... And yes, that would've been a bisect hazard - switch to path_lookupat() later in the series gets rid of that. Incremental (to be foldede, of course): diff --git a/fs/namei.c b/fs/namei.c index 1793661c3342..204677c37751 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) (err = lookup_last(nd)) > 0) { s = trailing_symlink(nd); } - if (!err) + if (!err && (nd->flags & LOOKUP_RCU)) err = unlazy_walk(nd); if (!err) err = handle_lookup_down(nd); ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-13 13:30 ` Al Viro @ 2020-01-14 7:25 ` Ian Kent 2020-01-14 12:17 ` Ian Kent 0 siblings, 1 reply; 92+ messages in thread From: Ian Kent @ 2020-01-14 7:25 UTC (permalink / raw) To: Al Viro Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote: > On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote: > > > Oh wait, for systemd I was actually looking at: > > https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c > > > > > Mind you, that's not the actual systemd repo. either I probably > > > need to look a lot deeper (and at the actual systemd repo) to > > > work out what's actually being called. > > > > > > > Sigh... Guess I'll have to dig that Fedora KVM image out and > > > > try to see what it's about... ;-/ Here comes a couple of hours > > > > of build... > > D'oh... And yes, that would've been a bisect hazard - switch to > path_lookupat() later in the series gets rid of that. Incremental > (to be foldede, of course): > > diff --git a/fs/namei.c b/fs/namei.c > index 1793661c3342..204677c37751 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned > flags, struct path *path) > (err = lookup_last(nd)) > 0) { > s = trailing_symlink(nd); > } > - if (!err) > + if (!err && (nd->flags & LOOKUP_RCU)) > err = unlazy_walk(nd); > if (!err) > err = handle_lookup_down(nd); Ok, so I've tested with the updated patch. The autofs connectathon tests I use function fine. I also tested sending a SIGKILL to the daemon with about 180 active mounts and restarted the daemon to test the function of the ioctls that Al was concerned about. While the connectathon test expired everything I had 3 mounts left after allowing sufficient expire time with the SIGKILL test. Those mounts correspond to one map entry that has a mix of NFS vers=3 and vers=2 mount options and NFSv2 isn't supported by the servers I use in testing. I'm inclined to think this is a bug in the automount mount tree re-connection code rather than a problem with this patch since all the other mounts, some simple and others with not so simple constructs, expired fine after automount re-connected to them. There are two other map entries that have an NFS vers=2 option but they are simple mounts that will fail on attempting the automount because the server doesn't support v2 so they don't end up with mounts to reconnect to. This particular map entry, having a mix of NFS vers=3 and vers=2 in the offsets of the entry, will lead to a partial mount of the map entry which is probably not being handled properly by automount when re-connecting to the mounts in the tree. So I think the patch here is fine from an autofs POV. Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-14 7:25 ` Ian Kent @ 2020-01-14 12:17 ` Ian Kent 0 siblings, 0 replies; 92+ messages in thread From: Ian Kent @ 2020-01-14 12:17 UTC (permalink / raw) To: Al Viro Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel On Tue, 2020-01-14 at 15:25 +0800, Ian Kent wrote: > On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote: > > On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote: > > > > > Oh wait, for systemd I was actually looking at: > > > https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c > > > > > > > Mind you, that's not the actual systemd repo. either I probably > > > > need to look a lot deeper (and at the actual systemd repo) to > > > > work out what's actually being called. > > > > > > > > > Sigh... Guess I'll have to dig that Fedora KVM image out and > > > > > try to see what it's about... ;-/ Here comes a couple of > > > > > hours > > > > > of build... > > > > D'oh... And yes, that would've been a bisect hazard - switch to > > path_lookupat() later in the series gets rid of that. Incremental > > (to be foldede, of course): > > > > diff --git a/fs/namei.c b/fs/namei.c > > index 1793661c3342..204677c37751 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, > > unsigned > > flags, struct path *path) > > (err = lookup_last(nd)) > 0) { > > s = trailing_symlink(nd); > > } > > - if (!err) > > + if (!err && (nd->flags & LOOKUP_RCU)) > > err = unlazy_walk(nd); > > if (!err) > > err = handle_lookup_down(nd); > > Ok, so I've tested with the updated patch. > > The autofs connectathon tests I use function fine. > > I also tested sending a SIGKILL to the daemon with about 180 active > mounts and restarted the daemon to test the function of the ioctls > that Al was concerned about. > > While the connectathon test expired everything I had 3 mounts left > after allowing sufficient expire time with the SIGKILL test. > > Those mounts correspond to one map entry that has a mix of NFS > vers=3 and vers=2 mount options and NFSv2 isn't supported by the > servers I use in testing. > > I'm inclined to think this is a bug in the automount mount tree > re-connection code rather than a problem with this patch since > all the other mounts, some simple and others with not so simple > constructs, expired fine after automount re-connected to them. > > There are two other map entries that have an NFS vers=2 option but > they are simple mounts that will fail on attempting the automount > because the server doesn't support v2 so they don't end up with > mounts to reconnect to. > > This particular map entry, having a mix of NFS vers=3 and vers=2 > in the offsets of the entry, will lead to a partial mount of the > map entry which is probably not being handled properly by automount > when re-connecting to the mounts in the tree. > > So I think the patch here is fine from an autofs POV. Umm ... unfortunately further testing shows an autofs problem. It appears to be present in the current kernel (so far I've only been able to check the current git head and an earlier kernel but can't remember the version and can't check) so I must have missed it. I'm attempting to bisect now but managed to trash the root file system on my VM. I'll get this done as quickly as I can. Ian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks 2020-01-01 14:44 ` Aleksa Sarai 2020-01-01 23:40 ` Al Viro @ 2020-01-04 5:52 ` Andy Lutomirski 1 sibling, 0 replies; 92+ messages in thread From: Andy Lutomirski @ 2020-01-04 5:52 UTC (permalink / raw) To: Aleksa Sarai Cc: Al Viro, David Howells, Eric Biederman, Linus Torvalds, stable, Christian Brauner, Serge Hallyn, dev, containers, linux-api, linux-fsdevel, linux-kernel > On Jan 1, 2020, at 11:44 PM, Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote: >>> On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote: >>> Note, BTW, that lookup_last() (aka walk_component()) does just >>> that - we only hit step_into() on LAST_NORM. The same goes >>> for do_last(). mountpoint_last() not doing the same is _not_ >>> intentional - it's definitely a bug. >>> >>> Consider your testcase; link points to . here. So the only >>> thing you could expect from trying to follow it would be >>> the directory 'link' lives in. And you don't have it >>> when you reach the fscker via /proc/self/fd/3; what happens >>> instead is nd->path set to ./link (by nd_jump_link()) *AND* >>> step_into() called, pushing the same ./link onto stack. >>> It violates all kinds of assumptions made by fs/namei.c - >>> when pushing a symlink onto stack nd->path is expected to >>> contain the base directory for resolving it. >>> >>> I'm fairly sure that this is the cause of at least some >>> of the insanity you've caught; there always could be >>> something else, of course, but this hole needs to be >>> closed in any case. >> >> ... and with removal of now unused local variable, that's >> >> mountpoint_last(): fix the treatment of LAST_BIND >> >> step_into() should be attempted only in LAST_NORM >> case, when we have the parent directory (in nd->path). >> We get away with that for LAST_DOT and LOST_DOTDOT, >> since those can't be symlinks, making step_init() and >> equivalent of path_to_nameidata() - we do a bit of >> useless work, but that's it. For LAST_BIND (i.e. >> the case when we'd just followed a procfs-style >> symlink) we really can't go there - result might >> be a symlink and we really can't attempt following >> it. >> >> lookup_last() and do_last() do handle that properly; >> mountpoint_last() should do the same. >> >> Cc: stable@vger.kernel.org >> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > Thanks, this fixes the issue for me (and also fixes another reproducer I > found -- mounting a symlink on top of itself then trying to umount it). > > Reported-by: Aleksa Sarai <cyphar@cyphar.com> > Tested-by: Aleksa Sarai <cyphar@cyphar.com> > > As for the original topic of bind-mounting symlinks -- given this is a > supported feature, would you be okay with me sending an updated > O_EMPTYPATH series? FWIW, I have an actual use case for mounting over a symlink: replacing /etc/resolv.conf. My virtme tool is presented with somewhat arbitrary crud in /etc, where /etc/resolv.conf might be a plain file or a symlink, but, regardless, has inappropriate contents. If it’s a file, I can mount a new file over it. If it’s a symlink and the kernel properly supported it, I could also mount over it. Yes, I could also use overlayfs. Maybe I should regardless. ^ permalink raw reply [flat|nested] 92+ messages in thread
end of thread, other threads:[~2020-01-30 15:55 UTC | newest] Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-30 5:20 [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai 2019-12-30 5:20 ` [PATCH RFC 1/1] " Aleksa Sarai 2019-12-30 7:34 ` Linus Torvalds 2019-12-30 8:28 ` Aleksa Sarai 2020-01-08 4:39 ` Andy Lutomirski 2019-12-30 5:44 ` [PATCH RFC 0/1] " Al Viro 2019-12-30 5:49 ` Aleksa Sarai 2019-12-30 7:29 ` Aleksa Sarai 2019-12-30 7:53 ` Linus Torvalds 2019-12-30 8:32 ` Aleksa Sarai 2020-01-02 8:58 ` David Laight 2020-01-02 9:09 ` Aleksa Sarai 2020-01-01 0:43 ` Al Viro 2020-01-01 0:54 ` Al Viro 2020-01-01 3:08 ` Al Viro 2020-01-01 14:44 ` Aleksa Sarai 2020-01-01 23:40 ` Al Viro 2020-01-02 3:59 ` Aleksa Sarai 2020-01-03 1:49 ` Al Viro 2020-01-04 4:46 ` Ian Kent 2020-01-08 3:13 ` Al Viro 2020-01-08 3:54 ` Linus Torvalds 2020-01-08 21:34 ` Al Viro 2020-01-10 0:08 ` Linus Torvalds 2020-01-10 4:15 ` Al Viro 2020-01-10 5:03 ` Linus Torvalds 2020-01-10 6:20 ` Ian Kent 2020-01-12 21:33 ` Al Viro 2020-01-13 2:59 ` Ian Kent 2020-01-14 0:25 ` Ian Kent 2020-01-14 4:39 ` Al Viro 2020-01-14 5:01 ` Ian Kent 2020-01-14 5:59 ` Ian Kent 2020-01-10 21:07 ` Aleksa Sarai 2020-01-14 4:57 ` Al Viro 2020-01-14 5:12 ` Al Viro 2020-01-14 20:01 ` Aleksa Sarai 2020-01-15 14:25 ` Al Viro 2020-01-15 14:29 ` Aleksa Sarai 2020-01-15 14:34 ` Aleksa Sarai 2020-01-15 14:48 ` Al Viro 2020-01-18 12:07 ` [PATCH v3 0/2] openat2: minor uapi cleanups Aleksa Sarai 2020-01-18 12:07 ` [PATCH v3 1/2] open: introduce openat2(2) syscall Aleksa Sarai 2020-01-18 12:08 ` [PATCH v3 2/2] selftests: add openat2(2) selftests Aleksa Sarai 2020-01-18 15:28 ` [PATCH v3 0/2] openat2: minor uapi cleanups Al Viro 2020-01-18 18:09 ` Al Viro 2020-01-18 23:03 ` Aleksa Sarai 2020-01-19 1:12 ` Al Viro 2020-01-15 13:57 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai 2020-01-19 3:14 ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Al Viro 2020-01-19 3:17 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Al Viro 2020-01-19 3:17 ` [PATCH 02/17] fix automount/automount race properly Al Viro 2020-01-30 14:34 ` Christian Brauner 2020-01-19 3:17 ` [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code Al Viro 2020-01-30 14:38 ` Christian Brauner 2020-01-19 3:17 ` [PATCH 04/17] follow_automount() doesn't need the entire nameidata Al Viro 2020-01-30 14:45 ` Christian Brauner 2020-01-30 15:38 ` Al Viro 2020-01-30 15:55 ` Al Viro 2020-01-19 3:17 ` [PATCH 05/17] make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW Al Viro 2020-01-19 3:17 ` [PATCH 06/17] handle_mounts(): start building a sane wrapper for follow_managed() Al Viro 2020-01-19 3:17 ` [PATCH 07/17] atomic_open(): saner calling conventions (return dentry on success) Al Viro 2020-01-19 3:17 ` [PATCH 08/17] lookup_open(): " Al Viro 2020-01-19 3:17 ` [PATCH 09/17] do_last(): collapse the call of path_to_nameidata() Al Viro 2020-01-19 3:17 ` [PATCH 10/17] handle_mounts(): pass dentry in, turn path into a pure out argument Al Viro 2020-01-19 3:17 ` [PATCH 11/17] lookup_fast(): consolidate the RCU success case Al Viro 2020-01-19 3:17 ` [PATCH 12/17] teach handle_mounts() to handle RCU mode Al Viro 2020-01-19 3:17 ` [PATCH 13/17] lookup_fast(): take mount traversal into callers Al Viro 2020-01-19 3:17 ` [PATCH 14/17] new step_into() flag: WALK_NOFOLLOW Al Viro 2020-01-19 3:17 ` [PATCH 15/17] fold handle_mounts() into step_into() Al Viro 2020-01-19 3:17 ` [PATCH 16/17] LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() Al Viro 2020-01-19 3:17 ` [PATCH 17/17] expand the only remaining call of path_lookup_conditional() Al Viro 2020-01-19 3:17 ` [PATCH 1/9] merging pick_link() with get_link(), part 1 Al Viro 2020-01-19 3:17 ` [PATCH 2/9] merging pick_link() with get_link(), part 2 Al Viro 2020-01-19 3:17 ` [PATCH 3/9] merging pick_link() with get_link(), part 3 Al Viro 2020-01-19 3:17 ` [PATCH 4/9] merging pick_link() with get_link(), part 4 Al Viro 2020-01-19 3:17 ` [PATCH 5/9] merging pick_link() with get_link(), part 5 Al Viro 2020-01-19 3:17 ` [PATCH 6/9] merging pick_link() with get_link(), part 6 Al Viro 2020-01-19 3:17 ` [PATCH 7/9] finally fold get_link() into pick_link() Al Viro 2020-01-19 3:17 ` [PATCH 8/9] massage __follow_mount_rcu() a bit Al Viro 2020-01-19 3:17 ` [PATCH 9/9] new helper: traverse_mounts() Al Viro 2020-01-30 14:13 ` [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers Christian Brauner 2020-01-19 14:33 ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Ian Kent 2020-01-10 23:19 ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro 2020-01-13 1:48 ` Ian Kent 2020-01-13 3:54 ` Al Viro 2020-01-13 6:00 ` Ian Kent 2020-01-13 6:03 ` Ian Kent 2020-01-13 13:30 ` Al Viro 2020-01-14 7:25 ` Ian Kent 2020-01-14 12:17 ` Ian Kent 2020-01-04 5:52 ` Andy Lutomirski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).