Stable Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
@ 2019-12-30  5:20 Aleksa Sarai
  2019-12-30  5:20 ` [PATCH RFC 1/1] " Aleksa Sarai
  2019-12-30  5:44 ` [PATCH RFC 0/1] " Al Viro
  0 siblings, 2 replies; 53+ messages in thread
From: Aleksa Sarai @ 2019-12-30  5:20 UTC (permalink / raw)
  To: Al Viro, David Howells, Eric Biederman, Linus Torvalds
  Cc: Aleksa Sarai, stable, Christian Brauner, Serge Hallyn, dev,
	containers, linux-api, linux-fsdevel, linux-kernel

An undocumented feature of the mount interface was that it was possible
to mount over a symlink (even with the old mount API) by mounting over
/proc/self/fd/$n -- where the corresponding file descrpitor was opened
with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts
(for a variety of reasons), but MS_BIND worked without issue. With the
new mount API it was even easier.

A reasonably detailed explanation of the issues is provided in the patch
itself, but the full traces produced by both the oopses and deadlocks is
included below (it makes little sense to include them in the commit since we
are disabling this feature, not directly fixing the bugs themselves).

I've posted this as an RFC on whether this feature should be allowed at
all (and if anyone knows of legitimate uses for it), or if we should
work on fixing these other kernel bugs that it exposes.

Oops on NULL dereference:
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    PGD 8000000181b1f067 P4D 8000000181b1f067 PUD 24829c067 PMD 0
    Oops: 0010 [#1] SMP PTI
    CPU: 6 PID: 20796 Comm: mount_to_symlin Tainted: G           OE     5.5.0-rc1+openat2~v18+ #123
    Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffbc7d87e1bcb0 EFLAGS: 00010206
    RAX: 0000000000000000 RBX: ffffa0c28cb633c0 RCX: 000000000000ae5a
    RDX: 0000000000000089 RSI: ffffa0c0eece8840 RDI: ffffa0c0eb8843b0
    RBP: ffffa0c0eb8843b0 R08: ffffdc7d7fbbb770 R09: ffffa0c0ca333000
    R10: 0000000000000000 R11: 808080807fffffff R12: ffffa0c0eece8840
    R13: 0000000000000089 R14: ffffbc7d87e1bdb0 R15: 0000000000000080
    FS:  00007fd921508540(0000) GS:ffffa0c3cf580000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 000000018878a003 CR4: 00000000003606e0
    Call Trace:
     __lookup_slow+0x94/0x160
     lookup_slow+0x36/0x50
     path_mountpoint+0x1be/0x350
     filename_mountpoint+0xa5/0x150
     ? __lookup_hash+0xa0/0xa0
     ksys_umount+0x78/0x490
     __x64_sys_umount+0x12/0x20
     do_syscall_64+0x64/0x240
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fd92143f4e7
    Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
          00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
          f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffe98c89cc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd92143f4e7
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000167a330
    RBP: 00007ffe98c89da0 R08: 0000000000000000 R09: 000000000000000f
    R10: 00000000004004c6 R11: 0000000000000202 R12: 00000000004010c0
    R13: 00007ffe98c89e80 R14: 0000000000000000 R15: 0000000000000000
    CR2: 0000000000000000

Oops on kernel address:
    BUG: unable to handle page fault for address: ffffbc7d87e1bcc0
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 107d4a067 P4D 107d4a067 PUD 107d4b067 PMD 46d753067 PTE 0
    Oops: 0002 [#2] SMP PTI
    CPU: 4 PID: 20975 Comm: mount_to_symlin Tainted: G      D    OE     5.5.0-rc1+openat2~v18+ #123
    Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
    RIP: 0010:_raw_spin_lock_irqsave+0x28/0x50
    Code: 00 00 0f 1f 44 00 00 41 54 53 48 89 fb 9c 58 0f 1f 44 00 00 49 89 c4
          fa 66 0f 1f 44 00 00 e8 3f 55 82 ff 31 c0 ba 01 00 00 00 <f0> 0f b1
          13 75 07 4c 89 e0 5b 41 5c c3 89 c6 48 89 df e8 01 52 77
    RSP: 0018:ffffbc7d90067bd8 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffffbc7d87e1bcc0 RCX: 0000000200000000
    RDX: 0000000000000001 RSI: ffffbc7d90067c50 RDI: ffffbc7d87e1bcc0
    RBP: ffffbc7d87e1bcc0 R08: 0000000000000001 R09: 0000000000000003
    R10: 0000000000000000 R11: 808080807fffffff R12: 0000000000000246
    R13: ffffa0c28cb633c0 R14: ffffbc7d90067db0 R15: ffffa0c0eece8898
    FS:  00007f4b80214540(0000) GS:ffffa0c3cf500000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffbc7d87e1bcc0 CR3: 000000026d4d0002 CR4: 00000000003606e0
    Call Trace:
     add_wait_queue+0x15/0x40
     d_alloc_parallel+0x36d/0x480
     ? get_acl+0x1a/0x160
     ? wake_up_q+0xa0/0xa0
     __lookup_slow+0x6b/0x160
     lookup_slow+0x36/0x50
     path_mountpoint+0x1be/0x350
     filename_mountpoint+0xa5/0x150
     ? __lookup_hash+0xa0/0xa0
     ksys_umount+0x78/0x490
     __x64_sys_umount+0x12/0x20
     do_syscall_64+0x64/0x240
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f4b8014b4e7
    Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
          00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
          f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffee8041b28 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4b8014b4e7
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000019c8330
    RBP: 00007ffee8041c00 R08: 0000000000000000 R09: 000000000000000f
    R10: 00000000004004c6 R11: 0000000000000206 R12: 00000000004010c0
    R13: 00007ffee8041ce0 R14: 0000000000000000 R15: 0000000000000000
    CR2: ffffbc7d87e1bcc0

Apparent deadlock in d_alloc_parallel:
    watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount_to_symlin:21285]
    CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G      D    OE     5.5.0-rc1+openat2~v18+ #123
    Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
    RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
    Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0
          a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84
          c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
    RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898
    RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003
    R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73
    R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898
    FS:  00007fbb1fd30540(0000) GS:ffffa0c3cf400000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbb1fbd25a0 CR3: 0000000181ace005 CR4: 00000000003606f0
    Call Trace:
     _raw_spin_lock+0x1a/0x20
     lockref_get_not_dead+0x4f/0x90
     d_alloc_parallel+0x1a8/0x480
     ? get_acl+0x1a/0x160
     __lookup_slow+0x6b/0x160
     lookup_slow+0x36/0x50
     path_mountpoint+0x1be/0x350
     filename_mountpoint+0xa5/0x150
     ? __lookup_hash+0xa0/0xa0
     ksys_umount+0x78/0x490
     __x64_sys_umount+0x12/0x20
     do_syscall_64+0x64/0x240
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fbb1fc674e7
    Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
          00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
          f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330
    RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f
    R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0
    R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000

RCU stall when trying to grab /proc/$pid/stack for the stuck process:
    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu:         0-....: (15000 ticks this GP) idle=2c6/1/0x4000000000000002 softirq=1172554/1172554 fqs=6849
            (t=15001 jiffies g=1935177 q=25734)
    NMI backtrace for cpu 0
    CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G      D    OEL    5.5.0-rc1+openat2~v18+ #123
    Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
    Call Trace:
     <IRQ>
     dump_stack+0x8f/0xd0
     ? lapic_can_unplug_cpu.cold+0x3e/0x3e
     nmi_cpu_backtrace.cold+0x14/0x52
     nmi_trigger_cpumask_backtrace+0xf6/0xf8
     rcu_dump_cpu_stacks+0x8f/0xbd
     rcu_sched_clock_irq.cold+0x1b2/0x39f
     update_process_times+0x24/0x50
     tick_sched_handle+0x22/0x60
     tick_sched_timer+0x38/0x80
     ? tick_sched_do_timer+0x60/0x60
     __hrtimer_run_queues+0xf6/0x270
     hrtimer_interrupt+0x10e/0x240
     smp_apic_timer_interrupt+0x6c/0x130
     apic_timer_interrupt+0xf/0x20
     </IRQ>
    RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
    Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0
         a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0
         75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
    RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898
    RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003
    R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73
    R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898
     _raw_spin_lock+0x1a/0x20
     lockref_get_not_dead+0x4f/0x90
     d_alloc_parallel+0x1a8/0x480
     ? get_acl+0x1a/0x160
     __lookup_slow+0x6b/0x160
     lookup_slow+0x36/0x50
     path_mountpoint+0x1be/0x350
     filename_mountpoint+0xa5/0x150
     ? __lookup_hash+0xa0/0xa0
     ksys_umount+0x78/0x490
     __x64_sys_umount+0x12/0x20
     do_syscall_64+0x64/0x240
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fbb1fc674e7
    Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
          00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
          f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330
    RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f
    R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0
    R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000

Deadlock on lock_mount after a successful umount(). The watchdog does trigger,
but I could only find this stall when trying to suspend the system in my logs:
    Freezing of tasks failed after 20.010 seconds (2 tasks refusing to freeze, wq_busy=0):
    mount_to_symlin D    0  5850   5849 0x00000004
    Call Trace:
     ? __schedule+0x2dd/0x770
     schedule+0x4a/0xb0
     rwsem_down_write_slowpath+0x256/0x500
     lock_mount+0x22/0xf0
     do_mount+0x4b7/0x9f0
     ksys_mount+0x7e/0xc0
     __x64_sys_mount+0x21/0x30
     do_syscall_64+0x64/0x240
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f86e6355fda
    Code: Bad RIP value.
    RSP: 002b:00007ffc36f952d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f86e6355fda
    RDX: 0000000000402099 RSI: 00000000019a5310 RDI: 00007ffc36f96ee1
    RBP: 00007ffc36f953b0 R08: 0000000000402099 R09: 000000000000000f
    R10: 0000000000001000 R11: 0000000000000206 R12: 00000000004010c0
    R13: 00007ffc36f95490 R14: 0000000000000000 R15: 0000000000000000

Cc: stable@vger.kernel.org # pre-git
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Aleksa Sarai (1):
  mount: universally disallow mounting over symlinks

 fs/namespace.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)


base-commit: fd6988496e79a6a4bdb514a4655d2920209eb85d
-- 
2.24.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH RFC 1/1] mount: universally disallow mounting over symlinks
  2019-12-30  5:20 [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai
@ 2019-12-30  5:20 ` " Aleksa Sarai
  2019-12-30  7:34   ` Linus Torvalds
  2019-12-30  5:44 ` [PATCH RFC 0/1] " Al Viro
  1 sibling, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2019-12-30  5:20 UTC (permalink / raw)
  To: Al Viro, David Howells, Eric Biederman, Linus Torvalds
  Cc: Aleksa Sarai, stable, Christian Brauner, Serge Hallyn, dev,
	containers, linux-api, linux-fsdevel, linux-kernel

An undocumented feature of the mount interface was that it was possible
to mount over a symlink (even with the old mount API) by mounting over
/proc/self/fd/$n -- where the corresponding file descrpitor was opened
with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts
(for a variety of reasons), but MS_BIND worked without issue. With the
new mount API it was even easier.

From userspace's perspective, this capability is only really useful as
an attack vector. Until the introduction of openat2(RESOLVE_NO_XDEV),
there was no trivial way to detect if a bind-mount was present. In the
container runtime context (in a similar vein to CVE-2019-19921), this
could result in a privileged process being unable to detect that a
configuration resulted in magic-link usage operating on the wrong
magic-links. Additionally, the API to use this feature was incredibly
strange -- in order to umount, you would have go through
/proc/self/fd/$n again (umounting the path would result in the
*underlying* symlink being followed).

Which brings us to the issues on the kernel side. When umounting a mount
on top of a symlink, several oopses (both NULL and garbage kernel
address dereferences) and deadlocks could be triggered incredibly
trivially. Note that because this works in user namespaces, an
unprivileged user could trigger these oopses incredibly trivially. While
these bugs could be fixed separately, it seems much cleaner to disable a
"feature" which clearly was not intentional (and is not used --
otherwise we would've seen bug reports about it breaking on umount).

Note that because the linux-utils mount(1) helper will expand paths
containing symlinks in user-space, only users which used the mount(2)
syscall directly could possibly have seen this behaviour.

Cc: stable@vger.kernel.org # pre-git
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namespace.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index be601d3a8008..01a62bce105f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2172,8 +2172,12 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
 	if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER)
 		return -EINVAL;
 
+	if (d_is_symlink(mp->m_dentry) ||
+	    d_is_symlink(mnt->mnt.mnt_root))
+		return -EINVAL;
+
 	if (d_is_dir(mp->m_dentry) !=
-	      d_is_dir(mnt->mnt.mnt_root))
+	    d_is_dir(mnt->mnt.mnt_root))
 		return -ENOTDIR;
 
 	return attach_recursive_mnt(mnt, p, mp, false);
@@ -2251,6 +2255,9 @@ static struct mount *__do_loopback(struct path *old_path, int recurse)
 	if (IS_MNT_UNBINDABLE(old))
 		return mnt;
 
+	if (d_is_symlink(old_path->dentry))
+		return mnt;
+
 	if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
 		return mnt;
 
@@ -2635,6 +2642,10 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	if (old_path->dentry != old_path->mnt->mnt_root)
 		goto out;
 
+	if (d_is_symlink(new_path->dentry) ||
+	    d_is_symlink(old_path->dentry))
+		goto out;
+
 	if (d_is_dir(new_path->dentry) !=
 	    d_is_dir(old_path->dentry))
 		goto out;
@@ -2726,10 +2737,6 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
 	    path->mnt->mnt_root == path->dentry)
 		goto unlock;
 
-	err = -EINVAL;
-	if (d_is_symlink(newmnt->mnt.mnt_root))
-		goto unlock;
-
 	newmnt->mnt.mnt_flags = mnt_flags;
 	err = graft_tree(newmnt, parent, mp);
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2019-12-30  5:20 [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai
  2019-12-30  5:20 ` [PATCH RFC 1/1] " Aleksa Sarai
@ 2019-12-30  5:44 ` " Al Viro
  2019-12-30  5:49   ` Aleksa Sarai
  1 sibling, 1 reply; 53+ messages in thread
From: Al Viro @ 2019-12-30  5:44 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:

> A reasonably detailed explanation of the issues is provided in the patch
> itself, but the full traces produced by both the oopses and deadlocks is
> included below (it makes little sense to include them in the commit since we
> are disabling this feature, not directly fixing the bugs themselves).
> 
> I've posted this as an RFC on whether this feature should be allowed at
> all (and if anyone knows of legitimate uses for it), or if we should
> work on fixing these other kernel bugs that it exposes.

Umm...  Are all of those traces
	a) reproducible on mainline and
	b) reproducible as the first oopsen?

As it is, quite a few might be secondary results of earlier memory
corruption...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2019-12-30  5:44 ` [PATCH RFC 0/1] " Al Viro
@ 2019-12-30  5:49   ` Aleksa Sarai
  2019-12-30  7:29     ` Aleksa Sarai
  0 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2019-12-30  5:49 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1431 bytes --]

On 2019-12-30, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
> 
> > A reasonably detailed explanation of the issues is provided in the patch
> > itself, but the full traces produced by both the oopses and deadlocks is
> > included below (it makes little sense to include them in the commit since we
> > are disabling this feature, not directly fixing the bugs themselves).
> > 
> > I've posted this as an RFC on whether this feature should be allowed at
> > all (and if anyone knows of legitimate uses for it), or if we should
> > work on fixing these other kernel bugs that it exposes.
> 
> Umm...  Are all of those traces
> 	a) reproducible on mainline and

This was on viro/for-next, I'll retry it on v5.5-rc4.

> 	b) reproducible as the first oopsen?

The NULL and garbage pointer derefs are reproducible as the first oops.
Looking at my logs, it looks like the deadlocks were always triggered
after the oops, but that might just have been a mistake on my part while
testing things.

> As it is, quite a few might be secondary results of earlier memory
> corruption...

Yeah, I thought that might be the case but decided to include them
anyway (the /proc/self/stack RCU stall is definitely the result of other
corruption and stalls).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2019-12-30  5:49   ` Aleksa Sarai
@ 2019-12-30  7:29     ` Aleksa Sarai
  2019-12-30  7:53       ` Linus Torvalds
  2020-01-01  0:43       ` Al Viro
  0 siblings, 2 replies; 53+ messages in thread
From: Aleksa Sarai @ 2019-12-30  7:29 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

[-- Attachment #1.1: Type: text/plain, Size: 4818 bytes --]

On 2019-12-30, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-12-30, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
> > 
> > > A reasonably detailed explanation of the issues is provided in the patch
> > > itself, but the full traces produced by both the oopses and deadlocks is
> > > included below (it makes little sense to include them in the commit since we
> > > are disabling this feature, not directly fixing the bugs themselves).
> > > 
> > > I've posted this as an RFC on whether this feature should be allowed at
> > > all (and if anyone knows of legitimate uses for it), or if we should
> > > work on fixing these other kernel bugs that it exposes.
> > 
> > Umm...  Are all of those traces
> > 	a) reproducible on mainline and
> 
> This was on viro/for-next, I'll retry it on v5.5-rc4.

The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems
harder to reproduce than on viro/for-next (I kept reproducing it there
by accident), but I'll double-check if that really is the case.

The simplest reproducer is (using the attached programs and .config):

  ln -s . link
  sudo ./umount_symlink link

There's also a few other whacky behaviours where you get -ELOOP or
-EACCES in cases where you shouldn't -- which results in MNT_DETACH
failing and the mount being impossible to get rid of. A good example is

  sudo ./mount_to_symlink /proc/self/exe link
  sudo ./umount_symlink link # -EACCES

Or

  ln -s . link1
  ln -s . link2
  sudo ./mount_to_symlink link1 link2
  sudo ./umount_symlink link1 # -ELOOP
  sudo ./umount_symlink link2 # -ELOOP

But I am trying to find a reproducer for the "umount of a mount
triggering an Oops" issue.

On another note -- I guess this is considered a feature which should
"just work" and not a bug?

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    PGD 80000003c6fca067 P4D 80000003c6fca067 PUD 3c6f42067 PMD 0
    Oops: 0010 [#1] SMP PTI
    CPU: 4 PID: 4486 Comm: umount_symlink Tainted: G            E     5.5.0-rc4-cyphar #126
    Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206
    RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
    RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
    RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
    R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
    R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
    FS:  00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0
    Call Trace:
     __lookup_slow+0x94/0x160
     lookup_slow+0x36/0x50
     path_mountpoint+0x1be/0x360
     filename_mountpoint+0xa5/0x150
     ? __lookup_hash+0xa0/0xa0
     ksys_umount+0x78/0x490
     __x64_sys_umount+0x12/0x20
     do_syscall_64+0x64/0x240
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fbc2a8274e7
	Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
		  00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0
		  ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffd1da9b3f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc2a8274e7
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000001300310
    RBP: 00007ffd1da9b4c0 R08: 0000000000000000 R09: 000000000000000f
    R10: 00007fbc2a92f800 R11: 0000000000000202 R12: 0000000000401090
    R13: 00007ffd1da9b5a0 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in: [snip]
    CR2: 0000000000000000
    ---[ end trace ae473813e34e641d ]---
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206
    RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
    RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
    RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
    R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
    R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
    FS:  00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #1.2: .config --]
[-- Type: application/x-config, Size: 237462 bytes --]

[-- Attachment #1.3: mount_to_symlink.c --]
[-- Type: text/x-c, Size: 1318 bytes --]

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>

#define bail(msg) \
	do { printf("mount_to_symlink: %s: %m\n", msg); exit(1); } while (0)

int is_symlink(const char *path)
{
	struct stat stat = {};
	if (lstat(path, &stat) < 0)
		bail("lstat(<path>)");
	return S_ISLNK(stat.st_mode);
}

int main(int argc, char **argv)
{
	struct stat stat = {};
	char *src, *dst, *src_fdpath, *dst_fdpath;
	int src_fd, dst_fd;

	if (argc != 3)
		bail("usage: mount_to_symlink <src> <dst>");

	src_fdpath = src = argv[1];
	dst_fdpath = dst = argv[2];

	if (is_symlink(src)) {
		// open source fd
		src_fd = open(src, O_PATH | O_CLOEXEC | O_NOFOLLOW);
		if (src_fd < 0)
			bail("open(<src>, O_PATH|O_NOFOLLOW)");
		// construct fd path
		asprintf(&src_fdpath, "/proc/self/fd/%d", src_fd);
	}

	if (is_symlink(dst)) {
		// open target fd
		dst_fd = open(dst, O_PATH | O_CLOEXEC | O_NOFOLLOW);
		if (dst_fd < 0)
			bail("open(<dst>, O_PATH|O_NOFOLLOW)");
		// construct fd path
		asprintf(&dst_fdpath, "/proc/self/fd/%d", dst_fd);
	}

	// try to mount
	mount(src_fdpath, dst_fdpath, "", MS_BIND, "");
	printf("mount(%s, %s, MS_BIND) = %m (%d)\n", src, dst, -errno);
	return 0;
}

[-- Attachment #1.4: umount_symlink.c --]
[-- Type: text/x-c, Size: 795 bytes --]

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>

#define bail(msg) \
	do { printf("mount_to_symlink: %s: %m\n", msg); exit(1); } while (0)

int main(int argc, char **argv)
{
	struct stat stat = {};
	char *mnt, *mnt_fdpath;
	int mnt_fd;

	if (argc != 2)
		bail("need <mount> argument");

	mnt = argv[1];

	// open mountpoint fd
	mnt_fd = open(mnt, O_PATH | O_CLOEXEC | O_NOFOLLOW);
	if (mnt_fd < 0)
		bail("open(<dst>, O_PATH|O_NOFOLLOW)");

	// get fdpaths
	asprintf(&mnt_fdpath, "/proc/self/fd/%d", mnt_fd);

	// try to mount
	umount2(mnt_fdpath, MNT_DETACH);
	printf("umount2(%s, MNT_DETACH) = %m (%d)\n", mnt, -errno);
	return 0;
}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks
  2019-12-30  5:20 ` [PATCH RFC 1/1] " Aleksa Sarai
@ 2019-12-30  7:34   ` Linus Torvalds
  2019-12-30  8:28     ` Aleksa Sarai
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2019-12-30  7:34 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> +       if (d_is_symlink(mp->m_dentry) ||
> +           d_is_symlink(mnt->mnt.mnt_root))
> +               return -EINVAL;

So I don't hate this kind of check in general - overmounting a symlink
sounds odd, but at the same time I get the feeling that the real issue
is that something went wrong earlier.

Yeah, the mount target kind of _is_ a path, but at the same time, we
most definitely want to have the permission to really open the
directory in question, don't we, and I don't see that we should accept
a O_PATH file descriptor.

I feel like the only valid use of "O_PATH" files is to then use them
as the base for an openat() and friends (ie fchmodat/execveat() etc).

But maybe I'm completely wrong, and people really do want O_PATH
handling exactly for mounting too. It does sound a bit odd. By
definition, mounting wants permissions to the mount-point, so what's
the point of using O_PATH?

So instead of saying "don't overmount symlinks", I would feel like
it's the mount system call that should use a proper file descriptor
that isn't FMODE_PATH.

Is it really the symlink that is the issue? Because if it's the
symlink that is the issue then I feel like O_NOFOLLOW should have
triggered it, but your other email seems to say that you really need
O_PATH | O_SYMLINK.

So I'm not sayng that this patch is wrong, but it really smells a bit
like it's papering over the more fundamental issue.

For example, is the problem that when you do a proper

  fd = open("somepath", O_PATH);

in one process, and then another thread does

   fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR);

then we get confused and do bad things on that *second* open? Because
now the second open doesn't have O_PATH, and doesn't ghet marked
FMODE_PATH, but the underlying file descriptor is one of those limited
"is really only useful for openat() and friends".

I dunno. I haven't thought through the whole thing. But the oopses you
quote seem like we're really doing something wrong, and it really does
feel like your patch in no way _fixes_ the wrong thing we're doing,
it's just hiding the symptoms.

               Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2019-12-30  7:29     ` Aleksa Sarai
@ 2019-12-30  7:53       ` Linus Torvalds
  2019-12-30  8:32         ` Aleksa Sarai
  2020-01-01  0:43       ` Al Viro
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2019-12-30  7:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
>     BUG: kernel NULL pointer dereference, address: 0000000000000000

Would you mind building with debug info, and then running the oops through

 scripts/decode_stacktrace.sh

which makes those addresses much more legible.

>     #PF: supervisor instruction fetch in kernel mode
>     #PF: error_code(0x0010) - not-present page

Somebody jumped through a NULL pointer.

>     RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
>     RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
>     RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
>     R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
>     R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
>     FS:  00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0
>     Call Trace:
>      __lookup_slow+0x94/0x160

And "__lookup_slow()" has two indirect calls (they aren't obvious with
retpoline, but look for something  like

        call __x86_indirect_thunk_rax

which is the modern sad way of doing "call *%rax"). One is for
revalidatinging an old dentry, but the one I _suspect_ you trigger is
this one:

                old = inode->i_op->lookup(inode, dentry, flags);

but I thought we only could get here if we know it's a directory.

How did we miss the "d_can_lookup()", which is what should check that
yes, we can call that ->lookup() routine.

This is why I have that suspicion that it's somehow that O_PATH fd
opened in another process without O_PATH causes confusion...

So what I think has happened is that because of the O_PATH thing,
we've ended up with an inode that has never been truly opened (because
O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we
now have a file descriptor that _looks_ like it is valid, and we're
treating that inode as if it can be used.

But I'm handwaving.

             Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks
  2019-12-30  7:34   ` Linus Torvalds
@ 2019-12-30  8:28     ` Aleksa Sarai
  2020-01-08  4:39       ` Andy Lutomirski
  0 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2019-12-30  8:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 5028 bytes --]

On 2019-12-29, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > +       if (d_is_symlink(mp->m_dentry) ||
> > +           d_is_symlink(mnt->mnt.mnt_root))
> > +               return -EINVAL;
> 
> So I don't hate this kind of check in general - overmounting a symlink
> sounds odd, but at the same time I get the feeling that the real issue
> is that something went wrong earlier.
> 
> Yeah, the mount target kind of _is_ a path, but at the same time, we
> most definitely want to have the permission to really open the
> directory in question, don't we, and I don't see that we should accept
> a O_PATH file descriptor.

The new mount API uses O_PATH under the hood (which is a good thing
since some files you'd like to avoid actually opening -- FIFOs are the
obvious example) so I'm not sure that's something we could really avoid.

But if we block O_PATH for mounts this will achieve the same thing,
because the only way to get a file descriptor that references a symlink
is through (O_PATH | O_NOFOLLOW).

> I feel like the only valid use of "O_PATH" files is to then use them
> as the base for an openat() and friends (ie fchmodat/execveat() etc).

See below, we use this for all sorts of dirty^Wclever tricks.

> But maybe I'm completely wrong, and people really do want O_PATH
> handling exactly for mounting too. It does sound a bit odd. By
> definition, mounting wants permissions to the mount-point, so what's
> the point of using O_PATH?

When you go through O_PATH, you still get a proper 'struct path' which
means that for operations such as mount (or open) you will operate on
the *real* underlying file.

This is part of what makes magic-links so useful (but also quite
terrifying).

> For example, is the problem that when you do a proper
> 
>   fd = open("somepath", O_PATH);
> 
> in one process, and then another thread does
> 
>    fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR);
> 
> then we get confused and do bad things on that *second* open? Because
> now the second open doesn't have O_PATH, and doesn't ghet marked
> FMODE_PATH, but the underlying file descriptor is one of those limited
> "is really only useful for openat() and friends".

Actually, this isn't true (for the same reason as above) -- when you do
a re-open through /proc/$pid/fd/$n you get a real-as-a-heart-attack file
descriptor. We make lots of use of this in container runtimes in order
to do some dirty^Wfun tricks that help us harden the runtime against
malicious container processes.

You might recall that when I was posting the earlier revisions of
openat2(), I also included a patch for O_EMPTYPATH (which basically did
a re-open of /proc/self/fd/$dfd but without needing /proc). That had
precisely the same semantics so that you could do the same operation
without procfs. That patch was dropped before Al merged openat2(), but I
am probably going to revive it for the reasons I outlined below.

> I dunno. I haven't thought through the whole thing. But the oopses you
> quote seem like we're really doing something wrong, and it really does
> feel like your patch in no way _fixes_ the wrong thing we're doing,
> it's just hiding the symptoms.

That's fair enough.

I'll be honest, the real reason why I don't want mounts over symlinks to
be possible is for an entirely different reason. I'm working on a safe
path resolution library to accompany openat2()[1] -- and one of the
things I want to do is to harden all of our uses of procfs (such that if
we are running in a context where procfs has been messed with -- such as
having files bind-mounted -- we can detect it and abort). The issue with
symlinks is that we need to be able to operate on magic-links (such as
/proc/self/fd/$n and /proc/self/exe) -- and if it's possible bind-mount
over those magic-links then we can't detect it at all.

openat2(RESOLVE_NO_XDEV) would block it, but it also blocks going
through magic-links which change your mount (which would almost always
be true). You can't trust /proc/self/mountinfo by definition -- not just
because of the TOCTOU race but also because you can't depend on /proc to
harden against a "bad" /proc. All other options such as
umount2(MNT_EXPIRE) won't help with magic-links because we cannot take
an O_PATH to a magic-link and follow it -- O_PATHs of symlinks are
completely stunted in this respect.

If allowing bind-mounts over symlinks is allowed (which I don't have a
problem with really), it just means we'll need a few more kernel pieces
to get this hardening to work. But these features would be useful
outside of the problems I'm dealing with (O_EMPTYPATH and some kind of
pidfd-based interface to grab the equivalent of /proc/self/exe and a few
other such magic-link targets).

[1]: https://github.com/openSUSE/libpathrs

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2019-12-30  7:53       ` Linus Torvalds
@ 2019-12-30  8:32         ` Aleksa Sarai
  2020-01-02  8:58           ` David Laight
  0 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2019-12-30  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2682 bytes --]

On 2019-12-29, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> >     BUG: kernel NULL pointer dereference, address: 0000000000000000
> 
> Would you mind building with debug info, and then running the oops through
> 
>  scripts/decode_stacktrace.sh
> 
> which makes those addresses much more legible.

Will do.

> >     #PF: supervisor instruction fetch in kernel mode
> >     #PF: error_code(0x0010) - not-present page
> 
> Somebody jumped through a NULL pointer.
> 
> >     RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
> >     RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
> >     RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
> >     R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
> >     R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
> >     FS:  00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
> >     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >     CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0
> >     Call Trace:
> >      __lookup_slow+0x94/0x160
> 
> And "__lookup_slow()" has two indirect calls (they aren't obvious with
> retpoline, but look for something  like
> 
>         call __x86_indirect_thunk_rax
> 
> which is the modern sad way of doing "call *%rax"). One is for
> revalidatinging an old dentry, but the one I _suspect_ you trigger is
> this one:
> 
>                 old = inode->i_op->lookup(inode, dentry, flags);
> 
> but I thought we only could get here if we know it's a directory.
> 
> How did we miss the "d_can_lookup()", which is what should check that
> yes, we can call that ->lookup() routine.

I'll try applying a trivial patch to add d_can_lookup() to see if it
fixes the immediate issue.

> This is why I have that suspicion that it's somehow that O_PATH fd
> opened in another process without O_PATH causes confusion...
> 
> So what I think has happened is that because of the O_PATH thing,
> we've ended up with an inode that has never been truly opened (because
> O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we
> now have a file descriptor that _looks_ like it is valid, and we're
> treating that inode as if it can be used.

I'm not sure I agree -- as I mentioned in my other mail, re-opening
through /proc/self/fd/$n works *very* well and has for a long time (in
fact, both LXC and runc depend on this working).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2019-12-30  7:29     ` Aleksa Sarai
  2019-12-30  7:53       ` Linus Torvalds
@ 2020-01-01  0:43       ` Al Viro
  2020-01-01  0:54         ` Al Viro
  1 sibling, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-01  0:43 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

On Mon, Dec 30, 2019 at 06:29:59PM +1100, Aleksa Sarai wrote:
> On 2019-12-30, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2019-12-30, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
> > > 
> > > > A reasonably detailed explanation of the issues is provided in the patch
> > > > itself, but the full traces produced by both the oopses and deadlocks is
> > > > included below (it makes little sense to include them in the commit since we
> > > > are disabling this feature, not directly fixing the bugs themselves).
> > > > 
> > > > I've posted this as an RFC on whether this feature should be allowed at
> > > > all (and if anyone knows of legitimate uses for it), or if we should
> > > > work on fixing these other kernel bugs that it exposes.
> > > 
> > > Umm...  Are all of those traces
> > > 	a) reproducible on mainline and
> > 
> > This was on viro/for-next, I'll retry it on v5.5-rc4.
> 
> The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems
> harder to reproduce than on viro/for-next (I kept reproducing it there
> by accident), but I'll double-check if that really is the case.
> 
> The simplest reproducer is (using the attached programs and .config):
> 
>   ln -s . link
>   sudo ./umount_symlink link

FWIW, the problem with that reproducer is that we *CAN'T* resolve that
path.  Look: you have /proc/self/fd/3 resolve to ./link.  OK, you've
asked to follow that.  Got ./link, which is a symlink, so we need to
follow it further.  Relative to what, though?

The meaning of symlink is dependent upon the directory you find it in.
And we don't have any here.

The bug is in mountpoint_last() - we have
        if (unlikely(nd->last_type != LAST_NORM)) {
                error = handle_dots(nd, nd->last_type);
                if (error)
                        return error;
                path.dentry = dget(nd->path.dentry);
        } else {
                path.dentry = d_lookup(dir, &nd->last);
                if (!path.dentry) {
                        /*
                         * No cached dentry. Mounted dentries are pinned in the
                         * cache, so that means that this dentry is probably
                         * a symlink or the path doesn't actually point
                         * to a mounted dentry.
                         */
                        path.dentry = lookup_slow(&nd->last, dir,
                                             nd->flags | LOOKUP_NO_REVAL);
                        if (IS_ERR(path.dentry))
                                return PTR_ERR(path.dentry);
                }
        }
        if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) {
                dput(path.dentry);
                return -ENOENT;
        }
        path.mnt = nd->path.mnt;
        return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
in there, and that ends up with step_into() called in case of LAST_DOT/LAST_DOTDOT
(where it's harmless) *AND* in case of LAST_BIND.  Where it very much isn't.

I'm not sure if you have caught anything else, but we really, really should *NOT*
consider the LAST_BIND as "maybe we should follow the result" material.  So
at least the following is needed; could you check if anything else remains
with that applied?

diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..d4fbbda8a7ff 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd)
 	nd->flags &= ~LOOKUP_PARENT;
 
 	if (unlikely(nd->last_type != LAST_NORM)) {
-		error = handle_dots(nd, nd->last_type);
-		if (error)
-			return error;
-		path.dentry = dget(nd->path.dentry);
+		return handle_dots(nd, nd->last_type);
 	} else {
 		path.dentry = d_lookup(dir, &nd->last);
 		if (!path.dentry) {

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-01  0:43       ` Al Viro
@ 2020-01-01  0:54         ` Al Viro
  2020-01-01  3:08           ` Al Viro
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-01  0:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

On Wed, Jan 01, 2020 at 12:43:24AM +0000, Al Viro wrote:
> I'm not sure if you have caught anything else, but we really, really should *NOT*
> consider the LAST_BIND as "maybe we should follow the result" material.  So
> at least the following is needed; could you check if anything else remains
> with that applied?
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..d4fbbda8a7ff 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd)
>  	nd->flags &= ~LOOKUP_PARENT;
>  
>  	if (unlikely(nd->last_type != LAST_NORM)) {
> -		error = handle_dots(nd, nd->last_type);
> -		if (error)
> -			return error;
> -		path.dentry = dget(nd->path.dentry);
> +		return handle_dots(nd, nd->last_type);
>  	} else {
>  		path.dentry = d_lookup(dir, &nd->last);
>  		if (!path.dentry) {

Note, BTW, that lookup_last() (aka walk_component()) does just
that - we only hit step_into() on LAST_NORM.  The same goes
for do_last().  mountpoint_last() not doing the same is _not_
intentional - it's definitely a bug.

Consider your testcase; link points to . here.  So the only
thing you could expect from trying to follow it would be
the directory 'link' lives in.  And you don't have it
when you reach the fscker via /proc/self/fd/3; what happens
instead is nd->path set to ./link (by nd_jump_link()) *AND*
step_into() called, pushing the same ./link onto stack.
It violates all kinds of assumptions made by fs/namei.c -
when pushing a symlink onto stack nd->path is expected to
contain the base directory for resolving it.

I'm fairly sure that this is the cause of at least some
of the insanity you've caught; there always could be
something else, of course, but this hole needs to be
closed in any case.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-01  0:54         ` Al Viro
@ 2020-01-01  3:08           ` Al Viro
  2020-01-01 14:44             ` Aleksa Sarai
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-01  3:08 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
> Note, BTW, that lookup_last() (aka walk_component()) does just
> that - we only hit step_into() on LAST_NORM.  The same goes
> for do_last().  mountpoint_last() not doing the same is _not_
> intentional - it's definitely a bug.
> 
> Consider your testcase; link points to . here.  So the only
> thing you could expect from trying to follow it would be
> the directory 'link' lives in.  And you don't have it
> when you reach the fscker via /proc/self/fd/3; what happens
> instead is nd->path set to ./link (by nd_jump_link()) *AND*
> step_into() called, pushing the same ./link onto stack.
> It violates all kinds of assumptions made by fs/namei.c -
> when pushing a symlink onto stack nd->path is expected to
> contain the base directory for resolving it.
> 
> I'm fairly sure that this is the cause of at least some
> of the insanity you've caught; there always could be
> something else, of course, but this hole needs to be
> closed in any case.

... and with removal of now unused local variable, that's

mountpoint_last(): fix the treatment of LAST_BIND

step_into() should be attempted only in LAST_NORM
case, when we have the parent directory (in nd->path).
We get away with that for LAST_DOT and LOST_DOTDOT,
since those can't be symlinks, making step_init() and
equivalent of path_to_nameidata() - we do a bit of
useless work, but that's it.  For LAST_BIND (i.e.
the case when we'd just followed a procfs-style
symlink) we really can't go there - result might
be a symlink and we really can't attempt following
it.

lookup_last() and do_last() do handle that properly;
mountpoint_last() should do the same.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..13f9f973722b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty);
 static int
 mountpoint_last(struct nameidata *nd)
 {
-	int error = 0;
 	struct dentry *dir = nd->path.dentry;
 	struct path path;
 
@@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd)
 	nd->flags &= ~LOOKUP_PARENT;
 
 	if (unlikely(nd->last_type != LAST_NORM)) {
-		error = handle_dots(nd, nd->last_type);
-		if (error)
-			return error;
-		path.dentry = dget(nd->path.dentry);
+		return handle_dots(nd, nd->last_type);
 	} else {
 		path.dentry = d_lookup(dir, &nd->last);
 		if (!path.dentry) {

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-01  3:08           ` Al Viro
@ 2020-01-01 14:44             ` Aleksa Sarai
  2020-01-01 23:40               ` Al Viro
  2020-01-04  5:52               ` Andy Lutomirski
  0 siblings, 2 replies; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-01 14:44 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3198 bytes --]

On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
> > Note, BTW, that lookup_last() (aka walk_component()) does just
> > that - we only hit step_into() on LAST_NORM.  The same goes
> > for do_last().  mountpoint_last() not doing the same is _not_
> > intentional - it's definitely a bug.
> > 
> > Consider your testcase; link points to . here.  So the only
> > thing you could expect from trying to follow it would be
> > the directory 'link' lives in.  And you don't have it
> > when you reach the fscker via /proc/self/fd/3; what happens
> > instead is nd->path set to ./link (by nd_jump_link()) *AND*
> > step_into() called, pushing the same ./link onto stack.
> > It violates all kinds of assumptions made by fs/namei.c -
> > when pushing a symlink onto stack nd->path is expected to
> > contain the base directory for resolving it.
> > 
> > I'm fairly sure that this is the cause of at least some
> > of the insanity you've caught; there always could be
> > something else, of course, but this hole needs to be
> > closed in any case.
> 
> ... and with removal of now unused local variable, that's
> 
> mountpoint_last(): fix the treatment of LAST_BIND
> 
> step_into() should be attempted only in LAST_NORM
> case, when we have the parent directory (in nd->path).
> We get away with that for LAST_DOT and LOST_DOTDOT,
> since those can't be symlinks, making step_init() and
> equivalent of path_to_nameidata() - we do a bit of
> useless work, but that's it.  For LAST_BIND (i.e.
> the case when we'd just followed a procfs-style
> symlink) we really can't go there - result might
> be a symlink and we really can't attempt following
> it.
> 
> lookup_last() and do_last() do handle that properly;
> mountpoint_last() should do the same.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Thanks, this fixes the issue for me (and also fixes another reproducer I
found -- mounting a symlink on top of itself then trying to umount it).

Reported-by: Aleksa Sarai <cyphar@cyphar.com>
Tested-by: Aleksa Sarai <cyphar@cyphar.com>

As for the original topic of bind-mounting symlinks -- given this is a
supported feature, would you be okay with me sending an updated
O_EMPTYPATH series?

> ---
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..13f9f973722b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty);
>  static int
>  mountpoint_last(struct nameidata *nd)
>  {
> -	int error = 0;
>  	struct dentry *dir = nd->path.dentry;
>  	struct path path;
>  
> @@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd)
>  	nd->flags &= ~LOOKUP_PARENT;
>  
>  	if (unlikely(nd->last_type != LAST_NORM)) {
> -		error = handle_dots(nd, nd->last_type);
> -		if (error)
> -			return error;
> -		path.dentry = dget(nd->path.dentry);
> +		return handle_dots(nd, nd->last_type);
>  	} else {
>  		path.dentry = d_lookup(dir, &nd->last);
>  		if (!path.dentry) {


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-01 14:44             ` Aleksa Sarai
@ 2020-01-01 23:40               ` Al Viro
  2020-01-02  3:59                 ` Aleksa Sarai
  2020-01-04  5:52               ` Andy Lutomirski
  1 sibling, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-01 23:40 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:

> Thanks, this fixes the issue for me (and also fixes another reproducer I
> found -- mounting a symlink on top of itself then trying to umount it).
> 
> Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> Tested-by: Aleksa Sarai <cyphar@cyphar.com>

Pushed into #fixes.

> As for the original topic of bind-mounting symlinks -- given this is a
> supported feature, would you be okay with me sending an updated
> O_EMPTYPATH series?

Post it on fsdevel; I'll need to reread it anyway to say anything useful...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-01 23:40               ` Al Viro
@ 2020-01-02  3:59                 ` Aleksa Sarai
  2020-01-03  1:49                   ` Al Viro
  0 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-02  3:59 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 855 bytes --]

On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> 
> > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > found -- mounting a symlink on top of itself then trying to umount it).
> > 
> > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Pushed into #fixes.

Thanks. One other thing I noticed is that umount applies to the
underlying symlink rather than the mountpoint on top. So, for example
(using the same scripts I posted in the thread):

  # ln -s /tmp/foo link
  # ./mount_to_symlink /etc/passwd link
  # umount -l link # will attempt to unmount "/tmp/foo"

Is that intentional?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2019-12-30  8:32         ` Aleksa Sarai
@ 2020-01-02  8:58           ` David Laight
  2020-01-02  9:09             ` Aleksa Sarai
  0 siblings, 1 reply; 53+ messages in thread
From: David Laight @ 2020-01-02  8:58 UTC (permalink / raw)
  To: Aleksa Sarai, Linus Torvalds
  Cc: Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

From: Aleksa Sarai
> Sent: 30 December 2019 08:32
...
> I'm not sure I agree -- as I mentioned in my other mail, re-opening
> through /proc/self/fd/$n works *very* well and has for a long time (in
> fact, both LXC and runc depend on this working).

I thought it was marginally broken because it is followed as a symlink?
On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the
filesystem inode and can be used to link the file back into the filesystem
if all the directory entries have been removed.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-02  8:58           ` David Laight
@ 2020-01-02  9:09             ` Aleksa Sarai
  0 siblings, 0 replies; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-02  9:09 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 937 bytes --]

On 2020-01-02, David Laight <David.Laight@ACULAB.COM> wrote:
> From: Aleksa Sarai
> > Sent: 30 December 2019 08:32
> ...
> > I'm not sure I agree -- as I mentioned in my other mail, re-opening
> > through /proc/self/fd/$n works *very* well and has for a long time (in
> > fact, both LXC and runc depend on this working).
> 
> I thought it was marginally broken because it is followed as a symlink?
> On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the
> filesystem inode and can be used to link the file back into the filesystem
> if all the directory entries have been removed.

That is also the case on Linux. It (strictly speaking) isn't a symlink
in the normal sense of the word, it's a magic-link (nd_jump_link
switches the nd->path to the actual 'struct file' in the case of
/proc/self/fd/$n).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-02  3:59                 ` Aleksa Sarai
@ 2020-01-03  1:49                   ` Al Viro
  2020-01-04  4:46                     ` Ian Kent
                                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Al Viro @ 2020-01-03  1:49 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel, Ian Kent

On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > 
> > > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > > found -- mounting a symlink on top of itself then trying to umount it).
> > > 
> > > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> > 
> > Pushed into #fixes.
> 
> Thanks. One other thing I noticed is that umount applies to the
> underlying symlink rather than the mountpoint on top. So, for example
> (using the same scripts I posted in the thread):
> 
>   # ln -s /tmp/foo link
>   # ./mount_to_symlink /etc/passwd link
>   # umount -l link # will attempt to unmount "/tmp/foo"
> 
> Is that intentional?

It's a mess, again in mountpoint_last().  FWIW, at some point I proposed
to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
Linus asked for reasons deeper than my dislike of the semantics, I looked
around and hadn't spotted anything.  And there hadn't been at the time,
but when four months later umount_lookup_last() went in I failed to look
for that source of potential problems in it ;-/

I've looked at that area again now.  Aside of usual cursing at do_last()
horrors (yes, its control flow is a horror; yes, it needs serious massage;
no, it's not a good idea to get sidetracked into that right now), there
are several fun questions:
	* d_manage() and d_automount().  We almost certainly don't
want those for autofs on the final component of pathname in umount,
including the trailing symlinks.  But do we want those on usual access
via /proc/*/fd/*?  I.e. suppose somebody does open() (O_PATH or not)
in autofs; do we want ->d_manage()/->d_automount() called when
resolving /proc/self/fd/<whatever>/foo/bar?  We do not; is that
correct from autofs point of view?  I suspect that refusing to
do ->d_automount() is correct, but I don't understand ->d_manage()
purpose well enough to tell.
	* I really hope that the weird "trailing / forces automount
even in cases when we normally wouldn't trigger it" (stat /mnt/foo
vs. stat /mnt/foo/) is not meant to extend to umount.  I'd like
Ian's confirmation, though.
	* do we want ->d_manage() on following .. into overmounted
directory?  Again, autofs question...

	The minimal fix to mountpoint_last() would be to have
follow_mount() done in LAST_NORM case.  However, I'd like to understand
(and hopefully regularize) the rules for follow_mount()/follow_managed().
Additional scary question is nfsd iterplay with automount.  For nfs4
exports it's potentially interesting...

	Ian, could you comment on the autofs questions above?
I'd rather avoid doing changes in that area without your input -
it's subtle and breakage in automount-related behaviour can be
mysterious as hell.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-03  1:49                   ` Al Viro
@ 2020-01-04  4:46                     ` Ian Kent
  2020-01-08  3:13                     ` Al Viro
  2020-01-10 23:19                     ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro
  2 siblings, 0 replies; 53+ messages in thread
From: Ian Kent @ 2020-01-04  4:46 UTC (permalink / raw)
  To: Al Viro, Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel


It may be a bit off-topic here but, in autofs symlinks can be used
in place of mounts. That mechanism can be used (mostly nowadays) with
amd map format maps.

If I'm using symlinks instead of mounts (where I can) I definitely
don't want these to be over mounted by a mount.

I haven't seen problems like that happening but if it did happen
that would be a bug in automount or user mis-use of some sort.

On Fri, 2020-01-03 at 01:49 +0000, Al Viro wrote:
> On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > > 
> > > > Thanks, this fixes the issue for me (and also fixes another
> > > > reproducer I
> > > > found -- mounting a symlink on top of itself then trying to
> > > > umount it).
> > > > 
> > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> > > 
> > > Pushed into #fixes.
> > 
> > Thanks. One other thing I noticed is that umount applies to the
> > underlying symlink rather than the mountpoint on top. So, for
> > example
> > (using the same scripts I posted in the thread):
> > 
> >   # ln -s /tmp/foo link
> >   # ./mount_to_symlink /etc/passwd link
> >   # umount -l link # will attempt to unmount "/tmp/foo"
> > 
> > Is that intentional?
> 
> It's a mess, again in mountpoint_last().  FWIW, at some point I
> proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a
> symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I
> looked
> around and hadn't spotted anything.  And there hadn't been at the
> time,
> but when four months later umount_lookup_last() went in I failed to
> look
> for that source of potential problems in it ;-/
> 
> I've looked at that area again now.  Aside of usual cursing at
> do_last()
> horrors (yes, its control flow is a horror; yes, it needs serious
> massage;
> no, it's not a good idea to get sidetracked into that right now),
> there
> are several fun questions:
> 	* d_manage() and d_automount().  We almost certainly don't
> want those for autofs on the final component of pathname in umount,
> including the trailing symlinks.  But do we want those on usual
> access
> via /proc/*/fd/*?  I.e. suppose somebody does open() (O_PATH or not)
> in autofs; do we want ->d_manage()/->d_automount() called when
> resolving /proc/self/fd/<whatever>/foo/bar?  We do not; is that
> correct from autofs point of view?  I suspect that refusing to
> do ->d_automount() is correct, but I don't understand ->d_manage()
> purpose well enough to tell.

Yes, we don't want those on the final component of the path in
umount. The following of a symlink will give use a new path of
some sort so the rules would change to the usual ones for the
new path.

The semantics of following a symlink, be the source a proc entry
or not (I think) should always be the same. If the follow takes
us to an autofs file system (be it a trigger mount or an indirect
mount in an autofs file system) the behaviour should be that of
the autofs file system when we arrive there, from an auto-mount
POV.

The original intent of ->d_manage() was to prevent walks into an
under construction mount and that might not be as simple as mounting
a source on a mount point.

For example take the case of an automount indirect mount map entry
like this:

test    /some/path/one  server:/source/path1 \
        /some/path/two  server2:/source/path2 \
        /some/other/path server:/source/path3 \
        /some/other/path/three server:/source/path4

This entry has no mount at the root of the tree (so called root-less
multi-mount) but walks need to block when it's under construction as
the topology isn't known until the directory tree and any associated
mounts (usually trigger mounts) have been completed.

In this case it's needed to go to ref-walk mode and block until it's
done.

The other (perhaps not so obvious) use of ->d_manage() is to detect
expire to mount races. When an automount is expiring at the same time
a process (that would cause an automount) is traversing the path. The
base (I'll not say root, since the root of the expire might not be the
root of the tree) needs to block the walk until the expire is done.

These multi-mounts are meant to provide a "mount as you go" mechanism
so that only portions of the tree of mounts are mounted or expired at
any one time.

For example, the offsets in the above entry are /some/path/one,
/some/path/two, /some/other/path and /some/other/path/three.

On access to <autofs mount>/test automount is meant to mount trigger
mounts for offsets /some/path/one, /some/path/two and /some/other/path
and mount an offset trigger for /some/other/path/three into the mount
for /some/other/path when it's accessed and that might not happen
during the initial mount of the tree. The reverse being done on umount
in sub-trees of mounts when a nesting point like /some/other/path is
encountered.

But that's something of an aside because in all cases below the root
there will be an actual mount preventing walks into the tree under
nesting point mounts being constructed or expired.

Anyway, returning to the topic at hand, the answer to whether we want
->d_manage()/->d_automount() after a symlink has been followed is
yes, I think, because at that point we could be within a file system
that has automounts of some sort.

But perhaps I'm missing something about the description of the case
above ...

> 	* I really hope that the weird "trailing / forces automount
> even in cases when we normally wouldn't trigger it" (stat /mnt/foo
> vs. stat /mnt/foo/) is not meant to extend to umount.  I'd like
> Ian's confirmation, though.

I can't see any way that the trailing "/" can realte to umount.

It has always been meant to be used to trigger a mount on something
that would otherwise not be mounted and that's the only case I'm
aware of.

> 	* do we want ->d_manage() on following .. into overmounted
> directory?  Again, autofs question...

I think that amounts to asking "can the target of the ../ be in the
process of being constructed or expired at this time" and that's
probably yes. A root-less multi-mount would be one case where this
could happen (although it's not strictly an over-mounted directory).

> 
> 	The minimal fix to mountpoint_last() would be to have
> follow_mount() done in LAST_NORM case.  However, I'd like to
> understand
> (and hopefully regularize) the rules for
> follow_mount()/follow_managed().
> Additional scary question is nfsd iterplay with automount.  For nfs4
> exports it's potentially interesting...

I'm not sure about nfs (and other cross mounting file systems). The
automounting in file systems other than autofs always have a real
mount as the target (AFAIK) so there's an implied blocking that occurs
on crossing the mount point. That's always made the nfs automounting
case simpler to my thinking anyway.

The real problem with nfs automount trees is when the topology of
the exports tree changes while parts of it are in use. People that
have any idea of how nfs cross mounting (and mount dependencies in
general) work shouldn't do that but they do it and then wonder why
things go wrong ...

> 
> 	Ian, could you comment on the autofs questions above?
> I'd rather avoid doing changes in that area without your input -
> it's subtle and breakage in automount-related behaviour can be
> mysterious as hell.

Thanks for the heads up.

As always I can run tests on changes you want to do.
Fortunately that's generally worked out ok for us in the past.

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-01 14:44             ` Aleksa Sarai
  2020-01-01 23:40               ` Al Viro
@ 2020-01-04  5:52               ` Andy Lutomirski
  1 sibling, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2020-01-04  5:52 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel


> On Jan 1, 2020, at 11:44 PM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
> On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
>>> Note, BTW, that lookup_last() (aka walk_component()) does just
>>> that - we only hit step_into() on LAST_NORM.  The same goes
>>> for do_last().  mountpoint_last() not doing the same is _not_
>>> intentional - it's definitely a bug.
>>> 
>>> Consider your testcase; link points to . here.  So the only
>>> thing you could expect from trying to follow it would be
>>> the directory 'link' lives in.  And you don't have it
>>> when you reach the fscker via /proc/self/fd/3; what happens
>>> instead is nd->path set to ./link (by nd_jump_link()) *AND*
>>> step_into() called, pushing the same ./link onto stack.
>>> It violates all kinds of assumptions made by fs/namei.c -
>>> when pushing a symlink onto stack nd->path is expected to
>>> contain the base directory for resolving it.
>>> 
>>> I'm fairly sure that this is the cause of at least some
>>> of the insanity you've caught; there always could be
>>> something else, of course, but this hole needs to be
>>> closed in any case.
>> 
>> ... and with removal of now unused local variable, that's
>> 
>> mountpoint_last(): fix the treatment of LAST_BIND
>> 
>> step_into() should be attempted only in LAST_NORM
>> case, when we have the parent directory (in nd->path).
>> We get away with that for LAST_DOT and LOST_DOTDOT,
>> since those can't be symlinks, making step_init() and
>> equivalent of path_to_nameidata() - we do a bit of
>> useless work, but that's it.  For LAST_BIND (i.e.
>> the case when we'd just followed a procfs-style
>> symlink) we really can't go there - result might
>> be a symlink and we really can't attempt following
>> it.
>> 
>> lookup_last() and do_last() do handle that properly;
>> mountpoint_last() should do the same.
>> 
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> Thanks, this fixes the issue for me (and also fixes another reproducer I
> found -- mounting a symlink on top of itself then trying to umount it).
> 
> Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> As for the original topic of bind-mounting symlinks -- given this is a
> supported feature, would you be okay with me sending an updated
> O_EMPTYPATH series?

FWIW, I have an actual use case for mounting over a symlink: replacing /etc/resolv.conf.  My virtme tool is presented with somewhat arbitrary crud in /etc, where /etc/resolv.conf might be a plain file or a symlink, but, regardless, has inappropriate contents. If it’s a file, I can mount a new file over it. If it’s a symlink and the kernel properly supported it, I could also mount over it.

Yes, I could also use overlayfs.  Maybe I should regardless.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-03  1:49                   ` Al Viro
  2020-01-04  4:46                     ` Ian Kent
@ 2020-01-08  3:13                     ` Al Viro
  2020-01-08  3:54                       ` Linus Torvalds
  2020-01-10 23:19                     ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro
  2 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-08  3:13 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel, Ian Kent

On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:

> It's a mess, again in mountpoint_last().  FWIW, at some point I proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I looked
> around and hadn't spotted anything.  And there hadn't been at the time,
> but when four months later umount_lookup_last() went in I failed to look
> for that source of potential problems in it ;-/
> 
> I've looked at that area again now.  Aside of usual cursing at do_last()
> horrors (yes, its control flow is a horror; yes, it needs serious massage;
> no, it's not a good idea to get sidetracked into that right now), there
> are several fun questions:
> 	* d_manage() and d_automount().  We almost certainly don't
> want those for autofs on the final component of pathname in umount,
> including the trailing symlinks.  But do we want those on usual access
> via /proc/*/fd/*?  I.e. suppose somebody does open() (O_PATH or not)
> in autofs; do we want ->d_manage()/->d_automount() called when
> resolving /proc/self/fd/<whatever>/foo/bar?  We do not; is that
> correct from autofs point of view?  I suspect that refusing to
> do ->d_automount() is correct, but I don't understand ->d_manage()
> purpose well enough to tell.
> 	* I really hope that the weird "trailing / forces automount
> even in cases when we normally wouldn't trigger it" (stat /mnt/foo
> vs. stat /mnt/foo/) is not meant to extend to umount.  I'd like
> Ian's confirmation, though.
> 	* do we want ->d_manage() on following .. into overmounted
> directory?  Again, autofs question...

FWIW, I suspect that we want to do something along the following lines:

1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been
O_NOFOLLOW in the mix.  Reason: if there is a trailing symlink, we want
to fail with EEXIST anyway.  Benefit: this fragment in do_last()
        error = follow_managed(&path, nd);
        if (unlikely(error < 0))
                return error;

        /*
         * create/update audit record if it already exists.
         */
        audit_inode(nd->name, path.dentry, 0);

        if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
                path_to_nameidata(&path, nd);
                return -EEXIST;
        }

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
finish_lookup:  
        error = step_into(nd, &path, 0, inode, seq);
        if (unlikely(error))
                return error;
can become
        error = follow_managed(&path, nd);
        if (unlikely(error < 0))
                return error;

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
finish_lookup:  
        error = step_into(nd, &path, 0, inode, seq);
        if (unlikely(error))
                return error;

        if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
		audit_inode(nd->name, nd->path.dentry, 0);
                return -EEXIST;
        }
Equivalent transformation, since the the only goto finish_lookup; is under
if (!(open_flag & O_CREAT)).  What it buys us is more regular structure
of follow_managed() callers.

2) make follow_managed() take &inode and &seq.  Look: follow_managed() never
returns 0 (we have
        if (ret == -EISDIR || !ret)
                ret = 1;
on the way to the only return in it) and the callers are
        err = follow_managed(path, nd);
        if (likely(err > 0))
                *inode = d_backing_inode(path->dentry);
        return err;
in lookup_fast(),
                err = follow_managed(&path, nd);
                if (unlikely(err < 0))
                        return err;

                seq = 0;        /* we are already out of RCU mode */
                inode = d_backing_inode(path.dentry);
in walk_component(),
                err = follow_managed(&path, nd);
                if (unlikely(err < 0))
                        return err;
                inode = d_backing_inode(path.dentry);
                seq = 0;
in handle_lookup_down() and (after the previous change)
        error = follow_managed(&path, nd);
        if (unlikely(error < 0))
                return error;

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
in do_last().  That's begging to fold those followups into follow_managed()
itself, doesn't it?  And having *seqp = 0; equivalent added in lookup_fast()
is not going to hurt the performance - in all callers it's an address of
local variable, right next to the one whose address is passed as inodep.
Which we'd just dirtied, and the cacheline is not going to have been shared
anyway.

Note that after that the arguments for follow_managed() become identical
to those for __follow_mount_rcu().  Which makes a lot of sense, since
the latter is RCU-mode counterpart of the former.

3) have the followup to failing __follow_mount_rcu() taken into it.
After (2) we have this in lookup_fast():

                *seqp = seq;
                status = d_revalidate(dentry, nd->flags);
                if (likely(status > 0)) {
                        /*
                         * Note: do negative dentry check after revalidation in
                         * case that drops it.
                         */
                        if (unlikely(negative))
                                return -ENOENT;
                        path->mnt = mnt;
                        path->dentry = dentry;
                        if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
                                return 1;
                }
                if (unlazy_child(nd, dentry, seq))
                        return -ECHILD;
                if (unlikely(status == -ECHILD))
                        /* we'd been told to redo it in non-rcu mode */
                        status = d_revalidate(dentry, nd->flags);
        } else {   
		...
	}
        if (unlikely(status <= 0)) {
                if (!status)
                        d_invalidate(dentry);
                dput(dentry);
                return status;
        }

        path->mnt = mnt;
        path->dentry = dentry;
        return follow_managed(path, nd, inode, seqp);

Suppose __follow_mount_rcu() returns false; what follows is
                if (unlazy_child(nd, dentry, seq))
                        return -ECHILD;
seq here is equal to *seqp here, dentry - the value of path->dentry at the
time of __follow_mount_rcu() call.
                if (unlikely(status == -ECHILD))
			....
not taken - we know that status must have been positive
        if (unlikely(status <= 0)) {
		...
	}
ditto
        path->mnt = mnt;
        path->dentry = dentry;
        return follow_managed(path, nd, inode, seqp);
we return *path to original and call follow_managed().  IOW, we could bloody
well do all of that in the __follow_mount_rcu() itself, having it return 1
when the original would've returned true and doing that "revert *path,
call unlazy_child() and fall back to follow_mount_rcu() in case of success"
in cases when the original would've returned false.  The caller turns into
                        /*
                         * Note: do negative dentry check after revalidation in
                         * case that drops it.
                         */
                        if (unlikely(negative))
                                return -ENOENT;
                        path->mnt = mnt;
                        path->dentry = dentry;
                        return __follow_mount_rcu(nd, path, inode, seqp);

4) fold __follow_mount_rcu() into follow_managed(), using the latter both in
RCU and non-RCU cases.

5) take the calls of follow_managed() out of lookup_fast() into its callers.
That would be
        err = lookup_fast(nd, &path, &inode, &seq);
        if (unlikely(err <= 0)) {
                if (err < 0)
                        return err;
                path.dentry = lookup_slow(&nd->last, nd->path.dentry,
                                          nd->flags);
                if (IS_ERR(path.dentry))
                        return PTR_ERR(path.dentry);

                path.mnt = nd->path.mnt;
                err = follow_managed(&path, nd, &inode, &seq);
                if (unlikely(err < 0))
                        return err;
        }
turning into
        err = lookup_fast(nd, &path, &inode, &seq);
        if (unlikely(err <= 0)) {
                if (err < 0)
                        return err;
                path.dentry = lookup_slow(&nd->last, nd->path.dentry,
                                          nd->flags);
                if (IS_ERR(path.dentry))
                        return PTR_ERR(path.dentry);

                path.mnt = nd->path.mnt;
	}
	err = follow_managed(&path, nd, &inode, &seq);
        if (unlikely(err < 0))
		return err;
in walk_component() and
                error = lookup_fast(nd, &path, &inode, &seq);
                if (likely(error > 0))
                        goto finish_lookup;
...
        error = follow_managed(&path, nd, &inode, &seq);
        if (unlikely(error < 0))
                return error;
finish_lookup:  
turning into
                error = lookup_fast(nd, &path, &inode, &seq);
                if (likely(error > 0))
                        goto finish_lookup;
...
finish_lookup:
        error = follow_managed(&path, nd, &inode, &seq);
        if (unlikely(error < 0))
                return error;
in do_last().

6) after that we have 3 callers of step_into(); the ones in
walk_component() and in do_last() would be immediately preceded
by the calls of follow_managed().  The last one is in
mountpoint_last().  That's
        if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) {
                dput(path.dentry);
                return -ENOENT;
        }
        path.mnt = nd->path.mnt;
        return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
And that's where we are missing the mountpoint traversal in symlink case -
sure, the caller does follow_mount(), but it doesn't catch the case when
we have a symlink overmounted - we run into step_into() before that.
Note that smp_load_acquire + d_flags_negative is what we would've done
in follow_managed(), as well as getting d_backing_inode().  So here
we also have an open-coded bastardized variant of follow_managed().
The difference is, we don't want to trigger ->d_automount() and ->d_manage()
in that one.

And at that point the only call of follow_managed() *NOT* followed by
step_into() is in handle_lookup_down().  What it is followed by is
        path_to_nameidata(&path, nd);
        nd->inode = inode;
        nd->seq = seq;
And that's a piece of step_into():
        if (likely(!d_is_symlink(path->dentry)) ||
           !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) {
                /* not a symlink or should not follow */
                path_to_nameidata(path, nd);
                nd->inode = inode;
                nd->seq = seq;
                return 0;
        }
is the normal path through that sucker.  What's more, we are guaranteed
that this will _not_ be a symlink (it's the starting point of pathwalk,
and path_init() would've told us to sod off were it not a directory).

So if we manage to convert the damn thing in mountpoint_last() into
follow_managed(), we could fold follow_managed() into step_into().
Which suggests the way to do that - not that step_into() takes an
argument containing ORed WALK_... constants.  So we can simply add
WALK_NOAUTOMOUNT and put a check for it into
                if (flags & DCACHE_MANAGE_TRANSIT) {
and
                if (flags & DCACHE_NEED_AUTOMOUNT) {
bodies, so that they would be ignored if that's passed to
follow_mount()/step_into() hybrid.

At that point we have one primitive for moving into child, handling
both the mountpoint traversals and keeping track of symlinks.  Moreover,
there's a fairly strong argument for using it in case of .. as well.
As it is, if the parent is overmounted, we cross into whatever is
mounted on top of it.  And we ignore ->d_manage/->d_automount on
the damn thing.  Which is not an issue for anything other than
autofs (nobody else has ->d_manage() and nfs/afs/cifs automount
points don't have children) and for autofs we *want* those called;
that's not something likely to be encountered, but it's an impossible
setup (autofs direct mount set on an ancestor of somebody's current
directory) and autofs does count upon not walking into something
being set up by the daemon.

I'll put together such series and see how well does it work; it would
fix the idiocies in user_path_mountpoint_at() and make the pathwalk
machinery easier to follow - the boilerplate around mountpoint
crossing and symlink handling is demonstrably easy to get wrong.
If that works and doesn't cause observable slowdown, I'll put it
into -next, either stepping around the changes done by openat2()
series, or rebasing it on top of that.

Another interesting question is whether we want O_PATH open
to trigger automounts.  The thing is, we do *NOT* trigger them
(or traverse mountpoints) at the starting point of lookups.
I believe it's a mistake (and mine, at that), but I doubt that
there's anything that can be done about it at that point.
It's a user-visible behaviour and I can easily imagine
a custom /init that ends up relying upon it ;-/  mkdir /root,
mount the final root there, chdir /root, mount --move . /,
remove everything on initramfs using absolute pathnames
and chroot to "." to finish...  Traversing mounts at the
beginning of pathwalk would break the hell out of that,
potentially with root filesystem contents wiped out... ;-/

I wish we could change that, but I'm afraid that's cast
in stone by now (and had been for 20 years or so).  As it is,
we have an unpleasant side effect - O_PATH open does *NOT*
trigger automounts.  So if you do that to e.g. referral point
and try to do ...at() syscalls with that as the origin, you'll
get an unpleasant surprise - automount won't trigger at all.

I think the easiest way to handle that is to have O_PATH
turn LOOKUP_AUTOMOUNT, same as the normal open() does.  That's
trivial to do, but that changes user-visible behaviour.  OTOH,
with the current behaviour nobody can rely upon automount not
getting triggered by somebody else just as they are entering
their open(dir, O_PATH), so I think that's not a problem.

Linus, do you have any objections to such O_PATH semantics
change?

PS: I think I see how to untangle the control flow horrors
in do_last() with this massage done, but I'm not going there
until this is sorted out - by previous experience touching
the damn thing can easily turn into several weeks of digging
through the nfs/gfs2/etc. guts trying to verify something,
with a couple of detours into fixing something in there
found in process... ;-/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-08  3:13                     ` Al Viro
@ 2020-01-08  3:54                       ` Linus Torvalds
  2020-01-08 21:34                         ` Al Viro
  2020-01-10 21:07                         ` Aleksa Sarai
  0 siblings, 2 replies; 53+ messages in thread
From: Linus Torvalds @ 2020-01-08  3:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Tue, Jan 7, 2020 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, I suspect that we want to do something along the following lines:
>
> 1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been
> O_NOFOLLOW in the mix.

My reaction to that is "Whee, that's a big change".

But:

> Benefit: this fragment in do_last()

you're right.

That's the semantics we have right now (and I think it's the correct
safe semantics when I think about it). But when I first looked at your
email I without thinking more about it actually thought we followed
the symlink, and then did the O_CREAT | O_EXCL on the target (and
potentially succeeded).

So I agree - making O_CREAT | O_EXCL imply O_NOFOLLOW seems to be the
right thing to do, and not only should simplify our code, it's much
more descriptive of what the real semantics are.

Even if my first reaction was that it would act differently.

Slash-and-burn approach to your explanatory subsequent steps:

> 2) make follow_managed() take &inode and &seq.
> 3) have the followup to failing __follow_mount_rcu() taken into it.
> 4) fold __follow_mount_rcu() into follow_managed(), using the latter both in
> RCU and non-RCU cases.
> 5) take the calls of follow_managed() out of lookup_fast() into its callers.
> 6) after that we have 3 callers of step_into(); [..]
> So if we manage to convert the damn thing in mountpoint_last() into
> follow_managed(), we could fold follow_managed() into step_into().

I think that all makes sense. I didn't go to look at the source, but
from the email contents your steps look reasonable to me.

> Another interesting question is whether we want O_PATH open
> to trigger automounts.

It does sound like they shouldn't, but as you say:

>     The thing is, we do *NOT* trigger them
> (or traverse mountpoints) at the starting point of lookups.
> I believe it's a mistake (and mine, at that), but I doubt that
> there's anything that can be done about it at that point.
> It's a user-visible behaviour [..]

Hmm. I wonder how set in stone that is. We may have two decades of
history of not doing it at start point of lookups, but we do *not*
have two decades of history of O_PATH.

So what I think we agree would be sane behavior would be for O_PATH
opens to not trigger automounts (unless there's a slash at the end,
whatever), but _do_ add the mount-point traversal to the beginning of
lookups.

But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.

That way we maintain original behavior: if somebody overmounts your
cwd, you still see the pre-mount directory on lookups, because your
cwd is "under" the mount.

But if you open a file with O_PATH, and somebody does a mount
_afterwards_, the openat() will see that later mount and/or do the
automount.

Don't you think that would be the more sane/obvious semantics of how
O_PATH should work?

> I think the easiest way to handle that is to have O_PATH
> turn LOOKUP_AUTOMOUNT, same as the normal open() does.  That's
> trivial to do, but that changes user-visible behaviour.  OTOH,
> with the current behaviour nobody can rely upon automount not
> getting triggered by somebody else just as they are entering
> their open(dir, O_PATH), so I think that's not a problem.
>
> Linus, do you have any objections to such O_PATH semantics
> change?

See above: I think I'd prefer the O_PATH behavior the other way
around. That seems to be more of a consistent behavior of what
"O_PATH" means - it means "don't really open, we'll do it only when
you use it as a directory".

But I don't have any _strong_ opinions. If you have a good reason to
tell me that I'm being stupid, go ahead and do so and override my
stupidity.

             Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks
  2019-12-30  8:28     ` Aleksa Sarai
@ 2020-01-08  4:39       ` Andy Lutomirski
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2020-01-08  4:39 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Linus Torvalds, Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Mon, Dec 30, 2019 at 12:29 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2019-12-29, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> If allowing bind-mounts over symlinks is allowed (which I don't have a
> problem with really), it just means we'll need a few more kernel pieces
> to get this hardening to work. But these features would be useful
> outside of the problems I'm dealing with (O_EMPTYPATH and some kind of
> pidfd-based interface to grab the equivalent of /proc/self/exe and a few
> other such magic-link targets).

As one data point, I would use this ability in virtme: this would
allow me to more reliably mount over /etc/resolve.conf even when it's
a symlink.

(Perhaps I should use overlayfs instead.  Hmm.)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-08  3:54                       ` Linus Torvalds
@ 2020-01-08 21:34                         ` Al Viro
  2020-01-10  0:08                           ` Linus Torvalds
  2020-01-10 21:07                         ` Aleksa Sarai
  1 sibling, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-08 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aleksa Sarai, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Tue, Jan 07, 2020 at 07:54:02PM -0800, Linus Torvalds wrote:

> > Another interesting question is whether we want O_PATH open
> > to trigger automounts.
> 
> It does sound like they shouldn't, but as you say:
> 
> >     The thing is, we do *NOT* trigger them
> > (or traverse mountpoints) at the starting point of lookups.
> > I believe it's a mistake (and mine, at that), but I doubt that
> > there's anything that can be done about it at that point.
> > It's a user-visible behaviour [..]
> 
> Hmm. I wonder how set in stone that is. We may have two decades of
> history of not doing it at start point of lookups, but we do *not*
> have two decades of history of O_PATH.
> 
> So what I think we agree would be sane behavior would be for O_PATH
> opens to not trigger automounts (unless there's a slash at the end,
> whatever), but _do_ add the mount-point traversal to the beginning of
> lookups.
> 
> But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
> 
> That way we maintain original behavior: if somebody overmounts your
> cwd, you still see the pre-mount directory on lookups, because your
> cwd is "under" the mount.
> 
> But if you open a file with O_PATH, and somebody does a mount
> _afterwards_, the openat() will see that later mount and/or do the
> automount.
> 
> Don't you think that would be the more sane/obvious semantics of how
> O_PATH should work?

Maybe, but... note that we do not (and AFAICS never had) follow mounts
on /proc/self/cwd, /proc/self/fd/42, etc.  And there are very good
reasons for that.  First of all, if your stdin is from /tmp/foo,
you'd better get that file when you open /dev/stdin, even if somebody
has done mount --bind /tmp/bar /tmp/foo; another issue is with
the use of stat("/proc/self/fd/42", &buf) - it should be an equivalent
of fstat(42, &buf), even if somebody has overmounted that.  BTW, for
similar reason after
	link(".", "foo");
	fd = open("foo", O_PATH);	// return 42
we really should (and do) have resolution of /proc/self/fd/42 stop at
foo, not .   Reason: consistency of stat() behaviour...

The point is, we'd never followed mounts on /proc/self/cwd et.al.
I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
is that way.  Actually, scratch that - 2.0 behaves the same way
(mountpoint crossing is done in iget() there; is that Minix influence
or straight from the Lions' book?)

Hmm...  Looking through the history, we have

(for reference) v7: mount traversal in iget()
(forward) and namei() (back); due to the way it's done, forward
traversal happens
	* at starting point
	* after any component (. and .. included)
	* on results of forward traversal (due to a loop in iget()).
Back traversal (to covered on .. from root directory) is also
to unlimited depth.

0.01: no mount handling

0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
(not by Lions' Book, then - v6 didn't do back traversals at all).
Forward traversal
	* after any component (. and .. included)
No traversal on starting point, no traversal on result of traversal.
OTOH, mount(2) refuses to mount on top of root, so the lack of the last
one is not an issue.

0.12: symlinks added; no mount traversal on starting point of those either.
We start at the process' root for absolute ones, even if it happens to be
overmounted, and we start from parent for relative ones.  The latter matters
only if we were in the beginning of the pathwalk, since anything else would've
traversed mounts back when we'd picked said parent.  Mount traversal takes
precedence over symlink traversal, but that's not an issue since mount follows
links on mountpoint.  It does not, at that point, reject fs image with
symlink for root, but that actually more or less works.

0.97.3: same, with addition of procfs symlinks.  No mount crossing on their
targets (for normal symlinks we don't do mount crossing in the beginning
and any component inside triggers mount crossing as usual; for procfs ones
there's no components inside)

Situation remains essentially unchanged until 2.1.42.  Next few kernels
are in flux, to put it politely - initial merge had been insane and it
took until 2.1.44 or so for the things to get more or less working.

At 2.1.44: forward traversal in fs/namei.c:lookup(), back traversal in
fs/namei.c:reserved_lookup().  Otherwise the same behaviour as pre-dcache
(wrt mount traversals, that is).

2.1.51pre1: forward traversal moved into real_lookup() and __d_lookup().
Forward traversal happens *ONLY* after normal components - not after . or ..

2.1.61: forward traversal moved into follow_mount(), behaviour reverted to
pre-dcache one.

Previous is from reading through the historical trees; my involvement started
circa 2.1.120-something.

2.3.50pre3: call of follow_mount() moved a bit, reverting to 2.1.51pre1
behaviour (nor traversal on . or ..) *again*.  Not sure whose idea had that
been - might've been mine, but unlike the other patch that went into fs/namei.c
in the same release, I hadn't been able to find anything related to that
one.  If your memories (or mail archives) are better...

2.3.99pre4-5: massive surgery in there.  Preparations to allowing mount on top
of mount; forward traversal adjusted accordingly, back traversal still isn't.

2.3.99pre7-1: more surgery, back traversals are also to unlimited depth now
and mount on top of mount has been allowed.

2.3.99pre9-4: mount --bind taught to mount non-directories on top of
non-directories.  At that point it does *NOT* follow trailing symlinks, so
mounting of symlinks and mounting on top of symlinks becomes possible.
Mount traversal still takes precedence over symlink traversal, symlink traversal
of mount traversal result still generally works, even though it's not something
I considered at the time.

v2.4.5.2: mount --bind started to follow symlinks.  So that source of mounting
of and on the symlinks was no more.

2.5.0.5: forward mount traversal is done after .. (in handle_dotdot()).
That brings back the pre-dcache behaviour for those suckers.  Still no
forward traversal after ., though.

At about the same time I'd been getting rid of the early-boot incestous
relationships with fs/namespace.c (initramfs work) and that was probably
the last time we could realistically switch to following mounts at starting
point; I considered trying to do that, but decided not to.  Pity, that...

2.6.5-rc2: normal mount now checks for corrupt fs with symlink for root.
Since it has always been following symlinks for mountpoint, the remaining
source of mounting of and on symlinks was gone; that lasted until
after O_PATH introduction.

2.6.39-rc1: mount traps support - instead of abusing ->follow_link()
for automounting, we have an explicit pair of methods that can be
called at the same places where we traverse mounts.  None too consistent -
we don't do that on .. results.  That was Dave Howells and Ian Kent.

2.6.39-rc1: O_PATH introduced and, later in the same series, allowed for
symlinks.  That has changed things - now procfs symlink targets could
be symlinks themselves.  Originally an attempt to follow those would
blow up with -ELOOP (there's simply no good way to follow such beast;
it's either "stop even if we are asked to follow" or "give an error").

3.6.0-rc1: nd_jump_link() introduction (hch) had unnoticed side effects -
we'd switched from "fail traversal with -ELOOP" to "stop there".  Mostly it
doesn't change behaviour, but it has opened a way to mount symlinks and
mount on top of symlinks.  Which generally worked.

circa 3.8--3.9: side effects had been noticed; my first reaction had been
"let's make nd_jump_link() return an error, then", but I hadn't been
able to find good reasons when challenged to do so.  Did an audit,
found no obvious problems, went "oh, well - whether it works by accident
or by design, it doesn't break anything".

3.12.0-rc1: lookups for umount(2) are different - we don't want
revalidate on the last component.  Which had been handled by
introduction of path_umountat()/umount_lookup_last(), parallel to
path_lookupat().  Which has gotten quite a few things wrong -
it *did* try to follow symlinks obtained by following procfs
ones (and blew up big way) and it didn't follow mounts on
overmounted trailing symlinks.  Nobody noticed for 6 years,
until folks actually tried to play with mount-on-symlink...
Patches were by Jeff Layton, neither he nor I have spotted the
problem back then.  And I should have, since it had been only
a few months since the audit for exactly that kind of problems...

AFAICS, there'd been no serious semantical changes since then.  What we
have right now:
	* no mount traversal on the starting point
	* mount traversal after any component other than "."
	* symlink traversal consists of possibly jumping to given
point plus following a given (possibly empty) series of components.
It can be both - e.g. symlink to "/foo/bar" is 'jump to root,
then traverse "foo", then traverse "bar"'.  Procfs "magic" symlinks
are not really magical - they behave as symlinks to "/" as far as
the pathwalk semantics is concerned.  The only differences is that
jump might be not to process' root.
	* mount traversal takes precedence over symlink traversal.
	* jump (if any) in symlink traversal is treated the same
as the starting point - it's not followed by mount traversal.
It's also not followed by symlink traversal, even if we are jumping
into a symlink.  Of course, in any position other than the end of
pathname that's an instant error.  That's also not different from
the starting point treatment - if ...at(2) is given a symlink for
starting point, it leaves it as-is if AT_EMPTY_PATH is given and
fails with -ENOTDIR otherwise.
	* umount(2) handles the final component differently -
for one thing, it does not do revalidate, for another - its
mount traversal (if any) does not include automount-related
parts.  And there we *do* want mount traversal at the final
point, for obvious reasons.

> > I think the easiest way to handle that is to have O_PATH
> > turn LOOKUP_AUTOMOUNT, same as the normal open() does.  That's
> > trivial to do, but that changes user-visible behaviour.  OTOH,
> > with the current behaviour nobody can rely upon automount not
> > getting triggered by somebody else just as they are entering
> > their open(dir, O_PATH), so I think that's not a problem.
> >
> > Linus, do you have any objections to such O_PATH semantics
> > change?
> 
> See above: I think I'd prefer the O_PATH behavior the other way
> around. That seems to be more of a consistent behavior of what
> "O_PATH" means - it means "don't really open, we'll do it only when
> you use it as a directory".

How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
vs. faccessat(42, "foo", MAY_READ)?  The latter would trigger automount,
the former would not...  Or would you extend that to "traverse mounts
upon following procfs links, if the file in question had been opened with
O_PATH"?  We could do that (give nd_jump_link() an extra argument telling
if we want mount traversal), but I'm not sure if the resulting semantics
is sane...

Note, BTW, that O_PATH users really can't rely upon automounts _not_
being triggered - all it takes is a lookup on bogus path with such prefix
by anybody who can reach that place...  We are not opening anything,
really, but we are not able to ignore automounts triggered by somebody
else.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-08 21:34                         ` Al Viro
@ 2020-01-10  0:08                           ` Linus Torvalds
  2020-01-10  4:15                             ` Al Viro
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2020-01-10  0:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> The point is, we'd never followed mounts on /proc/self/cwd et.al.
> I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
> is that way.

Hmm. If that's the case, maybe they should be marked implicitly as
O_PATH when opened?

> Actually, scratch that - 2.0 behaves the same way
> (mountpoint crossing is done in iget() there; is that Minix influence
> or straight from the Lions' book?)

I don't think I ever had access to Lions' - I've _seen_ a printout of
it later, and obviously maybe others did,

More likely it's from Maurice Bach: the Design of the Unix Operating
System. I'm pretty sure that's where a lot of the FS layer stuff came
from.  Certainly the bad old buffer head interfaces, and quite likely
the iget() stuff too.

> 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()

Whee, you _really_ went back in time.

So I did too.

And looking at that code in iget(), I doubt it came from anywhere.
Christ. It's just looping over a fixed-size array, both when finding
the inode, and finding the superblock.

Cute, but unbelievably stupid. It was a more innocent time.

In other words, I think you can chalk it up to just me, because
blaming anybody else for that garbage would be very very unfair indeed
;)

> How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
> vs. faccessat(42, "foo", MAY_READ)?

I think that in a perfect world, the O_PATH'ness of '42' would be the
deciding factor. Wouldn't those be the best and most consistent
semantics?

And then 'cwd'/'root' always have the O_PATH behavior.

> The latter would trigger automount,
> the former would not...  Or would you extend that to "traverse mounts
> upon following procfs links, if the file in question had been opened with
> O_PATH"?

Exactly.

But you know what? I do not believe this is all that important, and I
doubt it will matter to anybody.

So what matters most is what makes the most sense to the VFS layer,
and what makes the most sense to _you_.

Because my reaction from this thread is that not only have you thought
about this issue and followed the history a whole lot more than I
would ever have done, it's also that I trust you to DTRT.

I think it would be good to have some self-consistency, but at the
same time clearly we already don't really, and our behavior here has
subtly changed over the years (and not so subtly - if you go back
sufficiently far, /proc behavior wrt file descriptors has had both
"dup()" behavior and "make a new file descriptor with the same inode"
behavior, afaik).

               Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-10  0:08                           ` Linus Torvalds
@ 2020-01-10  4:15                             ` Al Viro
  2020-01-10  5:03                               ` Linus Torvalds
  2020-01-10  6:20                               ` Ian Kent
  0 siblings, 2 replies; 53+ messages in thread
From: Al Viro @ 2020-01-10  4:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aleksa Sarai, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
> On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > The point is, we'd never followed mounts on /proc/self/cwd et.al.
> > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
> > is that way.
> 
> Hmm. If that's the case, maybe they should be marked implicitly as
> O_PATH when opened?

I thought you wanted O_PATH as starting point to have mounts traversed?
Confused...

> > Actually, scratch that - 2.0 behaves the same way
> > (mountpoint crossing is done in iget() there; is that Minix influence
> > or straight from the Lions' book?)
> 
> I don't think I ever had access to Lions' - I've _seen_ a printout of
> it later, and obviously maybe others did,
> 
> More likely it's from Maurice Bach: the Design of the Unix Operating
> System. I'm pretty sure that's where a lot of the FS layer stuff came
> from.  Certainly the bad old buffer head interfaces, and quite likely
> the iget() stuff too.
>
> > 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
> 
> Whee, you _really_ went back in time.
> 
> So I did too.
> 
> And looking at that code in iget(), I doubt it came from anywhere.
> Christ. It's just looping over a fixed-size array, both when finding
> the inode, and finding the superblock.
> 
> Cute, but unbelievably stupid. It was a more innocent time.
> 
> In other words, I think you can chalk it up to just me, because
> blaming anybody else for that garbage would be very very unfair indeed
> ;)

See https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c
Exactly the same algorithm, complete with linear searches over those
fixed-sized array.

<grabs Bach> Right, he simply transcribes v7 iget().

So I suspect that you are right - your variant of iget was pretty much
one-to-one implementation of Bach's description of v7 iget.

Your namei wasn't - Bach has 'if the entry points to root and you are
in the root and name is "..", find mount table entry (by device number),
drop your directory inode, grab the inode of mountpount and restart
the search for ".." in there', which gives back traversals to arbitrary
depth.  And v7 namei() (as Bach mentions) uses iget() for starting point
as well as for each component.  You kept pointers instead, which is where
the other difference has come from (no mount traversal at the starting
point)...

Actually, I've misread your code in 0.10 - it does unlimited forward
traversals; it's back traversals that go only one level.  The forward
ones got limited to one level in 0.95, but then mount-over-root had
been banned all along.  I'd read the pre-dcache variant of iget(),
seen it go pretty much all the way back to beginning and hadn't
sorted out the 0.12 -> 0.95 transition...

> > How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
> > vs. faccessat(42, "foo", MAY_READ)?
> 
> I think that in a perfect world, the O_PATH'ness of '42' would be the
> deciding factor. Wouldn't those be the best and most consistent
> semantics?
> 
> And then 'cwd'/'root' always have the O_PATH behavior.

See above - unless I'm misparsing you, you wanted mount traversals in the
starting point if it's ...at() with O_PATH fd.  With O_PATH open() not
doing them.

For cwd and root the situation is opposite - we do NOT traverse mounts
for those.  And that's really too late to change.

> > The latter would trigger automount,
> > the former would not...  Or would you extend that to "traverse mounts
> > upon following procfs links, if the file in question had been opened with
> > O_PATH"?
> 
> Exactly.
> 
> But you know what? I do not believe this is all that important, and I
> doubt it will matter to anybody.

FWIW, digging through the automount-related parts of that stuff has
caught several fun issues.  One (and I'm rather embarrassed by it)
should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount
overput on simultaneous automount).  To quote the commit message:
    The problem is that lock_mount() drops the caller's reference to the
    mountpoint's vfsmount in the case where it finds something already mounted on
    the mountpoint as it transits to the mounted filesystem and replaces path->mnt
    with the new mountpoint vfsmount.
    
    During a pathwalk, however, we don't take a reference on the vfsmount if it is
    the same as the one in the nameidata struct, but do_add_mount() doesn't know
    this.
At which point I should've gone "what the fuck?" - lock_mount() does, indeed,
drop path->mnt in this situation and replaces it with the whatever's come to
cover it.  For mount(2) that's the right thing to do - we _want_ to mount
on top of whatever we have at the mountpoint.  For automounts we very much
don't want that - it's either "mount right on top of the automount trigger"
or discard whatever we'd been about to mount and walk into whatever's got
mounted there (presumably the same thing triggered by another process).
We kinda-sorta get that effect, but in a very convoluted way: do_add_mount()
will refuse to mount something on top of itself -
        /* Refuse the same filesystem on the same mount point */
        err = -EBUSY;
        if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
            path->mnt->mnt_root == path->dentry)
                goto unlock;
which will end up with -EBUSY returned (and recognized by follow_automount()).

First of all, that's unreliable.  If somebody not only has triggered that
automount, but managed to _mount_ something else on top (for example,
has triggered it by lookup of mountpoint-to-be in mount(2)), we'll end
up not triggering that check.  In which case we'll get something like
nfs referral point under nfs automounted there under tmpfs from explicit
overmount under same nfs mount we'd automounted there - identical to what's
been buried under tmpfs.  It's hard to hit, but not impossibly so.

What's more, the whole solution is a kludge - the root of problem is
that lock_mount() is the wrong thing to do in case of finish_automount().
We don't want to go into whatever's overmounting us there, both for
the reasons above *and* because it's a PITA for the caller.  So the
right solution is
	* lift lock_mount() call from do_add_mount() into its callers
(all 2 of them); while we are at it, lift unlock_mount() as well
(makes for simpler failure exits in do_add_mount()).
	* replace the call of lock_mount() in finish_automount()
with variant that doesn't do "unlock, walk deeper and retry locking",
returning ERR_PTR(-EBUSY) in such case.
	* get rid of the kludge introduced in that commit.  Better
yet, don't bother with traversing into the covering mount in case
of success - let the caller of follow_automount() do that.  Which
eliminates the need to pass need_mntput to the sucker and suggests
an even better solution - have this analogue of lock_mount()
return NULL instead of ERR_PTR(-EBUSY) and treat it in finish_automount()
as "OK, discard what we wanted to mount and return 0".  That gets
rid of the entire
        err = finish_automount(mnt, path);
        switch (err) {
        case -EBUSY:
                /* Someone else made a mount here whilst we were busy */
                return 0;
        case 0:
                path_put(path);
                path->mnt = mnt;
                path->dentry = dget(mnt->mnt_root);
                return 0;
        default:
                return err;
        }
chunk in follow_automount() - it would just be
	return finish_automount(mnt, path);

Another thing (in the same area) is not a bug per se, but...
after the call of ->d_automount() we have this:
        if (IS_ERR(mnt)) {
                /*
                 * The filesystem is allowed to return -EISDIR here to indicate
                 * it doesn't want to automount.  For instance, autofs would do
                 * this so that its userspace daemon can mount on this dentry.
                 *
                 * However, we can only permit this if it's a terminal point in
                 * the path being looked up; if it wasn't then the remainder of
                 * the path is inaccessible and we should say so.
                 */
                if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT))
                        return -EREMOTE;
                return PTR_ERR(mnt);
	}
Except that not a single instance of ->d_automount() has ever returned
-EISDIR.  Certainly not autofs one, despite the what the comment says.
That chunk has come from dhowells, back when the whole mount trap series
had been merged.  After talking that thing over (fun: trying to figure
out what had been intended nearly 9 years ago, when people involved are
in UK, US east coast and AU west coast respectively.  The only way it
could suck more would've been if I were on the west coast - then all
timezone deltas would be 8-hour ones)...  looks like it's a rudiment
of plans that got superseded during the series development, nobody
quite remembers exact details.  Conclusion: it's not even dead, it's
stillborn; bury it.

Unfortunately, there are other interesting questions related to
autofs-specific bits (->d_manage()) and the timezone-related fun
is, of course, still there.  I hope to sort that out today or
tomorrow, at least enough to do a reasonable set of backportable
fixes to put in front of follow_managed()/step_into() queue.
Oh, well...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-10  4:15                             ` Al Viro
@ 2020-01-10  5:03                               ` Linus Torvalds
  2020-01-10  6:20                               ` Ian Kent
  1 sibling, 0 replies; 53+ messages in thread
From: Linus Torvalds @ 2020-01-10  5:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Thu, Jan 9, 2020 at 8:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Hmm. If that's the case, maybe they should be marked implicitly as
> > O_PATH when opened?
>
> I thought you wanted O_PATH as starting point to have mounts traversed?
> Confused...

No, I'm confused.  I meant "non-O_PATH", just got the rules reversed in my mind.

So cwd/root would always act as it non-O_PATH, and only using an
actual fd would look at the O_PATH flag, and if it was set would walk
the mountpoints.

> <grabs Bach> Right, he simply transcribes v7 iget().
>
> So I suspect that you are right - your variant of iget was pretty much
> one-to-one implementation of Bach's description of v7 iget.

Ok, that makes sense. My copy of Bach literally had the system call
list "marked off" when I implemented them back when.

I may still have that paperbook copy somewhere. I don't _think_ I'd
have thrown it out, it has sentimental value.

> > I think that in a perfect world, the O_PATH'ness of '42' would be the
> > deciding factor. Wouldn't those be the best and most consistent
> > semantics?
> >
> > And then 'cwd'/'root' always have the O_PATH behavior.
>
> See above - unless I'm misparsing you, you wanted mount traversals in the
> starting point if it's ...at() with O_PATH fd.

.. and see above, it was just my confusion about the sense of O_PATH.

> For cwd and root the situation is opposite - we do NOT traverse mounts
> for those.  And that's really too late to change.

Oh, absolutely.

[ snip some more about your automount digging. Looks about right, but
I'm not going to make a peep after getting O_PATH reversed ;) ]

            Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-10  4:15                             ` Al Viro
  2020-01-10  5:03                               ` Linus Torvalds
@ 2020-01-10  6:20                               ` Ian Kent
  2020-01-12 21:33                                 ` Al Viro
  1 sibling, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-10  6:20 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Aleksa Sarai, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Fri, 2020-01-10 at 04:15 +0000, Al Viro wrote:
> On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk>
> > wrote:
> > > The point is, we'd never followed mounts on /proc/self/cwd et.al.
> > > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from
> > > me)
> > > is that way.
> > 
> > Hmm. If that's the case, maybe they should be marked implicitly as
> > O_PATH when opened?
> 
> I thought you wanted O_PATH as starting point to have mounts
> traversed?
> Confused...
> 
> > > Actually, scratch that - 2.0 behaves the same way
> > > (mountpoint crossing is done in iget() there; is that Minix
> > > influence
> > > or straight from the Lions' book?)
> > 
> > I don't think I ever had access to Lions' - I've _seen_ a printout
> > of
> > it later, and obviously maybe others did,
> > 
> > More likely it's from Maurice Bach: the Design of the Unix
> > Operating
> > System. I'm pretty sure that's where a lot of the FS layer stuff
> > came
> > from.  Certainly the bad old buffer head interfaces, and quite
> > likely
> > the iget() stuff too.
> > 
> > > 0.10: forward traversal in iget(), back traversal in
> > > fs/namei.c:find_entry()
> > 
> > Whee, you _really_ went back in time.
> > 
> > So I did too.
> > 
> > And looking at that code in iget(), I doubt it came from anywhere.
> > Christ. It's just looping over a fixed-size array, both when
> > finding
> > the inode, and finding the superblock.
> > 
> > Cute, but unbelievably stupid. It was a more innocent time.
> > 
> > In other words, I think you can chalk it up to just me, because
> > blaming anybody else for that garbage would be very very unfair
> > indeed
> > ;)
> 
> See 
> https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c
> Exactly the same algorithm, complete with linear searches over those
> fixed-sized array.
> 
> <grabs Bach> Right, he simply transcribes v7 iget().
> 
> So I suspect that you are right - your variant of iget was pretty
> much
> one-to-one implementation of Bach's description of v7 iget.
> 
> Your namei wasn't - Bach has 'if the entry points to root and you are
> in the root and name is "..", find mount table entry (by device
> number),
> drop your directory inode, grab the inode of mountpount and restart
> the search for ".." in there', which gives back traversals to
> arbitrary
> depth.  And v7 namei() (as Bach mentions) uses iget() for starting
> point
> as well as for each component.  You kept pointers instead, which is
> where
> the other difference has come from (no mount traversal at the
> starting
> point)...
> 
> Actually, I've misread your code in 0.10 - it does unlimited forward
> traversals; it's back traversals that go only one level.  The forward
> ones got limited to one level in 0.95, but then mount-over-root had
> been banned all along.  I'd read the pre-dcache variant of iget(),
> seen it go pretty much all the way back to beginning and hadn't
> sorted out the 0.12 -> 0.95 transition...
> 
> > > How would your proposal deal with access("/proc/self/fd/42/foo",
> > > MAY_READ)
> > > vs. faccessat(42, "foo", MAY_READ)?
> > 
> > I think that in a perfect world, the O_PATH'ness of '42' would be
> > the
> > deciding factor. Wouldn't those be the best and most consistent
> > semantics?
> > 
> > And then 'cwd'/'root' always have the O_PATH behavior.
> 
> See above - unless I'm misparsing you, you wanted mount traversals in
> the
> starting point if it's ...at() with O_PATH fd.  With O_PATH open()
> not
> doing them.
> 
> For cwd and root the situation is opposite - we do NOT traverse
> mounts
> for those.  And that's really too late to change.
> 
> > > The latter would trigger automount,
> > > the former would not...  Or would you extend that to "traverse
> > > mounts
> > > upon following procfs links, if the file in question had been
> > > opened with
> > > O_PATH"?
> > 
> > Exactly.
> > 
> > But you know what? I do not believe this is all that important, and
> > I
> > doubt it will matter to anybody.
> 
> FWIW, digging through the automount-related parts of that stuff has
> caught several fun issues.  One (and I'm rather embarrassed by it)
> should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount
> overput on simultaneous automount).  To quote the commit message:
>     The problem is that lock_mount() drops the caller's reference to
> the
>     mountpoint's vfsmount in the case where it finds something
> already mounted on
>     the mountpoint as it transits to the mounted filesystem and
> replaces path->mnt
>     with the new mountpoint vfsmount.
>     
>     During a pathwalk, however, we don't take a reference on the
> vfsmount if it is
>     the same as the one in the nameidata struct, but do_add_mount()
> doesn't know
>     this.
> At which point I should've gone "what the fuck?" - lock_mount() does,
> indeed,
> drop path->mnt in this situation and replaces it with the whatever's
> come to
> cover it.  For mount(2) that's the right thing to do - we _want_ to
> mount
> on top of whatever we have at the mountpoint.  For automounts we very
> much
> don't want that - it's either "mount right on top of the automount
> trigger"
> or discard whatever we'd been about to mount and walk into whatever's
> got
> mounted there (presumably the same thing triggered by another
> process).
> We kinda-sorta get that effect, but in a very convoluted way:
> do_add_mount()
> will refuse to mount something on top of itself -
>         /* Refuse the same filesystem on the same mount point */
>         err = -EBUSY;
>         if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
>             path->mnt->mnt_root == path->dentry)
>                 goto unlock;
> which will end up with -EBUSY returned (and recognized by
> follow_automount()).
> 
> First of all, that's unreliable.  If somebody not only has triggered
> that
> automount, but managed to _mount_ something else on top (for example,
> has triggered it by lookup of mountpoint-to-be in mount(2)), we'll
> end
> up not triggering that check.  In which case we'll get something like
> nfs referral point under nfs automounted there under tmpfs from
> explicit
> overmount under same nfs mount we'd automounted there - identical to
> what's
> been buried under tmpfs.  It's hard to hit, but not impossibly so.
> 
> What's more, the whole solution is a kludge - the root of problem is
> that lock_mount() is the wrong thing to do in case of
> finish_automount().
> We don't want to go into whatever's overmounting us there, both for
> the reasons above *and* because it's a PITA for the caller.  So the
> right solution is
> 	* lift lock_mount() call from do_add_mount() into its callers
> (all 2 of them); while we are at it, lift unlock_mount() as well
> (makes for simpler failure exits in do_add_mount()).
> 	* replace the call of lock_mount() in finish_automount()
> with variant that doesn't do "unlock, walk deeper and retry locking",
> returning ERR_PTR(-EBUSY) in such case.
> 	* get rid of the kludge introduced in that commit.  Better
> yet, don't bother with traversing into the covering mount in case
> of success - let the caller of follow_automount() do that.  Which
> eliminates the need to pass need_mntput to the sucker and suggests
> an even better solution - have this analogue of lock_mount()
> return NULL instead of ERR_PTR(-EBUSY) and treat it in
> finish_automount()
> as "OK, discard what we wanted to mount and return 0".  That gets
> rid of the entire
>         err = finish_automount(mnt, path);
>         switch (err) {
>         case -EBUSY:
>                 /* Someone else made a mount here whilst we were busy
> */
>                 return 0;
>         case 0:
>                 path_put(path);
>                 path->mnt = mnt;
>                 path->dentry = dget(mnt->mnt_root);
>                 return 0;
>         default:
>                 return err;
>         }
> chunk in follow_automount() - it would just be
> 	return finish_automount(mnt, path);
> 
> Another thing (in the same area) is not a bug per se, but...
> after the call of ->d_automount() we have this:
>         if (IS_ERR(mnt)) {
>                 /*
>                  * The filesystem is allowed to return -EISDIR here
> to indicate
>                  * it doesn't want to automount.  For instance,
> autofs would do
>                  * this so that its userspace daemon can mount on
> this dentry.
>                  *
>                  * However, we can only permit this if it's a
> terminal point in
>                  * the path being looked up; if it wasn't then the
> remainder of
>                  * the path is inaccessible and we should say so.
>                  */
>                 if (PTR_ERR(mnt) == -EISDIR && (nd->flags &
> LOOKUP_PARENT))
>                         return -EREMOTE;
>                 return PTR_ERR(mnt);
> 	}
> Except that not a single instance of ->d_automount() has ever
> returned
> -EISDIR.  Certainly not autofs one, despite the what the comment
> says.
> That chunk has come from dhowells, back when the whole mount trap
> series
> had been merged.  After talking that thing over (fun: trying to
> figure
> out what had been intended nearly 9 years ago, when people involved
> are
> in UK, US east coast and AU west coast respectively.  The only way it
> could suck more would've been if I were on the west coast - then all
> timezone deltas would be 8-hour ones)...  looks like it's a rudiment
> of plans that got superseded during the series development, nobody
> quite remembers exact details.  Conclusion: it's not even dead, it's
> stillborn; bury it.

Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
we get there it's not relevant any more, so that check looks
redundant. I'm not aware of any other fs automount implementation
that needs that EISDIR pass-thru function.

I didn't notice it at the time of the merge, sorry about that.

While we're at it that:
   if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
       return -EREMOTE;

at the top of follow_automount() isn't going to be be relevant
for autofs because ->d_automount() really must always be defined
for it.

But, at the time of the merge, I didn't object to it because
there were (are) other file systems that use the VFS automount
function which may accidentally not define the method.

> 
> Unfortunately, there are other interesting questions related to
> autofs-specific bits (->d_manage()) and the timezone-related fun
> is, of course, still there.  I hope to sort that out today or
> tomorrow, at least enough to do a reasonable set of backportable
> fixes to put in front of follow_managed()/step_into() queue.
> Oh, well...

Yeah, I know it slows you down but I kink-off like having a chance
to look at what's going and think about your questions before trying
to answer them, rather than replying prematurely, as I usually do ...

It's been a bit of a busy day so far but I'm getting to look into
the questions you've asked.

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-08  3:54                       ` Linus Torvalds
  2020-01-08 21:34                         ` Al Viro
@ 2020-01-10 21:07                         ` Aleksa Sarai
  2020-01-14  4:57                           ` Al Viro
  1 sibling, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-10 21:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

[-- Attachment #1: Type: text/plain, Size: 2739 bytes --]

On 2020-01-07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, Jan 7, 2020 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > Another interesting question is whether we want O_PATH open
> > to trigger automounts.
> 
> It does sound like they shouldn't, but as you say:
> 
> >     The thing is, we do *NOT* trigger them
> > (or traverse mountpoints) at the starting point of lookups.
> > I believe it's a mistake (and mine, at that), but I doubt that
> > there's anything that can be done about it at that point.
> > It's a user-visible behaviour [..]
> 
> Hmm. I wonder how set in stone that is. We may have two decades of
> history of not doing it at start point of lookups, but we do *not*
> have two decades of history of O_PATH.
> 
> So what I think we agree would be sane behavior would be for O_PATH
> opens to not trigger automounts (unless there's a slash at the end,
> whatever), but _do_ add the mount-point traversal to the beginning of
> lookups.
> 
> But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
> 
> That way we maintain original behavior: if somebody overmounts your
> cwd, you still see the pre-mount directory on lookups, because your
> cwd is "under" the mount.
> 
> But if you open a file with O_PATH, and somebody does a mount
> _afterwards_, the openat() will see that later mount and/or do the
> automount.
> 
> Don't you think that would be the more sane/obvious semantics of how
> O_PATH should work?

If I'm understanding this proposal correctly, this would be a problem
for the libpathrs use-case -- if this is done then there's no way to
avoid a TOCTOU with someone mounting and the userspace program checking
whether something is a mountpoint (unless you have Linux >5.6 and
RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:

  1. Open the candidate directory.
  2. umount2(MNT_EXPIRE) the fd.
    * -EINVAL means it wasn't a mountpoint when we got the fd, and the
	  fd is a stable handle to the underlying directory.
	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
	  mountpoint after the fd was opened (we don't care about that, but
	  fail-safe is better here).
  3. Use the fd from (1) for all operations.

Don't get me wrong, I want to fix this issue *properly* by adding some
new kernel features that allow us to avoid worrying about
mounts-over-magiclinks -- but on old kernels (which libpathrs cares
about) I would be worried about changes like this being backported
resulting in it being not possible to implement the hardening I
mentioned up-thread.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-03  1:49                   ` Al Viro
  2020-01-04  4:46                     ` Ian Kent
  2020-01-08  3:13                     ` Al Viro
@ 2020-01-10 23:19                     ` Al Viro
  2020-01-13  1:48                       ` Ian Kent
  2 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-10 23:19 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel, Ian Kent

On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:
> On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > > 
> > > > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > > > found -- mounting a symlink on top of itself then trying to umount it).
> > > > 
> > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> > > 
> > > Pushed into #fixes.
> > 
> > Thanks. One other thing I noticed is that umount applies to the
> > underlying symlink rather than the mountpoint on top. So, for example
> > (using the same scripts I posted in the thread):
> > 
> >   # ln -s /tmp/foo link
> >   # ./mount_to_symlink /etc/passwd link
> >   # umount -l link # will attempt to unmount "/tmp/foo"
> > 
> > Is that intentional?
> 
> It's a mess, again in mountpoint_last().  FWIW, at some point I proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I looked
> around and hadn't spotted anything.  And there hadn't been at the time,
> but when four months later umount_lookup_last() went in I failed to look
> for that source of potential problems in it ;-/

FWIW, since Ian appears to agree that we want ->d_manage() on the mount
crossing at the end of umount(2) lookup, here's a much simpler solution -
kill mountpoint_last() and switch to using lookup_last().  As a side
benefit, LOOKUP_NO_REVAL also goes away.  It's possible to trim the
things even more (path_mountpoint() is very similar to path_lookupat()
at that point, and it's not hard to make the differences conditional on
something like LOOKUP_UMOUNT); I would rather do that part in the
cleanups series - the one below is easier to backport.

Aleksa, Ian - could you see if the patch below works for you?

commit e56b43b971a7c08762fceab330a52b7245041dbc
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Jan 10 17:17:19 2020 -0500

    reimplement path_mountpoint() with less magic
    
    ... and get rid of a bunch of bugs in it.  Background:
    the reason for path_mountpoint() is that umount() really doesn't
    want attempts to revalidate the root of what it's trying to umount.
    The thing we want to avoid actually happen from complete_walk();
    solution was to do something parallel to normal path_lookupat()
    and it both went overboard and got the boilerplate subtly
    (and not so subtly) wrong.
    
    A better solution is to do pretty much what the normal path_lookupat()
    does, but instead of complete_walk() do unlazy_walk().  All it takes
    to avoid that ->d_weak_revalidate() call...  mountpoint_last() goes
    away, along with everything it got wrong, and so does the magic around
    LOOKUP_NO_REVAL.
    
    Another source of bugs is that when we traverse mounts at the final
    location (and we need to do that - umount . expects to get whatever's
    overmounting ., if any, out of the lookup) we really ought to take
    care of ->d_manage() - as it is, manual umount of autofs automount
    in progress can lead to unpleasant surprises for the daemon.  Easily
    solved by using handle_lookup_down() instead of follow_mount().
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..1793661c3342 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 	if (IS_ERR(dentry))
 		return dentry;
 	if (unlikely(!d_in_lookup(dentry))) {
-		if (!(flags & LOOKUP_NO_REVAL)) {
-			int error = d_revalidate(dentry, flags);
-			if (unlikely(error <= 0)) {
-				if (!error) {
-					d_invalidate(dentry);
-					dput(dentry);
-					goto again;
-				}
+		int error = d_revalidate(dentry, flags);
+		if (unlikely(error <= 0)) {
+			if (!error) {
+				d_invalidate(dentry);
 				dput(dentry);
-				dentry = ERR_PTR(error);
+				goto again;
 			}
+			dput(dentry);
+			dentry = ERR_PTR(error);
 		}
 	} else {
 		old = inode->i_op->lookup(inode, dentry, flags);
@@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
 EXPORT_SYMBOL(user_path_at_empty);
 
 /**
- * mountpoint_last - look up last component for umount
- * @nd:   pathwalk nameidata - currently pointing at parent directory of "last"
- *
- * This is a special lookup_last function just for umount. In this case, we
- * need to resolve the path without doing any revalidation.
- *
- * The nameidata should be the result of doing a LOOKUP_PARENT pathwalk. Since
- * mountpoints are always pinned in the dcache, their ancestors are too. Thus,
- * in almost all cases, this lookup will be served out of the dcache. The only
- * cases where it won't are if nd->last refers to a symlink or the path is
- * bogus and it doesn't exist.
- *
- * Returns:
- * -error: if there was an error during lookup. This includes -ENOENT if the
- *         lookup found a negative dentry.
- *
- * 0:      if we successfully resolved nd->last and found it to not to be a
- *         symlink that needs to be followed.
- *
- * 1:      if we successfully resolved nd->last and found it to be a symlink
- *         that needs to be followed.
- */
-static int
-mountpoint_last(struct nameidata *nd)
-{
-	int error = 0;
-	struct dentry *dir = nd->path.dentry;
-	struct path path;
-
-	/* If we're in rcuwalk, drop out of it to handle last component */
-	if (nd->flags & LOOKUP_RCU) {
-		if (unlazy_walk(nd))
-			return -ECHILD;
-	}
-
-	nd->flags &= ~LOOKUP_PARENT;
-
-	if (unlikely(nd->last_type != LAST_NORM)) {
-		error = handle_dots(nd, nd->last_type);
-		if (error)
-			return error;
-		path.dentry = dget(nd->path.dentry);
-	} else {
-		path.dentry = d_lookup(dir, &nd->last);
-		if (!path.dentry) {
-			/*
-			 * No cached dentry. Mounted dentries are pinned in the
-			 * cache, so that means that this dentry is probably
-			 * a symlink or the path doesn't actually point
-			 * to a mounted dentry.
-			 */
-			path.dentry = lookup_slow(&nd->last, dir,
-					     nd->flags | LOOKUP_NO_REVAL);
-			if (IS_ERR(path.dentry))
-				return PTR_ERR(path.dentry);
-		}
-	}
-	if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) {
-		dput(path.dentry);
-		return -ENOENT;
-	}
-	path.mnt = nd->path.mnt;
-	return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
-}
-
-/**
  * path_mountpoint - look up a path to be umounted
  * @nd:		lookup context
  * @flags:	lookup flags
@@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path)
 	int err;
 
 	while (!(err = link_path_walk(s, nd)) &&
-		(err = mountpoint_last(nd)) > 0) {
+		(err = lookup_last(nd)) > 0) {
 		s = trailing_symlink(nd);
 	}
+	if (!err)
+		err = unlazy_walk(nd);
+	if (!err)
+		err = handle_lookup_down(nd);
 	if (!err) {
 		*path = nd->path;
 		nd->path.mnt = NULL;
 		nd->path.dentry = NULL;
-		follow_mount(path);
 	}
 	terminate_walk(nd);
 	return err;
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index f64a33d2a1d1..2a82dcce5fc1 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT);
 TRACE_DEFINE_ENUM(LOOKUP_PARENT);
 TRACE_DEFINE_ENUM(LOOKUP_REVAL);
 TRACE_DEFINE_ENUM(LOOKUP_RCU);
-TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL);
 TRACE_DEFINE_ENUM(LOOKUP_OPEN);
 TRACE_DEFINE_ENUM(LOOKUP_CREATE);
 TRACE_DEFINE_ENUM(LOOKUP_EXCL);
@@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN);
 			{ LOOKUP_PARENT, "PARENT" }, \
 			{ LOOKUP_REVAL, "REVAL" }, \
 			{ LOOKUP_RCU, "RCU" }, \
-			{ LOOKUP_NO_REVAL, "NO_REVAL" }, \
 			{ LOOKUP_OPEN, "OPEN" }, \
 			{ LOOKUP_CREATE, "CREATE" }, \
 			{ LOOKUP_EXCL, "EXCL" }, \
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 7fe7b87a3ded..07bfb0874033 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 
 /* internal use only */
 #define LOOKUP_PARENT		0x0010
-#define LOOKUP_NO_REVAL		0x0080
 #define LOOKUP_JUMPED		0x1000
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_ROOT_GRABBED	0x0008

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-10  6:20                               ` Ian Kent
@ 2020-01-12 21:33                                 ` Al Viro
  2020-01-13  2:59                                   ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-12 21:33 UTC (permalink / raw)
  To: Ian Kent
  Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman,
	stable, Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:

> Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
> we get there it's not relevant any more, so that check looks
> redundant. I'm not aware of any other fs automount implementation
> that needs that EISDIR pass-thru function.
> 
> I didn't notice it at the time of the merge, sorry about that.
> 
> While we're at it that:
>    if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
>        return -EREMOTE;
> 
> at the top of follow_automount() isn't going to be be relevant
> for autofs because ->d_automount() really must always be defined
> for it.
> 
> But, at the time of the merge, I didn't object to it because
> there were (are) other file systems that use the VFS automount
> function which may accidentally not define the method.

OK...

> > Unfortunately, there are other interesting questions related to
> > autofs-specific bits (->d_manage()) and the timezone-related fun
> > is, of course, still there.  I hope to sort that out today or
> > tomorrow, at least enough to do a reasonable set of backportable
> > fixes to put in front of follow_managed()/step_into() queue.
> > Oh, well...
> 
> Yeah, I know it slows you down but I kink-off like having a chance

Nice typo, that ;-)

> to look at what's going and think about your questions before trying
> to answer them, rather than replying prematurely, as I usually do ...
> 
> It's been a bit of a busy day so far but I'm getting to look into
> the questions you've asked.

Here's a bit more of those (I might've missed some of your replies on
IRC; my apologies if that's the case):

1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try
->d_automount() here".  If its effect can be delayed until the decision
to call ->d_automount(), the things seem to get simpler.  Is it ever
returned in situation when the sucker _is_ overmounted?

2) can autofs_d_automount() ever be called for a daemon?  Looks like it
shouldn't be...

3) is _anything_ besides root directory ever created in direct autofs
superblocks by anyone?  If not, why does autofs_lookup() even bother to
do anything there?  IOW, why not have it return ERR_PTR(-ENOENT) immediately
for direct ones?  Or am I missing something and it is, in fact, possible
to have the daemon create something in those?

4) Symlinks look like they should qualify for parent being non-empty;
at least autofs_d_manage() seems to think so (simple_empty() use).
So shouldn't we remove the trap from its parent on symlink/restore on
unlink if parent gets empty?  For version 4 or earlier, that is.  Or is
it simply that daemon only creates symlinks in root directory?


Anyway, intermediate state of the series is in #work.namei right now,
and some _very_ interesting possibilities open up.  It definitely
needs more massage around __follow_mount_rcu() (as it is, the
fastpath in there is still too twisted).  Said that
	* call graph is less convoluted
	* follow_managed() calls are folded into step_into().  Interface:
int step_into(nd, flags, dentry, inode, seq), with inode/seq used only
if we are in RCU mode.
	* ".." still doesn't use that; it probably ought to.
	* lookup_fast() doesn't take path - nd, &inode, &seq and returns dentry
	* lookup_open() and fs/namei.c:atomic_open() get similar treatment
- don't take path, return dentry.
	* calls of follow_managed()/step_into() combination returning 1
are always followed by get_link(), and very shortly, at that.  So much
that we can realistically merge pick_link() (in the end of
step_into()) with get_link().  That merge is NOT done in this branch yet.

The last one promises to get rid of a rather unpleasant group of calling
conventions.  Right now we have several functions (step_into()/
walk_component()/lookup_last()/do_last()) with the following calling
conventions:
	-E...	=> error
	0	=> non-symlink or symlink not followed; nd->path points to it
	1	=> picked a symlink to follow; its mount/dentry/seq has been
pushed on nd->stack[]; its inode is stashed into nd->link_inode for
subsequent get_link() to pick.  nd->path is left unchanged.

That way all of those become
	ERR_PTR(-E...)	=> error
	NULL		=> non-symlink, symlink not followed or a pure
jump (bare "/" or procfs ones); nd->path points to where we end up
        string		=> symlink being followed; the sucker's pushed
to stack, initial jump (if any) has been handled and the string returned
is what we need to traverse.

IMO it's less arbitrary that way.  More importantly, the separation between
step_into() committing to symlink traversal and (inevitably following)
get_link() is gone - it's one operation after that change.  No nd->link_inode
either - it's only needed to carry the information from pick_link() to the
next get_link().

Loops turn into
	while (!(err = link_path_walk(nd, s)) &&
	       (s = lookup_last(nd)) != NULL)
		;
and
	while (!(err = link_path_walk(nd, s)) &&
	       (s = do_last(nd, file, op)) != NULL)
		;

trailing_symlink() goes away (folded into pick_link()/get_link() combo,
conditional upon nd->depth at the entry).  And in link_path_walk() we'll
have
                if (unlikely(!*name)) {
                        /* pathname body, done */
                        if (!nd->depth)
                                return 0;
                        name = nd->stack[nd->depth - 1].name;
                        /* trailing symlink, done */
                        if (!name)
                                return 0;
                        /* last component of nested symlink */
                        s = walk_component(nd, WALK_FOLLOW);
                } else {
                        /* not the last component */
                        s = walk_component(nd, WALK_FOLLOW | WALK_MORE);
                }
                if (s) {
                        if (IS_ERR(s))
                                return PTR_ERR(s);
			/* a symlink to follow */
			nd->stack[nd->depth - 1].name = name;
                        name = s;
                        continue;
                }

Anyway, before I try that one I'm going to fold path_openat2() into
that series - that step is definitely going to require some massage
there; it's too close to get_link() changes done in Aleksa's series.

If we do that, we get a single primitive for "here's the result of
lookup; traverse mounts and either move into the result or, if
it's a symlink that needs to be traversed, start the symlink
traversal - jump into the base position for it (if needed) and
return the pathname that needs to be handled".  As it is, mainline
has that logics spread over about a dozen locations...

Diffstat at the moment:
 fs/autofs/dev-ioctl.c |   6 +-
 fs/internal.h         |   1 -
 fs/namei.c            | 460 ++++++++++++++------------------------------------
 fs/namespace.c        |  97 +++++++----
 fs/nfs/nfstrace.h     |   2 -
 fs/open.c             |   4 +-
 include/linux/namei.h |   3 +-
 7 files changed, 197 insertions(+), 376 deletions(-)

In the current form the sucker appears to work (so far - about 30%
into the usual xfstests run) without visible slowdowns...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-10 23:19                     ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro
@ 2020-01-13  1:48                       ` Ian Kent
  2020-01-13  3:54                         ` Al Viro
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-13  1:48 UTC (permalink / raw)
  To: Al Viro, Aleksa Sarai
  Cc: David Howells, Eric Biederman, Linus Torvalds, stable,
	Christian Brauner, Serge Hallyn, dev, containers, linux-api,
	linux-fsdevel, linux-kernel

On Fri, 2020-01-10 at 23:19 +0000, Al Viro wrote:
> On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:
> > On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> > > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > > > 
> > > > > Thanks, this fixes the issue for me (and also fixes another
> > > > > reproducer I
> > > > > found -- mounting a symlink on top of itself then trying to
> > > > > umount it).
> > > > > 
> > > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> > > > 
> > > > Pushed into #fixes.
> > > 
> > > Thanks. One other thing I noticed is that umount applies to the
> > > underlying symlink rather than the mountpoint on top. So, for
> > > example
> > > (using the same scripts I posted in the thread):
> > > 
> > >   # ln -s /tmp/foo link
> > >   # ./mount_to_symlink /etc/passwd link
> > >   # umount -l link # will attempt to unmount "/tmp/foo"
> > > 
> > > Is that intentional?
> > 
> > It's a mess, again in mountpoint_last().  FWIW, at some point I
> > proposed
> > to have nd_jump_link() to fail with -ELOOP if the target was a
> > symlink;
> > Linus asked for reasons deeper than my dislike of the semantics, I
> > looked
> > around and hadn't spotted anything.  And there hadn't been at the
> > time,
> > but when four months later umount_lookup_last() went in I failed to
> > look
> > for that source of potential problems in it ;-/
> 
> FWIW, since Ian appears to agree that we want ->d_manage() on the
> mount
> crossing at the end of umount(2) lookup, here's a much simpler
> solution -
> kill mountpoint_last() and switch to using lookup_last().  As a side
> benefit, LOOKUP_NO_REVAL also goes away.  It's possible to trim the
> things even more (path_mountpoint() is very similar to
> path_lookupat()
> at that point, and it's not hard to make the differences conditional
> on
> something like LOOKUP_UMOUNT); I would rather do that part in the
> cleanups series - the one below is easier to backport.
> 
> Aleksa, Ian - could you see if the patch below works for you?

I did try this patch and I was trying to work out why it didn't
work. But thought I'd let you know what I saw.

Applying it to current Linus tree systemd stops at switch root.

Not sure what causes that, I couldn't see any reason for it.

I see you have a development branch in your repo. I'll have a look
at that rather than continue with this.

> 
> commit e56b43b971a7c08762fceab330a52b7245041dbc
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Fri Jan 10 17:17:19 2020 -0500
> 
>     reimplement path_mountpoint() with less magic
>     
>     ... and get rid of a bunch of bugs in it.  Background:
>     the reason for path_mountpoint() is that umount() really doesn't
>     want attempts to revalidate the root of what it's trying to
> umount.
>     The thing we want to avoid actually happen from complete_walk();
>     solution was to do something parallel to normal path_lookupat()
>     and it both went overboard and got the boilerplate subtly
>     (and not so subtly) wrong.
>     
>     A better solution is to do pretty much what the normal
> path_lookupat()
>     does, but instead of complete_walk() do unlazy_walk().  All it
> takes
>     to avoid that ->d_weak_revalidate() call...  mountpoint_last()
> goes
>     away, along with everything it got wrong, and so does the magic
> around
>     LOOKUP_NO_REVAL.
>     
>     Another source of bugs is that when we traverse mounts at the
> final
>     location (and we need to do that - umount . expects to get
> whatever's
>     overmounting ., if any, out of the lookup) we really ought to
> take
>     care of ->d_manage() - as it is, manual umount of autofs
> automount
>     in progress can lead to unpleasant surprises for the
> daemon.  Easily
>     solved by using handle_lookup_down() instead of follow_mount().
>     
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..1793661c3342 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const
> struct qstr *name,
>  	if (IS_ERR(dentry))
>  		return dentry;
>  	if (unlikely(!d_in_lookup(dentry))) {
> -		if (!(flags & LOOKUP_NO_REVAL)) {
> -			int error = d_revalidate(dentry, flags);
> -			if (unlikely(error <= 0)) {
> -				if (!error) {
> -					d_invalidate(dentry);
> -					dput(dentry);
> -					goto again;
> -				}
> +		int error = d_revalidate(dentry, flags);
> +		if (unlikely(error <= 0)) {
> +			if (!error) {
> +				d_invalidate(dentry);
>  				dput(dentry);
> -				dentry = ERR_PTR(error);
> +				goto again;
>  			}
> +			dput(dentry);
> +			dentry = ERR_PTR(error);
>  		}
>  	} else {
>  		old = inode->i_op->lookup(inode, dentry, flags);
> @@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char
> __user *name, unsigned flags,
>  EXPORT_SYMBOL(user_path_at_empty);
>  
>  /**
> - * mountpoint_last - look up last component for umount
> - * @nd:   pathwalk nameidata - currently pointing at parent
> directory of "last"
> - *
> - * This is a special lookup_last function just for umount. In this
> case, we
> - * need to resolve the path without doing any revalidation.
> - *
> - * The nameidata should be the result of doing a LOOKUP_PARENT
> pathwalk. Since
> - * mountpoints are always pinned in the dcache, their ancestors are
> too. Thus,
> - * in almost all cases, this lookup will be served out of the
> dcache. The only
> - * cases where it won't are if nd->last refers to a symlink or the
> path is
> - * bogus and it doesn't exist.
> - *
> - * Returns:
> - * -error: if there was an error during lookup. This includes
> -ENOENT if the
> - *         lookup found a negative dentry.
> - *
> - * 0:      if we successfully resolved nd->last and found it to not
> to be a
> - *         symlink that needs to be followed.
> - *
> - * 1:      if we successfully resolved nd->last and found it to be a
> symlink
> - *         that needs to be followed.
> - */
> -static int
> -mountpoint_last(struct nameidata *nd)
> -{
> -	int error = 0;
> -	struct dentry *dir = nd->path.dentry;
> -	struct path path;
> -
> -	/* If we're in rcuwalk, drop out of it to handle last component
> */
> -	if (nd->flags & LOOKUP_RCU) {
> -		if (unlazy_walk(nd))
> -			return -ECHILD;
> -	}
> -
> -	nd->flags &= ~LOOKUP_PARENT;
> -
> -	if (unlikely(nd->last_type != LAST_NORM)) {
> -		error = handle_dots(nd, nd->last_type);
> -		if (error)
> -			return error;
> -		path.dentry = dget(nd->path.dentry);
> -	} else {
> -		path.dentry = d_lookup(dir, &nd->last);
> -		if (!path.dentry) {
> -			/*
> -			 * No cached dentry. Mounted dentries are
> pinned in the
> -			 * cache, so that means that this dentry is
> probably
> -			 * a symlink or the path doesn't actually point
> -			 * to a mounted dentry.
> -			 */
> -			path.dentry = lookup_slow(&nd->last, dir,
> -					     nd->flags |
> LOOKUP_NO_REVAL);
> -			if (IS_ERR(path.dentry))
> -				return PTR_ERR(path.dentry);
> -		}
> -	}
> -	if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags)))
> {
> -		dput(path.dentry);
> -		return -ENOENT;
> -	}
> -	path.mnt = nd->path.mnt;
> -	return step_into(nd, &path, 0, d_backing_inode(path.dentry),
> 0);
> -}
> -
> -/**
>   * path_mountpoint - look up a path to be umounted
>   * @nd:		lookup context
>   * @flags:	lookup flags
> @@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd,
> unsigned flags, struct path *path)
>  	int err;
>  
>  	while (!(err = link_path_walk(s, nd)) &&
> -		(err = mountpoint_last(nd)) > 0) {
> +		(err = lookup_last(nd)) > 0) {
>  		s = trailing_symlink(nd);
>  	}
> +	if (!err)
> +		err = unlazy_walk(nd);
> +	if (!err)
> +		err = handle_lookup_down(nd);
>  	if (!err) {
>  		*path = nd->path;
>  		nd->path.mnt = NULL;
>  		nd->path.dentry = NULL;
> -		follow_mount(path);
>  	}
>  	terminate_walk(nd);
>  	return err;
> diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
> index f64a33d2a1d1..2a82dcce5fc1 100644
> --- a/fs/nfs/nfstrace.h
> +++ b/fs/nfs/nfstrace.h
> @@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT);
>  TRACE_DEFINE_ENUM(LOOKUP_PARENT);
>  TRACE_DEFINE_ENUM(LOOKUP_REVAL);
>  TRACE_DEFINE_ENUM(LOOKUP_RCU);
> -TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL);
>  TRACE_DEFINE_ENUM(LOOKUP_OPEN);
>  TRACE_DEFINE_ENUM(LOOKUP_CREATE);
>  TRACE_DEFINE_ENUM(LOOKUP_EXCL);
> @@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN);
>  			{ LOOKUP_PARENT, "PARENT" }, \
>  			{ LOOKUP_REVAL, "REVAL" }, \
>  			{ LOOKUP_RCU, "RCU" }, \
> -			{ LOOKUP_NO_REVAL, "NO_REVAL" }, \
>  			{ LOOKUP_OPEN, "OPEN" }, \
>  			{ LOOKUP_CREATE, "CREATE" }, \
>  			{ LOOKUP_EXCL, "EXCL" }, \
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 7fe7b87a3ded..07bfb0874033 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT,
> LAST_BIND};
>  
>  /* internal use only */
>  #define LOOKUP_PARENT		0x0010
> -#define LOOKUP_NO_REVAL		0x0080
>  #define LOOKUP_JUMPED		0x1000
>  #define LOOKUP_ROOT		0x2000
>  #define LOOKUP_ROOT_GRABBED	0x0008


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-12 21:33                                 ` Al Viro
@ 2020-01-13  2:59                                   ` Ian Kent
  2020-01-14  0:25                                     ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-13  2:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman,
	stable, Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Sun, 2020-01-12 at 21:33 +0000, Al Viro wrote:
> On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:
> 
> > Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
> > we get there it's not relevant any more, so that check looks
> > redundant. I'm not aware of any other fs automount implementation
> > that needs that EISDIR pass-thru function.
> > 
> > I didn't notice it at the time of the merge, sorry about that.
> > 
> > While we're at it that:
> >    if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
> >        return -EREMOTE;
> > 
> > at the top of follow_automount() isn't going to be be relevant
> > for autofs because ->d_automount() really must always be defined
> > for it.
> > 
> > But, at the time of the merge, I didn't object to it because
> > there were (are) other file systems that use the VFS automount
> > function which may accidentally not define the method.
> 
> OK...
> 
> > > Unfortunately, there are other interesting questions related to
> > > autofs-specific bits (->d_manage()) and the timezone-related fun
> > > is, of course, still there.  I hope to sort that out today or
> > > tomorrow, at least enough to do a reasonable set of backportable
> > > fixes to put in front of follow_managed()/step_into() queue.
> > > Oh, well...
> > 
> > Yeah, I know it slows you down but I kink-off like having a chance
> 
> Nice typo, that ;-)
> 
> > to look at what's going and think about your questions before
> > trying
> > to answer them, rather than replying prematurely, as I usually do
> > ...
> > 
> > It's been a bit of a busy day so far but I'm getting to look into
> > the questions you've asked.
> 
> Here's a bit more of those (I might've missed some of your replies on
> IRC; my apologies if that's the case):
> 
> 1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try
> ->d_automount() here".  If its effect can be delayed until the
> decision
> to call ->d_automount(), the things seem to get simpler.  Is it ever
> returned in situation when the sucker _is_ overmounted?

In theory it shouldn't need to be returned when there is an
actual mount there.

If there is a real mount at this point that should be enough to
prevent walks into that mount until it's mount is complete.

The whole idea of -EISDIR is to prevent processes from walking
into a directory tree that "doesn't have a real mount at its
base" (the so called multi-mount map construct).

> 
> 2) can autofs_d_automount() ever be called for a daemon?  Looks like
> it
> shouldn't be...

Can't do that, it will lead to deadlock very quickly.

> 
> 3) is _anything_ besides root directory ever created in direct autofs
> superblocks by anyone?  If not, why does autofs_lookup() even bother
> to
> do anything there?  IOW, why not have it return ERR_PTR(-ENOENT)
> immediately
> for direct ones?  Or am I missing something and it is, in fact,
> possible
> to have the daemon create something in those?

Short answer is no, longer answer is directories "shouldn't" ever
be created inside direct mount points.

The thing is that the multi-mount map construct can be used with
direct mounts too, but they must always have a real mount at the
base because they are direct mounts. So processes should not be
able to walk into them while they are being mounted (constructed).

But I'm pretty sure it's rare (maybe not done at all) that this
map construct is used with direct mounts.

> 
> 4) Symlinks look like they should qualify for parent being non-empty;
> at least autofs_d_manage() seems to think so (simple_empty() use).
> So shouldn't we remove the trap from its parent on symlink/restore on
> unlink if parent gets empty?  For version 4 or earlier, that is.  Or
> is
> it simply that daemon only creates symlinks in root directory?

Yes, they have to be empty.

If a symlink is to be used (based on autofs config or map option)
and the "browse" option is used for the indirect mount (browse
only makes sense for indirect autofs managed mounts) then the
mount point directory has to be removed and a symlink created
so it must be empty to for this to make sense.

If it's a "nobrowse" autofs mount then nothing should already
exist, it just gets created.

The catch is that a map entry for which a symlink is to be used
instead of a mount can't be a multi-mount. I'm pretty sure I don't
have sufficient error checking for that in the daemon but I also
haven't had reports of problems with it either.

For a very long time the use of symlinks was not common but when
the amd format map parser was added it made sense to use symlinks
in some cases for those. That was partly to reduce the number of
mounts needed and because I deliberately don't support amd map
entries that provide the multi-mount construct. The way amd did
this looked ugly to me, very much a hack to add a Sun format
mount feature.

As far as keeping the trap flags up to date, I don't.

It seemed so much simpler to just leave the flags in place but,
at that time, symlinks were not used (although it was possible to
do so), now that's changed fiddling with the flags might now make
sense.

As I said on IRC:
"DCACHE_NEED_AUTOMOUNT is set on symlink dentries because, when
->lookup() is called the dentry may trigger a callback to the
daemon that will either create a directory (since, in this case,
one does not already exist) and attempt to mount on it or create
a symlink if the autofs config/map requires it.

I didn't think there would be potential simplification by setting
and clearing the DCACHE_NEED_AUTOMOUNT flag based on it being a
directory (mountpoint) or a symlink so the flag is always left set.
Although, as you point out, symlinks won't actually trigger mounts
so the flag being left set when the dentry is a symlink is due to
lazyness, since there's nothing to gain. If you can see potential
simplification in the VFS code by managing this flag better then
that would be worth while."

> 
> 
> Anyway, intermediate state of the series is in #work.namei right now,
> and some _very_ interesting possibilities open up.  It definitely
> needs more massage around __follow_mount_rcu() (as it is, the
> fastpath in there is still too twisted).  Said that
> 	* call graph is less convoluted
> 	* follow_managed() calls are folded into
> step_into().  Interface:
> int step_into(nd, flags, dentry, inode, seq), with inode/seq used
> only
> if we are in RCU mode.
> 	* ".." still doesn't use that; it probably ought to.
> 	* lookup_fast() doesn't take path - nd, &inode, &seq and
> returns dentry
> 	* lookup_open() and fs/namei.c:atomic_open() get similar
> treatment
> - don't take path, return dentry.
> 	* calls of follow_managed()/step_into() combination returning 1
> are always followed by get_link(), and very shortly, at that.  So
> much
> that we can realistically merge pick_link() (in the end of
> step_into()) with get_link().  That merge is NOT done in this branch
> yet.
> 
> The last one promises to get rid of a rather unpleasant group of
> calling
> conventions.  Right now we have several functions (step_into()/
> walk_component()/lookup_last()/do_last()) with the following calling
> conventions:
> 	-E...	=> error
> 	0	=> non-symlink or symlink not followed; nd->path points to it
> 	1	=> picked a symlink to follow; its mount/dentry/seq has been
> pushed on nd->stack[]; its inode is stashed into nd->link_inode for
> subsequent get_link() to pick.  nd->path is left unchanged.
> 
> That way all of those become
> 	ERR_PTR(-E...)	=> error
> 	NULL		=> non-symlink, symlink not followed or a
> pure
> jump (bare "/" or procfs ones); nd->path points to where we end up
>         string		=> symlink being followed; the sucker's
> pushed
> to stack, initial jump (if any) has been handled and the string
> returned
> is what we need to traverse.
> 
> IMO it's less arbitrary that way.  More importantly, the separation
> between
> step_into() committing to symlink traversal and (inevitably
> following)
> get_link() is gone - it's one operation after that change.  No nd-
> >link_inode
> either - it's only needed to carry the information from pick_link()
> to the
> next get_link().
> 
> Loops turn into
> 	while (!(err = link_path_walk(nd, s)) &&
> 	       (s = lookup_last(nd)) != NULL)
> 		;
> and
> 	while (!(err = link_path_walk(nd, s)) &&
> 	       (s = do_last(nd, file, op)) != NULL)
> 		;
> 
> trailing_symlink() goes away (folded into pick_link()/get_link()
> combo,
> conditional upon nd->depth at the entry).  And in link_path_walk()
> we'll
> have
>                 if (unlikely(!*name)) {
>                         /* pathname body, done */
>                         if (!nd->depth)
>                                 return 0;
>                         name = nd->stack[nd->depth - 1].name;
>                         /* trailing symlink, done */
>                         if (!name)
>                                 return 0;
>                         /* last component of nested symlink */
>                         s = walk_component(nd, WALK_FOLLOW);
>                 } else {
>                         /* not the last component */
>                         s = walk_component(nd, WALK_FOLLOW |
> WALK_MORE);
>                 }
>                 if (s) {
>                         if (IS_ERR(s))
>                                 return PTR_ERR(s);
> 			/* a symlink to follow */
> 			nd->stack[nd->depth - 1].name = name;
>                         name = s;
>                         continue;
>                 }
> 
> Anyway, before I try that one I'm going to fold path_openat2() into
> that series - that step is definitely going to require some massage
> there; it's too close to get_link() changes done in Aleksa's series.
> 
> If we do that, we get a single primitive for "here's the result of
> lookup; traverse mounts and either move into the result or, if
> it's a symlink that needs to be traversed, start the symlink
> traversal - jump into the base position for it (if needed) and
> return the pathname that needs to be handled".  As it is, mainline
> has that logics spread over about a dozen locations...
> 
> Diffstat at the moment:
>  fs/autofs/dev-ioctl.c |   6 +-
>  fs/internal.h         |   1 -
>  fs/namei.c            | 460 ++++++++++++++------------------------
> ------------
>  fs/namespace.c        |  97 +++++++----
>  fs/nfs/nfstrace.h     |   2 -
>  fs/open.c             |   4 +-
>  include/linux/namei.h |   3 +-
>  7 files changed, 197 insertions(+), 376 deletions(-)
> 
> In the current form the sucker appears to work (so far - about 30%
> into the usual xfstests run) without visible slowdowns...

Ok, I'll have a look at that branch, ;)

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-13  1:48                       ` Ian Kent
@ 2020-01-13  3:54                         ` Al Viro
  2020-01-13  6:00                           ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-13  3:54 UTC (permalink / raw)
  To: Ian Kent
  Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds,
	stable, Christian Brauner, Serge Hallyn, dev, containers,
	linux-api, linux-fsdevel, linux-kernel

On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:

> I did try this patch and I was trying to work out why it didn't
> work. But thought I'd let you know what I saw.
> 
> Applying it to current Linus tree systemd stops at switch root.
> 
> Not sure what causes that, I couldn't see any reason for it.

Wait a minute...  So you are seeing problems early in the boot,
before any autofs ioctls might come into play?

Sigh...  Guess I'll have to dig that Fedora KVM image out and
try to see what it's about... ;-/  Here comes a couple of hours
of build...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-13  3:54                         ` Al Viro
@ 2020-01-13  6:00                           ` Ian Kent
  2020-01-13  6:03                             ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-13  6:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds,
	stable, Christian Brauner, Serge Hallyn, dev, containers,
	linux-api, linux-fsdevel, linux-kernel

On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote:
> On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:
> 
> > I did try this patch and I was trying to work out why it didn't
> > work. But thought I'd let you know what I saw.
> > 
> > Applying it to current Linus tree systemd stops at switch root.
> > 
> > Not sure what causes that, I couldn't see any reason for it.
> 
> Wait a minute...  So you are seeing problems early in the boot,
> before any autofs ioctls might come into play?

I did, then I checked it booted without the patch, then tried
building from scratch with the patch twice and same thing
happened each time.

Looked like this, such as it is:
[ OK ] Reached target Switch Root.
[ OK ] Started Plymouth switch root service.
       Starting Switch Root...

I don't have any evidence but thought it might be this:
https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c

Mind you, that's not the actual systemd repo. either I probably
need to look a lot deeper (and at the actual systemd repo) to
work out what's actually being called.

> 
> Sigh...  Guess I'll have to dig that Fedora KVM image out and
> try to see what it's about... ;-/  Here comes a couple of hours
> of build...


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-13  6:00                           ` Ian Kent
@ 2020-01-13  6:03                             ` Ian Kent
  2020-01-13 13:30                               ` Al Viro
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-13  6:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds,
	stable, Christian Brauner, Serge Hallyn, dev, containers,
	linux-api, linux-fsdevel, linux-kernel

On Mon, 2020-01-13 at 14:00 +0800, Ian Kent wrote:
> On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote:
> > On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:
> > 
> > > I did try this patch and I was trying to work out why it didn't
> > > work. But thought I'd let you know what I saw.
> > > 
> > > Applying it to current Linus tree systemd stops at switch root.
> > > 
> > > Not sure what causes that, I couldn't see any reason for it.
> > 
> > Wait a minute...  So you are seeing problems early in the boot,
> > before any autofs ioctls might come into play?
> 
> I did, then I checked it booted without the patch, then tried
> building from scratch with the patch twice and same thing
> happened each time.
> 
> Looked like this, such as it is:
> [ OK ] Reached target Switch Root.
> [ OK ] Started Plymouth switch root service.
>        Starting Switch Root...
> 
> I don't have any evidence but thought it might be this:
> https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c

Oh wait, for systemd I was actually looking at:
https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c

> 
> Mind you, that's not the actual systemd repo. either I probably
> need to look a lot deeper (and at the actual systemd repo) to
> work out what's actually being called.
> 
> > Sigh...  Guess I'll have to dig that Fedora KVM image out and
> > try to see what it's about... ;-/  Here comes a couple of hours
> > of build...


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-13  6:03                             ` Ian Kent
@ 2020-01-13 13:30                               ` Al Viro
  2020-01-14  7:25                                 ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-13 13:30 UTC (permalink / raw)
  To: Ian Kent
  Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds,
	stable, Christian Brauner, Serge Hallyn, dev, containers,
	linux-api, linux-fsdevel, linux-kernel

On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:

> Oh wait, for systemd I was actually looking at:
> https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
> 
> > 
> > Mind you, that's not the actual systemd repo. either I probably
> > need to look a lot deeper (and at the actual systemd repo) to
> > work out what's actually being called.
> > 
> > > Sigh...  Guess I'll have to dig that Fedora KVM image out and
> > > try to see what it's about... ;-/  Here comes a couple of hours
> > > of build...

D'oh...  And yes, that would've been a bisect hazard - switch to
path_lookupat() later in the series gets rid of that.  Incremental
(to be foldede, of course):

diff --git a/fs/namei.c b/fs/namei.c
index 1793661c3342..204677c37751 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path)
 		(err = lookup_last(nd)) > 0) {
 		s = trailing_symlink(nd);
 	}
-	if (!err)
+	if (!err && (nd->flags & LOOKUP_RCU))
 		err = unlazy_walk(nd);
 	if (!err)
 		err = handle_lookup_down(nd);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-13  2:59                                   ` Ian Kent
@ 2020-01-14  0:25                                     ` Ian Kent
  2020-01-14  4:39                                       ` Al Viro
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-14  0:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman,
	stable, Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Mon, 2020-01-13 at 10:59 +0800, Ian Kent wrote:
> 
> > 3) is _anything_ besides root directory ever created in direct
> > autofs
> > superblocks by anyone?  If not, why does autofs_lookup() even
> > bother
> > to
> > do anything there?  IOW, why not have it return ERR_PTR(-ENOENT)
> > immediately
> > for direct ones?  Or am I missing something and it is, in fact,
> > possible
> > to have the daemon create something in those?
> 
> Short answer is no, longer answer is directories "shouldn't" ever
> be created inside direct mount points.
> 
> The thing is that the multi-mount map construct can be used with
> direct mounts too, but they must always have a real mount at the
> base because they are direct mounts. So processes should not be
> able to walk into them while they are being mounted (constructed).
> 
> But I'm pretty sure it's rare (maybe not done at all) that this
> map construct is used with direct mounts.

This isn't right.

There's actually nothing stopping a user from using a direct map
entry that's a multi-mount without an actual mount at its root.
So there could be directories created under these, it's just not
usually done.

I'm pretty sure I don't check and disallow this.

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14  0:25                                     ` Ian Kent
@ 2020-01-14  4:39                                       ` Al Viro
  2020-01-14  5:01                                         ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-14  4:39 UTC (permalink / raw)
  To: Ian Kent
  Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman,
	stable, Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:

> This isn't right.
> 
> There's actually nothing stopping a user from using a direct map
> entry that's a multi-mount without an actual mount at its root.
> So there could be directories created under these, it's just not
> usually done.
> 
> I'm pretty sure I don't check and disallow this.

IDGI...  How the hell will that work in v5?  Who will set _any_
traps outside the one in root in that scenario?  autofs_lookup()
won't (there it's conditional upon indirect mount).  Neither
will autofs_dir_mkdir() (conditional upon version being less
than 5).  Who will, then?

Confused...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-10 21:07                         ` Aleksa Sarai
@ 2020-01-14  4:57                           ` Al Viro
  2020-01-14  5:12                             ` Al Viro
                                               ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Al Viro @ 2020-01-14  4:57 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:

> If I'm understanding this proposal correctly, this would be a problem
> for the libpathrs use-case -- if this is done then there's no way to
> avoid a TOCTOU with someone mounting and the userspace program checking
> whether something is a mountpoint (unless you have Linux >5.6 and
> RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> 
>   1. Open the candidate directory.
>   2. umount2(MNT_EXPIRE) the fd.
>     * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> 	  fd is a stable handle to the underlying directory.
> 	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
> 	  mountpoint after the fd was opened (we don't care about that, but
> 	  fail-safe is better here).
>   3. Use the fd from (1) for all operations.

... except that foo/../bar *WILL* cross into the covering mount, on any
kernel that supports ...at(2) at all, so I would be very cautious about
any kind "hardening" claims in that case.

I'm not sure about Linus' proposal - it looks rather convoluted and we
get a hard to describe twist of semantics in an area (procfs symlinks
vs. mount traversal) on top of everything else in there...

Anyway, a couple of questions:

1) do you see any problems on your testcases with the current #fixes?
That's commit 7a955b7363b8 as branch tip.

2) do you have any updates you would like to fold into stuff in
#work.openat2?  Right now I have a local variant of #work.namei (with
fairly cosmetical change compared to vfs.git one) that merges clean
with #work.openat2; I would like to do any updates/fold-ins/etc.
of #work.openat2 *before* doing a merge and continuing to work on
top of the merge results...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14  4:39                                       ` Al Viro
@ 2020-01-14  5:01                                         ` Ian Kent
  2020-01-14  5:59                                           ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-14  5:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman,
	stable, Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
> On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
> 
> > This isn't right.
> > 
> > There's actually nothing stopping a user from using a direct map
> > entry that's a multi-mount without an actual mount at its root.
> > So there could be directories created under these, it's just not
> > usually done.
> > 
> > I'm pretty sure I don't check and disallow this.
> 
> IDGI...  How the hell will that work in v5?  Who will set _any_
> traps outside the one in root in that scenario?  autofs_lookup()
> won't (there it's conditional upon indirect mount).  Neither
> will autofs_dir_mkdir() (conditional upon version being less
> than 5).  Who will, then?
> 
> Confused...

It's easy to miss.

For autofs type direct and offset mounts the flags are set at fill
super time.

They have to be set then because they are direct mounts and offset
mounts behave the same as direct mounts so they need to be set then
too. So, like direct mounts, offset mounts are each distinct autofs
(trigger) mounts.

I could check for this construct and refuse it if that's really
needed. I'm pretty sure this map construct isn't much used by
people using direct mounts.

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14  4:57                           ` Al Viro
@ 2020-01-14  5:12                             ` Al Viro
  2020-01-14 20:01                             ` Aleksa Sarai
  2020-01-15 13:57                             ` Aleksa Sarai
  2 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2020-01-14  5:12 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Tue, Jan 14, 2020 at 04:57:33AM +0000, Al Viro wrote:
> On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
> 
> > If I'm understanding this proposal correctly, this would be a problem
> > for the libpathrs use-case -- if this is done then there's no way to
> > avoid a TOCTOU with someone mounting and the userspace program checking
> > whether something is a mountpoint (unless you have Linux >5.6 and
> > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> > 
> >   1. Open the candidate directory.
> >   2. umount2(MNT_EXPIRE) the fd.
> >     * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> > 	  fd is a stable handle to the underlying directory.
> > 	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
> > 	  mountpoint after the fd was opened (we don't care about that, but
> > 	  fail-safe is better here).
> >   3. Use the fd from (1) for all operations.
> 
> ... except that foo/../bar *WILL* cross into the covering mount, on any
> kernel that supports ...at(2) at all, so I would be very cautious about
> any kind "hardening" claims in that case.
> 
> I'm not sure about Linus' proposal - it looks rather convoluted and we
> get a hard to describe twist of semantics in an area (procfs symlinks
> vs. mount traversal) on top of everything else in there...

PS: one thing that might be interesting is exposing LOOKUP_DOWN via
AT_... flag - it would allow to request mount traversals at the starting
point explicitly.  Pretty much all code needed for that is already there;
all it would take is checking the flag in path_openat() and path_parentat()
and having handle_lookup_down() called there, same as in path_lookupat().

A tricky question is whether such flag should affect absolute symlinks -
i.e.

chdir /foo
ln -s /bar barf
overmount /
do lookup with that flag for /bar/splat
do lookup with that flag for barf/splat

Do we want the same results in both calls?  The first one would
traverse mounts on / and walk into /bar/splat in overmounting;
the second - see no mounts whatsoever on current directory (/foo
in old root), see the symlink to "/bar", jump to process' root
and proceed from there, first for "bar", then "splat" in it...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14  5:01                                         ` Ian Kent
@ 2020-01-14  5:59                                           ` Ian Kent
  0 siblings, 0 replies; 53+ messages in thread
From: Ian Kent @ 2020-01-14  5:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Aleksa Sarai, David Howells, Eric Biederman,
	stable, Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Tue, 2020-01-14 at 13:01 +0800, Ian Kent wrote:
> On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
> > On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
> > 
> > > This isn't right.
> > > 
> > > There's actually nothing stopping a user from using a direct map
> > > entry that's a multi-mount without an actual mount at its root.
> > > So there could be directories created under these, it's just not
> > > usually done.
> > > 
> > > I'm pretty sure I don't check and disallow this.
> > 
> > IDGI...  How the hell will that work in v5?  Who will set _any_
> > traps outside the one in root in that scenario?  autofs_lookup()
> > won't (there it's conditional upon indirect mount).  Neither
> > will autofs_dir_mkdir() (conditional upon version being less
> > than 5).  Who will, then?
> > 
> > Confused...
> 
> It's easy to miss.
> 
> For autofs type direct and offset mounts the flags are set at fill
> super time.
> 
> They have to be set then because they are direct mounts and offset
> mounts behave the same as direct mounts so they need to be set then
> too. So, like direct mounts, offset mounts are each distinct autofs
> (trigger) mounts.
> 
> I could check for this construct and refuse it if that's really
> needed. I'm pretty sure this map construct isn't much used by
> people using direct mounts.

Ok, once again I'm not exactly accurate is some of what I said.

It turns out that the autofs connectathon tests, one of the tests
that I use, does test direct mounts with offsets both with and
without a real mount at the base of the mount.

Based on that, I have to say this map construct is meant to be
supported with Sun format maps of autofs (even though I think it's
probably not used much).

So not allowing it is probably the wrong thing to do.

OTOH initial testing with the #work.namei branch shows these are
functioning as required.

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-13 13:30                               ` Al Viro
@ 2020-01-14  7:25                                 ` Ian Kent
  2020-01-14 12:17                                   ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Kent @ 2020-01-14  7:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds,
	stable, Christian Brauner, Serge Hallyn, dev, containers,
	linux-api, linux-fsdevel, linux-kernel

On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote:
> On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:
> 
> > Oh wait, for systemd I was actually looking at:
> > https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
> > 
> > > Mind you, that's not the actual systemd repo. either I probably
> > > need to look a lot deeper (and at the actual systemd repo) to
> > > work out what's actually being called.
> > > 
> > > > Sigh...  Guess I'll have to dig that Fedora KVM image out and
> > > > try to see what it's about... ;-/  Here comes a couple of hours
> > > > of build...
> 
> D'oh...  And yes, that would've been a bisect hazard - switch to
> path_lookupat() later in the series gets rid of that.  Incremental
> (to be foldede, of course):
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 1793661c3342..204677c37751 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned
> flags, struct path *path)
>  		(err = lookup_last(nd)) > 0) {
>  		s = trailing_symlink(nd);
>  	}
> -	if (!err)
> +	if (!err && (nd->flags & LOOKUP_RCU))
>  		err = unlazy_walk(nd);
>  	if (!err)
>  		err = handle_lookup_down(nd);

Ok, so I've tested with the updated patch.

The autofs connectathon tests I use function fine.

I also tested sending a SIGKILL to the daemon with about 180 active
mounts and restarted the daemon to test the function of the ioctls
that Al was concerned about.

While the connectathon test expired everything I had 3 mounts left
after allowing sufficient expire time with the SIGKILL test.

Those mounts correspond to one map entry that has a mix of NFS
vers=3 and vers=2 mount options and NFSv2 isn't supported by the
servers I use in testing.

I'm inclined to think this is a bug in the automount mount tree
re-connection code rather than a problem with this patch since
all the other mounts, some simple and others with not so simple
constructs, expired fine after automount re-connected to them.

There are two other map entries that have an NFS vers=2 option but
they are simple mounts that will fail on attempting the automount
because the server doesn't support v2 so they don't end up with
mounts to reconnect to.

This particular map entry, having a mix of NFS vers=3 and vers=2
in the offsets of the entry, will lead to a partial mount of the
map entry which is probably not being handled properly by automount
when re-connecting to the mounts in the tree.

So I think the patch here is fine from an autofs POV.

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14  7:25                                 ` Ian Kent
@ 2020-01-14 12:17                                   ` Ian Kent
  0 siblings, 0 replies; 53+ messages in thread
From: Ian Kent @ 2020-01-14 12:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, David Howells, Eric Biederman, Linus Torvalds,
	stable, Christian Brauner, Serge Hallyn, dev, containers,
	linux-api, linux-fsdevel, linux-kernel

On Tue, 2020-01-14 at 15:25 +0800, Ian Kent wrote:
> On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote:
> > On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:
> > 
> > > Oh wait, for systemd I was actually looking at:
> > > https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
> > > 
> > > > Mind you, that's not the actual systemd repo. either I probably
> > > > need to look a lot deeper (and at the actual systemd repo) to
> > > > work out what's actually being called.
> > > > 
> > > > > Sigh...  Guess I'll have to dig that Fedora KVM image out and
> > > > > try to see what it's about... ;-/  Here comes a couple of
> > > > > hours
> > > > > of build...
> > 
> > D'oh...  And yes, that would've been a bisect hazard - switch to
> > path_lookupat() later in the series gets rid of that.  Incremental
> > (to be foldede, of course):
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 1793661c3342..204677c37751 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd,
> > unsigned
> > flags, struct path *path)
> >  		(err = lookup_last(nd)) > 0) {
> >  		s = trailing_symlink(nd);
> >  	}
> > -	if (!err)
> > +	if (!err && (nd->flags & LOOKUP_RCU))
> >  		err = unlazy_walk(nd);
> >  	if (!err)
> >  		err = handle_lookup_down(nd);
> 
> Ok, so I've tested with the updated patch.
> 
> The autofs connectathon tests I use function fine.
> 
> I also tested sending a SIGKILL to the daemon with about 180 active
> mounts and restarted the daemon to test the function of the ioctls
> that Al was concerned about.
> 
> While the connectathon test expired everything I had 3 mounts left
> after allowing sufficient expire time with the SIGKILL test.
> 
> Those mounts correspond to one map entry that has a mix of NFS
> vers=3 and vers=2 mount options and NFSv2 isn't supported by the
> servers I use in testing.
> 
> I'm inclined to think this is a bug in the automount mount tree
> re-connection code rather than a problem with this patch since
> all the other mounts, some simple and others with not so simple
> constructs, expired fine after automount re-connected to them.
> 
> There are two other map entries that have an NFS vers=2 option but
> they are simple mounts that will fail on attempting the automount
> because the server doesn't support v2 so they don't end up with
> mounts to reconnect to.
> 
> This particular map entry, having a mix of NFS vers=3 and vers=2
> in the offsets of the entry, will lead to a partial mount of the
> map entry which is probably not being handled properly by automount
> when re-connecting to the mounts in the tree.
> 
> So I think the patch here is fine from an autofs POV.

Umm ... unfortunately further testing shows an autofs problem.

It appears to be present in the current kernel (so far I've only
been able to check the current git head and an earlier kernel
but can't remember the version and can't check) so I must have
missed it.

I'm attempting to bisect now but managed to trash the root
file system on my VM. I'll get this done as quickly as I can.

Ian


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14  4:57                           ` Al Viro
  2020-01-14  5:12                             ` Al Viro
@ 2020-01-14 20:01                             ` Aleksa Sarai
  2020-01-15 14:25                               ` Al Viro
  2020-01-15 13:57                             ` Aleksa Sarai
  2 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-14 20:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

[-- Attachment #1: Type: text/plain, Size: 2839 bytes --]

On 2020-01-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
> 
> > If I'm understanding this proposal correctly, this would be a problem
> > for the libpathrs use-case -- if this is done then there's no way to
> > avoid a TOCTOU with someone mounting and the userspace program checking
> > whether something is a mountpoint (unless you have Linux >5.6 and
> > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> > 
> >   1. Open the candidate directory.
> >   2. umount2(MNT_EXPIRE) the fd.
> >     * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> > 	  fd is a stable handle to the underlying directory.
> > 	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
> > 	  mountpoint after the fd was opened (we don't care about that, but
> > 	  fail-safe is better here).
> >   3. Use the fd from (1) for all operations.
> 
> ... except that foo/../bar *WILL* cross into the covering mount, on any
> kernel that supports ...at(2) at all, so I would be very cautious about
> any kind "hardening" claims in that case.

In the use-case I have, we would have full control over what the path
being opened is (and thus you wouldn't open "foo/../bar"). But I agree
that generally the MNT_EXPIRE solution is really non-ideal anyway.

Not to mention that we're still screwed when it comes to using
magic-links (because if someone bind-mounts a magic-link over a
magic-link there's absolutely no race-free way to be sure that we're
traversing the right magic-link -- for that we'll need to have a
different solution).

> I'm not sure about Linus' proposal - it looks rather convoluted and we
> get a hard to describe twist of semantics in an area (procfs symlinks
> vs. mount traversal) on top of everything else in there...

Yeah, I agree.

> 1) do you see any problems on your testcases with the current #fixes?
> That's commit 7a955b7363b8 as branch tip.

I will take a quick look later today, but I'm currently at a conference.

> 2) do you have any updates you would like to fold into stuff in
> #work.openat2?  Right now I have a local variant of #work.namei (with
> fairly cosmetical change compared to vfs.git one) that merges clean
> with #work.openat2; I would like to do any updates/fold-ins/etc.
> of #work.openat2 *before* doing a merge and continuing to work on
> top of the merge results...

Yes, there were two patches I sent a while ago[1]. I can re-send them if
you like. The second patch switches open_how->mode to a u64, but I'm
still on the fence about whether that makes sense to do...

[1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14  4:57                           ` Al Viro
  2020-01-14  5:12                             ` Al Viro
  2020-01-14 20:01                             ` Aleksa Sarai
@ 2020-01-15 13:57                             ` Aleksa Sarai
  2020-01-19  3:14                               ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Al Viro
  2 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-15 13:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

[-- Attachment #1: Type: text/plain, Size: 461 bytes --]

On 2020-01-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 1) do you see any problems on your testcases with the current #fixes?
> That's commit 7a955b7363b8 as branch tip.

I just finished testing the few cases I reported earlier and they both
appear to be fixed with the current #work.namei branch. And I don't have
any troubles booting whatsoever.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-14 20:01                             ` Aleksa Sarai
@ 2020-01-15 14:25                               ` Al Viro
  2020-01-15 14:29                                 ` Aleksa Sarai
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-15 14:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:

> Yes, there were two patches I sent a while ago[1]. I can re-send them if
> you like. The second patch switches open_how->mode to a u64, but I'm
> still on the fence about whether that makes sense to do...

IMO plain __u64 is better than games with __aligned_u64 - all sizes are
fixed, so...

> [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/

Do you want that series folded into "open: introduce openat2(2) syscall"
and "selftests: add openat2(2) selftests" or would you rather have them
appended at the end of the series.  Personally I'd go for "fold them in"
if it had been about my code, but it's really up to you.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-15 14:25                               ` Al Viro
@ 2020-01-15 14:29                                 ` Aleksa Sarai
  2020-01-15 14:34                                   ` Aleksa Sarai
  0 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-15 14:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

[-- Attachment #1: Type: text/plain, Size: 1028 bytes --]

On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> 
> > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > you like. The second patch switches open_how->mode to a u64, but I'm
> > still on the fence about whether that makes sense to do...
> 
> IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> fixed, so...
> 
> > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
> 
> Do you want that series folded into "open: introduce openat2(2) syscall"
> and "selftests: add openat2(2) selftests" or would you rather have them
> appended at the end of the series.  Personally I'd go for "fold them in"
> if it had been about my code, but it's really up to you.

"fold them in" would probably be better to avoid making the mainline
history confusing afterwards. Thanks.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-15 14:29                                 ` Aleksa Sarai
@ 2020-01-15 14:34                                   ` Aleksa Sarai
  2020-01-15 14:48                                     ` Al Viro
  0 siblings, 1 reply; 53+ messages in thread
From: Aleksa Sarai @ 2020-01-15 14:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

[-- Attachment #1: Type: text/plain, Size: 1205 bytes --]

On 2020-01-16, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> > 
> > > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > > you like. The second patch switches open_how->mode to a u64, but I'm
> > > still on the fence about whether that makes sense to do...
> > 
> > IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> > fixed, so...
> > 
> > > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
> > 
> > Do you want that series folded into "open: introduce openat2(2) syscall"
> > and "selftests: add openat2(2) selftests" or would you rather have them
> > appended at the end of the series.  Personally I'd go for "fold them in"
> > if it had been about my code, but it's really up to you.
> 
> "fold them in" would probably be better to avoid making the mainline
> history confusing afterwards. Thanks.

Also (if you prefer) I can send a v3 which uses u64s rather than
aligned_u64s.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks
  2020-01-15 14:34                                   ` Aleksa Sarai
@ 2020-01-15 14:48                                     ` Al Viro
  0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2020-01-15 14:48 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

On Thu, Jan 16, 2020 at 01:34:59AM +1100, Aleksa Sarai wrote:
> On 2020-01-16, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> > > 
> > > > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > > > you like. The second patch switches open_how->mode to a u64, but I'm
> > > > still on the fence about whether that makes sense to do...
> > > 
> > > IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> > > fixed, so...
> > > 
> > > > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
> > > 
> > > Do you want that series folded into "open: introduce openat2(2) syscall"
> > > and "selftests: add openat2(2) selftests" or would you rather have them
> > > appended at the end of the series.  Personally I'd go for "fold them in"
> > > if it had been about my code, but it's really up to you.
> > 
> > "fold them in" would probably be better to avoid making the mainline
> > history confusing afterwards. Thanks.
> 
> Also (if you prefer) I can send a v3 which uses u64s rather than
> aligned_u64s.

<mode "lazy bastard">
Could you fold and resend the results of folding (i.e. replacements
for two commits in question)?
</mode>

The hard part is, of course, in updating commit messages ;-)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC][PATCHSET][CFT] pathwalk cleanups and fixes
  2020-01-15 13:57                             ` Aleksa Sarai
@ 2020-01-19  3:14                               ` Al Viro
  2020-01-19 14:33                                 ` Ian Kent
  0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2020-01-19  3:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Ian Kent

	OK, vfs.git #work.namei seems to survive xfstests.  I think
it cleans the things quite a bit, but it obviously needs more
review and testing.

	Review and testing would be _very_ welcome; it does a lot
of massage, so there had been a plenty of opportunities to fuck up
and fail to spot that.  The same goes for profiling - it doesn't
seem to slow the things down, but that needs to be verified.

	It does include #work.openat2.  Topology: 17 commits, followed
by clean merge with #work.openat2, followed by 9 followups.  The
part is #work.openat2 is as posted by Aleksa; I can repost it, but
I don't see much point.  Description of the rest follows; patches
themselves will be in followups.

part 1: follow_automount() cleanups and fixes.

	Quite a bit of that function had been about working around the
wrong calling conventions of finish_automount().  The problem is that
finish_automount() misuses the primitive intended for mount(2) and
friends, where we want to mount on top of the pile, even if something
has managed to add to that while we'd been trying to lock the namespace.
For automount that's not the right thing to do - there we want to discard
whatever it was going to attach and just cross into what got mounted
there in the meanwhile (most likely - the results of the same automount
triggered by somebody else).  Current mainline kinda-sorta manages to do
that, but it's unreliable and very convoluted.  Much simpler approach
is to stop using lock_mount() in finish_automount() and have it bail
out if something turns out to have been mounted on top where we wanted
to attach.  That allows to get rid of a lot of PITA in the caller.
Another simplification comes from not trying to cross into the results
of automount - simply ride through the next iteration of the loop and
let it move into overmount.

	Another thing in the same series is divorcing follow_automount()
from nameidata; that'll play later when we get to unifying follow_down()
with the guts of follow_managed().

	4 commits, the second one fixes a hard-to-hit race.  The first
is a prereq for it.

1/17	do_add_mount(): lift lock_mount/unlock_mount into callers
2/17	fix automount/automount race properly
3/17	follow_automount(): get rid of dead^Wstillborn code
4/17	follow_automount() doesn't need the entire nameidata

part 2: unifying mount traversals in pathwalk.

	Handling of mount traversal (follow_managed()) is currently called
in a bunch of places.  Each of them is shortly followed by a call of
step_into() or an open-coded equivalent thereof.  However, the locations
of those step_into() calls are far from preceding follow_managed();
moreover, that preceding call might happen on different paths that
converge to given step_into() call.  It's harder to analyse that it should
be (especially when it comes to liveness analysis) and it forces rather
ugly calling conventions on lookup_fast()/atomic_open()/lookup_open().
The series below massages the code to the point when the calls of
follow_managed() (and __follow_mount_rcu()) move into the beginning of
step_into().

5/17	make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
	gets EEXIST handling in do_last() past the step_into() call there.
6/17	handle_mounts(): start building a sane wrapper for follow_managed()
	rather than mangling follow_managed() itself (and creating conflicts
	with openat2 series), add a wrapper that will absorb the required
	interface changes.
7/17	atomic_open(): saner calling conventions (return dentry on success)
	struct path passed to it is pure out parameter; only dentry part
	ever varies, though - mnt is always nd->path.mnt.  Just return
	the dentry on success, and ERR_PTR(-E...) on failure.
8/17	lookup_open(): saner calling conventions (return dentry on success)
	propagate the same change one level up the call chain.
9/17	do_last(): collapse the call of path_to_nameidata()
	struct path filled in lookup_open() call is eventually given to
	handle_mounts(); the only use it has before that is path_to_nameidata()
	call in "->atomic_open() has actually opened it" case, and there
	path_to_nameidata() is an overkill - we are guaranteed to replace
	only nd->path.dentry.  So have the struct path filled only immediately
	prior to handle_mounts().
10/17	handle_mounts(): pass dentry in, turn path into a pure out argument
	now all callers of handle_mount() are directly preceded by filling
	struct path it gets.  path->mnt is nd->path.mnt in all cases, so we can
	pass just the dentry instead and fill path in handle_mount() itself.
	Some boilerplate gone, path is pure out argument of handle_mount()
	now.
11/17	lookup_fast(): consolidate the RCU success case
	massage to gather what will become an RCU case equivalent of
	handle_mounts(); basically, that's what we do if revalidate succeeds
	in RCU case of lookup_fast(), including unlazy and fallback to
	handle_mounts() if __follow_mount_rcu() says "it's too tricky".
12/17	teach handle_mounts() to handle RCU mode
	... and take that into handle_mount() itself.  The other caller of
	__follow_mount_rcu() is fine with the same fallback (it just didn't
	bother since it's in the very beginning of pathwalk), switched to
	handle_mount() as well.
13/17	lookup_fast(): take mount traversal into callers
	Now we are getting somewhere - both RCU and non-RCU success cases of
	lookup_fast() are ended with the same return handle_mounts(...);
	move that to the callers - there it will merge with the identical calls
	that had been on the paths where we had to do slow lookups.
	lookup_fast() returns dentry now.
14/17	new step_into() flag: WALK_NOFOLLOW
	use step_into() instead of open-coding it in handle_lookup_down().
	Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for
	that (and eventually, I hope, for .. handling).
	Now *all* calls of handle_mounts() and step_into() are right next to
	each other.
15/17	fold handle_mounts() into step_into()
	... and we can move the call of handle_mounts() into step_into(),
	getting a slightly saner calling conventions out of that.
16/17	LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()
	another payoff from 14/17 - we can teach path_lookupat() to do
	what path_mountpointat() used to.  And kill the latter, along with
	its wrappers.
17/17	expand the only remaining call of path_lookup_conditional()
	minor cleanup - RIP path_lookup_conditional().  Only one caller left.

At that point we run out of things that can be done without textual conflicts
with openat2 series.  Changes so far:
	* mount traversal is taken into step_into().
	* lookup_fast(), atomic_open() and lookup_open() calling conventions
are slightly changed.  All of them return dentry now, instead of returning
an int and filling struct path on success.  For lookup_fast() the old
"0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache
miss, dentry - for hit".
	* step_into() can be called in RCU mode as well.  Takes nameidata,
WALK_... flags, dentry and, in RCU case, corresponding inode and seq value.
Handles mount traversals, decides whether it's a symlink to be followed.
Error => returns -E...; symlink to follow => returns 1, puts symlink on stack;
non-symlink or symlink not to follow => returns 0, moves nd->path to new location.
	* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends
became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.

Next comes the merge with Aleksa's openat2 patchset; everything up to that point
had been non-conflicting with it.  That patchset has been posted earlier;
it's in #work.openat2.  The next series comes on top of the merge.

part 3: untangling the symlink handling.

	Right now when we decide to follow a symlink it happens this way:
	* step_into() decides that it has been given a symlink that needs to
be followed.
	* it calls pick_link(), which pushes the symlink on stack and
returns 1 on success / -E... on error.  Symlink's mount/dentry/seq is
stored on stack and the inode is stashed in nd->link_inode.
	* step_into() passes that 1 to its callers, which proceed to pass it
up the call chain for several layers.  In all cases we get to get_link()
call shortly afterwards.
	* get_link() is called, picks the inode stashed in nd->link_inode
by the pick_link(), does some checks, touches the atime, etc.
	* get_link() either picks the link body out of inode or calls
->get_link().  If it's an absolute symlink, we move to the root and return
the relative portion of the body; if it's a relative one - just return the
body.  If it's a procfs-style one, the call of nd_jump_link() has been
made and we'd moved to whatever location is desired.  And return NULL,
same as we do for symlink to "/".
	* the caller proceeds to deal with the string returned to it.

	The sequence is the same in all cases (nested symlink, trailing
symlink on lookup, trailing symlink on open), but its pieces are not close
to each other and the bit between the call of pick_link() and (inevitable)
call of get_link() afterwards is not easy to follow.  Moreover, a bunch
of functions (walk_component/lookup_last/do_last) ends up with the same
conventions for return values as step_into().  And those conventions
(see above) are not pretty - 0/1/-E... is asking for mistakes, especially
when returned 1 is used only to direct control flow on a rather twisted
way to matching get_link() call.  And that path can be seriously twisted.
E.g. when we are trying to open /dev/stdin, we get the following sequence:
	* path_init() has put us into root and returned "/dev/stdin"
	* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
	* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast().  Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing.  Into the stack it goes and we return 1.
	* do_last() sees 1 and returns it.
	* trailing_symlink() is called (in the top-level loop) and it
calls get_link().  OK, we get "/proc/self/fd/0" for body, move to
root again and return "proc/self/fd/0".
	* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
	* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
	* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now).  It's dropped (from step_into()) and we get
to look at what we'd been given.  A symlink to follow, so on the stack
it goes and we return 1.
	* again, do_last() passes 1 to caller
	* trailing_symlink() is called and calls get_link().
	* this time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin.  And returns NULL.  But the
fun doesn't stop yet.
	* trailing_symlink() returns "" to the caller
	* link_path_walk() is called on that and does nothing
whatsoever.
	* do_last() is called and sees LAST_BIND left by the get_link().
It calls handle_dots()
	* handle_dots() drops the symlink from stack and returns
	* do_last() *FINALLY* proceeds to the point after its call of
step_into() (finish_open:) and gets around to opening the damn thing.

	Making sense of the control flow through all of that is not fun,
to put it mildly; debugging anything in that area can be a massive PITA,
and this example has touched only one of 3 cases.  Arguably, the worst
one, but...  Anyway, it turns out that this code can be massaged to
considerably saner shape - both in terms of control flow and wrt calling
conventions.

1/9	merging pick_link() with get_link(), part 1
	prep work: move the "hardening" crap from trailing_symlink() into
get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags).
We'll be moving the calls of get_link() around quite a bit through that
series, and the next step will be to eliminate trailing_symlink().
2/9	merging pick_link() with get_link(), part 2
	fold trailing_symlink() into lookup_last() and do_last().
Now these are returning strings; it's not the final calling conventions,
but it's almost there.  NULL => old 0, we are done.  ERR_PTR(-E...) =>
old -E..., we'd failed.  string => old 1, and the string is the symlink
body to follow.  Just as for trailing_symlink(), "/" and procfs ones
(where get_link() returns NULL) yield "", so the ugly song and dance
with no-op trip through link_path_walk()/handle_dots() still remains.
3/9	merging pick_link() with get_link(), part 3
	elimination of that round-trip.  In *all* cases having
get_link() return NULL on such symlinks means that we'll proceed to
drop the symlink from stack and get back to the point near that
get_link() call - basically, where we would be if it hadn't been
a symlink at all.  The path by which we are getting there depends
upon the call site; the end result is the same in all cases - such
symlinks (procfs ones and symlink to "/") are fully processed by
the time get_link() returns, so we could as well drop them from the
stack right in get_link().  Makes life simpler in terms of control
flow analysis...
	And now the calling conventions for do_last() and lookup_last()
have reached the final shape - ERR_PTR(-E...) for error, NULL for
"we are done", string for "traverse this".
4/9	merging pick_link() with get_link(), part 4
	now all calls of walk_component() are followed by the same
boilerplate - "if it has returned 1, call get_link() and if that
has returned NULL treat that as if walk_component() has returned 0".
Eliminate by folding that into walk_component() itself.  Now
walk_component() return value conventions have joined those of
do_last()/lookup_last().
5/9	merging pick_link() with get_link(), part 5
	same as for the previous, only this time the boilerplate
migrates one level down, into step_into().  Only one caller of
get_link() left, step_into() has joined the same return value
conventions.
6/9	merging pick_link() with get_link(), part 6
	move that thing into pick_link().  Now all traces of
"return 1 if we are following a symlink" are gone.
7/9	finally fold get_link() into pick_link()
	ta-da - expand get_link() into the only caller.  As a side
benefit, we get rid of stashing the inode in nd->link_inode - it
was done only to carry that piece of information from pick_link()
to eventual get_link().  That's not the main benefit, though - the
control flow became considerably easier to reason about.

For what it's worth, the example above (/dev/stdin) becomes
	* path_init() has put us into root and returned "/dev/stdin"
	* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
	* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast().  Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing.  On the stack it goes and we get its body.  Which is
"/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
	* do_last() sees non-NULL and returns it - whether it's an error
or a pathname to traverse, we hadn't reached something we'll be opening.
	* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
	* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
	* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now).  It's dropped (from step_into()) and we get
to look at what we'd been given.  A symlink to follow, so on the stack
it goes.   This time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin.  And returns NULL.  So we
drop symlink from stack and return that NULL to caller.
	* that NULL is returned by step_into(), same as if we had just
moved to a non-symlink.
	* do_last() proceeds to open the damn thing.

part 4.  some mount traversal cleanups.

8/9	massage __follow_mount_rcu() a bit
	make it more similar to non-RCU counterpart
9/9	new helper: traverse_mounts()
	the guts of follow_managed() are very similar to
follow_down().  The calling conventions are different (follow_managed()
works with nameidata, follow_down() - with standalone struct path),
but the core loop is pretty much the same in both.  Turned that loop
into a common helper (traverse_mounts()) and since follow_managed()
becomes a very thin wrapper around it, expand follow_managed() at its
only call site (in handle_mounts()),

That's where the series stands right now.  FWIW, at 5.5-rc1 fs/namei.c
had been 4867 lines, at the tip of #work.openat2 - 4998, at the
tip of #work.namei (containing #work.openat2) - 4730...  And IMO
the thing has become considerably easier to follow.

What's more, it might be possible to untangle the control flow in
do_last() now.  Probably a separate series, though - do_last() is
one hell of a tarpit, so I'm not stepping into it for the rest
of this cycle...


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCHSET][CFT] pathwalk cleanups and fixes
  2020-01-19  3:14                               ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Al Viro
@ 2020-01-19 14:33                                 ` Ian Kent
  0 siblings, 0 replies; 53+ messages in thread
From: Ian Kent @ 2020-01-19 14:33 UTC (permalink / raw)
  To: Al Viro, Aleksa Sarai
  Cc: Linus Torvalds, David Howells, Eric Biederman, stable,
	Christian Brauner, Serge Hallyn, dev, Linux Containers,
	Linux API, linux-fsdevel, Linux Kernel Mailing List

On Sun, 2020-01-19 at 03:14 +0000, Al Viro wrote:
> 	OK, vfs.git #work.namei seems to survive xfstests.  I think
> it cleans the things quite a bit, but it obviously needs more
> review and testing.
> 
> 	Review and testing would be _very_ welcome; it does a lot
> of massage, so there had been a plenty of opportunities to fuck up
> and fail to spot that.  The same goes for profiling - it doesn't
> seem to slow the things down, but that needs to be verified.

I have run my usual tests (the second run of my submount-test is still
going) and they have run through fine.

I spend what time I can looking through the series tomorrow but will
probably need to complete that when I return from my trip to Albany
(Western Australia) some time on Friday.

> 
> 	It does include #work.openat2.  Topology: 17 commits, followed
> by clean merge with #work.openat2, followed by 9 followups.  The
> part is #work.openat2 is as posted by Aleksa; I can repost it, but
> I don't see much point.  Description of the rest follows; patches
> themselves will be in followups.
> 
> part 1: follow_automount() cleanups and fixes.
> 
> 	Quite a bit of that function had been about working around the
> wrong calling conventions of finish_automount().  The problem is that
> finish_automount() misuses the primitive intended for mount(2) and
> friends, where we want to mount on top of the pile, even if something
> has managed to add to that while we'd been trying to lock the
> namespace.
> For automount that's not the right thing to do - there we want to
> discard
> whatever it was going to attach and just cross into what got mounted
> there in the meanwhile (most likely - the results of the same
> automount
> triggered by somebody else).  Current mainline kinda-sorta manages to
> do
> that, but it's unreliable and very convoluted.  Much simpler approach
> is to stop using lock_mount() in finish_automount() and have it bail
> out if something turns out to have been mounted on top where we
> wanted
> to attach.  That allows to get rid of a lot of PITA in the caller.
> Another simplification comes from not trying to cross into the
> results
> of automount - simply ride through the next iteration of the loop and
> let it move into overmount.
> 
> 	Another thing in the same series is divorcing
> follow_automount()
> from nameidata; that'll play later when we get to unifying
> follow_down()
> with the guts of follow_managed().
> 
> 	4 commits, the second one fixes a hard-to-hit race.  The first
> is a prereq for it.
> 
> 1/17	do_add_mount(): lift lock_mount/unlock_mount into callers
> 2/17	fix automount/automount race properly
> 3/17	follow_automount(): get rid of dead^Wstillborn code
> 4/17	follow_automount() doesn't need the entire nameidata
> 
> part 2: unifying mount traversals in pathwalk.
> 
> 	Handling of mount traversal (follow_managed()) is currently
> called
> in a bunch of places.  Each of them is shortly followed by a call of
> step_into() or an open-coded equivalent thereof.  However, the
> locations
> of those step_into() calls are far from preceding follow_managed();
> moreover, that preceding call might happen on different paths that
> converge to given step_into() call.  It's harder to analyse that it
> should
> be (especially when it comes to liveness analysis) and it forces
> rather
> ugly calling conventions on
> lookup_fast()/atomic_open()/lookup_open().
> The series below massages the code to the point when the calls of
> follow_managed() (and __follow_mount_rcu()) move into the beginning
> of
> step_into().
> 
> 5/17	make build_open_flags() treat O_CREAT | O_EXCL as implying
> O_NOFOLLOW
> 	gets EEXIST handling in do_last() past the step_into() call
> there.
> 6/17	handle_mounts(): start building a sane wrapper for
> follow_managed()
> 	rather than mangling follow_managed() itself (and creating
> conflicts
> 	with openat2 series), add a wrapper that will absorb the
> required
> 	interface changes.
> 7/17	atomic_open(): saner calling conventions (return dentry on
> success)
> 	struct path passed to it is pure out parameter; only dentry
> part
> 	ever varies, though - mnt is always nd->path.mnt.  Just return
> 	the dentry on success, and ERR_PTR(-E...) on failure.
> 8/17	lookup_open(): saner calling conventions (return dentry on
> success)
> 	propagate the same change one level up the call chain.
> 9/17	do_last(): collapse the call of path_to_nameidata()
> 	struct path filled in lookup_open() call is eventually given to
> 	handle_mounts(); the only use it has before that is
> path_to_nameidata()
> 	call in "->atomic_open() has actually opened it" case, and
> there
> 	path_to_nameidata() is an overkill - we are guaranteed to
> replace
> 	only nd->path.dentry.  So have the struct path filled only
> immediately
> 	prior to handle_mounts().
> 10/17	handle_mounts(): pass dentry in, turn path into a pure out
> argument
> 	now all callers of handle_mount() are directly preceded by
> filling
> 	struct path it gets.  path->mnt is nd->path.mnt in all cases,
> so we can
> 	pass just the dentry instead and fill path in handle_mount()
> itself.
> 	Some boilerplate gone, path is pure out argument of
> handle_mount()
> 	now.
> 11/17	lookup_fast(): consolidate the RCU success case
> 	massage to gather what will become an RCU case equivalent of
> 	handle_mounts(); basically, that's what we do if revalidate
> succeeds
> 	in RCU case of lookup_fast(), including unlazy and fallback to
> 	handle_mounts() if __follow_mount_rcu() says "it's too tricky".
> 12/17	teach handle_mounts() to handle RCU mode
> 	... and take that into handle_mount() itself.  The other caller
> of
> 	__follow_mount_rcu() is fine with the same fallback (it just
> didn't
> 	bother since it's in the very beginning of pathwalk), switched
> to
> 	handle_mount() as well.
> 13/17	lookup_fast(): take mount traversal into callers
> 	Now we are getting somewhere - both RCU and non-RCU success
> cases of
> 	lookup_fast() are ended with the same return
> handle_mounts(...);
> 	move that to the callers - there it will merge with the
> identical calls
> 	that had been on the paths where we had to do slow lookups.
> 	lookup_fast() returns dentry now.
> 14/17	new step_into() flag: WALK_NOFOLLOW
> 	use step_into() instead of open-coding it in
> handle_lookup_down().
> 	Add a flag for "don't follow symlinks regardless of
> LOOKUP_FOLLOW" for
> 	that (and eventually, I hope, for .. handling).
> 	Now *all* calls of handle_mounts() and step_into() are right
> next to
> 	each other.
> 15/17	fold handle_mounts() into step_into()
> 	... and we can move the call of handle_mounts() into
> step_into(),
> 	getting a slightly saner calling conventions out of that.
> 16/17	LOOKUP_MOUNTPOINT: fold path_mountpointat() into
> path_lookupat()
> 	another payoff from 14/17 - we can teach path_lookupat() to do
> 	what path_mountpointat() used to.  And kill the latter, along
> with
> 	its wrappers.
> 17/17	expand the only remaining call of path_lookup_conditional()
> 	minor cleanup - RIP path_lookup_conditional().  Only one caller
> left.
> 
> At that point we run out of things that can be done without textual
> conflicts
> with openat2 series.  Changes so far:
> 	* mount traversal is taken into step_into().
> 	* lookup_fast(), atomic_open() and lookup_open() calling
> conventions
> are slightly changed.  All of them return dentry now, instead of
> returning
> an int and filling struct path on success.  For lookup_fast() the old
> "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for
> cache
> miss, dentry - for hit".
> 	* step_into() can be called in RCU mode as well.  Takes
> nameidata,
> WALK_... flags, dentry and, in RCU case, corresponding inode and seq
> value.
> Handles mount traversals, decides whether it's a symlink to be
> followed.
> Error => returns -E...; symlink to follow => returns 1, puts symlink
> on stack;
> non-symlink or symlink not to follow => returns 0, moves nd->path to
> new location.
> 	* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and
> friends
> became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in
> flags.
> 
> Next comes the merge with Aleksa's openat2 patchset; everything up to
> that point
> had been non-conflicting with it.  That patchset has been posted
> earlier;
> it's in #work.openat2.  The next series comes on top of the merge.
> 
> part 3: untangling the symlink handling.
> 
> 	Right now when we decide to follow a symlink it happens this
> way:
> 	* step_into() decides that it has been given a symlink that
> needs to
> be followed.
> 	* it calls pick_link(), which pushes the symlink on stack and
> returns 1 on success / -E... on error.  Symlink's mount/dentry/seq is
> stored on stack and the inode is stashed in nd->link_inode.
> 	* step_into() passes that 1 to its callers, which proceed to
> pass it
> up the call chain for several layers.  In all cases we get to
> get_link()
> call shortly afterwards.
> 	* get_link() is called, picks the inode stashed in nd-
> >link_inode
> by the pick_link(), does some checks, touches the atime, etc.
> 	* get_link() either picks the link body out of inode or calls
> ->get_link().  If it's an absolute symlink, we move to the root and
> return
> the relative portion of the body; if it's a relative one - just
> return the
> body.  If it's a procfs-style one, the call of nd_jump_link() has
> been
> made and we'd moved to whatever location is desired.  And return
> NULL,
> same as we do for symlink to "/".
> 	* the caller proceeds to deal with the string returned to it.
> 
> 	The sequence is the same in all cases (nested symlink, trailing
> symlink on lookup, trailing symlink on open), but its pieces are not
> close
> to each other and the bit between the call of pick_link() and
> (inevitable)
> call of get_link() afterwards is not easy to follow.  Moreover, a
> bunch
> of functions (walk_component/lookup_last/do_last) ends up with the
> same
> conventions for return values as step_into().  And those conventions
> (see above) are not pretty - 0/1/-E... is asking for mistakes,
> especially
> when returned 1 is used only to direct control flow on a rather
> twisted
> way to matching get_link() call.  And that path can be seriously
> twisted.
> E.g. when we are trying to open /dev/stdin, we get the following
> sequence:
> 	* path_init() has put us into root and returned "/dev/stdin"
> 	* link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> 	* we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast().  Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> 	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing.  Into the stack it goes and we return 1.
> 	* do_last() sees 1 and returns it.
> 	* trailing_symlink() is called (in the top-level loop) and it
> calls get_link().  OK, we get "/proc/self/fd/0" for body, move to
> root again and return "proc/self/fd/0".
> 	* link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> 	* do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> 	* _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now).  It's dropped (from step_into()) and we
> get
> to look at what we'd been given.  A symlink to follow, so on the
> stack
> it goes and we return 1.
> 	* again, do_last() passes 1 to caller
> 	* trailing_symlink() is called and calls get_link().
> 	* this time it's a procfs symlink and its ->get_link() method
> moves us to the mount/dentry of our stdin.  And returns NULL.  But
> the
> fun doesn't stop yet.
> 	* trailing_symlink() returns "" to the caller
> 	* link_path_walk() is called on that and does nothing
> whatsoever.
> 	* do_last() is called and sees LAST_BIND left by the
> get_link().
> It calls handle_dots()
> 	* handle_dots() drops the symlink from stack and returns
> 	* do_last() *FINALLY* proceeds to the point after its call of
> step_into() (finish_open:) and gets around to opening the damn thing.
> 
> 	Making sense of the control flow through all of that is not
> fun,
> to put it mildly; debugging anything in that area can be a massive
> PITA,
> and this example has touched only one of 3 cases.  Arguably, the
> worst
> one, but...  Anyway, it turns out that this code can be massaged to
> considerably saner shape - both in terms of control flow and wrt
> calling
> conventions.
> 
> 1/9	merging pick_link() with get_link(), part 1
> 	prep work: move the "hardening" crap from trailing_symlink()
> into
> get_link() (conditional on the absense of LOOKUP_PARENT in nd-
> >flags).
> We'll be moving the calls of get_link() around quite a bit through
> that
> series, and the next step will be to eliminate trailing_symlink().
> 2/9	merging pick_link() with get_link(), part 2
> 	fold trailing_symlink() into lookup_last() and do_last().
> Now these are returning strings; it's not the final calling
> conventions,
> but it's almost there.  NULL => old 0, we are done.  ERR_PTR(-E...)
> =>
> old -E..., we'd failed.  string => old 1, and the string is the
> symlink
> body to follow.  Just as for trailing_symlink(), "/" and procfs ones
> (where get_link() returns NULL) yield "", so the ugly song and dance
> with no-op trip through link_path_walk()/handle_dots() still remains.
> 3/9	merging pick_link() with get_link(), part 3
> 	elimination of that round-trip.  In *all* cases having
> get_link() return NULL on such symlinks means that we'll proceed to
> drop the symlink from stack and get back to the point near that
> get_link() call - basically, where we would be if it hadn't been
> a symlink at all.  The path by which we are getting there depends
> upon the call site; the end result is the same in all cases - such
> symlinks (procfs ones and symlink to "/") are fully processed by
> the time get_link() returns, so we could as well drop them from the
> stack right in get_link().  Makes life simpler in terms of control
> flow analysis...
> 	And now the calling conventions for do_last() and lookup_last()
> have reached the final shape - ERR_PTR(-E...) for error, NULL for
> "we are done", string for "traverse this".
> 4/9	merging pick_link() with get_link(), part 4
> 	now all calls of walk_component() are followed by the same
> boilerplate - "if it has returned 1, call get_link() and if that
> has returned NULL treat that as if walk_component() has returned 0".
> Eliminate by folding that into walk_component() itself.  Now
> walk_component() return value conventions have joined those of
> do_last()/lookup_last().
> 5/9	merging pick_link() with get_link(), part 5
> 	same as for the previous, only this time the boilerplate
> migrates one level down, into step_into().  Only one caller of
> get_link() left, step_into() has joined the same return value
> conventions.
> 6/9	merging pick_link() with get_link(), part 6
> 	move that thing into pick_link().  Now all traces of
> "return 1 if we are following a symlink" are gone.
> 7/9	finally fold get_link() into pick_link()
> 	ta-da - expand get_link() into the only caller.  As a side
> benefit, we get rid of stashing the inode in nd->link_inode - it
> was done only to carry that piece of information from pick_link()
> to eventual get_link().  That's not the main benefit, though - the
> control flow became considerably easier to reason about.
> 
> For what it's worth, the example above (/dev/stdin) becomes
> 	* path_init() has put us into root and returned "/dev/stdin"
> 	* link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> 	* we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast().  Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> 	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing.  On the stack it goes and we get its body.  Which is
> "/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
> 	* do_last() sees non-NULL and returns it - whether it's an
> error
> or a pathname to traverse, we hadn't reached something we'll be
> opening.
> 	* link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> 	* do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> 	* _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now).  It's dropped (from step_into()) and we
> get
> to look at what we'd been given.  A symlink to follow, so on the
> stack
> it goes.   This time it's a procfs symlink and its ->get_link()
> method
> moves us to the mount/dentry of our stdin.  And returns NULL.  So we
> drop symlink from stack and return that NULL to caller.
> 	* that NULL is returned by step_into(), same as if we had just
> moved to a non-symlink.
> 	* do_last() proceeds to open the damn thing.
> 
> part 4.  some mount traversal cleanups.
> 
> 8/9	massage __follow_mount_rcu() a bit
> 	make it more similar to non-RCU counterpart
> 9/9	new helper: traverse_mounts()
> 	the guts of follow_managed() are very similar to
> follow_down().  The calling conventions are different
> (follow_managed()
> works with nameidata, follow_down() - with standalone struct path),
> but the core loop is pretty much the same in both.  Turned that loop
> into a common helper (traverse_mounts()) and since follow_managed()
> becomes a very thin wrapper around it, expand follow_managed() at its
> only call site (in handle_mounts()),
> 
> That's where the series stands right now.  FWIW, at 5.5-rc1
> fs/namei.c
> had been 4867 lines, at the tip of #work.openat2 - 4998, at the
> tip of #work.namei (containing #work.openat2) - 4730...  And IMO
> the thing has become considerably easier to follow.
> 
> What's more, it might be possible to untangle the control flow in
> do_last() now.  Probably a separate series, though - do_last() is
> one hell of a tarpit, so I'm not stepping into it for the rest
> of this cycle...
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, back to index

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-30  5:20 [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Aleksa Sarai
2019-12-30  5:20 ` [PATCH RFC 1/1] " Aleksa Sarai
2019-12-30  7:34   ` Linus Torvalds
2019-12-30  8:28     ` Aleksa Sarai
2020-01-08  4:39       ` Andy Lutomirski
2019-12-30  5:44 ` [PATCH RFC 0/1] " Al Viro
2019-12-30  5:49   ` Aleksa Sarai
2019-12-30  7:29     ` Aleksa Sarai
2019-12-30  7:53       ` Linus Torvalds
2019-12-30  8:32         ` Aleksa Sarai
2020-01-02  8:58           ` David Laight
2020-01-02  9:09             ` Aleksa Sarai
2020-01-01  0:43       ` Al Viro
2020-01-01  0:54         ` Al Viro
2020-01-01  3:08           ` Al Viro
2020-01-01 14:44             ` Aleksa Sarai
2020-01-01 23:40               ` Al Viro
2020-01-02  3:59                 ` Aleksa Sarai
2020-01-03  1:49                   ` Al Viro
2020-01-04  4:46                     ` Ian Kent
2020-01-08  3:13                     ` Al Viro
2020-01-08  3:54                       ` Linus Torvalds
2020-01-08 21:34                         ` Al Viro
2020-01-10  0:08                           ` Linus Torvalds
2020-01-10  4:15                             ` Al Viro
2020-01-10  5:03                               ` Linus Torvalds
2020-01-10  6:20                               ` Ian Kent
2020-01-12 21:33                                 ` Al Viro
2020-01-13  2:59                                   ` Ian Kent
2020-01-14  0:25                                     ` Ian Kent
2020-01-14  4:39                                       ` Al Viro
2020-01-14  5:01                                         ` Ian Kent
2020-01-14  5:59                                           ` Ian Kent
2020-01-10 21:07                         ` Aleksa Sarai
2020-01-14  4:57                           ` Al Viro
2020-01-14  5:12                             ` Al Viro
2020-01-14 20:01                             ` Aleksa Sarai
2020-01-15 14:25                               ` Al Viro
2020-01-15 14:29                                 ` Aleksa Sarai
2020-01-15 14:34                                   ` Aleksa Sarai
2020-01-15 14:48                                     ` Al Viro
2020-01-15 13:57                             ` Aleksa Sarai
2020-01-19  3:14                               ` [RFC][PATCHSET][CFT] pathwalk cleanups and fixes Al Viro
2020-01-19 14:33                                 ` Ian Kent
2020-01-10 23:19                     ` [PATCH RFC 0/1] mount: universally disallow mounting over symlinks Al Viro
2020-01-13  1:48                       ` Ian Kent
2020-01-13  3:54                         ` Al Viro
2020-01-13  6:00                           ` Ian Kent
2020-01-13  6:03                             ` Ian Kent
2020-01-13 13:30                               ` Al Viro
2020-01-14  7:25                                 ` Ian Kent
2020-01-14 12:17                                   ` Ian Kent
2020-01-04  5:52               ` Andy Lutomirski

Stable Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/stable/0 stable/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 stable stable/ https://lore.kernel.org/stable \
		stable@vger.kernel.org
	public-inbox-index stable

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.stable


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git