From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-f68.google.com ([209.85.221.68]:42950 "EHLO mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726264AbeJGRzd (ORCPT ); Sun, 7 Oct 2018 13:55:33 -0400 Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12] From: Alan Jenkins To: David Howells , viro@zeniv.linux.org.uk Cc: torvalds@linux-foundation.org, ebiederm@xmission.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, mszeredi@redhat.com References: <153754740781.17872.7869536526927736855.stgit@warthog.procyon.org.uk> <153754743491.17872.12115848333103740766.stgit@warthog.procyon.org.uk> Message-ID: <862e36a2-2a6f-4e26-3228-8cab4b4cf230@gmail.com> Date: Sun, 7 Oct 2018 11:48:37 +0100 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 05/10/2018 19:24, Alan Jenkins wrote: > On 21/09/2018 17:30, David Howells wrote: >> From: Al Viro >> >> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be >> attached by move_mount(2). >> >> If by the time of final fput() of OPEN_TREE_CLONE-opened file its >> tree is >> not detached anymore, it won't be dissolved.  move_mount(2) is adjusted >> to handle detached source. >> >> That gives us equivalents of mount --bind and mount --rbind. >> >> Signed-off-by: Al Viro >> Signed-off-by: David Howells >> --- >> >>   fs/namespace.c |   26 ++++++++++++++++++++------ >>   1 file changed, 20 insertions(+), 6 deletions(-) >> >> diff --git a/fs/namespace.c b/fs/namespace.c >> index dd38141b1723..caf5c55ef555 100644 >> --- a/fs/namespace.c >> +++ b/fs/namespace.c >> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt) >>   { >>       namespace_lock(); >>       lock_mount_hash(); >> -    mntget(mnt); >> -    umount_tree(real_mount(mnt), UMOUNT_CONNECTED); >> +    if (!real_mount(mnt)->mnt_ns) { >> +        mntget(mnt); >> +        umount_tree(real_mount(mnt), UMOUNT_CONNECTED); >> +    } >>       unlock_mount_hash(); >>       namespace_unlock(); >>   } >> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, >> struct path *new_path) >>       struct mount *old; >>       struct mountpoint *mp; >>       int err; >> +    bool attached; >>         mp = lock_mount(new_path); >>       err = PTR_ERR(mp); >> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path >> *old_path, struct path *new_path) >>       p = real_mount(new_path->mnt); >>         err = -EINVAL; >> -    if (!check_mnt(p) || !check_mnt(old)) >> +    /* The mountpoint must be in our namespace. */ >> +    if (!check_mnt(p)) >> +        goto out1; >> +    /* The thing moved should be either ours or completely >> unattached. */ >> +    if (old->mnt_ns && !check_mnt(old)) >>           goto out1; >>   -    if (!mnt_has_parent(old)) >> +    attached = mnt_has_parent(old); >> +    /* >> +     * We need to allow open_tree(OPEN_TREE_CLONE) followed by >> +     * move_mount(), but mustn't allow "/" to be moved. >> +     */ >> +    if (old->mnt_ns && !attached) >>           goto out1; >>         if (old->mnt.mnt_flags & MNT_LOCKED) > > Hi > > I replied last time to wonder about the MNT_UMOUNT mnt_flag. So I've > tested it now :-), on David's current tree (commit 5581f4935add). > > The modified do_move_mount() allows re-attaching something that was > lazy-unmounted. But the lazy unmount sets MNT_UMOUNT. And this flag is > not cleared when the mount is re-attached. > > I wasn't sure what effect this would have. Luckily it showed up > straight away, when I tried to unmount again. It causes a soft lockup. > > Debug printk: > > diff --git a/fs/namespace.c b/fs/namespace.c > index 4dfe7e23b7ee..ac8de9191cfe 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2472,6 +2472,10 @@ static int do_move_mount(struct path *old_path, > struct path *new_path) >      if (old->mnt.mnt_flags & MNT_LOCKED) >          goto out1; > > +    pr_info("mnt_flags=%x umount=%x\n", > +            (unsigned) old->mnt.mnt_flags, > +            (unsigned) !!(old->mnt.mnt_flags & MNT_UMOUNT); > + >      if (old_path->dentry != old_path->mnt->mnt_root) >          goto out1; The lockup seems to be a general problem with the cleanup code. Even if I use this as advertised, i.e. for a simple bind mount. (I was suspicious that being able to pass around detached trees as an FD, and re-attach them in any namespace, allows leaking memory by creating a namespace loop.  I.e. maybe it gives you enough rope to skip the test in mnt_ns_loop().  But I didn't get that far). I converted test-fsmount.c for my own purposes: diff --git a/samples/vfs/test-fsmount.c b/samples/vfs/test-fsmount.c index 74124025ade0..da6e3fbf0513 100644 --- a/samples/vfs/test-fsmount.c +++ b/samples/vfs/test-fsmount.c @@ -83,6 +83,11 @@ static inline int move_mount(int from_dfd, const char *from_pathname, to_dfd, to_pathname, flags); } +static inline int open_tree(int dfd, const char *pathname, unsigned flags) +{ + return syscall(__NR_open_tree, dfd, pathname, flags); +} + #define E_fsconfig(fd, cmd, key, val, aux) \ do { \ if (fsconfig(fd, cmd, key, val, aux) == -1) \ @@ -93,6 +98,7 @@ int main(int argc, char *argv[]) { int fsfd, mfd; +#if 0 /* Mount a publically available AFS filesystem */ fsfd = fsopen("afs", 0); if (fsfd == -1) { @@ -115,4 +121,9 @@ int main(int argc, char *argv[]) E(close(mfd)); exit(0); +#endif + + E( mfd = open_tree(-1, "/mnt", OPEN_TREE_CLONE) ); + E( fchdir(mfd) ); + E( execl("/bin/bash", "/bin/bash", NULL) ); } If I close() the mount FD "mfd", and then do "mount --move . /mnt", my printk() shows MNT_UMOUNT has been set. ( I guess fchdir() works more like openat(... , O_PATH) than dup() ). Then unmounting /mnt hangs, as I would expect from my previous test. If I instead do the mount+unmount first, and close the FD as a second step, I think there's a lockup in the close().  The lockup happens in the same place as the unmount lockup from before. (Except there's a line "Code: Bad RIP value", I don't know why that happens). # unshare --mount # test-fsmount # mount --move . /mnt [ 270.859542] umount=0 mnt_flags=20 Check the flags are still the same: # mount --move /mnt /mnt [ 305./mnt: mount(2) system call failed: Too many levels of symbolic links. [ 313.737030] umount=0 mnt_flags=20 Clean up the bind mount, and then the inherited mount FD. # cd # umount /mnt # exit [ 351.898629] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [bash:1483] [ 351.899841] Modules linked in: xt_CHECKSUM(E) ipt_MASQUERADE(E) tun(E) bridge(E) stp(E) llc(E) ip6t_rpfilter(E) ip6t_REJECT(E) nf_reject_ipv6(E) xt_conntrack(E) ip6table_nat(E) nf_nat_ipv6(E) devlink(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ip6table_filter(E) ip6_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hwdep(E) snd_hda_core(E) snd_seq(E) snd_seq_device(E) snd_pcm(E) joydev(E) crc32_pclmul(E) snd_timer(E) ghash_clmulni_intel(E) snd(E) crct10dif_pclmul(E) virtio_balloon(E) serio_raw(E) soundcore(E) crc32c_intel(E) qxl(E) drm_kms_helper(E) virtio_console(E) ttm(E) virtio_net(E) net_failover(E) [ 351.912077] failover(E) drm(E) qemu_fw_cfg(E) pata_acpi(E) ata_generic(E) [ 351.912888] CPU: 0 PID: 1483 Comm: bash Tainted: G E 4.19.0-rc3+ #7 [ 351.914221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 [ 351.916582] RIP: 0010:pin_kill+0x128/0x140 [ 351.917369] Code: f2 5a 00 48 8b 44 24 20 48 39 c5 0f 84 6f ff ff ff 48 89 df e8 e9 4a 5b 00 8b 43 18 85 c0 7e b3 c6 03 00 fb 66 0f 1f 44 00 00 51 ff ff ff e8 be 11 dd ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 [ 351.920729] RSP: 0018:ffffa1b381be3d88 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 [ 351.921801] RAX: 0000000000000000 RBX: ffff909cf2ea68b0 RCX: dead000000000200 [ 351.922807] RDX: 0000000000000001 RSI: ffffa1b381be3d28 RDI: ffff909cf2ea68b0 [ 351.923811] RBP: ffffa1b381be3da8 R08: ffff909d59621760 R09: 0000000000000000 [ 351.924813] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010000000 [ 351.925818] R13: ffff909cf5db9a38 R14: ffff909cf2ea67a0 R15: ffff909cedc07300 [ 351.926824] FS: 00007f1eb90ac740(0000) GS:ffff909d59600000(0000) knlGS:0000000000000000 [ 351.927957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 351.928772] CR2: 00007f1eabedb180 CR3: 000000000f20a003 CR4: 00000000003606f0 [ 351.929779] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 351.930785] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 351.931791] Call Trace: [ 351.932160] ? finish_wait+0x80/0x80 [ 351.932684] group_pin_kill+0x1a/0x30 [ 351.933207] namespace_unlock+0x6f/0x80 [ 351.933766] __fput+0x239/0x240 [ 351.934217] task_work_run+0x84/0xa0 [ 351.934743] do_exit+0x2d3/0xae0 [ 351.935206] ? __do_page_fault+0x263/0x4e0 [ 351.935799] do_group_exit+0x3a/0xa0 [ 351.936307] __x64_sys_exit_group+0x14/0x20 [ 351.936911] do_syscall_64+0x5b/0x160 [ 351.937436] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 351.938164] RIP: 0033:0x7f1eb877adb6 [ 351.938688] Code: Bad RIP value. [ 351.939149] RSP: 002b:00007ffd56e019d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 [ 351.940216] RAX: ffffffffffffffda RBX: 00007f1eb8a69740 RCX: 00007f1eb877adb6 [ 351.941222] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 [ 351.942229] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff80 [ 351.943236] R10: 00007ffd56e0188a R11: 0000000000000246 R12: 00007f1eb8a69740 [ 351.944242] R13: 0000000000000001 R14: 00007f1eb8a72708 R15: 0000000000000000