linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/34] VFS: Introduce filesystem context [ver #12]
@ 2018-09-21 16:30 David Howells
  2018-09-21 16:30 ` [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: David Howells @ 2018-09-21 16:30 UTC (permalink / raw)
  To: viro
  Cc: John Johansen, Tejun Heo, Eric W. Biederman, selinux, Paul Moore,
	Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
	Greg Kroah-Hartman, Eric Biggers, linux-security-module,
	Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
	cgroups, torvalds, dhowells, ebiederm, linux-fsdevel,
	linux-kernel


Hi Al,

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount.  This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

	fd = fsopen("nfs");
	fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(fd, MS_NODEV);
	move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by a legacy filesystem wrapper.


====================
WHY DO WE WANT THIS?
====================

Firstly, there's a bunch of problems with the mount(2) syscall:

 (1) It's actually six or seven different interfaces rolled into one and
     weird combinations of flags make it do different things beyond the
     original specification of the syscall.

 (2) It produces a particularly large and diverse set of errors, which have
     to be mapped back to a small error code.  Yes, there's dmesg - if you
     have it configured - but you can't necessarily see that if you're
     doing a mount inside of a container.

 (3) It copies a PAGE_SIZE block of data for each of the type, device name
     and options.

 (4) The size of the buffers is PAGE_SIZE - and this is arch dependent.

 (5) You can't mount into another mount namespace.  I could, for example,
     build a container without having to be in that container's namespace
     if I can do it from outside.

 (6) It's not really geared for the specification of multiple sources, but
     some filesystems really want that - overlayfs, for example.

and some problems in the internal kernel api:

 (1) There's no defined way to supply namespace configuration for the
     superblock - so, for instance, I can't say that I want to create a
     superblock in a particular network namespace (on automount, say).

     NFS hacks around this by creating multiple shadow file_system_types
     with different ->mount() ops.

 (2) When calling mount internally, unless you have NFS-like hacks, you
     have to generate or otherwise provide text config data which then gets
     parsed, when some of the time you could bypass the parsing stage
     entirely.

 (3) The amount of data in the data buffer is not known, but the data
     buffer might be on a kernel stack somewhere, leading to the
     possibility of tripping the stack underrun guard.

and other issues too:

 (1) Superblock remount in some filesystems applies options on an as-parsed
     basis, so if there's a parse failure, a partial alteration with no
     rollback is effected.

 (2) Under some circumstances, the mount data may get copied multiple times
     so that it can have multiple parsers applied to it or because it has
     to be parsed multiple times - for instance, once to get the
     preliminary info required to access the on-disk superblock and then
     again to update the superblock record in the kernel.

I want to be able to add support for a bunch of things:

 (1) UID, GID and Project ID mapping/translation.  I want to be able to
     install a translation table of some sort on the superblock to
     translate source identifiers (which may be foreign numeric UIDs/GIDs,
     text names, GUIDs) into system identifiers.  This needs to be done
     before the superblock is published[*].

     Note that this may, for example, involve using the context and the
     superblock held therein to issue an RPC to a server to look up
     translations.

     [*] By "published" I mean made available through mount so that other
     	 userspace processes can access it by path.

     Maybe specifying a translation range element with something like:

	fsconfig(fd, fsconfig_translate_uid, "<srcuid> <nsuid> <count>", 0, 0);

     The translation information also needs to propagate over an automount
     in some circumstances.

 (2) Namespace configuration.  I want to be able to tell the superblock
     creation process what namespaces should be applied when it created (in
     particular the userns and netns) for containerisation purposes, e.g.:

	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "user", 0, userns_fd);
	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "net", 0, netns_fd);

 (3) Namespace propagation.  I want to have a properly defined mechanism
     for propagating namespace configuration over automounts within the
     kernel.  This will be particularly useful for network filesystems.

 (4) Pre-mount attribute query.  A chunk of the changes is actually the
     fsinfo() syscall to query attributes of the filesystem beyond what's
     available in statx() and statfs().  This will allow a created
     superblock to be queried before it is published.

 (5) Upcall for configuration.  I would like to be able to query
     configuration that's stored in userspace when an automount is made.
     For instance, to look up network parameters for NFS or to find a cache
     selector for fscache.

     The internal fs_context could be passed to the upcall process or the
     kernel could read a config file directly if named appropriately for the
     superblock, perhaps:

	[/etc/fscontext.d/afs/example.com/cell.cfg]
	realm = EXAMPLE.COM
	translation = uid,3000,4000,100
	fscache = tag=fred

 (6) Event notifications.  I want to be able to install a watch on a
     superblock before it is published to catch things like quota events
     and EIO.

 (7) Large and binary parameters.  There might be at some point a need to
     pass large/binary objects like Microsoft PACs around.  If I understand
     PACs correctly, you can obtain these from the Kerberos server and then
     pass them to the file server when you connect.

     Having it possible to pass large or binary objects as individual
     fsconfig calls make parsing these trivial.  OTOH, some or all of this
     can potentially be handled with the use of the keyrings interface - as
     the afs filesystem does for passing kerberos tokens around; it's just
     that that seems overkill for a parameter you may only need once.


===================
SIGNIFICANT CHANGES
===================

 ver #12:

 (*) Rebased on v4.19-rc3.

 (*) Added three new context purposes: mount for hidden root, reconfigure
     for unmount, reconfigure for emergency remount.

 (*) Added a parameter for the new purpose into vfs_dup_fs_context().

 (*) Moved the reconfiguration hook from struct super_operations to struct
     fs_context_operations so they can be handled through the legacy
     wrapper.  mount -o remount now goes through that.

 (*) Changed the parameter description in the following ways:

     - Nominated one master name for each parameter, held in a simple
       string pointer array.  This makes it easy to simply look up a name
       for that parameter for logging.

     - Added a table of additional names for parameters.  The name chosen
       can be used to influence the action of the parameter.

     - Noted which parameter is the source specifier, if there is one.

 (*) Use correct user_ns for a new pidns superblock.

 (*) Fix mqueue to not crash on mounting.

 (*) Make VFS sample programs dependent on X86 to avoid errors in
     autobuilders due to unset syscall IDs in other arches.

 (*) [Miklós] Fixed subtype handling.

 ver #11:

 (*) Fixed AppArmor.

 (*) Capitalised all the UAPI constants.

 (*) Explicitly numbered the FSCONFIG_* UAPI constants.

 (*) Removed all the places ANON_INODES is selected.

 (*) Fixed a bug whereby the context gets freed twice (which broke mounts of
     procfs).

 (*) Split fsinfo() off into its own patch series.

 ver #10:

 (*) Renamed "option" to "parameter" in a number of places.

 (*) Replaced the use of write() to drive the configuration with an fsconfig()
     syscall.  This also allows at-style paths and fds to be presented as typed
     object.

 (*) Routed the key=value parameter concept all the way through from the
     fsconfig() system call to the LSM and filesystem.

 (*) Added a parameter-description concept and helper functions to help
     interpret a parameter and possibly convert the value.

 (*) Made it possible to query the parameter description using the fsinfo()
     syscall.  Added a test-fs-query sample to dump the parameters used by a
     filesystem.

 ver #9:

 (*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.

 (*) Al added an open_tree() system call to allow a mount tree to be picked
     referenced or cloned into an O_PATH-style fd.  This can then be used
     with sys_move_mount().  Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
     open() flags.

 (*) Brought error logging back in, though only in the fs_context and not
     in the task_struct.

 (*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.

 (*) Used anon_inodes for the fd returned by fsopen() and fspick().  This
     requires making it unconditional.

 (*) Fixed lots of bugs.  Especial thanks to Al and Eric Biggers for
     finding them and providing patches.

 (*) Wrote manual pages, which I'll post separately.

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
     of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
     actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
     certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
     with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
     dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
     vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

	[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
     filesystem.  The idea being that this will, in future, work with the
     fd from fsopen() too and permit querying of the parameters and
     metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
     take the data pointer.  This fixes a problem where someone (say SELinux)
     tries to copy the mount data, assuming it to be a page in size, and
     overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
     deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
     replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
     passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
     caller always provide a struct file_system_type pointer and the
     parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
     FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
     used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
     task_struct rather than the sb_config so that error messages can be
     obtained from NFS doing a mount-root-and-pathwalk inside the
     nfs_get_tree() operation.

     Further, made this managed and read by prctl rather than through the
     mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
     directly to functions that want to call it.  NFS now calls
     nfs_fill_super() directly rather than jumping through a pointer to it
     since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
     proc_sb_config.

 (*) Renamed create_super -> get_tree.

 (*) Renamed struct mount_context to struct sb_config and amended various
     variable names.

 (*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
     arguments.

 ver #1:

 (*) Split the sb_config stuff out into its own header.

 (*) Support non-context aware filesystems through a special set of
     sb_config operations.

 (*) Stored the created superblock and root dentry into the sb_config after
     creation rather than directly into a vfsmount.  This allows some
     arguments to be removed to various NFS functions.

 (*) Added an explicit superblock-creation step.  This allows a created
     superblock to then be mounted multiple times.

 (*) Added a flag to say that the sb_config is degraded and cannot have
     another go at having a superblock creation whilst getting rid of the
     one that says it's already mounted.

Possible further developments:

 (*) Implement sb reconfiguration (for now it returns ENOANO).

 (*) Implement mount context support in more filesystems, ext4 being next
     on my list.

 (*) Move the walk-from-root stuff that nfs has to generic code so that you
     can do something akin to:

	mount /dev/sda1:/foo/bar /mnt

     See nfs_follow_remote_path() and mount_subtree().  This is slightly
     tricky in NFS as we have to prevent referral loops.

 (*) Work out how to get at the error message incurred by submounts
     encountered during nfs_follow_remote_path().

     Should the error message be moved to task_struct and made more
     general, perhaps retrieved with a prctl() function?

 (*) Clean up/consolidate the security functions.  Possibly add a
     validation hook to be called at the same time as the mount context
     validate op.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

	mount-api

David
---
Al Viro (2):
      vfs: syscall: Add open_tree(2) to reference or clone a mount
      teach move_mount(2) to work with OPEN_TREE_CLONE

David Howells (32):
      vfs: syscall: Add move_mount(2) to move mounts around
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
      vfs: Introduce the basic header for the new mount API's filesystem context
      vfs: Introduce logging functions
      vfs: Add configuration parser helpers
      vfs: Add LSM hooks for the new mount API
      vfs: Put security flags into the fs_context struct
      selinux: Implement the new mount API LSM hooks
      smack: Implement filesystem context security hooks
      apparmor: Implement security hooks for the new mount API
      tomoyo: Implement security hooks for the new mount API
      vfs: Separate changing mount flags full remount
      vfs: Implement a filesystem superblock creation/configuration context
      vfs: Remove unused code after filesystem context changes
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Add fs_context support to procfs
      ipc: Convert mqueue fs to fs_context
      cpuset: Use fs_context
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context
      hugetlbfs: Convert to fs_context
      vfs: Remove kern_mount_data()
      vfs: Provide documentation for new mount API
      Make anon_inodes unconditional
      vfs: syscall: Add fsopen() to prepare for superblock creation
      vfs: Implement logging through fs_context
      vfs: Add some logging to the core users of the fs_context log
      vfs: syscall: Add fsconfig() for configuring and managing a context
      vfs: syscall: Add fsmount() to create a mount for a superblock
      vfs: syscall: Add fspick() to select a superblock for reconfiguration
      afs: Add fs_context support
      afs: Use fs_context to pass parameters over automount
      vfs: Add a sample program for the new mount API


 Documentation/filesystems/mount_api.txt  |  741 +++++++++++++++++++++++++
 arch/arc/kernel/setup.c                  |    1 
 arch/arm/kernel/atags_parse.c            |    1 
 arch/arm/kvm/Kconfig                     |    1 
 arch/arm64/kvm/Kconfig                   |    1 
 arch/mips/kvm/Kconfig                    |    1 
 arch/powerpc/kvm/Kconfig                 |    1 
 arch/s390/kvm/Kconfig                    |    1 
 arch/sh/kernel/setup.c                   |    1 
 arch/sparc/kernel/setup_32.c             |    1 
 arch/sparc/kernel/setup_64.c             |    1 
 arch/x86/Kconfig                         |    1 
 arch/x86/entry/syscalls/syscall_32.tbl   |    6 
 arch/x86/entry/syscalls/syscall_64.tbl   |    6 
 arch/x86/kernel/cpu/intel_rdt.h          |   15 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  183 ++++--
 arch/x86/kernel/setup.c                  |    1 
 arch/x86/kvm/Kconfig                     |    1 
 drivers/base/Kconfig                     |    1 
 drivers/base/devtmpfs.c                  |    1 
 drivers/char/tpm/Kconfig                 |    1 
 drivers/dma-buf/Kconfig                  |    1 
 drivers/gpio/Kconfig                     |    1 
 drivers/iio/Kconfig                      |    1 
 drivers/infiniband/Kconfig               |    1 
 drivers/vfio/Kconfig                     |    1 
 fs/Kconfig                               |    7 
 fs/Makefile                              |    5 
 fs/afs/internal.h                        |    9 
 fs/afs/mntpt.c                           |  148 +++--
 fs/afs/super.c                           |  470 +++++++++-------
 fs/afs/volume.c                          |    4 
 fs/f2fs/super.c                          |    2 
 fs/file_table.c                          |    9 
 fs/filesystems.c                         |    4 
 fs/fs_context.c                          |  769 ++++++++++++++++++++++++++
 fs/fs_parser.c                           |  555 +++++++++++++++++++
 fs/fsopen.c                              |  563 +++++++++++++++++++
 fs/hugetlbfs/inode.c                     |  391 ++++++++-----
 fs/internal.h                            |   19 +
 fs/kernfs/mount.c                        |   88 +--
 fs/libfs.c                               |   20 +
 fs/namei.c                               |    4 
 fs/namespace.c                           |  895 +++++++++++++++++++++++-------
 fs/notify/fanotify/Kconfig               |    1 
 fs/notify/inotify/Kconfig                |    1 
 fs/pnode.c                               |    1 
 fs/proc/inode.c                          |   50 --
 fs/proc/internal.h                       |    5 
 fs/proc/root.c                           |  256 ++++++---
 fs/super.c                               |  464 ++++++++++++----
 fs/sysfs/mount.c                         |   67 ++
 include/linux/cgroup.h                   |    3 
 include/linux/errno.h                    |    1 
 include/linux/fs.h                       |   23 +
 include/linux/fs_context.h               |  215 +++++++
 include/linux/fs_parser.h                |  119 ++++
 include/linux/kernfs.h                   |   41 +
 include/linux/lsm_hooks.h                |   79 ++-
 include/linux/module.h                   |    6 
 include/linux/mount.h                    |    5 
 include/linux/security.h                 |   65 ++
 include/linux/syscalls.h                 |    9 
 include/uapi/linux/fcntl.h               |    2 
 include/uapi/linux/fs.h                  |   82 +--
 include/uapi/linux/mount.h               |   75 +++
 init/Kconfig                             |   10 
 init/do_mounts.c                         |    1 
 init/do_mounts_initrd.c                  |    1 
 ipc/mqueue.c                             |  107 +++-
 ipc/namespace.c                          |    2 
 kernel/cgroup/cgroup-internal.h          |   50 +-
 kernel/cgroup/cgroup-v1.c                |  345 ++++++------
 kernel/cgroup/cgroup.c                   |  264 ++++++---
 kernel/cgroup/cpuset.c                   |   69 ++
 samples/Kconfig                          |   10 
 samples/Makefile                         |    2 
 samples/statx/Makefile                   |    7 
 samples/statx/test-statx.c               |  258 ---------
 samples/vfs/Makefile                     |   10 
 samples/vfs/test-fsmount.c               |  118 ++++
 samples/vfs/test-statx.c                 |  258 +++++++++
 security/apparmor/include/mount.h        |   11 
 security/apparmor/lsm.c                  |  108 ++++
 security/apparmor/mount.c                |   47 ++
 security/security.c                      |   56 ++
 security/selinux/hooks.c                 |  387 ++++++++++---
 security/selinux/include/security.h      |   16 -
 security/smack/smack.h                   |   21 -
 security/smack/smack_lsm.c               |  365 +++++++++++-
 security/tomoyo/common.h                 |    3 
 security/tomoyo/mount.c                  |   46 ++
 security/tomoyo/tomoyo.c                 |   15 +
 93 files changed, 7191 insertions(+), 1900 deletions(-)
 create mode 100644 Documentation/filesystems/mount_api.txt
 create mode 100644 fs/fs_context.c
 create mode 100644 fs/fs_parser.c
 create mode 100644 fs/fsopen.c
 create mode 100644 include/linux/fs_context.h
 create mode 100644 include/linux/fs_parser.h
 create mode 100644 include/uapi/linux/mount.h
 delete mode 100644 samples/statx/Makefile
 delete mode 100644 samples/statx/test-statx.c
 create mode 100644 samples/vfs/Makefile
 create mode 100644 samples/vfs/test-fsmount.c
 create mode 100644 samples/vfs/test-statx.c

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #12]
  2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
@ 2018-09-21 16:30 ` David Howells
  2018-10-21 16:41   ` Eric W. Biederman
  2018-09-21 16:30 ` [PATCH 02/34] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: David Howells @ 2018-09-21 16:30 UTC (permalink / raw)
  To: viro
  Cc: linux-api, torvalds, dhowells, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

From: Al Viro <viro@zeniv.linux.org.uk>

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)).  flags should be an OR of
some of the following:
	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
	* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question.  In other words, the same as mount --rbind
or mount --bind would've taken.  The detached tree will be
dissolved on the final close of obtained file.  Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/file_table.c                        |    9 +-
 fs/internal.h                          |    1 
 fs/namespace.c                         |  132 +++++++++++++++++++++++++++-----
 include/linux/fs.h                     |    7 +-
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fcntl.h             |    2 
 include/uapi/linux/mount.h             |   10 ++
 9 files changed, 137 insertions(+), 27 deletions(-)
 create mode 100644 include/uapi/linux/mount.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..ea1b413afd47 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..0545bed581dc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	open_tree		__x64_sys_open_tree
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index e49af4caf15d..e03c8d121c6c 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -255,6 +255,7 @@ static void __fput(struct file *file)
 	struct dentry *dentry = file->f_path.dentry;
 	struct vfsmount *mnt = file->f_path.mnt;
 	struct inode *inode = file->f_inode;
+	fmode_t mode = file->f_mode;
 
 	if (unlikely(!(file->f_mode & FMODE_OPENED)))
 		goto out;
@@ -277,18 +278,20 @@ static void __fput(struct file *file)
 	if (file->f_op->release)
 		file->f_op->release(inode, file);
 	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
-		     !(file->f_mode & FMODE_PATH))) {
+		     !(mode & FMODE_PATH))) {
 		cdev_put(inode->i_cdev);
 	}
 	fops_put(file->f_op);
 	put_pid(file->f_owner.pid);
-	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_dec(inode);
-	if (file->f_mode & FMODE_WRITER) {
+	if (mode & FMODE_WRITER) {
 		put_write_access(inode);
 		__mnt_drop_write(mnt);
 	}
 	dput(dentry);
+	if (unlikely(mode & FMODE_NEED_UNMOUNT))
+		dissolve_on_fput(mnt);
 	mntput(mnt);
 out:
 	file_free(file);
diff --git a/fs/internal.h b/fs/internal.h
index 364c20b5ea2d..17029b30e196 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -85,6 +85,7 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 
+extern void dissolve_on_fput(struct vfsmount *);
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 8a7e1a7d1d06..ded1a970ec40 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -20,12 +20,14 @@
 #include <linux/init.h>		/* init_rootfs */
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
+#include <linux/file.h>
 #include <linux/uaccess.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
 #include <linux/task_work.h>
 #include <linux/sched/task.h>
+#include <uapi/linux/mount.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -1779,6 +1781,16 @@ struct vfsmount *collect_mounts(const struct path *path)
 	return &tree->mnt;
 }
 
+void dissolve_on_fput(struct vfsmount *mnt)
+{
+	namespace_lock();
+	lock_mount_hash();
+	mntget(mnt);
+	umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
+	unlock_mount_hash();
+	namespace_unlock();
+}
+
 void drop_collected_mounts(struct vfsmount *mnt)
 {
 	namespace_lock();
@@ -2138,6 +2150,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
 	return false;
 }
 
+static struct mount *__do_loopback(struct path *old_path, int recurse)
+{
+	struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+
+	if (IS_MNT_UNBINDABLE(old))
+		return mnt;
+
+	if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
+		return mnt;
+
+	if (!recurse && has_locked_children(old, old_path->dentry))
+		return mnt;
+
+	if (recurse)
+		mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+	else
+		mnt = clone_mnt(old, old_path->dentry, 0);
+
+	if (!IS_ERR(mnt))
+		mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+	return mnt;
+}
+
 /*
  * do loopback mount.
  */
@@ -2145,7 +2181,7 @@ static int do_loopback(struct path *path, const char *old_name,
 				int recurse)
 {
 	struct path old_path;
-	struct mount *mnt = NULL, *old, *parent;
+	struct mount *mnt = NULL, *parent;
 	struct mountpoint *mp;
 	int err;
 	if (!old_name || !*old_name)
@@ -2159,38 +2195,21 @@ static int do_loopback(struct path *path, const char *old_name,
 		goto out;
 
 	mp = lock_mount(path);
-	err = PTR_ERR(mp);
-	if (IS_ERR(mp))
+	if (IS_ERR(mp)) {
+		err = PTR_ERR(mp);
 		goto out;
+	}
 
-	old = real_mount(old_path.mnt);
 	parent = real_mount(path->mnt);
-
-	err = -EINVAL;
-	if (IS_MNT_UNBINDABLE(old))
-		goto out2;
-
 	if (!check_mnt(parent))
 		goto out2;
 
-	if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
-		goto out2;
-
-	if (!recurse && has_locked_children(old, old_path.dentry))
-		goto out2;
-
-	if (recurse)
-		mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
-	else
-		mnt = clone_mnt(old, old_path.dentry, 0);
-
+	mnt = __do_loopback(&old_path, recurse);
 	if (IS_ERR(mnt)) {
 		err = PTR_ERR(mnt);
 		goto out2;
 	}
 
-	mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
 	err = graft_tree(mnt, parent, mp);
 	if (err) {
 		lock_mount_hash();
@@ -2204,6 +2223,75 @@ static int do_loopback(struct path *path, const char *old_name,
 	return err;
 }
 
+SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
+{
+	struct file *file;
+	struct path path;
+	int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+	bool detached = flags & OPEN_TREE_CLONE;
+	int error;
+	int fd;
+
+	BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
+
+	if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
+		      AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
+		      OPEN_TREE_CLOEXEC))
+		return -EINVAL;
+
+	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
+		return -EINVAL;
+
+	if (flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	if (detached && !may_mount())
+		return -EPERM;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	error = user_path_at(dfd, filename, lookup_flags, &path);
+	if (error)
+		goto out;
+
+	if (detached) {
+		struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
+		if (IS_ERR(mnt)) {
+			error = PTR_ERR(mnt);
+			goto out2;
+		}
+		mntput(path.mnt);
+		path.mnt = &mnt->mnt;
+	}
+
+	file = dentry_open(&path, O_PATH, current_cred());
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto out3;
+	}
+
+	if (detached)
+		file->f_mode |= FMODE_NEED_UNMOUNT;
+	path_put(&path);
+	fd_install(fd, file);
+	return fd;
+
+out3:
+	if (detached)
+		dissolve_on_fput(path.mnt);
+out2:
+	path_put(&path);
+out:
+	put_unused_fd(fd);
+	return error;
+}
+
 static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
 {
 	int error = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4323b8fe353d..6dc32507762f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -157,10 +157,13 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
 
 /* File is capable of returning -EAGAIN if I/O will block */
-#define FMODE_NOWAIT	((__force fmode_t)0x8000000)
+#define FMODE_NOWAIT		((__force fmode_t)0x8000000)
+
+/* File represents mount that needs unmounting */
+#define FMODE_NEED_UNMOUNT	((__force fmode_t)0x10000000)
 
 /* File does not contribute to nr_files count */
-#define FMODE_NOACCOUNT	((__force fmode_t)0x20000000)
+#define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ff814c92f7f..6978f3c76d41 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -906,6 +906,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..594b85f7cb86 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,5 +90,7 @@
 #define AT_STATX_FORCE_SYNC	0x2000	/* - Force the attributes to be sync'd with the server */
 #define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
 
+#define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
+
 
 #endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..e8db2911adca
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,10 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+/*
+ * open_tree() flags.
+ */
+#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
+#define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
+
+#endif /* _UAPI_LINUX_MOUNT_H */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/34] vfs: syscall: Add move_mount(2) to move mounts around [ver #12]
  2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
  2018-09-21 16:30 ` [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
@ 2018-09-21 16:30 ` David Howells
  2018-09-21 16:33 ` [PATCH 26/34] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2018-09-21 16:30 UTC (permalink / raw)
  To: viro
  Cc: linux-api, torvalds, dhowells, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Add a move_mount() system call that will move a mount from one place to
another and, in the next commit, allow to attach an unattached mount tree.

The new system call looks like the following:

	int move_mount(int from_dfd, const char *from_path,
		       int to_dfd, const char *to_path,
		       unsigned int flags);

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |  102 ++++++++++++++++++++++++++------
 include/linux/lsm_hooks.h              |    6 ++
 include/linux/security.h               |    7 ++
 include/linux/syscalls.h               |    3 +
 include/uapi/linux/mount.h             |   11 +++
 security/security.c                    |    5 ++
 8 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ea1b413afd47..76d092b7d1b0 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -399,3 +399,4 @@
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
+388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0545bed581dc..37ba4e65eee6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
 335	common	open_tree		__x64_sys_open_tree
+336	common	move_mount		__x64_sys_move_mount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index ded1a970ec40..dd38141b1723 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2386,43 +2386,37 @@ static inline int tree_contains_unbindable(struct mount *mnt)
 	return 0;
 }
 
-static int do_move_mount(struct path *path, const char *old_name)
+static int do_move_mount(struct path *old_path, struct path *new_path)
 {
-	struct path old_path, parent_path;
+	struct path parent_path = {.mnt = NULL, .dentry = NULL};
 	struct mount *p;
 	struct mount *old;
 	struct mountpoint *mp;
 	int err;
-	if (!old_name || !*old_name)
-		return -EINVAL;
-	err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
-	if (err)
-		return err;
 
-	mp = lock_mount(path);
+	mp = lock_mount(new_path);
 	err = PTR_ERR(mp);
 	if (IS_ERR(mp))
 		goto out;
 
-	old = real_mount(old_path.mnt);
-	p = real_mount(path->mnt);
+	old = real_mount(old_path->mnt);
+	p = real_mount(new_path->mnt);
 
 	err = -EINVAL;
 	if (!check_mnt(p) || !check_mnt(old))
 		goto out1;
 
-	if (old->mnt.mnt_flags & MNT_LOCKED)
+	if (!mnt_has_parent(old))
 		goto out1;
 
-	err = -EINVAL;
-	if (old_path.dentry != old_path.mnt->mnt_root)
+	if (old->mnt.mnt_flags & MNT_LOCKED)
 		goto out1;
 
-	if (!mnt_has_parent(old))
+	if (old_path->dentry != old_path->mnt->mnt_root)
 		goto out1;
 
-	if (d_is_dir(path->dentry) !=
-	      d_is_dir(old_path.dentry))
+	if (d_is_dir(new_path->dentry) !=
+	    d_is_dir(old_path->dentry))
 		goto out1;
 	/*
 	 * Don't move a mount residing in a shared parent.
@@ -2440,7 +2434,8 @@ static int do_move_mount(struct path *path, const char *old_name)
 		if (p == old)
 			goto out1;
 
-	err = attach_recursive_mnt(old, real_mount(path->mnt), mp, &parent_path);
+	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
+				   &parent_path);
 	if (err)
 		goto out1;
 
@@ -2452,6 +2447,22 @@ static int do_move_mount(struct path *path, const char *old_name)
 out:
 	if (!err)
 		path_put(&parent_path);
+	return err;
+}
+
+static int do_move_mount_old(struct path *path, const char *old_name)
+{
+	struct path old_path;
+	int err;
+
+	if (!old_name || !*old_name)
+		return -EINVAL;
+
+	err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
+	if (err)
+		return err;
+
+	err = do_move_mount(&old_path, path);
 	path_put(&old_path);
 	return err;
 }
@@ -2873,7 +2884,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 	else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
 		retval = do_change_type(&path, flags);
 	else if (flags & MS_MOVE)
-		retval = do_move_mount(&path, dev_name);
+		retval = do_move_mount_old(&path, dev_name);
 	else
 		retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
 				      dev_name, data_page, data_size);
@@ -3108,6 +3119,61 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 	return ksys_mount(dev_name, dir_name, type, flags, data);
 }
 
+/*
+ * Move a mount from one place to another.
+ *
+ * Note the flags value is a combination of MOVE_MOUNT_* flags.
+ */
+SYSCALL_DEFINE5(move_mount,
+		int, from_dfd, const char *, from_pathname,
+		int, to_dfd, const char *, to_pathname,
+		unsigned int, flags)
+{
+	struct path from_path, to_path;
+	unsigned int lflags;
+	int ret = 0;
+
+	if (!may_mount())
+		return -EPERM;
+
+	if (flags & ~MOVE_MOUNT__MASK)
+		return -EINVAL;
+
+	/* If someone gives a pathname, they aren't permitted to move
+	 * from an fd that requires unmount as we can't get at the flag
+	 * to clear it afterwards.
+	 */
+	lflags = 0;
+	if (flags & MOVE_MOUNT_F_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
+	if (flags & MOVE_MOUNT_F_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
+	if (flags & MOVE_MOUNT_F_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(from_dfd, from_pathname, lflags, &from_path);
+	if (ret < 0)
+		return ret;
+
+	lflags = 0;
+	if (flags & MOVE_MOUNT_T_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
+	if (flags & MOVE_MOUNT_T_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
+	if (flags & MOVE_MOUNT_T_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
+	if (ret < 0)
+		goto out_from;
+
+	ret = security_move_mount(&from_path, &to_path);
+	if (ret < 0)
+		goto out_to;
+
+	ret = do_move_mount(&from_path, &to_path);
+
+out_to:
+	path_put(&to_path);
+out_from:
+	path_put(&from_path);
+	return ret;
+}
+
 /*
  * Return true if path is reachable from root
  *
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 8a44075347ac..d052db1a1565 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -147,6 +147,10 @@
  *	Parse a string of security data filling in the opts structure
  *	@options string containing all mount options known by the LSM
  *	@opts binary data structure usable by the LSM
+ * @move_mount:
+ *	Check permission before a mount is moved.
+ *	@from_path indicates the mount that is going to be moved.
+ *	@to_path indicates the mountpoint that will be mounted upon.
  * @dentry_init_security:
  *	Compute a context for a dentry as the inode is not yet available
  *	since NFSv4 has no label backed by an EA anyway.
@@ -1484,6 +1488,7 @@ union security_list_options {
 					unsigned long kern_flags,
 					unsigned long *set_kern_flags);
 	int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+	int (*move_mount)(const struct path *from_path, const struct path *to_path);
 	int (*dentry_init_security)(struct dentry *dentry, int mode,
 					const struct qstr *name, void **ctx,
 					u32 *ctxlen);
@@ -1816,6 +1821,7 @@ struct security_hook_heads {
 	struct hlist_head sb_set_mnt_opts;
 	struct hlist_head sb_clone_mnt_opts;
 	struct hlist_head sb_parse_opts_str;
+	struct hlist_head move_mount;
 	struct hlist_head dentry_init_security;
 	struct hlist_head dentry_create_files_as;
 #ifdef CONFIG_SECURITY_PATH
diff --git a/include/linux/security.h b/include/linux/security.h
index 30a3db9f284b..a306061d2197 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -266,6 +266,7 @@ int security_sb_clone_mnt_opts(const struct super_block *oldsb,
 				unsigned long kern_flags,
 				unsigned long *set_kern_flags);
 int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+int security_move_mount(const struct path *from_path, const struct path *to_path);
 int security_dentry_init_security(struct dentry *dentry, int mode,
 					const struct qstr *name, void **ctx,
 					u32 *ctxlen);
@@ -621,6 +622,12 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
 	return 0;
 }
 
+static inline int security_move_mount(const struct path *from_path,
+				      const struct path *to_path)
+{
+	return 0;
+}
+
 static inline int security_inode_alloc(struct inode *inode)
 {
 	return 0;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 6978f3c76d41..79042396f7e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -907,6 +907,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
+asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
+			       int to_dfd, const char __user *to_path,
+			       unsigned int ms_flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index e8db2911adca..89adf0d731ab 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -7,4 +7,15 @@
 #define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
 #define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
 
+/*
+ * move_mount() flags.
+ */
+#define MOVE_MOUNT_F_SYMLINKS		0x00000001 /* Follow symlinks on from path */
+#define MOVE_MOUNT_F_AUTOMOUNTS		0x00000002 /* Follow automounts on from path */
+#define MOVE_MOUNT_F_EMPTY_PATH		0x00000004 /* Empty from path permitted */
+#define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
+#define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
+#define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
+#define MOVE_MOUNT__MASK		0x00000077
+
 #endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/security/security.c b/security/security.c
index 3d99ed8d9ddd..96a061cecb39 100644
--- a/security/security.c
+++ b/security/security.c
@@ -444,6 +444,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
 }
 EXPORT_SYMBOL(security_sb_parse_opts_str);
 
+int security_move_mount(const struct path *from_path, const struct path *to_path)
+{
+	return call_int_hook(move_mount, 0, from_path, to_path);
+}
+
 int security_inode_alloc(struct inode *inode)
 {
 	inode->i_security = NULL;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 26/34] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #12]
  2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
  2018-09-21 16:30 ` [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
  2018-09-21 16:30 ` [PATCH 02/34] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
@ 2018-09-21 16:33 ` David Howells
  2018-09-21 16:34 ` [PATCH 29/34] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2018-09-21 16:33 UTC (permalink / raw)
  To: viro
  Cc: linux-api, torvalds, dhowells, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Provide an fsopen() system call that starts the process of preparing to
create a superblock that will then be mountable, using an fd as a context
handle.  fsopen() is given the name of the filesystem that will be used:

	int mfd = fsopen(const char *fsname, unsigned int flags);

where flags can be 0 or FSOPEN_CLOEXEC.

For example:

	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
	fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
	fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	fsinfo(sfd, NULL, ...); // query new superblock attributes
	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

	sfd = fsopen("afs", -1);
	fsconfig(fd, FSCONFIG_SET_STRING, "source",
		 "#grand.central.org:root.cell", 0);
	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(sfd, 0, MS_NODEV);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:

	"e <subsys>:<problem>"
	"e SELinux:Mount on mountpoint not permitted"

Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
even if the fsmount() fails.  read() is still possible to retrieve error
information.

The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.

Netlink is not used because it is optional and would make the core VFS
dependent on the networking layer and also potentially add network
namespace issues.

Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/Makefile                            |    2 -
 fs/fs_context.c                        |    4 +
 fs/fsopen.c                            |   87 ++++++++++++++++++++++++++++++++
 include/linux/fs_context.h             |   18 +++++++
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fs.h                |    5 ++
 8 files changed, 118 insertions(+), 1 deletion(-)
 create mode 100644 fs/fsopen.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 76d092b7d1b0..1647fefd2969 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
+389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37ba4e65eee6..235d33dbccb2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 335	common	open_tree		__x64_sys_open_tree
 336	common	move_mount		__x64_sys_move_mount
+337	common	fsopen			__x64_sys_fsopen
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index ae681523b4b1..e3ea8093b178 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
-		fs_context.o fs_parser.o
+		fs_context.o fs_parser.o fsopen.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
index 328fcb764667..5d93c86c649d 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -267,6 +267,8 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
 	fc->fs_type	= get_filesystem(fs_type);
 	fc->cred	= get_current_cred();
 
+	mutex_init(&fc->uapi_mutex);
+
 	switch (purpose) {
 	case FS_CONTEXT_FOR_KERNEL_MOUNT:
 		fc->sb_flags |= SB_KERNMOUNT;
@@ -334,6 +336,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc,
 	if (!fc)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&fc->uapi_mutex);
+
 	fc->fs_private	= NULL;
 	fc->s_fs_info	= NULL;
 	fc->source	= NULL;
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..a9a36f61d76c
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,87 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include <linux/anon_inodes.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include "mount.h"
+
+static int fscontext_release(struct inode *inode, struct file *file)
+{
+	struct fs_context *fc = file->private_data;
+
+	if (fc) {
+		file->private_data = NULL;
+		put_fs_context(fc);
+	}
+	return 0;
+}
+
+const struct file_operations fscontext_fops = {
+	.release	= fscontext_release,
+	.llseek		= no_llseek,
+};
+
+/*
+ * Attach a filesystem context to a file and an fd.
+ */
+static int fscontext_create_fd(struct fs_context *fc, unsigned int o_flags)
+{
+	int fd;
+
+	fd = anon_inode_getfd("fscontext", &fscontext_fops, fc,
+			      O_RDWR | o_flags);
+	if (fd < 0)
+		put_fs_context(fc);
+	return fd;
+}
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
+{
+	struct file_system_type *fs_type;
+	struct fs_context *fc;
+	const char *fs_name;
+
+	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (flags & ~FSOPEN_CLOEXEC)
+		return -EINVAL;
+
+	fs_name = strndup_user(_fs_name, PAGE_SIZE);
+	if (IS_ERR(fs_name))
+		return PTR_ERR(fs_name);
+
+	fs_type = get_fs_type(fs_name);
+	kfree(fs_name);
+	if (!fs_type)
+		return -ENODEV;
+
+	fc = vfs_new_fs_context(fs_type, NULL, 0, 0, FS_CONTEXT_FOR_USER_MOUNT);
+	put_filesystem(fs_type);
+	if (IS_ERR(fc))
+		return PTR_ERR(fc);
+
+	fc->phase = FS_CONTEXT_CREATE_PARAMS;
+	return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
+}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 0415510f64ed..37f8bafaaec3 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -14,6 +14,7 @@
 
 #include <linux/kernel.h>
 #include <linux/errno.h>
+#include <linux/mutex.h>
 
 struct cred;
 struct dentry;
@@ -37,6 +38,19 @@ enum fs_context_purpose {
 	FS_CONTEXT_FOR_EMERGENCY_RO,	/* Emergency reconfiguration to R/O */
 };
 
+/*
+ * Userspace usage phase for fsopen/fspick.
+ */
+enum fs_context_phase {
+	FS_CONTEXT_CREATE_PARAMS,	/* Loading params for sb creation */
+	FS_CONTEXT_CREATING,		/* A superblock is being created */
+	FS_CONTEXT_AWAITING_MOUNT,	/* Superblock created, awaiting fsmount() */
+	FS_CONTEXT_AWAITING_RECONF,	/* Awaiting initialisation for reconfiguration */
+	FS_CONTEXT_RECONF_PARAMS,	/* Loading params for reconfiguration */
+	FS_CONTEXT_RECONFIGURING,	/* Reconfiguring the superblock */
+	FS_CONTEXT_FAILED,		/* Failed to correctly transition a context */
+};
+
 /*
  * Type of parameter value.
  */
@@ -77,6 +91,7 @@ struct fs_parameter {
  */
 struct fs_context {
 	const struct fs_context_operations *ops;
+	struct mutex		uapi_mutex;	/* Userspace access mutex */
 	struct file_system_type	*fs_type;
 	void			*fs_private;	/* The filesystem's context */
 	struct dentry		*root;		/* The root and superblock */
@@ -91,6 +106,7 @@ struct fs_context {
 	unsigned int		sb_flags_mask;	/* Superblock flags that were changed */
 	unsigned int		lsm_flags;	/* Information flags from the fs to the LSM */
 	enum fs_context_purpose	purpose:8;
+	enum fs_context_phase	phase:8;	/* The phase the context is in */
 	bool			sloppy:1;	/* T if unrecognised options are okay */
 	bool			silent:1;	/* T if "o silent" specified */
 	bool			need_free:1;	/* Need to call ops->free() */
@@ -136,6 +152,8 @@ extern int vfs_get_super(struct fs_context *fc,
 			 int (*fill_super)(struct super_block *sb,
 					   struct fs_context *fc));
 
+extern const struct file_operations fscontext_fops;
+
 #define logfc(FC, FMT, ...) pr_notice(FMT, ## __VA_ARGS__)
 
 /**
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 79042396f7e5..650d99c91987 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -910,6 +910,7 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
+asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 1c982eb44ff4..f8818e6cddd6 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -344,4 +344,9 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/*
+ * Flags for fsopen() and co.
+ */
+#define FSOPEN_CLOEXEC		0x00000001
+
 #endif /* _UAPI_LINUX_FS_H */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 29/34] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #12]
  2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
                   ` (2 preceding siblings ...)
  2018-09-21 16:33 ` [PATCH 26/34] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
@ 2018-09-21 16:34 ` David Howells
  2018-09-21 16:34 ` [PATCH 30/34] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2018-09-21 16:34 UTC (permalink / raw)
  To: viro
  Cc: linux-api, torvalds, dhowells, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Add a syscall for configuring a filesystem creation context and triggering
actions upon it, to be used in conjunction with fsopen, fspick and fsmount.

    long fsconfig(int fs_fd, unsigned int cmd, const char *key,
		  const void *value, int aux);

Where fs_fd indicates the context, cmd indicates the action to take, key
indicates the parameter name for parameter-setting actions and, if needed,
value points to a buffer containing the value and aux can give more
information for the value.

The following command IDs are proposed:

 (*) FSCONFIG_SET_FLAG: No value is specified.  The parameter must be
     boolean in nature.  The key may be prefixed with "no" to invert the
     setting. value must be NULL and aux must be 0.

 (*) FSCONFIG_SET_STRING: A string value is specified.  The parameter can
     be expecting boolean, integer, string or take a path.  A conversion to
     an appropriate type will be attempted (which may include looking up as
     a path).  value points to a NUL-terminated string and aux must be 0.

 (*) FSCONFIG_SET_BINARY: A binary blob is specified.  value points to
     the blob and aux indicates its size.  The parameter must be expecting
     a blob.

 (*) FSCONFIG_SET_PATH: A non-empty path is specified.  The parameter must
     be expecting a path object.  value points to a NUL-terminated string
     that is the path and aux is a file descriptor at which to start a
     relative lookup or AT_FDCWD.

 (*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
     implied.

 (*) FSCONFIG_SET_FD: An open file descriptor is specified.  value must
     be NULL and aux indicates the file descriptor.

 (*) FSCONFIG_CMD_CREATE: Trigger superblock creation.

 (*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.

For the "set" command IDs, the idea is that the file_system_type will point
to a list of parameters and the types of value that those parameters expect
to take.  The core code can then do the parse and argument conversion and
then give the LSM and FS a cooked option or array of options to use.

Source specification is also done the same way same way, using special keys
"source", "source1", "source2", etc..

[!] Note that, for the moment, the key and value are just glued back
together and handed to the filesystem.  Every filesystem that uses options
uses match_token() and co. to do this, and this will need to be changed -
but not all at once.

Example usage:

    fd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
    fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
    fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
    fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
    fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
    fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
    fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
    fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
    fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("afs", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("jffs2", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/fsopen.c                            |  355 ++++++++++++++++++++++++++++++++
 include/linux/syscalls.h               |    2 
 include/uapi/linux/fs.h                |   14 +
 5 files changed, 373 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1647fefd2969..f9970310c126 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,3 +401,4 @@
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
+390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 235d33dbccb2..4185d36e03bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
 335	common	open_tree		__x64_sys_open_tree
 336	common	move_mount		__x64_sys_move_mount
 337	common	fsopen			__x64_sys_fsopen
+338	common	fsconfig		__x64_sys_fsconfig
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 1b966b0c15f7..5955a6b65596 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -10,6 +10,7 @@
  */
 
 #include <linux/fs_context.h>
+#include <linux/fs_parser.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
@@ -17,6 +18,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/namei.h>
 #include <linux/file.h>
+#include "internal.h"
 #include "mount.h"
 
 /*
@@ -152,3 +154,356 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 	put_fs_context(fc);
 	return ret;
 }
+
+/*
+ * Check the state and apply the configuration.  Note that this function is
+ * allowed to 'steal' the value by setting param->xxx to NULL before returning.
+ */
+static int vfs_fsconfig(struct fs_context *fc, struct fs_parameter *param)
+{
+	int ret;
+
+	/* We need to reinitialise the context if we have reconfiguration
+	 * pending after creation or a previous reconfiguration.
+	 */
+	if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+		if (fc->fs_type->init_fs_context) {
+			ret = fc->fs_type->init_fs_context(fc, fc->root);
+			if (ret < 0) {
+				fc->phase = FS_CONTEXT_FAILED;
+				return ret;
+			}
+			fc->need_free = true;
+		} else {
+			/* Leave legacy context ops in place */
+		}
+
+		/* Do the security check last because ->init_fs_context may
+		 * change the namespace subscriptions.
+		 */
+		ret = security_fs_context_alloc(fc, fc->root);
+		if (ret < 0) {
+			fc->phase = FS_CONTEXT_FAILED;
+			return ret;
+		}
+
+		fc->phase = FS_CONTEXT_RECONF_PARAMS;
+	}
+
+	if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+	    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+		return -EBUSY;
+
+	return vfs_parse_fs_param(fc, param);
+}
+
+/*
+ * Reconfigure a superblock.
+ */
+int vfs_reconfigure_sb(struct fs_context *fc)
+{
+	struct super_block *sb = fc->root->d_sb;
+	int ret;
+
+	if (fc->ops->validate) {
+		ret = fc->ops->validate(fc);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = security_fs_context_validate(fc);
+	if (ret)
+		return ret;
+
+	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	down_write(&sb->s_umount);
+	ret = reconfigure_super(fc);
+	up_write(&sb->s_umount);
+	return ret;
+}
+
+/*
+ * Clean up a context after performing an action on it and put it into a state
+ * from where it can be used to reconfigure a superblock.
+ */
+void vfs_clean_context(struct fs_context *fc)
+{
+	if (fc->ops && fc->ops->free)
+		fc->ops->free(fc);
+	fc->need_free = false;
+	fc->fs_private = NULL;
+	fc->s_fs_info = NULL;
+	fc->sb_flags = 0;
+	fc->sloppy = false;
+	fc->silent = false;
+	security_fs_context_free(fc);
+	fc->security = NULL;
+	kfree(fc->subtype);
+	fc->subtype = NULL;
+	kfree(fc->source);
+	fc->source = NULL;
+
+	fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
+	fc->phase = FS_CONTEXT_AWAITING_RECONF;
+}
+
+/*
+ * Perform an action on a context.
+ */
+static int vfs_fsconfig_action(struct fs_context *fc, enum fsconfig_command cmd)
+{
+	int ret = -EINVAL;
+
+	switch (cmd) {
+	case FSCONFIG_CMD_CREATE:
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+			return -EBUSY;
+		fc->phase = FS_CONTEXT_CREATING;
+		ret = vfs_get_tree(fc);
+		if (ret == 0)
+			fc->phase = FS_CONTEXT_AWAITING_MOUNT;
+		else
+			fc->phase = FS_CONTEXT_FAILED;
+		return ret;
+
+	case FSCONFIG_CMD_RECONFIGURE:
+		if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+			/* This is probably pointless, since no changes have
+			 * been proposed.
+			 */
+			if (fc->fs_type->init_fs_context) {
+				ret = fc->fs_type->init_fs_context(fc, fc->root);
+				if (ret < 0) {
+					fc->phase = FS_CONTEXT_FAILED;
+					return ret;
+				}
+				fc->need_free = true;
+			}
+			fc->phase = FS_CONTEXT_RECONF_PARAMS;
+		}
+
+		fc->phase = FS_CONTEXT_RECONFIGURING;
+		ret = vfs_reconfigure_sb(fc);
+		if (ret == 0)
+			vfs_clean_context(fc);
+		else
+			fc->phase = FS_CONTEXT_FAILED;
+		return ret;
+
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+/**
+ * sys_fsconfig - Set parameters and trigger actions on a context
+ * @fd: The filesystem context to act upon
+ * @cmd: The action to take
+ * @_key: Where appropriate, the parameter key to set
+ * @_value: Where appropriate, the parameter value to set
+ * @aux: Additional information for the value
+ *
+ * This system call is used to set parameters on a context, including
+ * superblock settings, data source and security labelling.
+ *
+ * Actions include triggering the creation of a superblock and the
+ * reconfiguration of the superblock attached to the specified context.
+ *
+ * When setting a parameter, @cmd indicates the type of value being proposed
+ * and @_key indicates the parameter to be altered.
+ *
+ * @_value and @aux are used to specify the value, should a value be required:
+ *
+ * (*) fsconfig_set_flag: No value is specified.  The parameter must be boolean
+ *     in nature.  The key may be prefixed with "no" to invert the
+ *     setting. @_value must be NULL and @aux must be 0.
+ *
+ * (*) fsconfig_set_string: A string value is specified.  The parameter can be
+ *     expecting boolean, integer, string or take a path.  A conversion to an
+ *     appropriate type will be attempted (which may include looking up as a
+ *     path).  @_value points to a NUL-terminated string and @aux must be 0.
+ *
+ * (*) fsconfig_set_binary: A binary blob is specified.  @_value points to the
+ *     blob and @aux indicates its size.  The parameter must be expecting a
+ *     blob.
+ *
+ * (*) fsconfig_set_path: A non-empty path is specified.  The parameter must be
+ *     expecting a path object.  @_value points to a NUL-terminated string that
+ *     is the path and @aux is a file descriptor at which to start a relative
+ *     lookup or AT_FDCWD.
+ *
+ * (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
+ *     implied.
+ *
+ * (*) fsconfig_set_fd: An open file descriptor is specified.  @_value must be
+ *     NULL and @aux indicates the file descriptor.
+ */
+SYSCALL_DEFINE5(fsconfig,
+		int, fd,
+		unsigned int, cmd,
+		const char __user *, _key,
+		const void __user *, _value,
+		int, aux)
+{
+	struct fs_context *fc;
+	struct fd f;
+	int ret;
+
+	struct fs_parameter param = {
+		.type	= fs_value_is_undefined,
+	};
+
+	if (fd < 0)
+		return -EINVAL;
+
+	switch (cmd) {
+	case FSCONFIG_SET_FLAG:
+		if (!_key || _value || aux)
+			return -EINVAL;
+		break;
+	case FSCONFIG_SET_STRING:
+		if (!_key || !_value || aux)
+			return -EINVAL;
+		break;
+	case FSCONFIG_SET_BINARY:
+		if (!_key || !_value || aux <= 0 || aux > 1024 * 1024)
+			return -EINVAL;
+		break;
+	case FSCONFIG_SET_PATH:
+	case FSCONFIG_SET_PATH_EMPTY:
+		if (!_key || !_value || (aux != AT_FDCWD && aux < 0))
+			return -EINVAL;
+		break;
+	case FSCONFIG_SET_FD:
+		if (!_key || _value || aux < 0)
+			return -EINVAL;
+		break;
+	case FSCONFIG_CMD_CREATE:
+	case FSCONFIG_CMD_RECONFIGURE:
+		if (_key || _value || aux)
+			return -EINVAL;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (f.file->f_op != &fscontext_fops)
+		goto out_f;
+
+	fc = f.file->private_data;
+	if (fc->ops == &legacy_fs_context_ops) {
+		switch (cmd) {
+		case FSCONFIG_SET_BINARY:
+		case FSCONFIG_SET_PATH:
+		case FSCONFIG_SET_PATH_EMPTY:
+		case FSCONFIG_SET_FD:
+			ret = -EOPNOTSUPP;
+			goto out_f;
+		}
+	}
+
+	if (_key) {
+		param.key = strndup_user(_key, 256);
+		if (IS_ERR(param.key)) {
+			ret = PTR_ERR(param.key);
+			goto out_f;
+		}
+	}
+
+	switch (cmd) {
+	case FSCONFIG_SET_STRING:
+		param.type = fs_value_is_string;
+		param.string = strndup_user(_value, 256);
+		if (IS_ERR(param.string)) {
+			ret = PTR_ERR(param.string);
+			goto out_key;
+		}
+		param.size = strlen(param.string);
+		break;
+	case FSCONFIG_SET_BINARY:
+		param.type = fs_value_is_blob;
+		param.size = aux;
+		param.blob = memdup_user_nul(_value, aux);
+		if (IS_ERR(param.blob)) {
+			ret = PTR_ERR(param.blob);
+			goto out_key;
+		}
+		break;
+	case FSCONFIG_SET_PATH:
+		param.type = fs_value_is_filename;
+		param.name = getname_flags(_value, 0, NULL);
+		if (IS_ERR(param.name)) {
+			ret = PTR_ERR(param.name);
+			goto out_key;
+		}
+		param.dirfd = aux;
+		param.size = strlen(param.name->name);
+		break;
+	case FSCONFIG_SET_PATH_EMPTY:
+		param.type = fs_value_is_filename_empty;
+		param.name = getname_flags(_value, LOOKUP_EMPTY, NULL);
+		if (IS_ERR(param.name)) {
+			ret = PTR_ERR(param.name);
+			goto out_key;
+		}
+		param.dirfd = aux;
+		param.size = strlen(param.name->name);
+		break;
+	case FSCONFIG_SET_FD:
+		param.type = fs_value_is_file;
+		ret = -EBADF;
+		param.file = fget(aux);
+		if (!param.file)
+			goto out_key;
+		break;
+	default:
+		break;
+	}
+
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret == 0) {
+		switch (cmd) {
+		case FSCONFIG_CMD_CREATE:
+		case FSCONFIG_CMD_RECONFIGURE:
+			ret = vfs_fsconfig_action(fc, cmd);
+			break;
+		default:
+			ret = vfs_fsconfig(fc, &param);
+			break;
+		}
+		mutex_unlock(&fc->uapi_mutex);
+	}
+
+	/* Clean up the our record of any value that we obtained from
+	 * userspace.  Note that the value may have been stolen by the LSM or
+	 * filesystem, in which case the value pointer will have been cleared.
+	 */
+	switch (cmd) {
+	case FSCONFIG_SET_STRING:
+	case FSCONFIG_SET_BINARY:
+		kfree(param.string);
+		break;
+	case FSCONFIG_SET_PATH:
+	case FSCONFIG_SET_PATH_EMPTY:
+		if (param.name)
+			putname(param.name);
+		break;
+	case FSCONFIG_SET_FD:
+		if (param.file)
+			fput(param.file);
+		break;
+	default:
+		break;
+	}
+out_key:
+	kfree(param.key);
+out_f:
+	fdput(f);
+	return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 650d99c91987..4ab15fdf8aea 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -911,6 +911,8 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
+asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
+			     const void __user *value, int aux);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f8818e6cddd6..fecbae30a30d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,4 +349,18 @@ typedef int __bitwise __kernel_rwf_t;
  */
 #define FSOPEN_CLOEXEC		0x00000001
 
+/*
+ * The type of fsconfig() call made.
+ */
+enum fsconfig_command {
+	FSCONFIG_SET_FLAG	= 0,	/* Set parameter, supplying no value */
+	FSCONFIG_SET_STRING	= 1,	/* Set parameter, supplying a string value */
+	FSCONFIG_SET_BINARY	= 2,	/* Set parameter, supplying a binary blob value */
+	FSCONFIG_SET_PATH	= 3,	/* Set parameter, supplying an object by path */
+	FSCONFIG_SET_PATH_EMPTY	= 4,	/* Set parameter, supplying an object by (empty) path */
+	FSCONFIG_SET_FD		= 5,	/* Set parameter, supplying an object by fd */
+	FSCONFIG_CMD_CREATE	= 6,	/* Invoke superblock creation */
+	FSCONFIG_CMD_RECONFIGURE = 7,	/* Invoke superblock reconfiguration */
+};
+
 #endif /* _UAPI_LINUX_FS_H */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 30/34] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #12]
  2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
                   ` (3 preceding siblings ...)
  2018-09-21 16:34 ` [PATCH 29/34] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
@ 2018-09-21 16:34 ` David Howells
  2018-09-21 16:34 ` [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
  2018-10-04 18:37 ` [PATCH 00/34] VFS: Introduce filesystem context " Eric W. Biederman
  6 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2018-09-21 16:34 UTC (permalink / raw)
  To: viro
  Cc: linux-api, torvalds, dhowells, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Provide a system call by which a filesystem opened with fsopen() and
configured by a series of fsconfig() calls can have a detached mount object
created for it.  This mount object can then be attached to the VFS mount
hierarchy using move_mount() by passing the returned file descriptor as the
from directory fd.

The system call looks like:

	int mfd = fsmount(int fsfd, unsigned int flags,
			  unsigned int ms_flags);

where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
FSMOUNT_CLOEXEC.  ms_flags is a bitwise-OR of the following flags:

	MS_RDONLY
	MS_NOSUID
	MS_NODEV
	MS_NOEXEC
	MS_NOATIME
	MS_NODIRATIME
	MS_RELATIME
	MS_STRICTATIME

	MS_UNBINDABLE
	MS_PRIVATE
	MS_SLAVE
	MS_SHARED

In the event that fsmount() fails, it may be possible to get an error
message by calling read() on fsfd.  If no message is available, ENODATA
will be reported.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |  125 +++++++++++++++++++++++++++++++-
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fs.h                |    2 +
 5 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f9970310c126..c78b68256f8a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -402,3 +402,4 @@
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
 390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
+391	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 4185d36e03bb..d44ead5d4368 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,7 @@
 336	common	move_mount		__x64_sys_move_mount
 337	common	fsopen			__x64_sys_fsopen
 338	common	fsconfig		__x64_sys_fsconfig
+339	common	fsmount			__x64_sys_fsmount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index 156261d03c12..4dfe7e23b7ee 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2463,7 +2463,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 
 	attached = mnt_has_parent(old);
 	/*
-	 * We need to allow open_tree(OPEN_TREE_CLONE) followed by
+	 * We need to allow open_tree(OPEN_TREE_CLONE) or fsmount() followed by
 	 * move_mount(), but mustn't allow "/" to be moved.
 	 */
 	if (old->mnt_ns && !attached)
@@ -3329,9 +3329,126 @@ struct vfsmount *kern_mount(struct file_system_type *type)
 EXPORT_SYMBOL_GPL(kern_mount);
 
 /*
- * Move a mount from one place to another.
- * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
- * used to copy a mount subtree.
+ * Create a kernel mount representation for a new, prepared superblock
+ * (specified by fs_fd) and attach to an open_tree-like file descriptor.
+ */
+SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags)
+{
+	struct fs_context *fc;
+	struct file *file;
+	struct path newmount;
+	struct fd f;
+	unsigned int mnt_flags = 0;
+	long ret;
+
+	if (!may_mount())
+		return -EPERM;
+
+	if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
+		return -EINVAL;
+
+	if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+			 MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+			 MS_STRICTATIME))
+		return -EINVAL;
+
+	if (ms_flags & MS_RDONLY)
+		mnt_flags |= MNT_READONLY;
+	if (ms_flags & MS_NOSUID)
+		mnt_flags |= MNT_NOSUID;
+	if (ms_flags & MS_NODEV)
+		mnt_flags |= MNT_NODEV;
+	if (ms_flags & MS_NOEXEC)
+		mnt_flags |= MNT_NOEXEC;
+	if (ms_flags & MS_NODIRATIME)
+		mnt_flags |= MNT_NODIRATIME;
+
+	if (ms_flags & MS_STRICTATIME) {
+		if (ms_flags & MS_NOATIME)
+			return -EINVAL;
+	} else if (ms_flags & MS_NOATIME) {
+		mnt_flags |= MNT_NOATIME;
+	} else {
+		mnt_flags |= MNT_RELATIME;
+	}
+
+	f = fdget(fs_fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EINVAL;
+	if (f.file->f_op != &fscontext_fops)
+		goto err_fsfd;
+
+	fc = f.file->private_data;
+
+	/* There must be a valid superblock or we can't mount it */
+	ret = -EINVAL;
+	if (!fc->root)
+		goto err_fsfd;
+
+	ret = -EPERM;
+	if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+		pr_warn("VFS: Mount too revealing\n");
+		goto err_fsfd;
+	}
+
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret < 0)
+		goto err_fsfd;
+
+	ret = -EBUSY;
+	if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
+		goto err_unlock;
+
+	ret = -EPERM;
+	if ((fc->sb_flags & SB_MANDLOCK) && !may_mandlock())
+		goto err_unlock;
+
+	newmount.mnt = vfs_create_mount(fc, mnt_flags);
+	if (IS_ERR(newmount.mnt)) {
+		ret = PTR_ERR(newmount.mnt);
+		goto err_unlock;
+	}
+	newmount.dentry = dget(fc->root);
+
+	/* We've done the mount bit - now move the file context into more or
+	 * less the same state as if we'd done an fspick().  We don't want to
+	 * do any memory allocation or anything like that at this point as we
+	 * don't want to have to handle any errors incurred.
+	 */
+	vfs_clean_context(fc);
+
+	/* Attach to an apparent O_PATH fd with a note that we need to unmount
+	 * it, not just simply put it.
+	 */
+	file = dentry_open(&newmount, O_PATH, fc->cred);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto err_path;
+	}
+	file->f_mode |= FMODE_NEED_UNMOUNT;
+
+	ret = get_unused_fd_flags((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0);
+	if (ret >= 0)
+		fd_install(ret, file);
+	else
+		fput(file);
+
+err_path:
+	path_put(&newmount);
+err_unlock:
+	mutex_unlock(&fc->uapi_mutex);
+err_fsfd:
+	fdput(f);
+	return ret;
+}
+
+/*
+ * Move a mount from one place to another.  In combination with
+ * fsopen()/fsmount() this is used to install a new mount and in combination
+ * with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be used to copy
+ * a mount subtree.
  *
  * Note the flags value is a combination of MOVE_MOUNT_* flags.
  */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4ab15fdf8aea..4697fad47789 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -913,6 +913,7 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
 			     const void __user *value, int aux);
+asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index fecbae30a30d..10281d582e28 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,6 +349,8 @@ typedef int __bitwise __kernel_rwf_t;
  */
 #define FSOPEN_CLOEXEC		0x00000001
 
+#define FSMOUNT_CLOEXEC		0x00000001
+
 /*
  * The type of fsconfig() call made.
  */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
                   ` (4 preceding siblings ...)
  2018-09-21 16:34 ` [PATCH 30/34] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
@ 2018-09-21 16:34 ` David Howells
  2018-10-12 14:49   ` Alan Jenkins
  2018-10-04 18:37 ` [PATCH 00/34] VFS: Introduce filesystem context " Eric W. Biederman
  6 siblings, 1 reply; 24+ messages in thread
From: David Howells @ 2018-09-21 16:34 UTC (permalink / raw)
  To: viro
  Cc: linux-api, torvalds, dhowells, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).

This looks like:

	int fd = fspick(AT_FDCWD, "/mnt",
			FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
	fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
	fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
	fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);

At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 fs/fsopen.c                            |   54 ++++++++++++++++++++++++++++++++
 include/linux/syscalls.h               |    1 +
 include/uapi/linux/fs.h                |    5 +++
 5 files changed, 62 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c78b68256f8a..d1eb6c815790 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -403,3 +403,4 @@
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
 390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 391	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
+392	i386	fspick			sys_fspick			__ia32_sys_fspick
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d44ead5d4368..d3ab703c02bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -348,6 +348,7 @@
 337	common	fsopen			__x64_sys_fsopen
 338	common	fsconfig		__x64_sys_fsconfig
 339	common	fsmount			__x64_sys_fsmount
+340	common	fspick			__x64_sys_fspick
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 5955a6b65596..9ead9220e2cb 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -155,6 +155,60 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 	return ret;
 }
 
+/*
+ * Pick a superblock into a context for reconfiguration.
+ */
+SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
+{
+	struct fs_context *fc;
+	struct path target;
+	unsigned int lookup_flags;
+	int ret;
+
+	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if ((flags & ~(FSPICK_CLOEXEC |
+		       FSPICK_SYMLINK_NOFOLLOW |
+		       FSPICK_NO_AUTOMOUNT |
+		       FSPICK_EMPTY_PATH)) != 0)
+		return -EINVAL;
+
+	lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	if (flags & FSPICK_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & FSPICK_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & FSPICK_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+	ret = user_path_at(dfd, path, lookup_flags, &target);
+	if (ret < 0)
+		goto err;
+
+	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
+				0, 0, FS_CONTEXT_FOR_RECONFIGURE);
+	if (IS_ERR(fc)) {
+		ret = PTR_ERR(fc);
+		goto err_path;
+	}
+
+	fc->phase = FS_CONTEXT_RECONF_PARAMS;
+
+	ret = fscontext_alloc_log(fc);
+	if (ret < 0)
+		goto err_fc;
+
+	path_put(&target);
+	return fscontext_create_fd(fc, flags & FSPICK_CLOEXEC ? O_CLOEXEC : 0);
+
+err_fc:
+	put_fs_context(fc);
+err_path:
+	path_put(&target);
+err:
+	return ret;
+}
+
 /*
  * Check the state and apply the configuration.  Note that this function is
  * allowed to 'steal' the value by setting param->xxx to NULL before returning.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4697fad47789..eb8d62f4ee24 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -914,6 +914,7 @@ asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
 			     const void __user *value, int aux);
 asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
+asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 10281d582e28..7f01503a9e9b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -351,6 +351,11 @@ typedef int __bitwise __kernel_rwf_t;
 
 #define FSMOUNT_CLOEXEC		0x00000001
 
+#define FSPICK_CLOEXEC		0x00000001
+#define FSPICK_SYMLINK_NOFOLLOW	0x00000002
+#define FSPICK_NO_AUTOMOUNT	0x00000004
+#define FSPICK_EMPTY_PATH	0x00000008
+
 /*
  * The type of fsconfig() call made.
  */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/34] VFS: Introduce filesystem context [ver #12]
  2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
                   ` (5 preceding siblings ...)
  2018-09-21 16:34 ` [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
@ 2018-10-04 18:37 ` Eric W. Biederman
  6 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2018-10-04 18:37 UTC (permalink / raw)
  To: David Howells
  Cc: viro, John Johansen, Tejun Heo, Eric W. Biederman, selinux,
	Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
	fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
	linux-security-module, Tetsuo Handa, Johannes Weiner,
	Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
	linux-kernel, mszeredi


David,

I have been going through these and it is a wonderful proof of concept
patchset.  There are a couple significant problems with it however.

- Many patches do more than one thing that could benefit from being
  broken up into more patches so that there is only one logical change
  per patch.  I have attempted a little of that and have found several
  significant bugs.

- There are many unnecessary changes in this patchset that just add
  noise and make it difficult to review.

- There are many typos and thinkos in this patchset that while not hard
  to correct keep this from being anywhere close to being ready for
  prime time.

- Some of the bugs I have encountered.
  * proc that isn't pid_ns_prepare_proc does not set fc->user_ns to
    match the pid namespace.
  * mqueue does not set fc->user_ns to match the ipc namespace.
  * The cpuset filesystem always fails to mount
  * Non-converted filesystems don't have the old security hooks
    and only have a bit blob so don't call into the new security
    hooks either.
  * The changes to implement the new security hooks at least for
    selinux are riddled with typos, and thinkos.

I was hoping to get into the semantic questions but I can't get
there until I get a good solid baseline patch to work with.

I have been able to hoist the permission check out of sget_fc for
converted filesystems.  So progress is being made.  That absolutely
requires fc->user_ns to be set properly before vfs_get_tree.  Something
that still needs to be fixed.

I have also observed that by not allowing unconverted filesystems
to mount using the new api.  The compatbitility code can be
significantly simplified, and the who data_size problem goes away.

I am going to be travelling for the next couple of days so I
don't expect I will be able to answer questions in a timely manner.
In the hopes that it might help below is my work in progress git
tree where I have cleaned up some of these issues.

https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git new-mount-api-testing

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-09-21 16:34 ` [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
@ 2018-10-12 14:49   ` Alan Jenkins
  2018-10-13  6:11     ` Al Viro
  0 siblings, 1 reply; 24+ messages in thread
From: Alan Jenkins @ 2018-10-12 14:49 UTC (permalink / raw)
  To: David Howells, viro
  Cc: linux-api, torvalds, ebiederm, linux-fsdevel, linux-kernel, mszeredi

On 21/09/2018 17:34, David Howells wrote:
> Provide an fspick() system call that can be used to pick an existing
> mountpoint into an fs_context which can thereafter be used to reconfigure a
> superblock (equivalent of the superblock side of -o remount).
>
> This looks like:
>
> 	int fd = fspick(AT_FDCWD, "/mnt",
> 			FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
> 	fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
> 	fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
> 	fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
>
> At the point of fspick being called, the file descriptor referring to the
> filesystem context is in exactly the same state as the one that was created
> by fsopen() after fsmount() has been successfully called.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: linux-api@vger.kernel.org
> ---
>
>   arch/x86/entry/syscalls/syscall_32.tbl |    1 +
>   arch/x86/entry/syscalls/syscall_64.tbl |    1 +
>   fs/fsopen.c                            |   54 ++++++++++++++++++++++++++++++++
>   include/linux/syscalls.h               |    1 +
>   include/uapi/linux/fs.h                |    5 +++
>   5 files changed, 62 insertions(+)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index c78b68256f8a..d1eb6c815790 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -403,3 +403,4 @@
>   389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
>   390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
>   391	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
> +392	i386	fspick			sys_fspick			__ia32_sys_fspick
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index d44ead5d4368..d3ab703c02bb 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -348,6 +348,7 @@
>   337	common	fsopen			__x64_sys_fsopen
>   338	common	fsconfig		__x64_sys_fsconfig
>   339	common	fsmount			__x64_sys_fsmount
> +340	common	fspick			__x64_sys_fspick
>   
>   #
>   # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> index 5955a6b65596..9ead9220e2cb 100644
> --- a/fs/fsopen.c
> +++ b/fs/fsopen.c
> @@ -155,6 +155,60 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
>   	return ret;
>   }
>   
> +/*
> + * Pick a superblock into a context for reconfiguration.
> + */
> +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
> +{
> +	struct fs_context *fc;
> +	struct path target;
> +	unsigned int lookup_flags;
> +	int ret;
> +
> +	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
> +		return -EPERM;


This seems to accept basically any mount.  Specifically: are you sure 
it's OK to return a handle to a SB_NO_USER superblock?

# strace -f -v -e trace=154 \
     ./fspick 3</proc/self/ns/mnt 3 \
     stat -f /dev/fd/3

syscall_0x154(0x3, 0x4009a1, 0x8, ...) = 0x4
   File: "/dev/fd/3"
     ID: 0        Namelen: 255     Type: anon-inode FS
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 0          Free: 0          Available: 0
Inodes: Total: 0          Free: 0
+++ exited with 0 +++

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-12 14:49   ` Alan Jenkins
@ 2018-10-13  6:11     ` Al Viro
  2018-10-13  9:45       ` Alan Jenkins
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Al Viro @ 2018-10-13  6:11 UTC (permalink / raw)
  To: Alan Jenkins
  Cc: David Howells, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

On Fri, Oct 12, 2018 at 03:49:50PM +0100, Alan Jenkins wrote:
> > +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
> > +{
> > +	struct fs_context *fc;
> > +	struct path target;
> > +	unsigned int lookup_flags;
> > +	int ret;
> > +
> > +	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
> > +		return -EPERM;
> 
> 
> This seems to accept basically any mount.  Specifically: are you sure it's
> OK to return a handle to a SB_NO_USER superblock?

Umm...  As long as we don't try to do pathname resolution from its ->s_root,
shouldn't be a problem and I don't see anything that would do that.  I might've
missed something, but...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-13  6:11     ` Al Viro
@ 2018-10-13  9:45       ` Alan Jenkins
  2018-10-13 23:04         ` Andy Lutomirski
  2018-10-17 13:15       ` David Howells
  2018-10-17 13:20       ` David Howells
  2 siblings, 1 reply; 24+ messages in thread
From: Alan Jenkins @ 2018-10-13  9:45 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

On 13/10/2018 07:11, Al Viro wrote:
> On Fri, Oct 12, 2018 at 03:49:50PM +0100, Alan Jenkins wrote:
>>> +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
>>> +{
>>> +	struct fs_context *fc;
>>> +	struct path target;
>>> +	unsigned int lookup_flags;
>>> +	int ret;
>>> +
>>> +	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
>>> +		return -EPERM;
>>
>> This seems to accept basically any mount.  Specifically: are you sure it's
>> OK to return a handle to a SB_NO_USER superblock?
> Umm...  As long as we don't try to do pathname resolution from its ->s_root,
> shouldn't be a problem and I don't see anything that would do that.  I might've
> missed something, but...

Sorry, I guess SB_NOUSER was the wrong word.  I was trying find if 
anything stopped things like

int memfd = memfd_create("foo", 0);
int fsfd = fspick(memfd, "", FSPICK_EMPTY_PATH);

fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
fsconfig(fsfd, FSCONFIG_SET_STRING, "size", "100M", 0);
fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);

So far I'm getting -EBUSY if I try to apply the "ro", -EINVAL if I try 
to apply the "size=100M".  But if I don't apply either, then 
FSCONFIG_CMD_RECONFIGURE succeeds.

It seems worrying that it might let me set options on shm_mnt. Or at 
least letting me get as far as the -EBUSY check for the "ro" superblock 
flag.

I'm not sure why I'm getting the -EINVAL setting the "size" option.  But 
it would be much more reassuring if I was getting -EPERM :-).

Alan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-13  9:45       ` Alan Jenkins
@ 2018-10-13 23:04         ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-10-13 23:04 UTC (permalink / raw)
  To: alan.christopher.jenkins
  Cc: Al Viro, David Howells, Linux API, Linus Torvalds,
	Eric W. Biederman, Linux FS Devel, LKML, Miklos Szeredi

On Sat, Oct 13, 2018 at 2:45 AM Alan Jenkins
<alan.christopher.jenkins@gmail.com> wrote:
>
> On 13/10/2018 07:11, Al Viro wrote:
> > On Fri, Oct 12, 2018 at 03:49:50PM +0100, Alan Jenkins wrote:
> >>> +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
> >>> +{
> >>> +   struct fs_context *fc;
> >>> +   struct path target;
> >>> +   unsigned int lookup_flags;
> >>> +   int ret;
> >>> +
> >>> +   if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
> >>> +           return -EPERM;
> >>
> >> This seems to accept basically any mount.  Specifically: are you sure it's
> >> OK to return a handle to a SB_NO_USER superblock?
> > Umm...  As long as we don't try to do pathname resolution from its ->s_root,
> > shouldn't be a problem and I don't see anything that would do that.  I might've
> > missed something, but...
>
> Sorry, I guess SB_NOUSER was the wrong word.  I was trying find if
> anything stopped things like
>
> int memfd = memfd_create("foo", 0);
> int fsfd = fspick(memfd, "", FSPICK_EMPTY_PATH);
>
> fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> fsconfig(fsfd, FSCONFIG_SET_STRING, "size", "100M", 0);
> fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
>
> So far I'm getting -EBUSY if I try to apply the "ro", -EINVAL if I try
> to apply the "size=100M".  But if I don't apply either, then
> FSCONFIG_CMD_RECONFIGURE succeeds.
>
> It seems worrying that it might let me set options on shm_mnt. Or at
> least letting me get as far as the -EBUSY check for the "ro" superblock
> flag.
>
> I'm not sure why I'm getting the -EINVAL setting the "size" option.  But
> it would be much more reassuring if I was getting -EPERM :-).
>

I would argue that the filesystem associated with a memfd, and even
the fact that there *is* a filesystem, is none of user code's
business.  So that fspick() call should return -EINVAL or similar.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-13  6:11     ` Al Viro
  2018-10-13  9:45       ` Alan Jenkins
@ 2018-10-17 13:15       ` David Howells
  2018-10-17 13:20       ` David Howells
  2 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2018-10-17 13:15 UTC (permalink / raw)
  To: Alan Jenkins
  Cc: dhowells, Al Viro, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:

> Sorry, I guess SB_NOUSER was the wrong word.  I was trying find if anything
> stopped things like
> 
> int memfd = memfd_create("foo", 0);
> int fsfd = fspick(memfd, "", FSPICK_EMPTY_PATH);
> 
> fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> fsconfig(fsfd, FSCONFIG_SET_STRING, "size", "100M", 0);
> fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
> 
> So far I'm getting -EBUSY if I try to apply the "ro", -EINVAL if I try to
> apply the "size=100M".  But if I don't apply either, then
> FSCONFIG_CMD_RECONFIGURE succeeds.

I should probably check that the picked point is actually a mountpoint.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-13  6:11     ` Al Viro
  2018-10-13  9:45       ` Alan Jenkins
  2018-10-17 13:15       ` David Howells
@ 2018-10-17 13:20       ` David Howells
  2018-10-17 14:31         ` Alan Jenkins
  2018-10-17 15:45         ` David Howells
  2 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2018-10-17 13:20 UTC (permalink / raw)
  Cc: dhowells, Alan Jenkins, Al Viro, linux-api, torvalds, ebiederm,
	linux-fsdevel, linux-kernel, mszeredi

David Howells <dhowells@redhat.com> wrote:

> I should probably check that the picked point is actually a mountpoint.

The root of the mount object at the path specified, that is, perhaps with
something like the attached.

David
---
diff --git a/fs/fsopen.c b/fs/fsopen.c
index f673e93ac456..aaaaa17a233c 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
 	if (ret < 0)
 		goto err;
 
+	ret = -EINVAL;
+	if (target.mnt->mnt_root != target.dentry)
+		goto err_path;
+
 	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
 				0, 0, FS_CONTEXT_FOR_RECONFIGURE);
 	if (IS_ERR(fc)) {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 13:20       ` David Howells
@ 2018-10-17 14:31         ` Alan Jenkins
  2018-10-17 14:35           ` Eric W. Biederman
  2018-10-17 15:24           ` David Howells
  2018-10-17 15:45         ` David Howells
  1 sibling, 2 replies; 24+ messages in thread
From: Alan Jenkins @ 2018-10-17 14:31 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

On 17/10/2018 14:20, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
>
>> I should probably check that the picked point is actually a mountpoint.
> The root of the mount object at the path specified, that is, perhaps with
> something like the attached.
>
> David


I agree.  I'm happy to see this is using the same check as do_remount().


* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
* on it - tough luck.
*/


Thanks

Alan


> ---
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> index f673e93ac456..aaaaa17a233c 100644
> --- a/fs/fsopen.c
> +++ b/fs/fsopen.c
> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
>   	if (ret < 0)
>   		goto err;
>   
> +	ret = -EINVAL;
> +	if (target.mnt->mnt_root != target.dentry)
> +		goto err_path;
> +
>   	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>   				0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>   	if (IS_ERR(fc)) {
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 14:31         ` Alan Jenkins
@ 2018-10-17 14:35           ` Eric W. Biederman
  2018-10-17 14:55             ` Alan Jenkins
  2018-10-17 15:24           ` David Howells
  1 sibling, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2018-10-17 14:35 UTC (permalink / raw)
  To: Alan Jenkins
  Cc: David Howells, Al Viro, linux-api, torvalds, linux-fsdevel,
	linux-kernel, mszeredi

Alan Jenkins <alan.christopher.jenkins@gmail.com> writes:

> On 17/10/2018 14:20, David Howells wrote:
>> David Howells <dhowells@redhat.com> wrote:
>>
>>> I should probably check that the picked point is actually a mountpoint.
>> The root of the mount object at the path specified, that is, perhaps with
>> something like the attached.
>>
>> David
>
>
> I agree.  I'm happy to see this is using the same check as do_remount().
>
>
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> */

Davids check will work for bind mounts as well.  It just won't work it
just won't work for files or subdirectories of some mountpoint.

Eric

>> ---
>> diff --git a/fs/fsopen.c b/fs/fsopen.c
>> index f673e93ac456..aaaaa17a233c 100644
>> --- a/fs/fsopen.c
>> +++ b/fs/fsopen.c
>> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
>>   	if (ret < 0)
>>   		goto err;
>>   +	ret = -EINVAL;
>> +	if (target.mnt->mnt_root != target.dentry)
>> +		goto err_path;
>> +
>>   	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>>   				0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>>   	if (IS_ERR(fc)) {
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 14:35           ` Eric W. Biederman
@ 2018-10-17 14:55             ` Alan Jenkins
  0 siblings, 0 replies; 24+ messages in thread
From: Alan Jenkins @ 2018-10-17 14:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Howells, Al Viro, linux-api, torvalds, linux-fsdevel,
	linux-kernel, mszeredi

On 17/10/2018 15:35, Eric W. Biederman wrote:
> Alan Jenkins <alan.christopher.jenkins@gmail.com> writes:
>
>> On 17/10/2018 14:20, David Howells wrote:
>>> David Howells <dhowells@redhat.com> wrote:
>>>
>>>> I should probably check that the picked point is actually a mountpoint.
>>> The root of the mount object at the path specified, that is, perhaps with
>>> something like the attached.
>>>
>>> David
>>
>> I agree.  I'm happy to see this is using the same check as do_remount().
>>
>>
>> * change filesystem flags. dir should be a physical root of filesystem.
>> * If you've mounted a non-root directory somewhere and want to do remount
>> * on it - tough luck.
>> */
> Davids check will work for bind mounts as well.  It just won't work it
> just won't work for files or subdirectories of some mountpoint.
>
> Eric


I see.  Then I am still happy to see the fspick() check match a check in 
do_remount() (and it still solves the problem I was worried about).

I cannot blame David for the do_remount() comment being out of date :-).

# uname -r
4.18.10-200.fc.28.x86_64
# mount --bind /mnt /mnt
# mount -oremount,debug /mnt
# findmnt /mnt; findmnt /
[findmnt shows / has been remounted, adding the ext4 "debug" mount option]


>
>>> ---
>>> diff --git a/fs/fsopen.c b/fs/fsopen.c
>>> index f673e93ac456..aaaaa17a233c 100644
>>> --- a/fs/fsopen.c
>>> +++ b/fs/fsopen.c
>>> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
>>>    	if (ret < 0)
>>>    		goto err;
>>>    +	ret = -EINVAL;
>>> +	if (target.mnt->mnt_root != target.dentry)
>>> +		goto err_path;
>>> +
>>>    	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>>>    				0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>>>    	if (IS_ERR(fc)) {
>>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 14:31         ` Alan Jenkins
  2018-10-17 14:35           ` Eric W. Biederman
@ 2018-10-17 15:24           ` David Howells
  2018-10-17 15:38             ` Eric W. Biederman
  1 sibling, 1 reply; 24+ messages in thread
From: David Howells @ 2018-10-17 15:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, Alan Jenkins, Al Viro, linux-api, torvalds,
	linux-fsdevel, linux-kernel, mszeredi

Eric W. Biederman <ebiederm@xmission.com> wrote:

> Davids check will work for bind mounts as well.  It just won't work it
> just won't work for files or subdirectories of some mountpoint.

Bind mounts have to be done with open_tree() and move_mount().  You can't now
do fsmount() on something fspick()'d.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 15:24           ` David Howells
@ 2018-10-17 15:38             ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2018-10-17 15:38 UTC (permalink / raw)
  To: David Howells
  Cc: Alan Jenkins, Al Viro, linux-api, torvalds, linux-fsdevel,
	linux-kernel, mszeredi

David Howells <dhowells@redhat.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> Davids check will work for bind mounts as well.  It just won't work it
>> just won't work for files or subdirectories of some mountpoint.
>
> Bind mounts have to be done with open_tree() and move_mount().  You can't now
> do fsmount() on something fspick()'d.

But a bind mount will have mnt_root set to the the dentry that was
bound.

Therefore fspick as you are proposing modifying will work for the root
of bind mounts, as well as the root of regular mounts.  My apologies for
not being clear.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 13:20       ` David Howells
  2018-10-17 14:31         ` Alan Jenkins
@ 2018-10-17 15:45         ` David Howells
  2018-10-17 17:41           ` Alan Jenkins
                             ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: David Howells @ 2018-10-17 15:45 UTC (permalink / raw)
  To: Alan Jenkins
  Cc: dhowells, Al Viro, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:

> I agree.  I'm happy to see this is using the same check as do_remount().
> 
> 
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> */

Are you suggesting that it should only work at the ultimate root of a
superblock and not a bind mount somewhere within?

That's tricky to make work for NFS because s_root is a dummy dentry.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 15:45         ` David Howells
@ 2018-10-17 17:41           ` Alan Jenkins
  2018-10-17 21:20           ` David Howells
  2018-10-17 22:13           ` Alan Jenkins
  2 siblings, 0 replies; 24+ messages in thread
From: Alan Jenkins @ 2018-10-17 17:41 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

On 17/10/2018 16:45, David Howells wrote:
> Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:
>
>> I agree.  I'm happy to see this is using the same check as do_remount().
>>
>>
>> * change filesystem flags. dir should be a physical root of filesystem.
>> * If you've mounted a non-root directory somewhere and want to do remount
>> * on it - tough luck.
>> */
> Are you suggesting that it should only work at the ultimate root of a
> superblock and not a bind mount somewhere within?
>
> That's tricky to make work for NFS because s_root is a dummy dentry.
>
> David


Retro-actively: I do not suggest that.

I tried to answer this question in my reply to Eric correcting me.  Eric 
was right to correct me.  I now understand the comment above 
do_remount() is incorrect.  I re-reviewed your diff in light of that.  I 
re-endorse your diff as a way to solve the problem I raised.

(I think it would be useful to remove the misleading comment above 
do_remount(), to avoid future confusion.)

 > @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char 
__user *, path, unsigned int, flags

>  	if (ret < 0)
>  		goto err;
>  
> +	ret = -EINVAL;
> +	if (target.mnt->mnt_root != target.dentry)
> +		goto err_path;
> +

( the "if" statement it adds to fspick() is equivalent to the second 
"if" statement in do_remount(): )

static  int  do_remount <https://elixir.bootlin.com/linux/v4.18/ident/do_remount>(struct  path <https://elixir.bootlin.com/linux/v4.18/ident/path>  *path <https://elixir.bootlin.com/linux/v4.18/ident/path>,  int  ms_flags,  int  sb_flags,
		int  mnt_flags,  void  *data)
{
	int  err;
	struct  super_block <https://elixir.bootlin.com/linux/v4.18/ident/super_block>  *sb  =  path <https://elixir.bootlin.com/linux/v4.18/ident/path>->mnt 
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>->mnt_sb;
	struct  mount <https://elixir.bootlin.com/linux/v4.18/ident/mount>  *mnt <https://elixir.bootlin.com/linux/v4.18/ident/mnt>  =  real_mount 
<https://elixir.bootlin.com/linux/v4.18/ident/real_mount>(path 
<https://elixir.bootlin.com/linux/v4.18/ident/path>->mnt 
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>);

	if  (!check_mnt <https://elixir.bootlin.com/linux/v4.18/ident/check_mnt>(mnt 
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>))
		return  -EINVAL <https://elixir.bootlin.com/linux/v4.18/ident/EINVAL>;

	if  (path <https://elixir.bootlin.com/linux/v4.18/ident/path>->dentry  !=  path <https://elixir.bootlin.com/linux/v4.18/ident/path>->mnt 
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>->mnt_root)
		return  -EINVAL <https://elixir.bootlin.com/linux/v4.18/ident/EINVAL>;

Thanks

Alan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 15:45         ` David Howells
  2018-10-17 17:41           ` Alan Jenkins
@ 2018-10-17 21:20           ` David Howells
  2018-10-17 22:13           ` Alan Jenkins
  2 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2018-10-17 21:20 UTC (permalink / raw)
  To: Alan Jenkins
  Cc: dhowells, Al Viro, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:

> static  int  do_remount <https://elixir.bootlin.com/linux/v4.18/ident/do_remount>(struct  path <https://elixir.bootlin.com/linux/v4.18/ident/path>  *path <https://elixir.bootlin.com/linux/v4.18/ident/path>,  int  ms_flags,  int  sb_flags,
> 		int  mnt_flags,  void  *data)

What happened there?  You seem to have had a load of URLs substituted in.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]
  2018-10-17 15:45         ` David Howells
  2018-10-17 17:41           ` Alan Jenkins
  2018-10-17 21:20           ` David Howells
@ 2018-10-17 22:13           ` Alan Jenkins
  2 siblings, 0 replies; 24+ messages in thread
From: Alan Jenkins @ 2018-10-17 22:13 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, linux-api, torvalds, ebiederm, linux-fsdevel,
	linux-kernel, mszeredi

[resent, hopefully with slightly less formatting damage]

On 17/10/2018 16:45, David Howells wrote:

> Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:
>
>> I agree. I'm happy to see this is using the same check as do_remount().
>>
>>
>> * change filesystem flags. dir should be a physical root of filesystem.
>> * If you've mounted a non-root directory somewhere and want to do remount
>> * on it - tough luck.
>> */
> Are you suggesting that it should only work at the ultimate root of a
> superblock and not a bind mount somewhere within?
>
> That's tricky to make work for NFS because s_root is a dummy dentry.
>
> David


Retro-actively: I do not suggest that.

I tried to answer this question in my reply to Eric correcting me. Eric 
was right to correct me.  I now understand the comment above 
do_remount() is incorrect.  I re-reviewed your diff in light of that.  I 
re-endorse your diff as a way to solve the problem I raised.

(I think it would be useful to remove the misleading comment above 
do_remount(), to avoid future confusion.)


> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
>   	if (ret < 0)
>   		goto err;
>   
> +	ret = -EINVAL;
> +	if (target.mnt->mnt_root != target.dentry)
> +		goto err_path;
> +
>   	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>   				0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>   	if (IS_ERR(fc)) {


( the "if" statement it adds to fspick() is equivalent to the second 
"if" statement in do_remount(): )

static int do_remount(struct path *path, int ms_flags, int sb_flags,
		      int mnt_flags, void *data)
{
	int err;
	struct super_block *sb = path->mnt->mnt_sb;
	struct mount *mnt = real_mount(path->mnt);

	if (!check_mnt(mnt))
		return -EINVAL;

	if (path->dentry != path->mnt->mnt_root)
		return -EINVAL;

Thanks

Alan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #12]
  2018-09-21 16:30 ` [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
@ 2018-10-21 16:41   ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2018-10-21 16:41 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-api, torvalds, linux-fsdevel, linux-kernel, mszeredi

David Howells <dhowells@redhat.com> writes:

> From: Al Viro <viro@zeniv.linux.org.uk>
>
> open_tree(dfd, pathname, flags)
>
> Returns an O_PATH-opened file descriptor or an error.
> dfd and pathname specify the location to open, in usual
> fashion (see e.g. fstatat(2)).  flags should be an OR of
> some of the following:
> 	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
> same meanings as usual
> 	* OPEN_TREE_CLOEXEC - make the resulting descriptor
> close-on-exec
> 	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
> instead of opening the location in question, create a detached
> mount tree matching the subtree rooted at location specified by
> dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
> without it - only the part within in the mount containing the
> location in question.  In other words, the same as mount --rbind
> or mount --bind would've taken.  The detached tree will be
> dissolved on the final close of obtained file.  Creation of such
> detached trees requires the same capabilities as doing mount --bind.


What happens when mounts propgate to such a detached mount tree?

It looks to me like the test in propagate_one for setting
CL_UNPRIVILEGED will trigger a NULL pointer dereference:

AKA:
	/* Notice when we are propagating across user namespaces */
	if (m->mnt_ns->user_ns != user_ns)
		type |= CL_UNPRIVILEGED;

Since we don't know which mount namespace if any this subtree is going
into the test should become:

	if (!m->mnt_ns || (m->mnt_ns->user_ns != user_ns))
		type |= CL_UNPRIVILEGED;

If the tree is never attached anywhere it won't hurt as we don't allow
mounts or umounts on the detached subtree.  We don't have enough
information to know about the namespace we copied from if it would cause
CL_UNPRIVILEGED to be set on any given propagation.  Similarly we don't
have any information at all about the mount namespace this tree may
possibily be copied to.

Eric


> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: linux-api@vger.kernel.org
> ---
>
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  fs/file_table.c                        |    9 +-
>  fs/internal.h                          |    1 
>  fs/namespace.c                         |  132 +++++++++++++++++++++++++++-----
>  include/linux/fs.h                     |    7 +-
>  include/linux/syscalls.h               |    1 
>  include/uapi/linux/fcntl.h             |    2 
>  include/uapi/linux/mount.h             |   10 ++
>  9 files changed, 137 insertions(+), 27 deletions(-)
>  create mode 100644 include/uapi/linux/mount.h
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..ea1b413afd47 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
>  385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
>  386	i386	rseq			sys_rseq			__ia32_sys_rseq
> +387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..0545bed581dc 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332	common	statx			__x64_sys_statx
>  333	common	io_pgetevents		__x64_sys_io_pgetevents
>  334	common	rseq			__x64_sys_rseq
> +335	common	open_tree		__x64_sys_open_tree
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/file_table.c b/fs/file_table.c
> index e49af4caf15d..e03c8d121c6c 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -255,6 +255,7 @@ static void __fput(struct file *file)
>  	struct dentry *dentry = file->f_path.dentry;
>  	struct vfsmount *mnt = file->f_path.mnt;
>  	struct inode *inode = file->f_inode;
> +	fmode_t mode = file->f_mode;
>  
>  	if (unlikely(!(file->f_mode & FMODE_OPENED)))
>  		goto out;
> @@ -277,18 +278,20 @@ static void __fput(struct file *file)
>  	if (file->f_op->release)
>  		file->f_op->release(inode, file);
>  	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
> -		     !(file->f_mode & FMODE_PATH))) {
> +		     !(mode & FMODE_PATH))) {
>  		cdev_put(inode->i_cdev);
>  	}
>  	fops_put(file->f_op);
>  	put_pid(file->f_owner.pid);
> -	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
> +	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
>  		i_readcount_dec(inode);
> -	if (file->f_mode & FMODE_WRITER) {
> +	if (mode & FMODE_WRITER) {
>  		put_write_access(inode);
>  		__mnt_drop_write(mnt);
>  	}
>  	dput(dentry);
> +	if (unlikely(mode & FMODE_NEED_UNMOUNT))
> +		dissolve_on_fput(mnt);
>  	mntput(mnt);
>  out:
>  	file_free(file);
> diff --git a/fs/internal.h b/fs/internal.h
> index 364c20b5ea2d..17029b30e196 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -85,6 +85,7 @@ extern int __mnt_want_write_file(struct file *);
>  extern void __mnt_drop_write(struct vfsmount *);
>  extern void __mnt_drop_write_file(struct file *);
>  
> +extern void dissolve_on_fput(struct vfsmount *);
>  /*
>   * fs_struct.c
>   */
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 8a7e1a7d1d06..ded1a970ec40 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -20,12 +20,14 @@
>  #include <linux/init.h>		/* init_rootfs */
>  #include <linux/fs_struct.h>	/* get_fs_root et.al. */
>  #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
> +#include <linux/file.h>
>  #include <linux/uaccess.h>
>  #include <linux/proc_ns.h>
>  #include <linux/magic.h>
>  #include <linux/bootmem.h>
>  #include <linux/task_work.h>
>  #include <linux/sched/task.h>
> +#include <uapi/linux/mount.h>
>  
>  #include "pnode.h"
>  #include "internal.h"
> @@ -1779,6 +1781,16 @@ struct vfsmount *collect_mounts(const struct path *path)
>  	return &tree->mnt;
>  }
>  
> +void dissolve_on_fput(struct vfsmount *mnt)
> +{
> +	namespace_lock();
> +	lock_mount_hash();
> +	mntget(mnt);
> +	umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> +	unlock_mount_hash();
> +	namespace_unlock();
> +}
> +
>  void drop_collected_mounts(struct vfsmount *mnt)
>  {
>  	namespace_lock();
> @@ -2138,6 +2150,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
>  	return false;
>  }
>  
> +static struct mount *__do_loopback(struct path *old_path, int recurse)
> +{
> +	struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
> +
> +	if (IS_MNT_UNBINDABLE(old))
> +		return mnt;
> +
> +	if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
> +		return mnt;
> +
> +	if (!recurse && has_locked_children(old, old_path->dentry))
> +		return mnt;
> +
> +	if (recurse)
> +		mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
> +	else
> +		mnt = clone_mnt(old, old_path->dentry, 0);
> +
> +	if (!IS_ERR(mnt))
> +		mnt->mnt.mnt_flags &= ~MNT_LOCKED;
> +
> +	return mnt;
> +}
> +
>  /*
>   * do loopback mount.
>   */
> @@ -2145,7 +2181,7 @@ static int do_loopback(struct path *path, const char *old_name,
>  				int recurse)
>  {
>  	struct path old_path;
> -	struct mount *mnt = NULL, *old, *parent;
> +	struct mount *mnt = NULL, *parent;
>  	struct mountpoint *mp;
>  	int err;
>  	if (!old_name || !*old_name)
> @@ -2159,38 +2195,21 @@ static int do_loopback(struct path *path, const char *old_name,
>  		goto out;
>  
>  	mp = lock_mount(path);
> -	err = PTR_ERR(mp);
> -	if (IS_ERR(mp))
> +	if (IS_ERR(mp)) {
> +		err = PTR_ERR(mp);
>  		goto out;
> +	}
>  
> -	old = real_mount(old_path.mnt);
>  	parent = real_mount(path->mnt);
> -
> -	err = -EINVAL;
> -	if (IS_MNT_UNBINDABLE(old))
> -		goto out2;
> -
>  	if (!check_mnt(parent))
>  		goto out2;
>  
> -	if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
> -		goto out2;
> -
> -	if (!recurse && has_locked_children(old, old_path.dentry))
> -		goto out2;
> -
> -	if (recurse)
> -		mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
> -	else
> -		mnt = clone_mnt(old, old_path.dentry, 0);
> -
> +	mnt = __do_loopback(&old_path, recurse);
>  	if (IS_ERR(mnt)) {
>  		err = PTR_ERR(mnt);
>  		goto out2;
>  	}
>  
> -	mnt->mnt.mnt_flags &= ~MNT_LOCKED;
> -
>  	err = graft_tree(mnt, parent, mp);
>  	if (err) {
>  		lock_mount_hash();
> @@ -2204,6 +2223,75 @@ static int do_loopback(struct path *path, const char *old_name,
>  	return err;
>  }
>  
> +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
> +{
> +	struct file *file;
> +	struct path path;
> +	int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
> +	bool detached = flags & OPEN_TREE_CLONE;
> +	int error;
> +	int fd;
> +
> +	BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
> +
> +	if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
> +		      AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
> +		      OPEN_TREE_CLOEXEC))
> +		return -EINVAL;
> +
> +	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
> +		return -EINVAL;
> +
> +	if (flags & AT_NO_AUTOMOUNT)
> +		lookup_flags &= ~LOOKUP_AUTOMOUNT;
> +	if (flags & AT_SYMLINK_NOFOLLOW)
> +		lookup_flags &= ~LOOKUP_FOLLOW;
> +	if (flags & AT_EMPTY_PATH)
> +		lookup_flags |= LOOKUP_EMPTY;
> +
> +	if (detached && !may_mount())
> +		return -EPERM;
> +
> +	fd = get_unused_fd_flags(flags & O_CLOEXEC);
> +	if (fd < 0)
> +		return fd;
> +
> +	error = user_path_at(dfd, filename, lookup_flags, &path);
> +	if (error)
> +		goto out;
> +
> +	if (detached) {
> +		struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
> +		if (IS_ERR(mnt)) {
> +			error = PTR_ERR(mnt);
> +			goto out2;
> +		}
> +		mntput(path.mnt);
> +		path.mnt = &mnt->mnt;
> +	}
> +
> +	file = dentry_open(&path, O_PATH, current_cred());
> +	if (IS_ERR(file)) {
> +		error = PTR_ERR(file);
> +		goto out3;
> +	}
> +
> +	if (detached)
> +		file->f_mode |= FMODE_NEED_UNMOUNT;
> +	path_put(&path);
> +	fd_install(fd, file);
> +	return fd;
> +
> +out3:
> +	if (detached)
> +		dissolve_on_fput(path.mnt);
> +out2:
> +	path_put(&path);
> +out:
> +	put_unused_fd(fd);
> +	return error;
> +}
> +
>  static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
>  {
>  	int error = 0;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4323b8fe353d..6dc32507762f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -157,10 +157,13 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
>  
>  /* File is capable of returning -EAGAIN if I/O will block */
> -#define FMODE_NOWAIT	((__force fmode_t)0x8000000)
> +#define FMODE_NOWAIT		((__force fmode_t)0x8000000)
> +
> +/* File represents mount that needs unmounting */
> +#define FMODE_NEED_UNMOUNT	((__force fmode_t)0x10000000)
>  
>  /* File does not contribute to nr_files count */
> -#define FMODE_NOACCOUNT	((__force fmode_t)0x20000000)
> +#define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
>  
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 2ff814c92f7f..6978f3c76d41 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -906,6 +906,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
>  			  unsigned mask, struct statx __user *buffer);
>  asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
>  			 int flags, uint32_t sig);
> +asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 6448cdd9a350..594b85f7cb86 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -90,5 +90,7 @@
>  #define AT_STATX_FORCE_SYNC	0x2000	/* - Force the attributes to be sync'd with the server */
>  #define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
>  
> +#define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
> +
>  
>  #endif /* _UAPI_LINUX_FCNTL_H */
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> new file mode 100644
> index 000000000000..e8db2911adca
> --- /dev/null
> +++ b/include/uapi/linux/mount.h
> @@ -0,0 +1,10 @@
> +#ifndef _UAPI_LINUX_MOUNT_H
> +#define _UAPI_LINUX_MOUNT_H
> +
> +/*
> + * open_tree() flags.
> + */
> +#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
> +#define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
> +
> +#endif /* _UAPI_LINUX_MOUNT_H */

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-10-21 16:41 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-21 16:30 [PATCH 00/34] VFS: Introduce filesystem context [ver #12] David Howells
2018-09-21 16:30 ` [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
2018-10-21 16:41   ` Eric W. Biederman
2018-09-21 16:30 ` [PATCH 02/34] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
2018-09-21 16:33 ` [PATCH 26/34] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
2018-09-21 16:34 ` [PATCH 29/34] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
2018-09-21 16:34 ` [PATCH 30/34] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
2018-09-21 16:34 ` [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
2018-10-12 14:49   ` Alan Jenkins
2018-10-13  6:11     ` Al Viro
2018-10-13  9:45       ` Alan Jenkins
2018-10-13 23:04         ` Andy Lutomirski
2018-10-17 13:15       ` David Howells
2018-10-17 13:20       ` David Howells
2018-10-17 14:31         ` Alan Jenkins
2018-10-17 14:35           ` Eric W. Biederman
2018-10-17 14:55             ` Alan Jenkins
2018-10-17 15:24           ` David Howells
2018-10-17 15:38             ` Eric W. Biederman
2018-10-17 15:45         ` David Howells
2018-10-17 17:41           ` Alan Jenkins
2018-10-17 21:20           ` David Howells
2018-10-17 22:13           ` Alan Jenkins
2018-10-04 18:37 ` [PATCH 00/34] VFS: Introduce filesystem context " Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).