linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/32] VFS: Introduce filesystem context [ver #9]
@ 2018-07-10 22:41 David Howells
  2018-07-10 22:41 ` [PATCH 01/32] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
                   ` (37 more replies)
  0 siblings, 38 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:41 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel


Hi Al,

Can you update your tree with this?

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount.  This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

	fd = fsopen("nfs");
	write(fd, "option=val", ...);
	mfd = fsmount(fd, MS_NODEV);
	move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by a legacy filesystem wrapper.

Significant changes:

 ver #9:

 (*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.

 (*) Al added an open_tree() system call to allow a mount tree to be picked
     referenced or cloned into an O_PATH-style fd.  This can then be used
     with sys_move_mount().  Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
     open() flags.

 (*) Brought error logging back in, though only in the fs_context and not
     in the task_struct.

 (*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.

 (*) Used anon_inodes for the fd returned by fsopen() and fspick().  This
     requires making it unconditional.

 (*) Fixed lots of bugs.  Especial thanks to Al and Eric Biggers for
     finding them and providing patches.

 (*) Wrote manual pages, which I'll post separately.

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
     of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
     actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
     certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
     with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
     dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
     vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

	[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
     filesystem.  The idea being that this will, in future, work with the
     fd from fsopen() too and permit querying of the parameters and
     metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
     take the data pointer.  This fixes a problem where someone (say SELinux)
     tries to copy the mount data, assuming it to be a page in size, and
     overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
     deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
     replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
     passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
     caller always provide a struct file_system_type pointer and the
     parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
     FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
     used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
     task_struct rather than the sb_config so that error messages can be
     obtained from NFS doing a mount-root-and-pathwalk inside the
     nfs_get_tree() operation.

     Further, made this managed and read by prctl rather than through the
     mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
     directly to functions that want to call it.  NFS now calls
     nfs_fill_super() directly rather than jumping through a pointer to it
     since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
     proc_sb_config.

 (*) Renamed create_super -> get_tree.

 (*) Renamed struct mount_context to struct sb_config and amended various
     variable names.

 (*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
     arguments.

 ver #1:

 (*) Split the sb_config stuff out into its own header.

 (*) Support non-context aware filesystems through a special set of
     sb_config operations.

 (*) Stored the created superblock and root dentry into the sb_config after
     creation rather than directly into a vfsmount.  This allows some
     arguments to be removed to various NFS functions.

 (*) Added an explicit superblock-creation step.  This allows a created
     superblock to then be mounted multiple times.

 (*) Added a flag to say that the sb_config is degraded and cannot have
     another go at having a superblock creation whilst getting rid of the
     one that says it's already mounted.

Possible further developments:

 (*) Implement sb reconfiguration (for now it returns ENOANO).

 (*) Implement mount context support in more filesystems, ext4 being next
     on my list.

 (*) Move the walk-from-root stuff that nfs has to generic code so that you
     can do something akin to:

	mount /dev/sda1:/foo/bar /mnt

     See nfs_follow_remote_path() and mount_subtree().  This is slightly
     tricky in NFS as we have to prevent referral loops.

 (*) Work out how to get at the error message incurred by submounts
     encountered during nfs_follow_remote_path().

     Should the error message be moved to task_struct and made more
     general, perhaps retrieved with a prctl() function?

 (*) Clean up/consolidate the security functions.  Possibly add a
     validation hook to be called at the same time as the mount context
     validate op.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/tag/?h=mount-api-20180710-2

on branch:

	mount-context

David
---
Al Viro (2):
      vfs: syscall: Add open_tree(2) to reference or clone a mount
      teach move_mount(2) to work with OPEN_TREE_CLONE

David Howells (30):
      vfs: syscall: Add move_mount(2) to move mounts around
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
      vfs: Introduce the basic header for the new mount API's filesystem context
      vfs: Add LSM hooks for the new mount API
      selinux: Implement the new mount API LSM hooks
      smack: Implement filesystem context security hooks
      apparmor: Implement security hooks for the new mount API
      tomoyo: Implement security hooks for the new mount API
      vfs: Require specification of size of mount data for internal mounts
      vfs: Separate changing mount flags full remount
      vfs: Implement a filesystem superblock creation/configuration context
      vfs: Remove unused code after filesystem context changes
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Add fs_context support to procfs
      ipc: Convert mqueue fs to fs_context
      cpuset: Use fs_context
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context
      hugetlbfs: Convert to fs_context
      vfs: Remove kern_mount_data()
      vfs: Provide documentation for new mount API
      Make anon_inodes unconditional
      vfs: syscall: Add fsopen() to prepare for superblock creation
      vfs: syscall: Add fsmount() to create a mount for a superblock
      vfs: syscall: Add fspick() to select a superblock for reconfiguration
      vfs: Implement logging through fs_context
      vfs: Add some logging to the core users of the fs_context log
      afs: Add fs_context support
      afs: Use fs_context to pass parameters over automount
      vfs: syscall: Add fsinfo() to query filesystem information
      afs: Add fsinfo support


 Documentation/filesystems/mount_api.txt   |  439 +++++++++++++++
 arch/arc/kernel/setup.c                   |    1 
 arch/arm/kernel/atags_parse.c             |    1 
 arch/ia64/kernel/perfmon.c                |    3 
 arch/powerpc/platforms/cell/spufs/inode.c |    6 
 arch/s390/hypfs/inode.c                   |    7 
 arch/sh/kernel/setup.c                    |    1 
 arch/sparc/kernel/setup_32.c              |    1 
 arch/sparc/kernel/setup_64.c              |    1 
 arch/x86/entry/syscalls/syscall_32.tbl    |    6 
 arch/x86/entry/syscalls/syscall_64.tbl    |    6 
 arch/x86/kernel/cpu/intel_rdt.h           |   15 
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c  |  149 +++--
 arch/x86/kernel/setup.c                   |    1 
 drivers/base/devtmpfs.c                   |    7 
 drivers/dax/super.c                       |    2 
 drivers/gpu/drm/drm_drv.c                 |    3 
 drivers/gpu/drm/i915/i915_gemfs.c         |    2 
 drivers/infiniband/hw/qib/qib_fs.c        |    7 
 drivers/misc/cxl/api.c                    |    3 
 drivers/misc/ibmasm/ibmasmfs.c            |   11 
 drivers/mtd/mtdsuper.c                    |   26 -
 drivers/oprofile/oprofilefs.c             |    8 
 drivers/scsi/cxlflash/ocxl_hw.c           |    2 
 drivers/usb/gadget/function/f_fs.c        |    7 
 drivers/usb/gadget/legacy/inode.c         |    7 
 drivers/virtio/virtio_balloon.c           |    2 
 drivers/xen/xenfs/super.c                 |    7 
 fs/9p/vfs_super.c                         |    2 
 fs/Makefile                               |    5 
 fs/adfs/super.c                           |    9 
 fs/affs/super.c                           |   13 
 fs/afs/internal.h                         |    9 
 fs/afs/mntpt.c                            |  147 ++---
 fs/afs/super.c                            |  536 +++++++++++-------
 fs/afs/volume.c                           |    4 
 fs/aio.c                                  |    3 
 fs/anon_inodes.c                          |    3 
 fs/autofs/autofs_i.h                      |    2 
 fs/autofs/init.c                          |    4 
 fs/autofs/inode.c                         |    3 
 fs/befs/linuxvfs.c                        |   11 
 fs/bfs/inode.c                            |    8 
 fs/binfmt_misc.c                          |    7 
 fs/block_dev.c                            |    2 
 fs/btrfs/super.c                          |   30 +
 fs/btrfs/tests/btrfs-tests.c              |    2 
 fs/ceph/super.c                           |    3 
 fs/cifs/cifs_dfs_ref.c                    |    3 
 fs/cifs/cifsfs.c                          |   18 -
 fs/coda/inode.c                           |   11 
 fs/configfs/mount.c                       |    7 
 fs/cramfs/inode.c                         |   17 -
 fs/debugfs/inode.c                        |   14 
 fs/devpts/inode.c                         |   10 
 fs/ecryptfs/main.c                        |    2 
 fs/efivarfs/super.c                       |    9 
 fs/efs/super.c                            |   14 
 fs/exofs/super.c                          |    7 
 fs/ext2/super.c                           |   14 
 fs/ext4/super.c                           |   16 -
 fs/f2fs/super.c                           |   13 
 fs/fat/inode.c                            |    3 
 fs/fat/namei_msdos.c                      |    8 
 fs/fat/namei_vfat.c                       |    8 
 fs/file_table.c                           |    9 
 fs/freevxfs/vxfs_super.c                  |   12 
 fs/fs_context.c                           |  721 ++++++++++++++++++++++++
 fs/fsopen.c                               |  335 +++++++++++
 fs/fuse/control.c                         |    9 
 fs/fuse/inode.c                           |   16 -
 fs/gfs2/ops_fstype.c                      |    6 
 fs/gfs2/super.c                           |    4 
 fs/hfs/super.c                            |   12 
 fs/hfsplus/super.c                        |   12 
 fs/hostfs/hostfs_kern.c                   |    7 
 fs/hpfs/super.c                           |   11 
 fs/hugetlbfs/inode.c                      |  339 +++++++----
 fs/internal.h                             |    6 
 fs/isofs/inode.c                          |   11 
 fs/jffs2/super.c                          |   10 
 fs/jfs/super.c                            |   11 
 fs/kernfs/mount.c                         |   88 +--
 fs/libfs.c                                |   19 +
 fs/minix/inode.c                          |   14 
 fs/namespace.c                            |  877 ++++++++++++++++++++++-------
 fs/nfs/internal.h                         |    4 
 fs/nfs/namespace.c                        |    3 
 fs/nfs/nfs4namespace.c                    |    3 
 fs/nfs/nfs4super.c                        |   27 -
 fs/nfs/super.c                            |   22 -
 fs/nfsd/nfsctl.c                          |    8 
 fs/nilfs2/super.c                         |   10 
 fs/nsfs.c                                 |    3 
 fs/ntfs/super.c                           |   13 
 fs/ocfs2/dlmfs/dlmfs.c                    |    5 
 fs/ocfs2/super.c                          |   14 
 fs/omfs/inode.c                           |    9 
 fs/openpromfs/inode.c                     |   11 
 fs/orangefs/orangefs-kernel.h             |    2 
 fs/orangefs/super.c                       |    5 
 fs/overlayfs/super.c                      |   11 
 fs/pipe.c                                 |    3 
 fs/pnode.c                                |    1 
 fs/proc/inode.c                           |   50 --
 fs/proc/internal.h                        |    6 
 fs/proc/root.c                            |  212 +++++--
 fs/pstore/inode.c                         |   10 
 fs/qnx4/inode.c                           |   14 
 fs/qnx6/inode.c                           |   14 
 fs/ramfs/inode.c                          |    6 
 fs/reiserfs/super.c                       |   14 
 fs/romfs/super.c                          |   13 
 fs/squashfs/super.c                       |   12 
 fs/statfs.c                               |  470 ++++++++++++++++
 fs/super.c                                |  394 ++++++++++---
 fs/sysfs/mount.c                          |   67 ++
 fs/sysv/inode.c                           |    3 
 fs/sysv/super.c                           |   16 -
 fs/tracefs/inode.c                        |   10 
 fs/ubifs/super.c                          |    5 
 fs/udf/super.c                            |   16 -
 fs/ufs/super.c                            |   11 
 fs/xfs/xfs_super.c                        |   10 
 include/linux/cgroup.h                    |    3 
 include/linux/debugfs.h                   |    8 
 include/linux/fs.h                        |   47 +-
 include/linux/fs_context.h                |  178 ++++++
 include/linux/fsinfo.h                    |   40 +
 include/linux/kernfs.h                    |   39 +
 include/linux/lsm_hooks.h                 |   88 +++
 include/linux/module.h                    |    6 
 include/linux/mount.h                     |   10 
 include/linux/mtd/super.h                 |    4 
 include/linux/ramfs.h                     |    4 
 include/linux/security.h                  |   74 ++
 include/linux/shmem_fs.h                  |    3 
 include/linux/syscalls.h                  |   11 
 include/uapi/linux/fcntl.h                |    2 
 include/uapi/linux/fs.h                   |   68 +-
 include/uapi/linux/fsinfo.h               |  237 ++++++++
 include/uapi/linux/mount.h                |   75 ++
 init/Kconfig                              |   10 
 init/do_mounts.c                          |    5 
 init/do_mounts_initrd.c                   |    1 
 ipc/mqueue.c                              |  120 +++-
 kernel/bpf/inode.c                        |    7 
 kernel/cgroup/cgroup-internal.h           |   49 +-
 kernel/cgroup/cgroup-v1.c                 |  302 +++++-----
 kernel/cgroup/cgroup.c                    |  226 ++++---
 kernel/cgroup/cpuset.c                    |   67 ++
 kernel/trace/trace.c                      |    7 
 mm/shmem.c                                |   10 
 mm/zsmalloc.c                             |    3 
 net/socket.c                              |    3 
 net/sunrpc/rpc_pipe.c                     |    7 
 samples/statx/Makefile                    |    5 
 samples/statx/test-fsinfo.c               |  539 ++++++++++++++++++
 security/apparmor/apparmorfs.c            |    8 
 security/apparmor/include/mount.h         |   11 
 security/apparmor/lsm.c                   |   84 +++
 security/apparmor/mount.c                 |   47 ++
 security/inode.c                          |    7 
 security/security.c                       |   70 ++
 security/selinux/hooks.c                  |  294 +++++++++-
 security/selinux/selinuxfs.c              |    8 
 security/smack/smack_lsm.c                |  344 ++++++++++-
 security/smack/smackfs.c                  |    9 
 security/tomoyo/common.h                  |    3 
 security/tomoyo/mount.c                   |   46 ++
 security/tomoyo/tomoyo.c                  |   19 +
 171 files changed, 7147 insertions(+), 1805 deletions(-)
 create mode 100644 Documentation/filesystems/mount_api.txt
 create mode 100644 fs/fs_context.c
 create mode 100644 fs/fsopen.c
 create mode 100644 include/linux/fs_context.h
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/mount.h
 create mode 100644 samples/statx/test-fsinfo.c


^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 01/32] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
@ 2018-07-10 22:41 ` David Howells
  2018-07-10 22:41 ` [PATCH 02/32] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
                   ` (36 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:41 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-api, linux-fsdevel, torvalds, linux-kernel

From: Al Viro <viro@zeniv.linux.org.uk>

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)).  flags should be an OR of
some of the following:
	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
	* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question.  In other words, the same as mount --rbind
or mount --bind would've taken.  The detached tree will be
dissolved on the final close of obtained file.  Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/file_table.c                        |    9 +-
 fs/internal.h                          |    1 
 fs/namespace.c                         |  132 +++++++++++++++++++++++++++-----
 include/linux/fs.h                     |    3 +
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fcntl.h             |    2 
 include/uapi/linux/mount.h             |   10 ++
 9 files changed, 135 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/mount.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..ea1b413afd47 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..0545bed581dc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	open_tree		__x64_sys_open_tree
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index 7ec0b3e5f05d..7480271a0d21 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -189,6 +189,7 @@ static void __fput(struct file *file)
 	struct dentry *dentry = file->f_path.dentry;
 	struct vfsmount *mnt = file->f_path.mnt;
 	struct inode *inode = file->f_inode;
+	fmode_t mode = file->f_mode;
 
 	might_sleep();
 
@@ -209,14 +210,14 @@ static void __fput(struct file *file)
 		file->f_op->release(inode, file);
 	security_file_free(file);
 	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
-		     !(file->f_mode & FMODE_PATH))) {
+		     !(mode & FMODE_PATH))) {
 		cdev_put(inode->i_cdev);
 	}
 	fops_put(file->f_op);
 	put_pid(file->f_owner.pid);
-	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_dec(inode);
-	if (file->f_mode & FMODE_WRITER) {
+	if (mode & FMODE_WRITER) {
 		put_write_access(inode);
 		__mnt_drop_write(mnt);
 	}
@@ -224,6 +225,8 @@ static void __fput(struct file *file)
 	file->f_path.mnt = NULL;
 	file->f_inode = NULL;
 	file_free(file);
+	if (unlikely(mode & FMODE_NEED_UNMOUNT))
+		dissolve_on_fput(mnt);
 	dput(dentry);
 	mntput(mnt);
 }
diff --git a/fs/internal.h b/fs/internal.h
index 980d005b21b4..b55575b9b55c 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -85,6 +85,7 @@ extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 extern void mnt_drop_write_file_path(struct file *);
 
+extern void dissolve_on_fput(struct vfsmount *);
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 8ddd14806799..b355a555b4db 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -20,12 +20,14 @@
 #include <linux/init.h>		/* init_rootfs */
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
+#include <linux/file.h>
 #include <linux/uaccess.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
 #include <linux/task_work.h>
 #include <linux/sched/task.h>
+#include <uapi/linux/mount.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -1839,6 +1841,16 @@ struct vfsmount *collect_mounts(const struct path *path)
 	return &tree->mnt;
 }
 
+void dissolve_on_fput(struct vfsmount *mnt)
+{
+	namespace_lock();
+	lock_mount_hash();
+	mntget(mnt);
+	umount_tree(real_mount(mnt), UMOUNT_SYNC);
+	unlock_mount_hash();
+	namespace_unlock();
+}
+
 void drop_collected_mounts(struct vfsmount *mnt)
 {
 	namespace_lock();
@@ -2198,6 +2210,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
 	return false;
 }
 
+static struct mount *__do_loopback(struct path *old_path, int recurse)
+{
+	struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+
+	if (IS_MNT_UNBINDABLE(old))
+		return mnt;
+
+	if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
+		return mnt;
+
+	if (!recurse && has_locked_children(old, old_path->dentry))
+		return mnt;
+
+	if (recurse)
+		mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+	else
+		mnt = clone_mnt(old, old_path->dentry, 0);
+
+	if (!IS_ERR(mnt))
+		mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+	return mnt;
+}
+
 /*
  * do loopback mount.
  */
@@ -2205,7 +2241,7 @@ static int do_loopback(struct path *path, const char *old_name,
 				int recurse)
 {
 	struct path old_path;
-	struct mount *mnt = NULL, *old, *parent;
+	struct mount *mnt = NULL, *parent;
 	struct mountpoint *mp;
 	int err;
 	if (!old_name || !*old_name)
@@ -2219,38 +2255,21 @@ static int do_loopback(struct path *path, const char *old_name,
 		goto out;
 
 	mp = lock_mount(path);
-	err = PTR_ERR(mp);
-	if (IS_ERR(mp))
+	if (IS_ERR(mp)) {
+		err = PTR_ERR(mp);
 		goto out;
+	}
 
-	old = real_mount(old_path.mnt);
 	parent = real_mount(path->mnt);
-
-	err = -EINVAL;
-	if (IS_MNT_UNBINDABLE(old))
-		goto out2;
-
 	if (!check_mnt(parent))
 		goto out2;
 
-	if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
-		goto out2;
-
-	if (!recurse && has_locked_children(old, old_path.dentry))
-		goto out2;
-
-	if (recurse)
-		mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
-	else
-		mnt = clone_mnt(old, old_path.dentry, 0);
-
+	mnt = __do_loopback(&old_path, recurse);
 	if (IS_ERR(mnt)) {
 		err = PTR_ERR(mnt);
 		goto out2;
 	}
 
-	mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
 	err = graft_tree(mnt, parent, mp);
 	if (err) {
 		lock_mount_hash();
@@ -2264,6 +2283,75 @@ static int do_loopback(struct path *path, const char *old_name,
 	return err;
 }
 
+SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
+{
+	struct file *file;
+	struct path path;
+	int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+	bool detached = flags & OPEN_TREE_CLONE;
+	int error;
+	int fd;
+
+	BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
+
+	if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
+		      AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
+		      OPEN_TREE_CLOEXEC))
+		return -EINVAL;
+
+	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
+		return -EINVAL;
+
+	if (flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	if (detached && !may_mount())
+		return -EPERM;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	error = user_path_at(dfd, filename, lookup_flags, &path);
+	if (error)
+		goto out;
+
+	if (detached) {
+		struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
+		if (IS_ERR(mnt)) {
+			error = PTR_ERR(mnt);
+			goto out2;
+		}
+		mntput(path.mnt);
+		path.mnt = &mnt->mnt;
+	}
+
+	file = dentry_open(&path, O_PATH, current_cred());
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto out3;
+	}
+
+	if (detached)
+		file->f_mode |= FMODE_NEED_UNMOUNT;
+	path_put(&path);
+	fd_install(fd, file);
+	return fd;
+
+out3:
+	if (detached)
+		dissolve_on_fput(path.mnt);
+out2:
+	path_put(&path);
+out:
+	put_unused_fd(fd);
+	return error;
+}
+
 static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
 {
 	int error = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5c91108846db..00e255c195f2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -154,6 +154,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is capable of returning -EAGAIN if I/O will block */
 #define FMODE_NOWAIT	((__force fmode_t)0x8000000)
 
+/* File represents mount that needs unmounting */
+#define FMODE_NEED_UNMOUNT     ((__force fmode_t)0x10000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 73810808cdf2..3cc6b8f8bd2f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -900,6 +900,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..594b85f7cb86 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,5 +90,7 @@
 #define AT_STATX_FORCE_SYNC	0x2000	/* - Force the attributes to be sync'd with the server */
 #define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
 
+#define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
+
 
 #endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..e8db2911adca
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,10 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+/*
+ * open_tree() flags.
+ */
+#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
+#define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
+
+#endif /* _UAPI_LINUX_MOUNT_H */


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 02/32] vfs: syscall: Add move_mount(2) to move mounts around [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
  2018-07-10 22:41 ` [PATCH 01/32] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
@ 2018-07-10 22:41 ` David Howells
  2018-07-10 22:41 ` [PATCH 03/32] teach move_mount(2) to work with OPEN_TREE_CLONE " David Howells
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:41 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-api, linux-fsdevel, torvalds, linux-kernel

Add a move_mount() system call that will move a mount from one place to
another and, in the next commit, allow to attach an unattached mount tree.

The new system call looks like the following:

	int move_mount(int from_dfd, const char *from_path,
		       int to_dfd, const char *to_path,
		       unsigned int flags);

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |  102 ++++++++++++++++++++++++++------
 include/linux/lsm_hooks.h              |    6 ++
 include/linux/security.h               |    7 ++
 include/linux/syscalls.h               |    3 +
 include/uapi/linux/mount.h             |   11 +++
 security/security.c                    |    5 ++
 8 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ea1b413afd47..76d092b7d1b0 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -399,3 +399,4 @@
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
+388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0545bed581dc..37ba4e65eee6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
 335	common	open_tree		__x64_sys_open_tree
+336	common	move_mount		__x64_sys_move_mount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index b355a555b4db..e95b2bc8fcfe 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2446,43 +2446,37 @@ static inline int tree_contains_unbindable(struct mount *mnt)
 	return 0;
 }
 
-static int do_move_mount(struct path *path, const char *old_name)
+static int do_move_mount(struct path *old_path, struct path *new_path)
 {
-	struct path old_path, parent_path;
+	struct path parent_path = {.mnt = NULL, .dentry = NULL};
 	struct mount *p;
 	struct mount *old;
 	struct mountpoint *mp;
 	int err;
-	if (!old_name || !*old_name)
-		return -EINVAL;
-	err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
-	if (err)
-		return err;
 
-	mp = lock_mount(path);
+	mp = lock_mount(new_path);
 	err = PTR_ERR(mp);
 	if (IS_ERR(mp))
 		goto out;
 
-	old = real_mount(old_path.mnt);
-	p = real_mount(path->mnt);
+	old = real_mount(old_path->mnt);
+	p = real_mount(new_path->mnt);
 
 	err = -EINVAL;
 	if (!check_mnt(p) || !check_mnt(old))
 		goto out1;
 
-	if (old->mnt.mnt_flags & MNT_LOCKED)
+	if (!mnt_has_parent(old))
 		goto out1;
 
-	err = -EINVAL;
-	if (old_path.dentry != old_path.mnt->mnt_root)
+	if (old->mnt.mnt_flags & MNT_LOCKED)
 		goto out1;
 
-	if (!mnt_has_parent(old))
+	if (old_path->dentry != old_path->mnt->mnt_root)
 		goto out1;
 
-	if (d_is_dir(path->dentry) !=
-	      d_is_dir(old_path.dentry))
+	if (d_is_dir(new_path->dentry) !=
+	    d_is_dir(old_path->dentry))
 		goto out1;
 	/*
 	 * Don't move a mount residing in a shared parent.
@@ -2500,7 +2494,8 @@ static int do_move_mount(struct path *path, const char *old_name)
 		if (p == old)
 			goto out1;
 
-	err = attach_recursive_mnt(old, real_mount(path->mnt), mp, &parent_path);
+	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
+				   &parent_path);
 	if (err)
 		goto out1;
 
@@ -2512,6 +2507,22 @@ static int do_move_mount(struct path *path, const char *old_name)
 out:
 	if (!err)
 		path_put(&parent_path);
+	return err;
+}
+
+static int do_move_mount_old(struct path *path, const char *old_name)
+{
+	struct path old_path;
+	int err;
+
+	if (!old_name || !*old_name)
+		return -EINVAL;
+
+	err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
+	if (err)
+		return err;
+
+	err = do_move_mount(&old_path, path);
 	path_put(&old_path);
 	return err;
 }
@@ -2931,7 +2942,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 	else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
 		retval = do_change_type(&path, flags);
 	else if (flags & MS_MOVE)
-		retval = do_move_mount(&path, dev_name);
+		retval = do_move_mount_old(&path, dev_name);
 	else
 		retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
 				      dev_name, data_page);
@@ -3166,6 +3177,61 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 	return ksys_mount(dev_name, dir_name, type, flags, data);
 }
 
+/*
+ * Move a mount from one place to another.
+ *
+ * Note the flags value is a combination of MOVE_MOUNT_* flags.
+ */
+SYSCALL_DEFINE5(move_mount,
+		int, from_dfd, const char *, from_pathname,
+		int, to_dfd, const char *, to_pathname,
+		unsigned int, flags)
+{
+	struct path from_path, to_path;
+	unsigned int lflags;
+	int ret = 0;
+
+	if (!may_mount())
+		return -EPERM;
+
+	if (flags & ~MOVE_MOUNT__MASK)
+		return -EINVAL;
+
+	/* If someone gives a pathname, they aren't permitted to move
+	 * from an fd that requires unmount as we can't get at the flag
+	 * to clear it afterwards.
+	 */
+	lflags = 0;
+	if (flags & MOVE_MOUNT_F_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
+	if (flags & MOVE_MOUNT_F_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
+	if (flags & MOVE_MOUNT_F_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(from_dfd, from_pathname, lflags, &from_path);
+	if (ret < 0)
+		return ret;
+
+	lflags = 0;
+	if (flags & MOVE_MOUNT_T_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
+	if (flags & MOVE_MOUNT_T_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
+	if (flags & MOVE_MOUNT_T_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
+	if (ret < 0)
+		goto out_from;
+
+	ret = security_move_mount(&from_path, &to_path);
+	if (ret < 0)
+		goto out_to;
+
+	ret = do_move_mount(&from_path, &to_path);
+
+out_to:
+	path_put(&to_path);
+out_from:
+	path_put(&from_path);
+	return ret;
+}
+
 /*
  * Return true if path is reachable from root
  *
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 8f1131c8dd54..926607defd83 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -144,6 +144,10 @@
  *	Parse a string of security data filling in the opts structure
  *	@options string containing all mount options known by the LSM
  *	@opts binary data structure usable by the LSM
+ * @move_mount:
+ *	Check permission before a mount is moved.
+ *	@from_path indicates the mount that is going to be moved.
+ *	@to_path indicates the mountpoint that will be mounted upon.
  * @dentry_init_security:
  *	Compute a context for a dentry as the inode is not yet available
  *	since NFSv4 has no label backed by an EA anyway.
@@ -1475,6 +1479,7 @@ union security_list_options {
 					unsigned long kern_flags,
 					unsigned long *set_kern_flags);
 	int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+	int (*move_mount)(const struct path *from_path, const struct path *to_path);
 	int (*dentry_init_security)(struct dentry *dentry, int mode,
 					const struct qstr *name, void **ctx,
 					u32 *ctxlen);
@@ -1806,6 +1811,7 @@ struct security_hook_heads {
 	struct hlist_head sb_set_mnt_opts;
 	struct hlist_head sb_clone_mnt_opts;
 	struct hlist_head sb_parse_opts_str;
+	struct hlist_head move_mount;
 	struct hlist_head dentry_init_security;
 	struct hlist_head dentry_create_files_as;
 #ifdef CONFIG_SECURITY_PATH
diff --git a/include/linux/security.h b/include/linux/security.h
index 63030c85ee19..15d121f156b3 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -245,6 +245,7 @@ int security_sb_clone_mnt_opts(const struct super_block *oldsb,
 				unsigned long kern_flags,
 				unsigned long *set_kern_flags);
 int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+int security_move_mount(const struct path *from_path, const struct path *to_path);
 int security_dentry_init_security(struct dentry *dentry, int mode,
 					const struct qstr *name, void **ctx,
 					u32 *ctxlen);
@@ -598,6 +599,12 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
 	return 0;
 }
 
+static inline int security_move_mount(const struct path *from_path,
+				      const struct path *to_path)
+{
+	return 0;
+}
+
 static inline int security_inode_alloc(struct inode *inode)
 {
 	return 0;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3cc6b8f8bd2f..3c0855d9b105 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
+asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
+			       int to_dfd, const char __user *to_path,
+			       unsigned int ms_flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index e8db2911adca..89adf0d731ab 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -7,4 +7,15 @@
 #define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
 #define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
 
+/*
+ * move_mount() flags.
+ */
+#define MOVE_MOUNT_F_SYMLINKS		0x00000001 /* Follow symlinks on from path */
+#define MOVE_MOUNT_F_AUTOMOUNTS		0x00000002 /* Follow automounts on from path */
+#define MOVE_MOUNT_F_EMPTY_PATH		0x00000004 /* Empty from path permitted */
+#define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
+#define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
+#define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
+#define MOVE_MOUNT__MASK		0x00000077
+
 #endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/security/security.c b/security/security.c
index 68f46d849abe..c4cbdb7d3a5f 100644
--- a/security/security.c
+++ b/security/security.c
@@ -437,6 +437,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
 }
 EXPORT_SYMBOL(security_sb_parse_opts_str);
 
+int security_move_mount(const struct path *from_path, const struct path *to_path)
+{
+	return call_int_hook(move_mount, 0, from_path, to_path);
+}
+
 int security_inode_alloc(struct inode *inode)
 {
 	inode->i_security = NULL;


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 03/32] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
  2018-07-10 22:41 ` [PATCH 01/32] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
  2018-07-10 22:41 ` [PATCH 02/32] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
@ 2018-07-10 22:41 ` David Howells
  2018-07-10 22:41 ` [PATCH 04/32] vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled " David Howells
                   ` (34 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:41 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

From: Al Viro <viro@zeniv.linux.org.uk>

Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
attached by move_mount(2).

If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
to handle detached source.

That gives us equivalents of mount --bind and mount --rbind.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

 fs/namespace.c |   26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e95b2bc8fcfe..bd2526b24afb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1845,8 +1845,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
 {
 	namespace_lock();
 	lock_mount_hash();
-	mntget(mnt);
-	umount_tree(real_mount(mnt), UMOUNT_SYNC);
+	if (!real_mount(mnt)->mnt_ns) {
+		mntget(mnt);
+		umount_tree(real_mount(mnt), UMOUNT_SYNC);
+	}
 	unlock_mount_hash();
 	namespace_unlock();
 }
@@ -2453,6 +2455,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	struct mount *old;
 	struct mountpoint *mp;
 	int err;
+	bool attached;
 
 	mp = lock_mount(new_path);
 	err = PTR_ERR(mp);
@@ -2463,10 +2466,19 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	p = real_mount(new_path->mnt);
 
 	err = -EINVAL;
-	if (!check_mnt(p) || !check_mnt(old))
+	/* The mountpoint must be in our namespace. */
+	if (!check_mnt(p))
+		goto out1;
+	/* The thing moved should be either ours or completely unattached. */
+	if (old->mnt_ns && !check_mnt(old))
 		goto out1;
 
-	if (!mnt_has_parent(old))
+	attached = mnt_has_parent(old);
+	/*
+	 * We need to allow open_tree(OPEN_TREE_CLONE) followed by
+	 * move_mount(), but mustn't allow "/" to be moved.
+	 */
+	if (old->mnt_ns && !attached)
 		goto out1;
 
 	if (old->mnt.mnt_flags & MNT_LOCKED)
@@ -2481,7 +2493,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	/*
 	 * Don't move a mount residing in a shared parent.
 	 */
-	if (IS_MNT_SHARED(old->mnt_parent))
+	if (attached && IS_MNT_SHARED(old->mnt_parent))
 		goto out1;
 	/*
 	 * Don't move a mount tree containing unbindable mounts to a destination
@@ -2495,7 +2507,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 			goto out1;
 
 	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
-				   &parent_path);
+				   attached ? &parent_path : NULL);
 	if (err)
 		goto out1;
 
@@ -3179,6 +3191,8 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 
 /*
  * Move a mount from one place to another.
+ * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
+ * used to copy a mount subtree.
  *
  * Note the flags value is a combination of MOVE_MOUNT_* flags.
  */


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 04/32] vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (2 preceding siblings ...)
  2018-07-10 22:41 ` [PATCH 03/32] teach move_mount(2) to work with OPEN_TREE_CLONE " David Howells
@ 2018-07-10 22:41 ` David Howells
  2018-07-10 22:42 ` [PATCH 05/32] vfs: Introduce the basic header for the new mount API's filesystem context " David Howells
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:41 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Only the mount namespace code that implements mount(2) should be using the
MS_* flags.  Suppress them inside the kernel unless uapi/linux/mount.h is
included.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/arc/kernel/setup.c       |    1 +
 arch/arm/kernel/atags_parse.c |    1 +
 arch/sh/kernel/setup.c        |    1 +
 arch/sparc/kernel/setup_32.c  |    1 +
 arch/sparc/kernel/setup_64.c  |    1 +
 arch/x86/kernel/setup.c       |    1 +
 drivers/base/devtmpfs.c       |    1 +
 fs/f2fs/super.c               |    2 +
 fs/pnode.c                    |    1 +
 fs/super.c                    |    1 +
 include/uapi/linux/fs.h       |   56 ++++-------------------------------------
 include/uapi/linux/mount.h    |   54 ++++++++++++++++++++++++++++++++++++++++
 init/do_mounts.c              |    1 +
 init/do_mounts_initrd.c       |    1 +
 security/apparmor/lsm.c       |    1 +
 security/apparmor/mount.c     |    1 +
 security/selinux/hooks.c      |    1 +
 security/tomoyo/mount.c       |    1 +
 18 files changed, 75 insertions(+), 52 deletions(-)

diff --git a/arch/arc/kernel/setup.c b/arch/arc/kernel/setup.c
index b2cae79a25d7..714dc5c2baf1 100644
--- a/arch/arc/kernel/setup.c
+++ b/arch/arc/kernel/setup.c
@@ -19,6 +19,7 @@
 #include <linux/of_fdt.h>
 #include <linux/of.h>
 #include <linux/cache.h>
+#include <uapi/linux/mount.h>
 #include <asm/sections.h>
 #include <asm/arcregs.h>
 #include <asm/tlb.h>
diff --git a/arch/arm/kernel/atags_parse.c b/arch/arm/kernel/atags_parse.c
index c10a3e8ee998..a8a4333929f5 100644
--- a/arch/arm/kernel/atags_parse.c
+++ b/arch/arm/kernel/atags_parse.c
@@ -24,6 +24,7 @@
 #include <linux/root_dev.h>
 #include <linux/screen_info.h>
 #include <linux/memblock.h>
+#include <uapi/linux/mount.h>
 
 #include <asm/setup.h>
 #include <asm/system_info.h>
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index c286cf5da6e7..2c0e0f37a318 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -32,6 +32,7 @@
 #include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/uaccess.h>
+#include <uapi/linux/mount.h>
 #include <asm/io.h>
 #include <asm/page.h>
 #include <asm/elf.h>
diff --git a/arch/sparc/kernel/setup_32.c b/arch/sparc/kernel/setup_32.c
index 13664c377196..7df3d704284c 100644
--- a/arch/sparc/kernel/setup_32.c
+++ b/arch/sparc/kernel/setup_32.c
@@ -34,6 +34,7 @@
 #include <linux/kdebug.h>
 #include <linux/export.h>
 #include <linux/start_kernel.h>
+#include <uapi/linux/mount.h>
 
 #include <asm/io.h>
 #include <asm/processor.h>
diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
index 7944b3ca216a..206bf81eedaf 100644
--- a/arch/sparc/kernel/setup_64.c
+++ b/arch/sparc/kernel/setup_64.c
@@ -33,6 +33,7 @@
 #include <linux/module.h>
 #include <linux/start_kernel.h>
 #include <linux/bootmem.h>
+#include <uapi/linux/mount.h>
 
 #include <asm/io.h>
 #include <asm/processor.h>
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..3413f53e0a35 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -51,6 +51,7 @@
 #include <linux/kvm_para.h>
 #include <linux/dma-contiguous.h>
 #include <xen/xen.h>
+#include <uapi/linux/mount.h>
 
 #include <linux/errno.h>
 #include <linux/kernel.h>
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index f7768077e817..79a235184fb5 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -25,6 +25,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/kthread.h>
+#include <uapi/linux/mount.h>
 #include "base.h"
 
 static struct task_struct *thread;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 3995e926ba3a..54bf50295d1e 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1453,7 +1453,7 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
 		err = dquot_suspend(sb, -1);
 		if (err < 0)
 			goto restore_opts;
-	} else if (f2fs_readonly(sb) && !(*flags & MS_RDONLY)) {
+	} else if (f2fs_readonly(sb) && !(*flags & SB_RDONLY)) {
 		/* dquot_resume needs RW */
 		sb->s_flags &= ~SB_RDONLY;
 		if (sb_any_quota_suspended(sb)) {
diff --git a/fs/pnode.c b/fs/pnode.c
index 53d411a371ce..1100e810d855 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -10,6 +10,7 @@
 #include <linux/mount.h>
 #include <linux/fs.h>
 #include <linux/nsproxy.h>
+#include <uapi/linux/mount.h>
 #include "internal.h"
 #include "pnode.h"
 
diff --git a/fs/super.c b/fs/super.c
index 50728d9c1a05..5132a32e5ebc 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -35,6 +35,7 @@
 #include <linux/fsnotify.h>
 #include <linux/lockdep.h>
 #include <linux/user_namespace.h>
+#include <uapi/linux/mount.h>
 #include "internal.h"
 
 static int thaw_super_locked(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 73e01918f996..1c982eb44ff4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -14,6 +14,11 @@
 #include <linux/ioctl.h>
 #include <linux/types.h>
 
+/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */
+#if !defined(__KERNEL__)
+#include <linux/mount.h>
+#endif
+
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
  * the file limit at runtime and only root can increase the per-process
@@ -101,57 +106,6 @@ struct inodes_stat_t {
 
 #define NR_FILE  8192	/* this can well be larger on a larger system */
 
-
-/*
- * These are the fs-independent mount-flags: up to 32 flags are supported
- */
-#define MS_RDONLY	 1	/* Mount read-only */
-#define MS_NOSUID	 2	/* Ignore suid and sgid bits */
-#define MS_NODEV	 4	/* Disallow access to device special files */
-#define MS_NOEXEC	 8	/* Disallow program execution */
-#define MS_SYNCHRONOUS	16	/* Writes are synced at once */
-#define MS_REMOUNT	32	/* Alter flags of a mounted FS */
-#define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
-#define MS_DIRSYNC	128	/* Directory modifications are synchronous */
-#define MS_NOATIME	1024	/* Do not update access times. */
-#define MS_NODIRATIME	2048	/* Do not update directory access times */
-#define MS_BIND		4096
-#define MS_MOVE		8192
-#define MS_REC		16384
-#define MS_VERBOSE	32768	/* War is peace. Verbosity is silence.
-				   MS_VERBOSE is deprecated. */
-#define MS_SILENT	32768
-#define MS_POSIXACL	(1<<16)	/* VFS does not apply the umask */
-#define MS_UNBINDABLE	(1<<17)	/* change to unbindable */
-#define MS_PRIVATE	(1<<18)	/* change to private */
-#define MS_SLAVE	(1<<19)	/* change to slave */
-#define MS_SHARED	(1<<20)	/* change to shared */
-#define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
-#define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
-#define MS_I_VERSION	(1<<23) /* Update inode I_version field */
-#define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
-#define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
-
-/* These sb flags are internal to the kernel */
-#define MS_SUBMOUNT     (1<<26)
-#define MS_NOREMOTELOCK	(1<<27)
-#define MS_NOSEC	(1<<28)
-#define MS_BORN		(1<<29)
-#define MS_ACTIVE	(1<<30)
-#define MS_NOUSER	(1<<31)
-
-/*
- * Superblock flags that can be altered by MS_REMOUNT
- */
-#define MS_RMT_MASK	(MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
-			 MS_LAZYTIME)
-
-/*
- * Old magic mount flag and mask
- */
-#define MS_MGC_VAL 0xC0ED0000
-#define MS_MGC_MSK 0xffff0000
-
 /*
  * Structure for FS_IOC_FSGETXATTR[A] and FS_IOC_FSSETXATTR.
  */
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 89adf0d731ab..3634e065836c 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -1,6 +1,60 @@
 #ifndef _UAPI_LINUX_MOUNT_H
 #define _UAPI_LINUX_MOUNT_H
 
+/*
+ * These are the fs-independent mount-flags: up to 32 flags are supported
+ *
+ * Usage of these is restricted within the kernel to core mount(2) code and
+ * callers of sys_mount() only.  Filesystems should be using the SB_*
+ * equivalent instead.
+ */
+#define MS_RDONLY	 1	/* Mount read-only */
+#define MS_NOSUID	 2	/* Ignore suid and sgid bits */
+#define MS_NODEV	 4	/* Disallow access to device special files */
+#define MS_NOEXEC	 8	/* Disallow program execution */
+#define MS_SYNCHRONOUS	16	/* Writes are synced at once */
+#define MS_REMOUNT	32	/* Alter flags of a mounted FS */
+#define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
+#define MS_DIRSYNC	128	/* Directory modifications are synchronous */
+#define MS_NOATIME	1024	/* Do not update access times. */
+#define MS_NODIRATIME	2048	/* Do not update directory access times */
+#define MS_BIND		4096
+#define MS_MOVE		8192
+#define MS_REC		16384
+#define MS_VERBOSE	32768	/* War is peace. Verbosity is silence.
+				   MS_VERBOSE is deprecated. */
+#define MS_SILENT	32768
+#define MS_POSIXACL	(1<<16)	/* VFS does not apply the umask */
+#define MS_UNBINDABLE	(1<<17)	/* change to unbindable */
+#define MS_PRIVATE	(1<<18)	/* change to private */
+#define MS_SLAVE	(1<<19)	/* change to slave */
+#define MS_SHARED	(1<<20)	/* change to shared */
+#define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
+#define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
+#define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+#define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
+#define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
+
+/* These sb flags are internal to the kernel */
+#define MS_SUBMOUNT     (1<<26)
+#define MS_NOREMOTELOCK	(1<<27)
+#define MS_NOSEC	(1<<28)
+#define MS_BORN		(1<<29)
+#define MS_ACTIVE	(1<<30)
+#define MS_NOUSER	(1<<31)
+
+/*
+ * Superblock flags that can be altered by MS_REMOUNT
+ */
+#define MS_RMT_MASK	(MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
+			 MS_LAZYTIME)
+
+/*
+ * Old magic mount flag and mask
+ */
+#define MS_MGC_VAL 0xC0ED0000
+#define MS_MGC_MSK 0xffff0000
+
 /*
  * open_tree() flags.
  */
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 2c71dabe5626..ea6f21bb9440 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -32,6 +32,7 @@
 #include <linux/nfs_fs.h>
 #include <linux/nfs_fs_sb.h>
 #include <linux/nfs_mount.h>
+#include <uapi/linux/mount.h>
 
 #include "do_mounts.h"
 
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index 5a91aefa7305..65de0412f80f 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -18,6 +18,7 @@
 #include <linux/sched.h>
 #include <linux/freezer.h>
 #include <linux/kmod.h>
+#include <uapi/linux/mount.h>
 
 #include "do_mounts.h"
 
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 74f17376202b..c65307dcd652 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -24,6 +24,7 @@
 #include <linux/audit.h>
 #include <linux/user_namespace.h>
 #include <net/sock.h>
+#include <uapi/linux/mount.h>
 
 #include "include/apparmor.h"
 #include "include/apparmorfs.h"
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index c1da22482bfb..8c3787399356 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
+#include <uapi/linux/mount.h>
 
 #include "include/apparmor.h"
 #include "include/audit.h"
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 2b5ee5fbd652..5bb53edd74cc 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -88,6 +88,7 @@
 #include <linux/msg.h>
 #include <linux/shm.h>
 #include <linux/bpf.h>
+#include <uapi/linux/mount.h>
 
 #include "avc.h"
 #include "objsec.h"
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 807fd91dbb54..7dc7f59b7dde 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/slab.h>
+#include <uapi/linux/mount.h>
 #include "common.h"
 
 /* String table for special mount operations. */


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 05/32] vfs: Introduce the basic header for the new mount API's filesystem context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (3 preceding siblings ...)
  2018-07-10 22:41 ` [PATCH 04/32] vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 22:42 ` [PATCH 06/32] vfs: Add LSM hooks for the new mount API " David Howells
                   ` (32 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.  This is
allocated at the beginning of the mount procedure and into it is placed:

 (1) Filesystem type.

 (2) Namespaces.

 (3) Source/Device names (there may be multiple).

 (4) Superblock flags (SB_*).

 (5) Security details.

 (6) Filesystem-specific data, as set by the mount options.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/fs_context.h |   73 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 include/linux/fs_context.h

diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
new file mode 100644
index 000000000000..0bde0a2a782e
--- /dev/null
+++ b/include/linux/fs_context.h
@@ -0,0 +1,73 @@
+/* Filesystem superblock creation and reconfiguration context.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FS_CONTEXT_H
+#define _LINUX_FS_CONTEXT_H
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+
+struct cred;
+struct dentry;
+struct file_operations;
+struct file_system_type;
+struct mnt_namespace;
+struct net;
+struct pid_namespace;
+struct super_block;
+struct user_namespace;
+struct vfsmount;
+
+enum fs_context_purpose {
+	FS_CONTEXT_FOR_USER_MOUNT,	/* New superblock for user-specified mount */
+	FS_CONTEXT_FOR_KERNEL_MOUNT,	/* New superblock for kernel-internal mount */
+	FS_CONTEXT_FOR_SUBMOUNT,	/* New superblock for automatic submount */
+	FS_CONTEXT_FOR_RECONFIGURE,	/* Superblock reconfiguration (remount) */
+};
+
+/*
+ * Filesystem context for holding the parameters used in the creation or
+ * reconfiguration of a superblock.
+ *
+ * Superblock creation fills in ->root whereas reconfiguration begins with this
+ * already set.
+ *
+ * See Documentation/filesystems/mounting.txt
+ */
+struct fs_context {
+	const struct fs_context_operations *ops;
+	struct file_system_type	*fs_type;
+	void			*fs_private;	/* The filesystem's context */
+	struct dentry		*root;		/* The root and superblock */
+	struct user_namespace	*user_ns;	/* The user namespace for this mount */
+	struct net		*net_ns;	/* The network namespace for this mount */
+	const struct cred	*cred;		/* The mounter's credentials */
+	char			*source;	/* The source name (eg. dev path) */
+	char			*subtype;	/* The subtype to set on the superblock */
+	void			*security;	/* The LSM context */
+	void			*s_fs_info;	/* Proposed s_fs_info */
+	unsigned int		sb_flags;	/* Proposed superblock flags (SB_*) */
+	enum fs_context_purpose	purpose:8;
+	bool			sloppy:1;	/* T if unrecognised options are okay */
+	bool			silent:1;	/* T if "o silent" specified */
+};
+
+struct fs_context_operations {
+	void (*free)(struct fs_context *fc);
+	int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+	int (*parse_source)(struct fs_context *fc, char *source);
+	int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
+	int (*parse_monolithic)(struct fs_context *fc, void *data);
+	int (*validate)(struct fs_context *fc);
+	int (*get_tree)(struct fs_context *fc);
+};
+
+#endif /* _LINUX_FS_CONTEXT_H */


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 06/32] vfs: Add LSM hooks for the new mount API [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (4 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 05/32] vfs: Introduce the basic header for the new mount API's filesystem context " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 22:42 ` [PATCH 07/32] selinux: Implement the new mount API LSM hooks " David Howells
                   ` (31 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro
  Cc: dhowells, linux-fsdevel, linux-security-module, torvalds, linux-kernel

Add LSM hooks for use by the new mount API and filesystem context code.
This includes:

 (1) Hooks to handle allocation, duplication and freeing of the security
     record attached to a filesystem context.

 (2) A hook to snoop source specifications.  There may be multiple of these
     if the filesystem supports it.  They will to be local files/devices if
     fs_context::source_is_dev is true and will be something else, possibly
     remote server specifications, if false.

 (3) A hook to snoop superblock configuration options in key[=val] form.
     If the LSM decides it wants to handle it, it can suppress the option
     being passed to the filesystem.  Note that 'val' may include commas
     and binary data with the fsopen patch.

 (4) A hook to perform validation and allocation after the configuration
     has been done but before the superblock is allocated and set up.

 (5) A hook to transfer the security from the context to a newly created
     superblock.

 (6) A hook to rule on whether a path point can be used as a mountpoint.

These are intended to replace:

	security_sb_copy_data
	security_sb_kern_mount
	security_sb_mount
	security_sb_set_mnt_opts
	security_sb_clone_mnt_opts
	security_sb_parse_opts_str

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-security-module@vger.kernel.org
---

 include/linux/lsm_hooks.h |   70 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/security.h  |   49 ++++++++++++++++++++++++++++++++
 security/security.c       |   46 ++++++++++++++++++++++++++++++
 3 files changed, 165 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 926607defd83..43ca087b6454 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -76,6 +76,56 @@
  *	changes on the process such as clearing out non-inheritable signal
  *	state.  This is called immediately after commit_creds().
  *
+ * Security hooks for mount using fs_context.
+ *	[See also Documentation/filesystems/mounting.txt]
+ *
+ * @fs_context_alloc:
+ *	Allocate and attach a security structure to sc->security.  This pointer
+ *	is initialised to NULL by the caller.
+ *	@fc indicates the new filesystem context.
+ *	@reference indicates the source dentry of a submount or start of reconfig.
+ * @fs_context_dup:
+ *	Allocate and attach a security structure to sc->security.  This pointer
+ *	is initialised to NULL by the caller.
+ *	@fc indicates the new filesystem context.
+ *	@src_fc indicates the original filesystem context.
+ * @fs_context_free:
+ *	Clean up a filesystem context.
+ *	@fc indicates the filesystem context.
+ * @fs_context_parse_source:
+ *	Check a source for the superblock (multiple sources may be provided).
+ *	The LSM may reject it with an error; otherwise it should return 0.
+ *	@fc indicates the filesystem context.
+ *	@src indicates the source name.  It is NUL-terminated,
+ * @fs_context_parse_option:
+ *	Userspace provided an option to configure a superblock.  The LSM may
+ *	reject it with an error and may use it for itself, in which case it
+ *	should return 1; otherwise it should return 0 to pass it on to the
+ *	filesystem.
+ *	@fc indicates the filesystem context.
+ *	@opt indicates the option in "key[=val]" form.  It is NUL-terminated,
+ *	but val may be binary data.
+ *	@len indicates the size of the option.
+ * @fs_context_validate:
+ *	Validate the filesystem context preparatory to applying it.  This is
+ *	done after all the options have been parsed.
+ *	@fc indicates the filesystem context.
+ * @sb_get_tree:
+ *	Assign the security to a newly created superblock.
+ *	@fc indicates the filesystem context.
+ *	@fc->root indicates the root that will be mounted.
+ *	@fc->root->d_sb points to the superblock.
+ * @sb_reconfigure:
+ *	Apply reconfiguration to the security on a superblock.
+ *	@fc indicates the filesystem context.
+ *	@fc->root indicates a dentry in the mount.
+ *	@fc->root->d_sb points to the superblock.
+ * @sb_mountpoint:
+ *	Equivalent of sb_mount, but with an fs_context.
+ *	@fc indicates the filesystem context.
+ *	@mountpoint indicates the path on which the mount will take place.
+ *	@mnt_flags indicates the MNT_* flags specified.
+ *
  * Security hooks for filesystem operations.
  *
  * @sb_alloc_security:
@@ -1459,6 +1509,17 @@ union security_list_options {
 	void (*bprm_committing_creds)(struct linux_binprm *bprm);
 	void (*bprm_committed_creds)(struct linux_binprm *bprm);
 
+	int (*fs_context_alloc)(struct fs_context *fc, struct dentry *reference);
+	int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
+	void (*fs_context_free)(struct fs_context *fc);
+	int (*fs_context_parse_source)(struct fs_context *fc, char *src);
+	int (*fs_context_parse_option)(struct fs_context *fc, char *opt, size_t len);
+	int (*fs_context_validate)(struct fs_context *fc);
+	int (*sb_get_tree)(struct fs_context *fc);
+	void (*sb_reconfigure)(struct fs_context *fc);
+	int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint,
+			     unsigned int mnt_flags);
+
 	int (*sb_alloc_security)(struct super_block *sb);
 	void (*sb_free_security)(struct super_block *sb);
 	int (*sb_copy_data)(char *orig, char *copy);
@@ -1798,6 +1859,15 @@ struct security_hook_heads {
 	struct hlist_head bprm_check_security;
 	struct hlist_head bprm_committing_creds;
 	struct hlist_head bprm_committed_creds;
+	struct hlist_head fs_context_alloc;
+	struct hlist_head fs_context_dup;
+	struct hlist_head fs_context_free;
+	struct hlist_head fs_context_parse_source;
+	struct hlist_head fs_context_parse_option;
+	struct hlist_head fs_context_validate;
+	struct hlist_head sb_get_tree;
+	struct hlist_head sb_reconfigure;
+	struct hlist_head sb_mountpoint;
 	struct hlist_head sb_alloc_security;
 	struct hlist_head sb_free_security;
 	struct hlist_head sb_copy_data;
diff --git a/include/linux/security.h b/include/linux/security.h
index 15d121f156b3..7f093b27169d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -53,6 +53,7 @@ struct msg_msg;
 struct xattr;
 struct xfrm_sec_ctx;
 struct mm_struct;
+struct fs_context;
 
 /* If capable should audit the security request */
 #define SECURITY_CAP_NOAUDIT 0
@@ -225,6 +226,16 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
 int security_bprm_check(struct linux_binprm *bprm);
 void security_bprm_committing_creds(struct linux_binprm *bprm);
 void security_bprm_committed_creds(struct linux_binprm *bprm);
+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference);
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
+void security_fs_context_free(struct fs_context *fc);
+int security_fs_context_parse_source(struct fs_context *fc, char *src);
+int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len);
+int security_fs_context_validate(struct fs_context *fc);
+int security_sb_get_tree(struct fs_context *fc);
+void security_sb_reconfigure(struct fs_context *fc);
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+			   unsigned int mnt_flags);
 int security_sb_alloc(struct super_block *sb);
 void security_sb_free(struct super_block *sb);
 int security_sb_copy_data(char *orig, char *copy);
@@ -526,6 +537,44 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
 {
 }
 
+static inline int security_fs_context_alloc(struct fs_context *fc,
+					    struct dentry *reference)
+{
+	return 0;
+}
+static inline int security_fs_context_dup(struct fs_context *fc,
+					  struct fs_context *src_fc)
+{
+	return 0;
+}
+static inline void security_fs_context_free(struct fs_context *fc)
+{
+}
+static inline int security_fs_context_parse_source(struct fs_context *fc, char *src)
+{
+	return 0;
+}
+static inline int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+	return 0;
+}
+static inline int security_fs_context_validate(struct fs_context *fc)
+{
+	return 0;
+}
+static inline int security_sb_get_tree(struct fs_context *fc)
+{
+	return 0;
+}
+static inline void security_sb_reconfigure(struct fs_context *fc)
+{
+}
+static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+					 unsigned int mnt_flags)
+{
+	return 0;
+}
+
 static inline int security_sb_alloc(struct super_block *sb)
 {
 	return 0;
diff --git a/security/security.c b/security/security.c
index c4cbdb7d3a5f..597470fd3727 100644
--- a/security/security.c
+++ b/security/security.c
@@ -358,6 +358,52 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
 	call_void_hook(bprm_committed_creds, bprm);
 }
 
+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+	return call_int_hook(fs_context_alloc, 0, fc, reference);
+}
+
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+	return call_int_hook(fs_context_dup, 0, fc, src_fc);
+}
+
+void security_fs_context_free(struct fs_context *fc)
+{
+	call_void_hook(fs_context_free, fc);
+}
+
+int security_fs_context_parse_source(struct fs_context *fc, char *src)
+{
+	return call_int_hook(fs_context_parse_source, 0, fc, src);
+}
+
+int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+	return call_int_hook(fs_context_parse_option, 0, fc, opt, len);
+}
+
+int security_fs_context_validate(struct fs_context *fc)
+{
+	return call_int_hook(fs_context_validate, 0, fc);
+}
+
+int security_sb_get_tree(struct fs_context *fc)
+{
+	return call_int_hook(sb_get_tree, 0, fc);
+}
+
+void security_sb_reconfigure(struct fs_context *fc)
+{
+	call_void_hook(sb_reconfigure, fc);
+}
+
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+			   unsigned int mnt_flags)
+{
+	return call_int_hook(sb_mountpoint, 0, fc, mountpoint, mnt_flags);
+}
+
 int security_sb_alloc(struct super_block *sb)
 {
 	return call_int_hook(sb_alloc_security, 0, sb);


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 07/32] selinux: Implement the new mount API LSM hooks [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (5 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 06/32] vfs: Add LSM hooks for the new mount API " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-11 14:08   ` Stephen Smalley
  2018-07-10 22:42 ` [PATCH 08/32] smack: Implement filesystem context security " David Howells
                   ` (30 subsequent siblings)
  37 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro
  Cc: Paul Moore, Stephen Smalley, linux-kernel, dhowells,
	linux-security-module, selinux, linux-fsdevel, torvalds

Implement the new mount API LSM hooks for SELinux.  At some point the old
hooks will need to be removed.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paul Moore <paul@paul-moore.com>
cc: Stephen Smalley <sds@tycho.nsa.gov>
cc: selinux@tycho.nsa.gov
cc: linux-security-module@vger.kernel.org
---

 security/selinux/hooks.c |  264 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 264 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 5bb53edd74cc..bdecae4b7306 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -48,6 +48,7 @@
 #include <linux/fdtable.h>
 #include <linux/namei.h>
 #include <linux/mount.h>
+#include <linux/fs_context.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/netfilter_ipv6.h>
 #include <linux/tty.h>
@@ -2973,6 +2974,261 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
 				   FILESYSTEM__UNMOUNT, NULL);
 }
 
+/* fsopen mount context operations */
+
+static int selinux_fs_context_alloc(struct fs_context *fc,
+				    struct dentry *reference)
+{
+	struct security_mnt_opts *opts;
+
+	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+
+	fc->security = opts;
+	return 0;
+}
+
+static int selinux_fs_context_dup(struct fs_context *fc,
+				  struct fs_context *src_fc)
+{
+	const struct security_mnt_opts *src = src_fc->security;
+	struct security_mnt_opts *opts;
+	int i, n;
+
+	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+	fc->security = opts;
+
+	if (!src || !src->num_mnt_opts)
+		return 0;
+	n = opts->num_mnt_opts = src->num_mnt_opts;
+
+	if (src->mnt_opts) {
+		opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
+		if (!opts->mnt_opts)
+			return -ENOMEM;
+
+		for (i = 0; i < n; i++) {
+			if (src->mnt_opts[i]) {
+				opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
+							    GFP_KERNEL);
+				if (!opts->mnt_opts[i])
+					return -ENOMEM;
+			}
+		}
+	}
+
+	if (src->mnt_opts_flags) {
+		opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
+					       n * sizeof(int), GFP_KERNEL);
+		if (!opts->mnt_opts_flags)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void selinux_fs_context_free(struct fs_context *fc)
+{
+	struct security_mnt_opts *opts = fc->security;
+
+	if (opts) {
+		security_free_mnt_opts(opts);
+		fc->security = NULL;
+	}
+}
+
+static int selinux_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+	struct security_mnt_opts *opts = fc->security;
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int have;
+	char *c, **oo;
+	int token, ctx, i, *of;
+
+	token = match_token(opt, tokens, args);
+	if (token == Opt_error)
+		return 0; /* Doesn't belong to us. */
+
+	have = 0;
+	for (i = 0; i < opts->num_mnt_opts; i++)
+		have |= 1 << opts->mnt_opts_flags[i];
+	if (have & (1 << token))
+		return -EINVAL;
+
+	switch (token) {
+	case Opt_context:
+		if (have & (1 << Opt_defcontext))
+			goto incompatible;
+		ctx = CONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_fscontext:
+		ctx = FSCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_rootcontext:
+		ctx = ROOTCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_defcontext:
+		if (have & (1 << Opt_context))
+			goto incompatible;
+		ctx = DEFCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_labelsupport:
+		return 1;
+
+	default:
+		return -EINVAL;
+	}
+
+copy_context_string:
+	if (opts->num_mnt_opts > 3)
+		return -EINVAL;
+
+	of = krealloc(opts->mnt_opts_flags,
+		      (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
+	if (!of)
+		return -ENOMEM;
+	of[opts->num_mnt_opts] = 0;
+	opts->mnt_opts_flags = of;
+
+	oo = krealloc(opts->mnt_opts,
+		      (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
+	if (!oo)
+		return -ENOMEM;
+	oo[opts->num_mnt_opts] = NULL;
+	opts->mnt_opts = oo;
+
+	c = match_strdup(&args[0]);
+	if (!c)
+		return -ENOMEM;
+	opts->mnt_opts[opts->num_mnt_opts] = c;
+	opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
+	opts->num_mnt_opts++;
+	return 1;
+
+incompatible:
+	return -EINVAL;
+}
+
+/*
+ * Validate the security parameters supplied for a reconfiguration/remount
+ * event.
+ */
+static int selinux_validate_for_sb_reconfigure(struct fs_context *fc)
+{
+	struct super_block *sb = fc->root->d_sb;
+	struct superblock_security_struct *sbsec = sb->s_security;
+	struct security_mnt_opts *opts = fc->security;
+	int rc, i, *flags;
+	char **mount_options;
+
+	if (!(sbsec->flags & SE_SBINITIALIZED))
+		return 0;
+
+	mount_options = opts->mnt_opts;
+	flags = opts->mnt_opts_flags;
+
+	for (i = 0; i < opts->num_mnt_opts; i++) {
+		u32 sid;
+
+		if (flags[i] == SBLABEL_MNT)
+			continue;
+
+		rc = security_context_str_to_sid(&selinux_state, mount_options[i],
+						 &sid, GFP_KERNEL);
+		if (rc) {
+			pr_warn("SELinux: security_context_str_to_sid"
+				"(%s) failed for (dev %s, type %s) errno=%d\n",
+				mount_options[i], sb->s_id, sb->s_type->name, rc);
+			goto inval;
+		}
+
+		switch (flags[i]) {
+		case FSCONTEXT_MNT:
+			if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
+				goto bad_option;
+			break;
+		case CONTEXT_MNT:
+			if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
+				goto bad_option;
+			break;
+		case ROOTCONTEXT_MNT: {
+			struct inode_security_struct *root_isec;
+			root_isec = backing_inode_security(sb->s_root);
+
+			if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
+				goto bad_option;
+			break;
+		}
+		case DEFCONTEXT_MNT:
+			if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
+				goto bad_option;
+			break;
+		default:
+			goto inval;
+		}
+	}
+
+	rc = 0;
+out:
+	return rc;
+
+bad_option:
+	pr_warn("SELinux: unable to change security options "
+		"during remount (dev %s, type=%s)\n",
+		sb->s_id, sb->s_type->name);
+inval:
+	rc = -EINVAL;
+	goto out;
+}
+
+/*
+ * Validate the security context assembled from the option data supplied to
+ * mount.
+ */
+static int selinux_fs_context_validate(struct fs_context *fc)
+{
+	if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)
+		return selinux_validate_for_sb_reconfigure(fc);
+	return 0;
+}
+
+/*
+ * Set the security context on a superblock.
+ */
+static int selinux_sb_get_tree(struct fs_context *fc)
+{
+	const struct cred *cred = current_cred();
+	struct common_audit_data ad;
+	int rc;
+
+	rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
+	if (rc)
+		return rc;
+
+	/* Allow all mounts performed by the kernel */
+	if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+		return 0;
+
+	ad.type = LSM_AUDIT_DATA_DENTRY;
+	ad.u.dentry = fc->root;
+	return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
+}
+
+static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+				 unsigned int mnt_flags)
+{
+	const struct cred *cred = current_cred();
+
+	return path_has_perm(cred, mountpoint, FILE__MOUNTON);
+}
+
 /* inode security operations */
 
 static int selinux_inode_alloc_security(struct inode *inode)
@@ -6905,6 +7161,14 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
 	LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
 
+	LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
+	LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
+	LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
+	LSM_HOOK_INIT(fs_context_parse_option, selinux_fs_context_parse_option),
+	LSM_HOOK_INIT(fs_context_validate, selinux_fs_context_validate),
+	LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
+	LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
+
 	LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
 	LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
 	LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 08/32] smack: Implement filesystem context security hooks [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (6 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 07/32] selinux: Implement the new mount API LSM hooks " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 23:13   ` Casey Schaufler
  2018-07-10 23:19   ` David Howells
  2018-07-10 22:42 ` [PATCH 09/32] apparmor: Implement security hooks for the new mount API " David Howells
                   ` (29 subsequent siblings)
  37 siblings, 2 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro
  Cc: linux-kernel, dhowells, linux-fsdevel, linux-security-module,
	Casey Schaufler, torvalds

Implement filesystem context security hooks for the smack LSM.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Casey Schaufler <casey@schaufler-ca.com>
cc: linux-security-module@vger.kernel.org
---

 security/smack/smack_lsm.c |  309 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 309 insertions(+)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 7ad226018f51..39780b06469b 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -42,6 +42,7 @@
 #include <linux/shm.h>
 #include <linux/binfmts.h>
 #include <linux/parser.h>
+#include <linux/fs_context.h>
 #include "smack.h"
 
 #define TRANS_TRUE	"TRUE"
@@ -521,6 +522,307 @@ static int smack_syslog(int typefrom_file)
 	return rc;
 }
 
+/*
+ * Mount context operations
+ */
+
+struct smack_fs_context {
+	union {
+		struct {
+			char		*fsdefault;
+			char		*fsfloor;
+			char		*fshat;
+			char		*fsroot;
+			char		*fstransmute;
+		};
+		char			*ptrs[5];
+
+	};
+	struct superblock_smack		*sbsp;
+	struct inode_smack		*isp;
+	bool				transmute;
+};
+
+/**
+ * smack_fs_context_free - Free the security data from a filesystem context
+ * @fc: The filesystem context to be cleaned up.
+ */
+static void smack_fs_context_free(struct fs_context *fc)
+{
+	struct smack_fs_context *ctx = fc->security;
+	int i;
+
+	if (ctx) {
+		for (i = 0; i < ARRAY_SIZE(ctx->ptrs); i++)
+			kfree(ctx->ptrs[i]);
+		kfree(ctx->isp);
+		kfree(ctx->sbsp);
+		kfree(ctx);
+		fc->security = NULL;
+	}
+}
+
+/**
+ * smack_fs_context_alloc - Allocate security data for a filesystem context
+ * @fc: The filesystem context.
+ * @reference: Reference dentry (automount/reconfigure) or NULL
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_alloc(struct fs_context *fc,
+				  struct dentry *reference)
+{
+	struct smack_fs_context *ctx;
+	struct superblock_smack *sbsp;
+	struct inode_smack *isp;
+	struct smack_known *skp;
+
+	ctx = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+	if (!ctx)
+		goto nomem;
+	fc->security = ctx;
+
+	sbsp = kzalloc(sizeof(struct superblock_smack), GFP_KERNEL);
+	if (!sbsp)
+		goto nomem_free;
+	ctx->sbsp = sbsp;
+
+	isp = new_inode_smack(NULL);
+	if (!isp)
+		goto nomem_free;
+	ctx->isp = isp;
+
+	if (reference) {
+		if (reference->d_sb->s_security)
+			memcpy(sbsp, reference->d_sb->s_security, sizeof(*sbsp));
+	} else if (!smack_privileged(CAP_MAC_ADMIN)) {
+		/* Unprivileged mounts get root and default from the caller. */
+		skp = smk_of_current();
+		sbsp->smk_root = skp;
+		sbsp->smk_default = skp;
+	} else {
+		sbsp->smk_root = &smack_known_floor;
+		sbsp->smk_default = &smack_known_floor;
+		sbsp->smk_floor = &smack_known_floor;
+		sbsp->smk_hat = &smack_known_hat;
+		/* SMK_SB_INITIALIZED will be zero from kzalloc. */
+	}
+
+	return 0;
+
+nomem_free:
+	smack_fs_context_free(fc);
+nomem:
+	return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_dup - Duplicate the security data on fs_context duplication
+ * @fc: The new filesystem context.
+ * @src_fc: The source filesystem context being duplicated.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_dup(struct fs_context *fc,
+				struct fs_context *src_fc)
+{
+	struct smack_fs_context *dst, *src = src_fc->security;
+	int i;
+
+	dst = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+	if (!dst)
+		goto nomem;
+	fc->security = dst;
+
+	dst->sbsp = kmemdup(src->sbsp, sizeof(struct superblock_smack),
+			    GFP_KERNEL);
+	if (!dst->sbsp)
+		goto nomem_free;
+
+	for (i = 0; i < ARRAY_SIZE(dst->ptrs); i++) {
+		if (src->ptrs[i]) {
+			dst->ptrs[i] = kstrdup(src->ptrs[i], GFP_KERNEL);
+			if (!dst->ptrs[i])
+				goto nomem_free;
+		}
+	}
+
+	return 0;
+
+nomem_free:
+	smack_fs_context_free(fc);
+nomem:
+	return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_parse_option - Parse a single mount option
+ * @fc: The new filesystem context being constructed.
+ * @opt: The option text buffer.
+ * @len: The length of the text.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_parse_option(struct fs_context *fc, char *p, size_t len)
+{
+	struct smack_fs_context *ctx = fc->security;
+	substring_t args[MAX_OPT_ARGS];
+	int rc = -ENOMEM;
+	int token;
+
+	/* Unprivileged mounts don't get to specify Smack values. */
+	if (!smack_privileged(CAP_MAC_ADMIN))
+		return -EPERM;
+
+	token = match_token(p, smk_mount_tokens, args);
+	switch (token) {
+	case Opt_fsdefault:
+		if (ctx->fsdefault)
+			goto error_dup;
+		ctx->fsdefault = match_strdup(&args[0]);
+		if (!ctx->fsdefault)
+			goto error;
+		break;
+	case Opt_fsfloor:
+		if (ctx->fsfloor)
+			goto error_dup;
+		ctx->fsfloor = match_strdup(&args[0]);
+		if (!ctx->fsfloor)
+			goto error;
+		break;
+	case Opt_fshat:
+		if (ctx->fshat)
+			goto error_dup;
+		ctx->fshat = match_strdup(&args[0]);
+		if (!ctx->fshat)
+			goto error;
+		break;
+	case Opt_fsroot:
+		if (ctx->fsroot)
+			goto error_dup;
+		ctx->fsroot = match_strdup(&args[0]);
+		if (!ctx->fsroot)
+			goto error;
+		break;
+	case Opt_fstransmute:
+		if (ctx->fstransmute)
+			goto error_dup;
+		ctx->fstransmute = match_strdup(&args[0]);
+		if (!ctx->fstransmute)
+			goto error;
+		break;
+	default:
+		pr_warn("Smack:  unknown mount option\n");
+		goto error_inval;
+	}
+
+	return 0;
+
+error_dup:
+	pr_warn("Smack: duplicate mount option\n");
+error_inval:
+	rc = -EINVAL;
+error:
+	return rc;
+}
+
+/**
+ * smack_fs_context_validate - Validate the filesystem context security data
+ * @fc: The filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_validate(struct fs_context *fc)
+{
+	struct smack_fs_context *ctx = fc->security;
+	struct superblock_smack *sbsp = ctx->sbsp;
+	struct inode_smack *isp = ctx->isp;
+	struct smack_known *skp;
+
+	if (ctx->fsdefault) {
+		skp = smk_import_entry(ctx->fsdefault, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_default = skp;
+	}
+
+	if (ctx->fsfloor) {
+		skp = smk_import_entry(ctx->fsfloor, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_floor = skp;
+	}
+
+	if (ctx->fshat) {
+		skp = smk_import_entry(ctx->fshat, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_hat = skp;
+	}
+
+	if (ctx->fsroot || ctx->fstransmute) {
+		skp = smk_import_entry(ctx->fstransmute ?: ctx->fsroot, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_root = skp;
+		ctx->transmute = !!ctx->fstransmute;
+	}
+
+	isp->smk_inode = sbsp->smk_root;
+	return 0;
+}
+
+/**
+ * smack_sb_get_tree - Assign the context to a newly created superblock
+ * @fc: The new filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_sb_get_tree(struct fs_context *fc)
+{
+	struct smack_fs_context *ctx = fc->security;
+	struct superblock_smack *sbsp = ctx->sbsp;
+	struct dentry *root = fc->root;
+	struct inode *inode = d_backing_inode(root);
+	struct super_block *sb = root->d_sb;
+	struct inode_smack *isp;
+	bool transmute = ctx->transmute;
+
+	if (sb->s_security)
+		return 0;
+
+	if (!smack_privileged(CAP_MAC_ADMIN)) {
+		/*
+		 * For a handful of fs types with no user-controlled
+		 * backing store it's okay to trust security labels
+		 * in the filesystem. The rest are untrusted.
+		 */
+		if (fc->user_ns != &init_user_ns &&
+		    sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
+		    sb->s_magic != RAMFS_MAGIC) {
+			transmute = true;
+			sbsp->smk_flags |= SMK_SB_UNTRUSTED;
+		}
+	}
+
+	sbsp->smk_flags |= SMK_SB_INITIALIZED;
+	sb->s_security = sbsp;
+	ctx->sbsp = NULL;
+
+	/* Initialize the root inode. */
+	isp = inode->i_security;
+	if (isp == NULL) {
+		isp = ctx->isp;
+		ctx->isp = NULL;
+		inode->i_security = isp;
+	} else
+		isp->smk_inode = sbsp->smk_root;
+
+	if (transmute)
+		isp->smk_flags |= SMK_INODE_TRANSMUTE;
+
+	return 0;
+}
 
 /*
  * Superblock Hooks.
@@ -4647,6 +4949,13 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(ptrace_traceme, smack_ptrace_traceme),
 	LSM_HOOK_INIT(syslog, smack_syslog),
 
+	LSM_HOOK_INIT(fs_context_alloc, smack_fs_context_alloc),
+	LSM_HOOK_INIT(fs_context_dup, smack_fs_context_dup),
+	LSM_HOOK_INIT(fs_context_free, smack_fs_context_free),
+	LSM_HOOK_INIT(fs_context_parse_option, smack_fs_context_parse_option),
+	LSM_HOOK_INIT(fs_context_validate, smack_fs_context_validate),
+	LSM_HOOK_INIT(sb_get_tree, smack_sb_get_tree),
+
 	LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
 	LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
 	LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 09/32] apparmor: Implement security hooks for the new mount API [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (7 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 08/32] smack: Implement filesystem context security " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 22:42 ` [PATCH 10/32] tomoyo: " David Howells
                   ` (28 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro
  Cc: John Johansen, apparmor, linux-kernel, dhowells,
	linux-security-module, linux-fsdevel, torvalds

Implement hooks to check the creation of new mountpoints for AppArmor.

Unfortunately, the DFA evaluation puts the option data in last, after the
details of the mountpoint, so we have to cache the mount options in the
fs_context using those hooks till we get to the new mountpoint hook.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: John Johansen <john.johansen@canonical.com>
cc: apparmor@lists.ubuntu.com
cc: linux-security-module@vger.kernel.org
---

 security/apparmor/include/mount.h |   11 +++++
 security/apparmor/lsm.c           |   80 +++++++++++++++++++++++++++++++++++++
 security/apparmor/mount.c         |   46 +++++++++++++++++++++
 3 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/security/apparmor/include/mount.h b/security/apparmor/include/mount.h
index 25d6067fa6ef..0441bfae30fa 100644
--- a/security/apparmor/include/mount.h
+++ b/security/apparmor/include/mount.h
@@ -16,6 +16,7 @@
 
 #include <linux/fs.h>
 #include <linux/path.h>
+#include <linux/fs_context.h>
 
 #include "domain.h"
 #include "policy.h"
@@ -27,7 +28,13 @@
 #define AA_AUDIT_DATA		0x40
 #define AA_MNT_CONT_MATCH	0x40
 
-#define AA_MS_IGNORE_MASK (MS_KERNMOUNT | MS_NOSEC | MS_ACTIVE | MS_BORN)
+#define AA_SB_IGNORE_MASK (SB_KERNMOUNT | SB_NOSEC | SB_ACTIVE | SB_BORN)
+
+struct apparmor_fs_context {
+	struct fs_context	fc;
+	char			*saved_options;
+	size_t			saved_size;
+};
 
 int aa_remount(struct aa_label *label, const struct path *path,
 	       unsigned long flags, void *data);
@@ -45,6 +52,8 @@ int aa_move_mount(struct aa_label *label, const struct path *path,
 int aa_new_mount(struct aa_label *label, const char *dev_name,
 		 const struct path *path, const char *type, unsigned long flags,
 		 void *data);
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+		    const struct path *mountpoint);
 
 int aa_umount(struct aa_label *label, struct vfsmount *mnt, int flags);
 
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index c65307dcd652..29803dc604f8 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -520,6 +520,78 @@ static int apparmor_file_mprotect(struct vm_area_struct *vma,
 			   !(vma->vm_flags & VM_SHARED) ? MAP_PRIVATE : 0);
 }
 
+static int apparmor_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+	struct apparmor_fs_context *afc;
+
+	afc = kzalloc(sizeof(*afc), GFP_KERNEL);
+	if (!afc)
+		return -ENOMEM;
+
+	fc->security = afc;
+	return 0;
+}
+
+static int apparmor_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+	fc->security = NULL;
+	return 0;
+}
+
+static void apparmor_fs_context_free(struct fs_context *fc)
+{
+	struct apparmor_fs_context *afc = fc->security;
+
+	if (afc) {
+		kfree(afc->saved_options);
+		kfree(afc);
+	}
+}
+
+/*
+ * As a temporary hack, we buffer all the options.  The problem is that we need
+ * to pass them to the DFA evaluator *after* mount point parameters, which
+ * means deferring the entire check to the sb_mountpoint hook.
+ */
+static int apparmor_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+	struct apparmor_fs_context *afc = fc->security;
+	size_t space = 0;
+	char *p, *q;
+
+	if (afc->saved_size > 0)
+		space = 1;
+
+	p = krealloc(afc->saved_options, afc->saved_size + space + len + 1, GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+
+	q = p + afc->saved_size;
+	if (q != p)
+		*q++ = ' ';
+	memcpy(q, opt, len);
+	q += len;
+	*q = 0;
+
+	afc->saved_options = p;
+	afc->saved_size += 1 + len;
+	return 0;
+}
+
+static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+				  unsigned int mnt_flags)
+{
+	struct aa_label *label;
+	int error = 0;
+
+	label = __begin_current_label_crit_section();
+	if (!unconfined(label))
+		error = aa_new_mount_fc(label, fc, mountpoint);
+	__end_current_label_crit_section(label);
+
+	return error;
+}
+
 static int apparmor_sb_mount(const char *dev_name, const struct path *path,
 			     const char *type, unsigned long flags, void *data)
 {
@@ -530,7 +602,7 @@ static int apparmor_sb_mount(const char *dev_name, const struct path *path,
 	if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
 		flags &= ~MS_MGC_MSK;
 
-	flags &= ~AA_MS_IGNORE_MASK;
+	flags &= ~AA_SB_IGNORE_MASK;
 
 	label = __begin_current_label_crit_section();
 	if (!unconfined(label)) {
@@ -1133,6 +1205,12 @@ static struct security_hook_list apparmor_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(capget, apparmor_capget),
 	LSM_HOOK_INIT(capable, apparmor_capable),
 
+	LSM_HOOK_INIT(fs_context_alloc, apparmor_fs_context_alloc),
+	LSM_HOOK_INIT(fs_context_dup, apparmor_fs_context_dup),
+	LSM_HOOK_INIT(fs_context_free, apparmor_fs_context_free),
+	LSM_HOOK_INIT(fs_context_parse_option, apparmor_fs_context_parse_option),
+	LSM_HOOK_INIT(sb_mountpoint, apparmor_sb_mountpoint),
+
 	LSM_HOOK_INIT(sb_mount, apparmor_sb_mount),
 	LSM_HOOK_INIT(sb_umount, apparmor_sb_umount),
 	LSM_HOOK_INIT(sb_pivotroot, apparmor_sb_pivotroot),
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 8c3787399356..3c95fffb76ac 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -554,6 +554,52 @@ int aa_new_mount(struct aa_label *label, const char *dev_name,
 	return error;
 }
 
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+		    const struct path *mountpoint)
+{
+	struct apparmor_fs_context *afc = fc->security;
+	struct aa_profile *profile;
+	char *buffer = NULL, *dev_buffer = NULL;
+	bool binary;
+	int error;
+	struct path tmp_path, *dev_path = NULL;
+
+	AA_BUG(!label);
+	AA_BUG(!mountpoint);
+
+	binary = fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA;
+
+	if (fc->fs_type->fs_flags & FS_REQUIRES_DEV) {
+		if (!fc->source)
+			return -ENOENT;
+
+		error = kern_path(fc->source, LOOKUP_FOLLOW, &tmp_path);
+		if (error)
+			return error;
+		dev_path = &tmp_path;
+	}
+
+	get_buffers(buffer, dev_buffer);
+	if (dev_path) {
+		error = fn_for_each_confined(label, profile,
+			match_mnt(profile, mountpoint, buffer, dev_path, dev_buffer,
+				  fc->fs_type->name,
+				  fc->sb_flags & ~AA_SB_IGNORE_MASK,
+				  afc->saved_options, binary));
+	} else {
+		error = fn_for_each_confined(label, profile,
+			match_mnt_path_str(profile, mountpoint, buffer,
+					   fc->source, fc->fs_type->name,
+					   fc->sb_flags & ~AA_SB_IGNORE_MASK,
+					   afc->saved_options, binary, NULL));
+	}
+	put_buffers(buffer, dev_buffer);
+	if (dev_path)
+		path_put(dev_path);
+
+	return error;
+}
+
 static int profile_umount(struct aa_profile *profile, struct path *path,
 			  char *buffer)
 {


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 10/32] tomoyo: Implement security hooks for the new mount API [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (8 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 09/32] apparmor: Implement security hooks for the new mount API " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 23:34   ` Tetsuo Handa
  2018-07-10 22:42 ` [PATCH 11/32] vfs: Require specification of size of mount data for internal mounts " David Howells
                   ` (27 subsequent siblings)
  37 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro
  Cc: Tetsuo Handa, linux-kernel, dhowells, linux-fsdevel,
	linux-security-module, tomoyo-dev-en, torvalds

Implement the security hook to check the creation of a new mountpoint for
Tomoyo.

As far as I can tell, Tomoyo doesn't make use of the mount data or parse
any mount options, so I haven't implemented any of the fs_context hooks for
it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
cc: tomoyo-dev-en@lists.sourceforge.jp
cc: linux-security-module@vger.kernel.org
---

 security/tomoyo/common.h |    3 +++
 security/tomoyo/mount.c  |   45 +++++++++++++++++++++++++++++++++++++++++++++
 security/tomoyo/tomoyo.c |   15 +++++++++++++++
 3 files changed, 63 insertions(+)

diff --git a/security/tomoyo/common.h b/security/tomoyo/common.h
index 539bcdd30bb8..e637ce73f7f9 100644
--- a/security/tomoyo/common.h
+++ b/security/tomoyo/common.h
@@ -971,6 +971,9 @@ int tomoyo_init_request_info(struct tomoyo_request_info *r,
 			     const u8 index);
 int tomoyo_mkdev_perm(const u8 operation, const struct path *path,
 		      const unsigned int mode, unsigned int dev);
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+			       const struct path *mountpoint,
+			       unsigned int mnt_flags);
 int tomoyo_mount_permission(const char *dev_name, const struct path *path,
 			    const char *type, unsigned long flags,
 			    void *data_page);
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 7dc7f59b7dde..9ec84ab6f5e1 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/slab.h>
+#include <linux/fs_context.h>
 #include <uapi/linux/mount.h>
 #include "common.h"
 
@@ -236,3 +237,47 @@ int tomoyo_mount_permission(const char *dev_name, const struct path *path,
 	tomoyo_read_unlock(idx);
 	return error;
 }
+
+/**
+ * tomoyo_mount_permission_fc - Check permission to create a new mount.
+ * @fc:		Context describing the object to be mounted.
+ * @mountpoint:	The target object to mount on.
+ * @mnt:	The MNT_* flags to be set on the mountpoint.
+ *
+ * Check the permission to create a mount of the object described in @fc.  Note
+ * that the source object may be a newly created superblock or may be an
+ * existing one picked from the filesystem (bind mount).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+			       const struct path *mountpoint,
+			       unsigned int mnt_flags)
+{
+	struct tomoyo_request_info r;
+	unsigned int ms_flags = 0;
+	int error;
+	int idx;
+
+	if (tomoyo_init_request_info(&r, NULL, TOMOYO_MAC_FILE_MOUNT) ==
+	    TOMOYO_CONFIG_DISABLED)
+		return 0;
+
+	/* Convert MNT_* flags to MS_* equivalents. */
+	if (mnt_flags & MNT_NOSUID)	ms_flags |= MS_NOSUID;
+	if (mnt_flags & MNT_NODEV)	ms_flags |= MS_NODEV;
+	if (mnt_flags & MNT_NOEXEC)	ms_flags |= MS_NOEXEC;
+	if (mnt_flags & MNT_NOATIME)	ms_flags |= MS_NOATIME;
+	if (mnt_flags & MNT_NODIRATIME)	ms_flags |= MS_NODIRATIME;
+	if (mnt_flags & MNT_RELATIME)	ms_flags |= MS_RELATIME;
+	if (mnt_flags & MNT_READONLY)	ms_flags |= MS_RDONLY;
+
+	idx = tomoyo_read_lock();
+	/* TODO: There may be multiple sources; for the moment, just pick the
+	 * first if there is one.
+	 */
+	error = tomoyo_mount_acl(&r, fc->source, mountpoint, fc->fs_type->name,
+				 ms_flags);
+	tomoyo_read_unlock(idx);
+	return error;
+}
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index 213b8c593668..31fd6bd4f657 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -391,6 +391,20 @@ static int tomoyo_path_chroot(const struct path *path)
 	return tomoyo_path_perm(TOMOYO_TYPE_CHROOT, path, NULL);
 }
 
+/**
+ * tomoyo_sb_mount - Target for security_sb_mountpoint().
+ * @fc:		Context describing the object to be mounted.
+ * @mountpoint:	The target object to mount on.
+ * @mnt_flags:	Mountpoint specific options (as MNT_* flags).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+static int tomoyo_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+				unsigned int mnt_flags)
+{
+	return tomoyo_mount_permission_fc(fc, mountpoint, mnt_flags);
+}
+
 /**
  * tomoyo_sb_mount - Target for security_sb_mount().
  *
@@ -519,6 +533,7 @@ static struct security_hook_list tomoyo_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(path_chmod, tomoyo_path_chmod),
 	LSM_HOOK_INIT(path_chown, tomoyo_path_chown),
 	LSM_HOOK_INIT(path_chroot, tomoyo_path_chroot),
+	LSM_HOOK_INIT(sb_mountpoint, tomoyo_sb_mountpoint),
 	LSM_HOOK_INIT(sb_mount, tomoyo_sb_mount),
 	LSM_HOOK_INIT(sb_umount, tomoyo_sb_umount),
 	LSM_HOOK_INIT(sb_pivotroot, tomoyo_sb_pivotroot),


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 11/32] vfs: Require specification of size of mount data for internal mounts [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (9 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 10/32] tomoyo: " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 22:51   ` Linus Torvalds
  2018-07-10 22:42 ` [PATCH 12/32] vfs: Separate changing mount flags full remount " David Howells
                   ` (26 subsequent siblings)
  37 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Require specification of the size of the mount data passed to the VFS
mounting functions by internal mounts.  The problem is that the legacy
handling for the upcoming mount-context patches has to copy an entire page
as that's how big the buffer is defined as being, but some of the internal
calls pass in a short bit of stack space, with the result that the stack
guard page may get hit.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/ia64/kernel/perfmon.c                |    3 +-
 arch/powerpc/platforms/cell/spufs/inode.c |    6 ++--
 arch/s390/hypfs/inode.c                   |    7 +++--
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c  |    2 +
 drivers/base/devtmpfs.c                   |    6 ++--
 drivers/dax/super.c                       |    2 +
 drivers/gpu/drm/drm_drv.c                 |    3 +-
 drivers/gpu/drm/i915/i915_gemfs.c         |    2 +
 drivers/infiniband/hw/qib/qib_fs.c        |    7 +++--
 drivers/misc/cxl/api.c                    |    3 +-
 drivers/misc/ibmasm/ibmasmfs.c            |   11 +++++--
 drivers/mtd/mtdsuper.c                    |   26 ++++++++++-------
 drivers/oprofile/oprofilefs.c             |    8 +++--
 drivers/scsi/cxlflash/ocxl_hw.c           |    2 +
 drivers/usb/gadget/function/f_fs.c        |    7 +++--
 drivers/usb/gadget/legacy/inode.c         |    7 +++--
 drivers/virtio/virtio_balloon.c           |    2 +
 drivers/xen/xenfs/super.c                 |    7 +++--
 fs/9p/vfs_super.c                         |    2 +
 fs/adfs/super.c                           |    9 +++---
 fs/affs/super.c                           |   13 +++++----
 fs/afs/mntpt.c                            |    3 +-
 fs/afs/super.c                            |    6 +++-
 fs/aio.c                                  |    3 +-
 fs/anon_inodes.c                          |    3 +-
 fs/autofs/autofs_i.h                      |    2 +
 fs/autofs/init.c                          |    4 +--
 fs/autofs/inode.c                         |    3 +-
 fs/befs/linuxvfs.c                        |   11 ++++---
 fs/bfs/inode.c                            |    8 +++--
 fs/binfmt_misc.c                          |    7 +++--
 fs/block_dev.c                            |    2 +
 fs/btrfs/super.c                          |   30 ++++++++++++--------
 fs/btrfs/tests/btrfs-tests.c              |    2 +
 fs/ceph/super.c                           |    3 +-
 fs/cifs/cifs_dfs_ref.c                    |    3 +-
 fs/cifs/cifsfs.c                          |   18 +++++++-----
 fs/coda/inode.c                           |   11 +++++--
 fs/configfs/mount.c                       |    7 +++--
 fs/cramfs/inode.c                         |   17 +++++++----
 fs/debugfs/inode.c                        |   14 +++++----
 fs/devpts/inode.c                         |   10 ++++---
 fs/ecryptfs/main.c                        |    2 +
 fs/efivarfs/super.c                       |    9 ++++--
 fs/efs/super.c                            |   14 ++++++---
 fs/exofs/super.c                          |    7 +++--
 fs/ext2/super.c                           |   14 ++++++---
 fs/ext4/super.c                           |   16 +++++++----
 fs/f2fs/super.c                           |   11 +++++--
 fs/fat/inode.c                            |    3 +-
 fs/fat/namei_msdos.c                      |    8 +++--
 fs/fat/namei_vfat.c                       |    8 +++--
 fs/freevxfs/vxfs_super.c                  |   12 +++++---
 fs/fuse/control.c                         |    9 ++++--
 fs/fuse/inode.c                           |   16 +++++++----
 fs/gfs2/ops_fstype.c                      |    6 +++-
 fs/gfs2/super.c                           |    4 ++-
 fs/hfs/super.c                            |   12 +++++---
 fs/hfsplus/super.c                        |   12 +++++---
 fs/hostfs/hostfs_kern.c                   |    7 +++--
 fs/hpfs/super.c                           |   11 +++++--
 fs/hugetlbfs/inode.c                      |   13 +++++----
 fs/internal.h                             |    4 +--
 fs/isofs/inode.c                          |   11 +++++--
 fs/jffs2/super.c                          |   10 ++++---
 fs/jfs/super.c                            |   11 +++++--
 fs/kernfs/mount.c                         |    3 +-
 fs/libfs.c                                |    2 +
 fs/minix/inode.c                          |   14 ++++++---
 fs/namespace.c                            |   38 ++++++++++++++-----------
 fs/nfs/internal.h                         |    4 +--
 fs/nfs/namespace.c                        |    3 +-
 fs/nfs/nfs4namespace.c                    |    3 +-
 fs/nfs/nfs4super.c                        |   27 ++++++++++--------
 fs/nfs/super.c                            |   22 ++++++++-------
 fs/nfsd/nfsctl.c                          |    8 +++--
 fs/nilfs2/super.c                         |   10 ++++---
 fs/nsfs.c                                 |    3 +-
 fs/ntfs/super.c                           |   13 ++++++---
 fs/ocfs2/dlmfs/dlmfs.c                    |    5 ++-
 fs/ocfs2/super.c                          |   14 ++++++---
 fs/omfs/inode.c                           |    9 ++++--
 fs/openpromfs/inode.c                     |   11 +++++--
 fs/orangefs/orangefs-kernel.h             |    2 +
 fs/orangefs/super.c                       |    5 ++-
 fs/overlayfs/super.c                      |   11 +++++--
 fs/pipe.c                                 |    3 +-
 fs/proc/inode.c                           |    3 +-
 fs/proc/internal.h                        |    4 +--
 fs/proc/root.c                            |   11 +++++--
 fs/pstore/inode.c                         |   10 ++++---
 fs/qnx4/inode.c                           |   14 ++++++---
 fs/qnx6/inode.c                           |   14 ++++++---
 fs/ramfs/inode.c                          |    6 ++--
 fs/reiserfs/super.c                       |   14 ++++++---
 fs/romfs/super.c                          |   13 +++++----
 fs/squashfs/super.c                       |   12 +++++---
 fs/super.c                                |   44 ++++++++++++++++-------------
 fs/sysfs/mount.c                          |    2 +
 fs/sysv/inode.c                           |    3 +-
 fs/sysv/super.c                           |   16 +++++++----
 fs/tracefs/inode.c                        |   10 ++++---
 fs/ubifs/super.c                          |    5 ++-
 fs/udf/super.c                            |   16 +++++++----
 fs/ufs/super.c                            |   11 +++++--
 fs/xfs/xfs_super.c                        |   10 +++++--
 include/linux/debugfs.h                   |    8 +++--
 include/linux/fs.h                        |   29 ++++++++++---------
 include/linux/lsm_hooks.h                 |   13 ++++++---
 include/linux/mount.h                     |    5 ++-
 include/linux/mtd/super.h                 |    4 +--
 include/linux/ramfs.h                     |    4 +--
 include/linux/security.h                  |   17 ++++++-----
 include/linux/shmem_fs.h                  |    3 +-
 init/do_mounts.c                          |    4 +--
 ipc/mqueue.c                              |    9 +++---
 kernel/bpf/inode.c                        |    7 +++--
 kernel/cgroup/cgroup.c                    |    2 +
 kernel/cgroup/cpuset.c                    |    7 +++--
 kernel/trace/trace.c                      |    7 +++--
 mm/shmem.c                                |   10 ++++---
 mm/zsmalloc.c                             |    3 +-
 net/socket.c                              |    3 +-
 net/sunrpc/rpc_pipe.c                     |    7 +++--
 security/apparmor/apparmorfs.c            |    8 +++--
 security/apparmor/lsm.c                   |    3 +-
 security/inode.c                          |    7 +++--
 security/security.c                       |   18 +++++++-----
 security/selinux/hooks.c                  |   11 ++++---
 security/selinux/selinuxfs.c              |    8 +++--
 security/smack/smack_lsm.c                |    6 +++-
 security/smack/smackfs.c                  |    9 ++++--
 security/tomoyo/tomoyo.c                  |    4 ++-
 133 files changed, 708 insertions(+), 468 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 3b38c717008a..ae9a3ae2ba45 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -611,7 +611,8 @@ pfm_unprotect_ctx_ctxsw(pfm_context_t *x, unsigned long f)
 static const struct dentry_operations pfmfs_dentry_operations;
 
 static struct dentry *
-pfmfs_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data)
+pfmfs_mount(struct file_system_type *fs_type, int flags, const char *dev_name,
+	    void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type, "pfm:", NULL, &pfmfs_dentry_operations,
 			PFMFS_MAGIC);
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index db329d4bf1c3..90d55b47c471 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -734,7 +734,7 @@ spufs_create_root(struct super_block *sb, void *data)
 }
 
 static int
-spufs_fill_super(struct super_block *sb, void *data, int silent)
+spufs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
 {
 	struct spufs_sb_info *info;
 	static const struct super_operations s_ops = {
@@ -761,9 +761,9 @@ spufs_fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *
 spufs_mount(struct file_system_type *fstype, int flags,
-		const char *name, void *data)
+		const char *name, void *data, size_t data_size)
 {
-	return mount_single(fstype, flags, data, spufs_fill_super);
+	return mount_single(fstype, flags, data, data_size, spufs_fill_super);
 }
 
 static struct file_system_type spufs_type = {
diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index 06b513d192b9..7aa4227d59d4 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -266,7 +266,8 @@ static int hypfs_show_options(struct seq_file *s, struct dentry *root)
 	return 0;
 }
 
-static int hypfs_fill_super(struct super_block *sb, void *data, int silent)
+static int hypfs_fill_super(struct super_block *sb,
+			    void *data, size_t data_size, int silent)
 {
 	struct inode *root_inode;
 	struct dentry *root_dentry;
@@ -309,9 +310,9 @@ static int hypfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *hypfs_mount(struct file_system_type *fst, int flags,
-			const char *devname, void *data)
+			const char *devname, void *data, size_t data_size)
 {
-	return mount_single(fst, flags, data, hypfs_fill_super);
+	return mount_single(fst, flags, data, data_size, hypfs_fill_super);
 }
 
 static void hypfs_kill_super(struct super_block *sb)
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index fa668f967062..c74365b78253 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1238,7 +1238,7 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
 
 static struct dentry *rdt_mount(struct file_system_type *fs_type,
 				int flags, const char *unused_dev_name,
-				void *data)
+				void *data, size_t data_size)
 {
 	struct rdt_domain *dom;
 	struct rdt_resource *r;
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 79a235184fb5..1b87a1e03b45 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -57,12 +57,12 @@ static int __init mount_param(char *str)
 __setup("devtmpfs.mount=", mount_param);
 
 static struct dentry *dev_mount(struct file_system_type *fs_type, int flags,
-		      const char *dev_name, void *data)
+		      const char *dev_name, void *data, size_t data_size)
 {
 #ifdef CONFIG_TMPFS
-	return mount_single(fs_type, flags, data, shmem_fill_super);
+	return mount_single(fs_type, flags, data, data_size, shmem_fill_super);
 #else
-	return mount_single(fs_type, flags, data, ramfs_fill_super);
+	return mount_single(fs_type, flags, data, data_size, ramfs_fill_super);
 #endif
 }
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 903d9c473749..2f7cb1892576 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -411,7 +411,7 @@ static const struct super_operations dax_sops = {
 };
 
 static struct dentry *dax_mount(struct file_system_type *fs_type,
-		int flags, const char *dev_name, void *data)
+		int flags, const char *dev_name, void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type, "dax:", &dax_sops, NULL, DAXFS_MAGIC);
 }
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index b553a6f2ff0e..6b2087731c55 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -417,7 +417,8 @@ static const struct super_operations drm_fs_sops = {
 };
 
 static struct dentry *drm_fs_mount(struct file_system_type *fs_type, int flags,
-				   const char *dev_name, void *data)
+				   const char *dev_name,
+				   void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type,
 			    "drm:",
diff --git a/drivers/gpu/drm/i915/i915_gemfs.c b/drivers/gpu/drm/i915/i915_gemfs.c
index 888b7d3f04c3..bf0a355e8f46 100644
--- a/drivers/gpu/drm/i915/i915_gemfs.c
+++ b/drivers/gpu/drm/i915/i915_gemfs.c
@@ -57,7 +57,7 @@ int i915_gemfs_init(struct drm_i915_private *i915)
 		int flags = 0;
 		int err;
 
-		err = sb->s_op->remount_fs(sb, &flags, options);
+		err = sb->s_op->remount_fs(sb, &flags, options, sizeof(options));
 		if (err) {
 			kern_unmount(gemfs);
 			return err;
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index 1d940a2885c9..28648ef1f4cc 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -506,7 +506,8 @@ static int remove_device_files(struct super_block *sb,
  * after device init.  The direct add_cntr_files() call handles adding
  * them from the init code, when the fs is already mounted.
  */
-static int qibfs_fill_super(struct super_block *sb, void *data, int silent)
+static int qibfs_fill_super(struct super_block *sb,
+			    void *data, size_t data_size, int silent)
 {
 	struct qib_devdata *dd, *tmp;
 	unsigned long flags;
@@ -541,11 +542,11 @@ static int qibfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *qibfs_mount(struct file_system_type *fs_type, int flags,
-			const char *dev_name, void *data)
+			const char *dev_name, void *data, size_t data_size)
 {
 	struct dentry *ret;
 
-	ret = mount_single(fs_type, flags, data, qibfs_fill_super);
+	ret = mount_single(fs_type, flags, data, data_size, qibfs_fill_super);
 	if (!IS_ERR(ret))
 		qib_super = ret->d_sb;
 	return ret;
diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 753b1a698fc4..aba85a59fde7 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -43,7 +43,8 @@ static const struct dentry_operations cxl_fs_dops = {
 };
 
 static struct dentry *cxl_fs_mount(struct file_system_type *fs_type, int flags,
-				const char *dev_name, void *data)
+				   const char *dev_name,
+				   void *data, size_t data_len)
 {
 	return mount_pseudo(fs_type, "cxl:", NULL, &cxl_fs_dops,
 			CXL_PSEUDO_FS_MAGIC);
diff --git a/drivers/misc/ibmasm/ibmasmfs.c b/drivers/misc/ibmasm/ibmasmfs.c
index e05c3245930a..d0378eec6bca 100644
--- a/drivers/misc/ibmasm/ibmasmfs.c
+++ b/drivers/misc/ibmasm/ibmasmfs.c
@@ -88,13 +88,15 @@ static LIST_HEAD(service_processors);
 
 static struct inode *ibmasmfs_make_inode(struct super_block *sb, int mode);
 static void ibmasmfs_create_files (struct super_block *sb);
-static int ibmasmfs_fill_super (struct super_block *sb, void *data, int silent);
+static int ibmasmfs_fill_super (struct super_block *sb, void *data, size_t data_size,
+				int silent);
 
 
 static struct dentry *ibmasmfs_mount(struct file_system_type *fst,
-			int flags, const char *name, void *data)
+				     int flags, const char *name,
+				     void *data, size_t data_size)
 {
-	return mount_single(fst, flags, data, ibmasmfs_fill_super);
+	return mount_single(fst, flags, data, data_size, ibmasmfs_fill_super);
 }
 
 static const struct super_operations ibmasmfs_s_ops = {
@@ -112,7 +114,8 @@ static struct file_system_type ibmasmfs_type = {
 };
 MODULE_ALIAS_FS("ibmasmfs");
 
-static int ibmasmfs_fill_super (struct super_block *sb, void *data, int silent)
+static int ibmasmfs_fill_super (struct super_block *sb,
+				void *data, size_t data_size, int silent)
 {
 	struct inode *root;
 
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index d58a61c09304..13706ea5cf50 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -61,9 +61,9 @@ static int get_sb_mtd_set(struct super_block *sb, void *_mtd)
  * get a superblock on an MTD-backed filesystem
  */
 static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
-			  const char *dev_name, void *data,
+			  const char *dev_name, void *data, size_t data_size,
 			  struct mtd_info *mtd,
-			  int (*fill_super)(struct super_block *, void *, int))
+			  int (*fill_super)(struct super_block *, void *, size_t, int))
 {
 	struct super_block *sb;
 	int ret;
@@ -79,7 +79,7 @@ static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
 	pr_debug("MTDSB: New superblock for device %d (\"%s\")\n",
 	      mtd->index, mtd->name);
 
-	ret = fill_super(sb, data, flags & SB_SILENT ? 1 : 0);
+	ret = fill_super(sb, data, data_size, flags & SB_SILENT ? 1 : 0);
 	if (ret < 0) {
 		deactivate_locked_super(sb);
 		return ERR_PTR(ret);
@@ -105,8 +105,10 @@ static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
  * get a superblock on an MTD-backed filesystem by MTD device number
  */
 static struct dentry *mount_mtd_nr(struct file_system_type *fs_type, int flags,
-			 const char *dev_name, void *data, int mtdnr,
-			 int (*fill_super)(struct super_block *, void *, int))
+				   const char *dev_name,
+				   void *data, size_t data_size, int mtdnr,
+				   int (*fill_super)(struct super_block *, void *,
+						     size_t, int))
 {
 	struct mtd_info *mtd;
 
@@ -116,15 +118,16 @@ static struct dentry *mount_mtd_nr(struct file_system_type *fs_type, int flags,
 		return ERR_CAST(mtd);
 	}
 
-	return mount_mtd_aux(fs_type, flags, dev_name, data, mtd, fill_super);
+	return mount_mtd_aux(fs_type, flags, dev_name, data, data_size, mtd,
+			     fill_super);
 }
 
 /*
  * set up an MTD-based superblock
  */
 struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
-	       const char *dev_name, void *data,
-	       int (*fill_super)(struct super_block *, void *, int))
+			 const char *dev_name, void *data, size_t data_size,
+			 int (*fill_super)(struct super_block *, void *, size_t, int))
 {
 #ifdef CONFIG_BLOCK
 	struct block_device *bdev;
@@ -153,7 +156,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 			if (!IS_ERR(mtd))
 				return mount_mtd_aux(
 					fs_type, flags,
-					dev_name, data, mtd,
+					dev_name, data, data_size, mtd,
 					fill_super);
 
 			printk(KERN_NOTICE "MTD:"
@@ -170,7 +173,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 				pr_debug("MTDSB: mtd%%d, mtdnr %d\n",
 				      mtdnr);
 				return mount_mtd_nr(fs_type, flags,
-						     dev_name, data,
+						    dev_name, data, data_size,
 						     mtdnr, fill_super);
 			}
 		}
@@ -197,7 +200,8 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 	if (major != MTD_BLOCK_MAJOR)
 		goto not_an_MTD_device;
 
-	return mount_mtd_nr(fs_type, flags, dev_name, data, mtdnr, fill_super);
+	return mount_mtd_nr(fs_type, flags, dev_name, data, data_size, mtdnr,
+			    fill_super);
 
 not_an_MTD_device:
 #endif /* CONFIG_BLOCK */
diff --git a/drivers/oprofile/oprofilefs.c b/drivers/oprofile/oprofilefs.c
index 4ea08979312c..c721d7fd7c7e 100644
--- a/drivers/oprofile/oprofilefs.c
+++ b/drivers/oprofile/oprofilefs.c
@@ -238,7 +238,8 @@ struct dentry *oprofilefs_mkdir(struct dentry *parent, char const *name)
 }
 
 
-static int oprofilefs_fill_super(struct super_block *sb, void *data, int silent)
+static int oprofilefs_fill_super(struct super_block *sb,
+				 void *data, size_t data_size, int silent)
 {
 	struct inode *root_inode;
 
@@ -265,9 +266,10 @@ static int oprofilefs_fill_super(struct super_block *sb, void *data, int silent)
 
 
 static struct dentry *oprofilefs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, oprofilefs_fill_super);
+	return mount_single(fs_type, flags, data, data_size,
+			    oprofilefs_fill_super);
 }
 
 
diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
index 0a95b5f25380..52f486e8dff7 100644
--- a/drivers/scsi/cxlflash/ocxl_hw.c
+++ b/drivers/scsi/cxlflash/ocxl_hw.c
@@ -50,7 +50,7 @@ static const struct dentry_operations ocxlflash_fs_dops = {
  */
 static struct dentry *ocxlflash_fs_mount(struct file_system_type *fs_type,
 					 int flags, const char *dev_name,
-					 void *data)
+					 void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type, "ocxlflash:", NULL, &ocxlflash_fs_dops,
 			    OCXLFLASH_FS_MAGIC);
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index dce9d12c7981..c0b9f7ff516f 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -1366,7 +1366,8 @@ struct ffs_sb_fill_data {
 	struct ffs_data *ffs_data;
 };
 
-static int ffs_sb_fill(struct super_block *sb, void *_data, int silent)
+static int ffs_sb_fill(struct super_block *sb, void *_data, size_t data_size,
+		       int silent)
 {
 	struct ffs_sb_fill_data *data = _data;
 	struct inode	*inode;
@@ -1494,7 +1495,7 @@ static int ffs_fs_parse_opts(struct ffs_sb_fill_data *data, char *opts)
 
 static struct dentry *
 ffs_fs_mount(struct file_system_type *t, int flags,
-	      const char *dev_name, void *opts)
+	     const char *dev_name, void *opts, size_t data_size)
 {
 	struct ffs_sb_fill_data data = {
 		.perms = {
@@ -1536,7 +1537,7 @@ ffs_fs_mount(struct file_system_type *t, int flags,
 	ffs->private_data = ffs_dev;
 	data.ffs_data = ffs;
 
-	rv = mount_nodev(t, flags, &data, ffs_sb_fill);
+	rv = mount_nodev(t, flags, &data, sizeof(data), ffs_sb_fill);
 	if (IS_ERR(rv) && data.ffs_data) {
 		ffs_release_dev(data.ffs_data);
 		ffs_data_put(data.ffs_data);
diff --git a/drivers/usb/gadget/legacy/inode.c b/drivers/usb/gadget/legacy/inode.c
index 37ca0e669bd8..286a982b43a3 100644
--- a/drivers/usb/gadget/legacy/inode.c
+++ b/drivers/usb/gadget/legacy/inode.c
@@ -1990,7 +1990,8 @@ static const struct super_operations gadget_fs_operations = {
 };
 
 static int
-gadgetfs_fill_super (struct super_block *sb, void *opts, int silent)
+gadgetfs_fill_super (struct super_block *sb, void *opts, size_t data_size,
+		     int silent)
 {
 	struct inode	*inode;
 	struct dev_data	*dev;
@@ -2046,9 +2047,9 @@ gadgetfs_fill_super (struct super_block *sb, void *opts, int silent)
 /* "mount -t gadgetfs path /dev/gadget" ends up here */
 static struct dentry *
 gadgetfs_mount (struct file_system_type *t, int flags,
-		const char *path, void *opts)
+		const char *path, void *opts, size_t data_size)
 {
-	return mount_single (t, flags, opts, gadgetfs_fill_super);
+	return mount_single (t, flags, opts, data_size, gadgetfs_fill_super);
 }
 
 static void
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3f4983..49f4a03ec162 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -526,7 +526,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 }
 
 static struct dentry *balloon_mount(struct file_system_type *fs_type,
-		int flags, const char *dev_name, void *data)
+		int flags, const char *dev_name, void *data, size_t data_size)
 {
 	static const struct dentry_operations ops = {
 		.d_dname = simple_dname,
diff --git a/drivers/xen/xenfs/super.c b/drivers/xen/xenfs/super.c
index 71ddfb4cf61c..fc4e6e43b66f 100644
--- a/drivers/xen/xenfs/super.c
+++ b/drivers/xen/xenfs/super.c
@@ -42,7 +42,8 @@ static const struct file_operations capabilities_file_ops = {
 	.llseek = default_llseek,
 };
 
-static int xenfs_fill_super(struct super_block *sb, void *data, int silent)
+static int xenfs_fill_super(struct super_block *sb,
+			    void *data, size_t data_size, int silent)
 {
 	static const struct tree_descr xenfs_files[] = {
 		[2] = { "xenbus", &xen_xenbus_fops, S_IRUSR|S_IWUSR },
@@ -69,9 +70,9 @@ static int xenfs_fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *xenfs_mount(struct file_system_type *fs_type,
 				  int flags, const char *dev_name,
-				  void *data)
+				  void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, xenfs_fill_super);
+	return mount_single(fs_type, flags, data, data_size, xenfs_fill_super);
 }
 
 static struct file_system_type xenfs_type = {
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 48ce50484e80..7def28abd3a5 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -116,7 +116,7 @@ v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses,
  */
 
 static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
-		       const char *dev_name, void *data)
+		       const char *dev_name, void *data, size_t data_size)
 {
 	struct super_block *sb = NULL;
 	struct inode *inode = NULL;
diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 71fa525d63a0..bf6e2f11fcae 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -210,7 +210,7 @@ static int parse_options(struct super_block *sb, char *options)
 	return 0;
 }
 
-static int adfs_remount(struct super_block *sb, int *flags, char *data)
+static int adfs_remount(struct super_block *sb, int *flags, char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_NODIRATIME;
@@ -362,7 +362,8 @@ static inline unsigned long adfs_discsize(struct adfs_discrecord *dr, int block_
 	return discsize;
 }
 
-static int adfs_fill_super(struct super_block *sb, void *data, int silent)
+static int adfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct adfs_discrecord *dr;
 	struct buffer_head *bh;
@@ -522,9 +523,9 @@ static int adfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *adfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, adfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size, adfs_fill_super);
 }
 
 static struct file_system_type adfs_fs_type = {
diff --git a/fs/affs/super.c b/fs/affs/super.c
index d1ad11a8a4a5..69dd5da6d88b 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -26,7 +26,8 @@
 
 static int affs_statfs(struct dentry *dentry, struct kstatfs *buf);
 static int affs_show_options(struct seq_file *m, struct dentry *root);
-static int affs_remount (struct super_block *sb, int *flags, char *data);
+static int affs_remount (struct super_block *sb, int *flags,
+			 char *data, size_t data_size);
 
 static void
 affs_commit_super(struct super_block *sb, int wait)
@@ -335,7 +336,8 @@ static int affs_show_options(struct seq_file *m, struct dentry *root)
  * hopefully have the guts to do so. Until then: sorry for the mess.
  */
 
-static int affs_fill_super(struct super_block *sb, void *data, int silent)
+static int affs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct affs_sb_info	*sbi;
 	struct buffer_head	*root_bh = NULL;
@@ -550,7 +552,7 @@ static int affs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static int
-affs_remount(struct super_block *sb, int *flags, char *data)
+affs_remount(struct super_block *sb, int *flags, char *data, size_t data_size)
 {
 	struct affs_sb_info	*sbi = AFFS_SB(sb);
 	int			 blocksize;
@@ -633,9 +635,10 @@ affs_statfs(struct dentry *dentry, struct kstatfs *buf)
 }
 
 static struct dentry *affs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, affs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  affs_fill_super);
 }
 
 static void affs_kill_sb(struct super_block *sb)
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index 99fd13500a97..c45aa1776591 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -152,7 +152,8 @@ static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
 
 	/* try and do the mount */
 	_debug("--- attempting mount %s -o %s ---", devname, options);
-	mnt = vfs_submount(mntpt, &afs_fs_type, devname, options);
+	mnt = vfs_submount(mntpt, &afs_fs_type, devname,
+			   options, strlen(options) + 1);
 	_debug("--- mount result %p ---", mnt);
 
 	free_page((unsigned long) devname);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 4d3e274207fb..b85f5e993539 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -31,7 +31,8 @@
 
 static void afs_i_init_once(void *foo);
 static struct dentry *afs_mount(struct file_system_type *fs_type,
-		      int flags, const char *dev_name, void *data);
+				int flags, const char *dev_name,
+				void *data, size_t data_size);
 static void afs_kill_super(struct super_block *sb);
 static struct inode *afs_alloc_inode(struct super_block *sb);
 static void afs_destroy_inode(struct inode *inode);
@@ -490,7 +491,8 @@ static void afs_kill_super(struct super_block *sb)
  * get an AFS superblock
  */
 static struct dentry *afs_mount(struct file_system_type *fs_type,
-				int flags, const char *dev_name, void *options)
+				int flags, const char *dev_name,
+				void *options, size_t data_size)
 {
 	struct afs_mount_params params;
 	struct super_block *sb;
diff --git a/fs/aio.c b/fs/aio.c
index e1d20124ec0e..ac06e0b81bec 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -245,7 +245,8 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 }
 
 static struct dentry *aio_mount(struct file_system_type *fs_type,
-				int flags, const char *dev_name, void *data)
+				int flags, const char *dev_name,
+				void *data, size_t data_size)
 {
 	static const struct dentry_operations ops = {
 		.d_dname	= simple_dname,
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3168ee4e77f4..13c06a7e0b85 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -39,7 +39,8 @@ static const struct dentry_operations anon_inodefs_dentry_operations = {
 };
 
 static struct dentry *anon_inodefs_mount(struct file_system_type *fs_type,
-				int flags, const char *dev_name, void *data)
+					 int flags, const char *dev_name,
+					 void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type, "anon_inode:", NULL,
 			&anon_inodefs_dentry_operations, ANON_INODE_FS_MAGIC);
diff --git a/fs/autofs/autofs_i.h b/fs/autofs/autofs_i.h
index 9400a9f6318a..31303d4b6af9 100644
--- a/fs/autofs/autofs_i.h
+++ b/fs/autofs/autofs_i.h
@@ -201,7 +201,7 @@ static inline void managed_dentry_clear_managed(struct dentry *dentry)
 
 /* Initializing function */
 
-int autofs_fill_super(struct super_block *, void *, int);
+int autofs_fill_super(struct super_block *, void *, size_t, int);
 struct autofs_info *autofs_new_ino(struct autofs_sb_info *);
 void autofs_clean_ino(struct autofs_info *);
 
diff --git a/fs/autofs/init.c b/fs/autofs/init.c
index cc9447e1903f..c2fec5734ed4 100644
--- a/fs/autofs/init.c
+++ b/fs/autofs/init.c
@@ -11,9 +11,9 @@
 #include "autofs_i.h"
 
 static struct dentry *autofs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, data, autofs_fill_super);
+	return mount_nodev(fs_type, flags, data, data_size, autofs_fill_super);
 }
 
 static struct file_system_type autofs_fs_type = {
diff --git a/fs/autofs/inode.c b/fs/autofs/inode.c
index b51980fc274e..810ae26305cd 100644
--- a/fs/autofs/inode.c
+++ b/fs/autofs/inode.c
@@ -202,7 +202,8 @@ static int parse_options(char *options, int *pipefd, kuid_t *uid, kgid_t *gid,
 	return (*pipefd < 0);
 }
 
-int autofs_fill_super(struct super_block *s, void *data, int silent)
+int autofs_fill_super(struct super_block *s, void *data, size_t data_size,
+		      int silent)
 {
 	struct inode *root_inode;
 	struct dentry *root;
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 4700b4534439..31f760ea2494 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -52,7 +52,7 @@ static int befs_utf2nls(struct super_block *sb, const char *in, int in_len,
 static int befs_nls2utf(struct super_block *sb, const char *in, int in_len,
 			char **out, int *out_len);
 static void befs_put_super(struct super_block *);
-static int befs_remount(struct super_block *, int *, char *);
+static int befs_remount(struct super_block *, int *, char *, size_t);
 static int befs_statfs(struct dentry *, struct kstatfs *);
 static int befs_show_options(struct seq_file *, struct dentry *);
 static int parse_options(char *, struct befs_mount_options *);
@@ -810,7 +810,7 @@ befs_put_super(struct super_block *sb)
  * Load a set of NLS translations if needed.
  */
 static int
-befs_fill_super(struct super_block *sb, void *data, int silent)
+befs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
 {
 	struct buffer_head *bh;
 	struct befs_sb_info *befs_sb;
@@ -942,7 +942,7 @@ befs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static int
-befs_remount(struct super_block *sb, int *flags, char *data)
+befs_remount(struct super_block *sb, int *flags, char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	if (!(*flags & SB_RDONLY))
@@ -976,9 +976,10 @@ befs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 static struct dentry *
 befs_mount(struct file_system_type *fs_type, int flags, const char *dev_name,
-	    void *data)
+	   void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, befs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  befs_fill_super);
 }
 
 static struct file_system_type befs_fs_type = {
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 9a69392f1fb3..6e76e4e762e8 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -317,7 +317,8 @@ void bfs_dump_imap(const char *prefix, struct super_block *s)
 #endif
 }
 
-static int bfs_fill_super(struct super_block *s, void *data, int silent)
+static int bfs_fill_super(struct super_block *s, void *data, size_t data_size,
+			  int silent)
 {
 	struct buffer_head *bh, *sbh;
 	struct bfs_super_block *bfs_sb;
@@ -460,9 +461,10 @@ static int bfs_fill_super(struct super_block *s, void *data, int silent)
 }
 
 static struct dentry *bfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, bfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  bfs_fill_super);
 }
 
 static struct file_system_type bfs_fs_type = {
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 4b5fff31ef27..2690e2bf634b 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -820,7 +820,8 @@ static const struct super_operations s_ops = {
 	.evict_inode	= bm_evict_inode,
 };
 
-static int bm_fill_super(struct super_block *sb, void *data, int silent)
+static int bm_fill_super(struct super_block *sb, void *data, size_t data_size,
+			 int silent)
 {
 	int err;
 	static const struct tree_descr bm_files[] = {
@@ -836,9 +837,9 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *bm_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, bm_fill_super);
+	return mount_single(fs_type, flags, data, data_size, bm_fill_super);
 }
 
 static struct linux_binfmt misc_format = {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aaeb39a..1322adb69c8c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -786,7 +786,7 @@ static const struct super_operations bdev_sops = {
 };
 
 static struct dentry *bd_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
 	struct dentry *dent;
 	dent = mount_pseudo(fs_type, "bdev:", &bdev_sops, NULL, BDEVFS_MAGIC);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 81107ad49f3a..ca866463128d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -64,7 +64,8 @@ static const struct super_operations btrfs_super_ops;
 static struct file_system_type btrfs_fs_type;
 static struct file_system_type btrfs_root_fs_type;
 
-static int btrfs_remount(struct super_block *sb, int *flags, char *data);
+static int btrfs_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size);
 
 const char *btrfs_decode_error(int errno)
 {
@@ -1456,7 +1457,7 @@ static struct dentry *mount_subvol(const char *subvol_name, u64 subvol_objectid,
 	return root;
 }
 
-static int parse_security_options(char *orig_opts,
+static int parse_security_options(char *orig_opts, size_t data_size,
 				  struct security_mnt_opts *sec_opts)
 {
 	char *secdata = NULL;
@@ -1465,7 +1466,7 @@ static int parse_security_options(char *orig_opts,
 	secdata = alloc_secdata();
 	if (!secdata)
 		return -ENOMEM;
-	ret = security_sb_copy_data(orig_opts, secdata);
+	ret = security_sb_copy_data(orig_opts, data_size, secdata);
 	if (ret) {
 		free_secdata(secdata);
 		return ret;
@@ -1513,7 +1514,8 @@ static int setup_security_options(struct btrfs_fs_info *fs_info,
  *       for multiple device setup.  Make sure to keep it in sync.
  */
 static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
-		int flags, const char *device_name, void *data)
+				       int flags, const char *device_name,
+				       void *data, size_t data_size)
 {
 	struct block_device *bdev = NULL;
 	struct super_block *s;
@@ -1534,7 +1536,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
 
 	security_init_mnt_opts(&new_sec_opts);
 	if (data) {
-		error = parse_security_options(data, &new_sec_opts);
+		error = parse_security_options(data, data_size, &new_sec_opts);
 		if (error)
 			return ERR_PTR(error);
 	}
@@ -1638,7 +1640,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
  *      "btrfs subvolume set-default", mount_subvol() is called always.
  */
 static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
-		const char *device_name, void *data)
+		const char *device_name, void *data, size_t data_size)
 {
 	struct vfsmount *mnt_root;
 	struct dentry *root;
@@ -1658,21 +1660,24 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
 	}
 
 	/* mount device's root (/) */
-	mnt_root = vfs_kern_mount(&btrfs_root_fs_type, flags, device_name, data);
+	mnt_root = vfs_kern_mount(&btrfs_root_fs_type, flags, device_name,
+				  data, data_size);
 	if (PTR_ERR_OR_ZERO(mnt_root) == -EBUSY) {
 		if (flags & SB_RDONLY) {
 			mnt_root = vfs_kern_mount(&btrfs_root_fs_type,
-				flags & ~SB_RDONLY, device_name, data);
+				flags & ~SB_RDONLY, device_name,
+				data, data_size);
 		} else {
 			mnt_root = vfs_kern_mount(&btrfs_root_fs_type,
-				flags | SB_RDONLY, device_name, data);
+				flags | SB_RDONLY, device_name,
+				data, data_size);
 			if (IS_ERR(mnt_root)) {
 				root = ERR_CAST(mnt_root);
 				goto out;
 			}
 
 			down_write(&mnt_root->mnt_sb->s_umount);
-			error = btrfs_remount(mnt_root->mnt_sb, &flags, NULL);
+			error = btrfs_remount(mnt_root->mnt_sb, &flags, NULL, 0);
 			up_write(&mnt_root->mnt_sb->s_umount);
 			if (error < 0) {
 				root = ERR_PTR(error);
@@ -1754,7 +1759,8 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
 	clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
 }
 
-static int btrfs_remount(struct super_block *sb, int *flags, char *data)
+static int btrfs_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 	struct btrfs_root *root = fs_info->tree_root;
@@ -1773,7 +1779,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		struct security_mnt_opts new_sec_opts;
 
 		security_init_mnt_opts(&new_sec_opts);
-		ret = parse_security_options(data, &new_sec_opts);
+		ret = parse_security_options(data, data_size, &new_sec_opts);
 		if (ret)
 			goto restore;
 		ret = setup_security_options(fs_info, sb,
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index db72b3b6209e..6577914800f5 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -24,7 +24,7 @@ static const struct super_operations btrfs_test_super_ops = {
 
 static struct dentry *btrfs_test_mount(struct file_system_type *fs_type,
 				       int flags, const char *dev_name,
-				       void *data)
+				       void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type, "btrfs_test:", &btrfs_test_super_ops,
 			    NULL, BTRFS_TEST_MAGIC);
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 95a3b3ac9b6e..b54bac215d04 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1017,7 +1017,8 @@ static int ceph_setup_bdi(struct super_block *sb, struct ceph_fs_client *fsc)
 }
 
 static struct dentry *ceph_mount(struct file_system_type *fs_type,
-		       int flags, const char *dev_name, void *data)
+				 int flags, const char *dev_name,
+				 void *data, size_t data_size)
 {
 	struct super_block *sb;
 	struct ceph_fs_client *fsc;
diff --git a/fs/cifs/cifs_dfs_ref.c b/fs/cifs/cifs_dfs_ref.c
index 6b61df117fd4..461d052a5d73 100644
--- a/fs/cifs/cifs_dfs_ref.c
+++ b/fs/cifs/cifs_dfs_ref.c
@@ -260,7 +260,8 @@ static struct vfsmount *cifs_dfs_do_refmount(struct dentry *mntpt,
 	if (IS_ERR(mountdata))
 		return (struct vfsmount *)mountdata;
 
-	mnt = vfs_submount(mntpt, &cifs_fs_type, devname, mountdata);
+	mnt = vfs_submount(mntpt, &cifs_fs_type, devname,
+			   mountdata, strlen(mountdata) + 1);
 	kfree(mountdata);
 	kfree(devname);
 	return mnt;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index d5aa7ae917bf..1db37e47f185 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -596,7 +596,8 @@ static int cifs_show_stats(struct seq_file *s, struct dentry *root)
 }
 #endif
 
-static int cifs_remount(struct super_block *sb, int *flags, char *data)
+static int cifs_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_NODIRATIME;
@@ -699,7 +700,8 @@ static int cifs_set_super(struct super_block *sb, void *data)
 
 static struct dentry *
 cifs_smb3_do_mount(struct file_system_type *fs_type,
-	      int flags, const char *dev_name, void *data, bool is_smb3)
+		   int flags, const char *dev_name, void *data, size_t data_size,
+		   bool is_smb3)
 {
 	int rc;
 	struct super_block *sb;
@@ -720,7 +722,7 @@ cifs_smb3_do_mount(struct file_system_type *fs_type,
 		goto out_nls;
 	}
 
-	cifs_sb->mountdata = kstrndup(data, PAGE_SIZE, GFP_KERNEL);
+	cifs_sb->mountdata = kstrndup(data, data_size, GFP_KERNEL);
 	if (cifs_sb->mountdata == NULL) {
 		root = ERR_PTR(-ENOMEM);
 		goto out_free;
@@ -792,16 +794,18 @@ cifs_smb3_do_mount(struct file_system_type *fs_type,
 
 static struct dentry *
 smb3_do_mount(struct file_system_type *fs_type,
-	      int flags, const char *dev_name, void *data)
+	      int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return cifs_smb3_do_mount(fs_type, flags, dev_name, data, true);
+	return cifs_smb3_do_mount(fs_type, flags, dev_name, data, data_size,
+				  true);
 }
 
 static struct dentry *
 cifs_do_mount(struct file_system_type *fs_type,
-	      int flags, const char *dev_name, void *data)
+	      int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return cifs_smb3_do_mount(fs_type, flags, dev_name, data, false);
+	return cifs_smb3_do_mount(fs_type, flags, dev_name, data, data_size,
+				  false);
 }
 
 static ssize_t
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 97424cf206c0..dd819c150f70 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -93,7 +93,8 @@ void coda_destroy_inodecache(void)
 	kmem_cache_destroy(coda_inode_cachep);
 }
 
-static int coda_remount(struct super_block *sb, int *flags, char *data)
+static int coda_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_NOATIME;
@@ -150,7 +151,8 @@ static int get_device_index(struct coda_mount_data *data)
 	return -1;
 }
 
-static int coda_fill_super(struct super_block *sb, void *data, int silent)
+static int coda_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct inode *root = NULL;
 	struct venus_comm *vc;
@@ -316,9 +318,10 @@ static int coda_statfs(struct dentry *dentry, struct kstatfs *buf)
 /* init_coda: used by filesystems.c to register coda */
 
 static struct dentry *coda_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+				 int flags, const char *dev_name,
+				 void *data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, data, coda_fill_super);
+	return mount_nodev(fs_type, flags, data, data_size, coda_fill_super);
 }
 
 struct file_system_type coda_fs_type = {
diff --git a/fs/configfs/mount.c b/fs/configfs/mount.c
index cfd91320e869..c9c7c14eb9db 100644
--- a/fs/configfs/mount.c
+++ b/fs/configfs/mount.c
@@ -66,7 +66,8 @@ static struct configfs_dirent configfs_root = {
 	.s_iattr	= NULL,
 };
 
-static int configfs_fill_super(struct super_block *sb, void *data, int silent)
+static int configfs_fill_super(struct super_block *sb,
+			       void *data, size_t data_size, int silent)
 {
 	struct inode *inode;
 	struct dentry *root;
@@ -103,9 +104,9 @@ static int configfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *configfs_do_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, configfs_fill_super);
+	return mount_single(fs_type, flags, data, data_size, configfs_fill_super);
 }
 
 static struct file_system_type configfs_fs_type = {
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index f408994fc632..77d5cb62e76a 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -502,7 +502,8 @@ static void cramfs_kill_sb(struct super_block *sb)
 	kfree(sbi);
 }
 
-static int cramfs_remount(struct super_block *sb, int *flags, char *data)
+static int cramfs_remount(struct super_block *sb, int *flags,
+			  char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_RDONLY;
@@ -603,7 +604,8 @@ static int cramfs_finalize_super(struct super_block *sb,
 	return 0;
 }
 
-static int cramfs_blkdev_fill_super(struct super_block *sb, void *data,
+static int cramfs_blkdev_fill_super(struct super_block *sb,
+				    void *data, size_t data_size,
 				    int silent)
 {
 	struct cramfs_sb_info *sbi;
@@ -625,8 +627,8 @@ static int cramfs_blkdev_fill_super(struct super_block *sb, void *data,
 	return cramfs_finalize_super(sb, &super.root);
 }
 
-static int cramfs_mtd_fill_super(struct super_block *sb, void *data,
-				 int silent)
+static int cramfs_mtd_fill_super(struct super_block *sb,
+				 void *data, size_t data_size, int silent)
 {
 	struct cramfs_sb_info *sbi;
 	struct cramfs_super super;
@@ -948,18 +950,19 @@ static const struct super_operations cramfs_ops = {
 };
 
 static struct dentry *cramfs_mount(struct file_system_type *fs_type, int flags,
-				   const char *dev_name, void *data)
+				   const char *dev_name,
+				   void *data, size_t data_size)
 {
 	struct dentry *ret = ERR_PTR(-ENOPROTOOPT);
 
 	if (IS_ENABLED(CONFIG_CRAMFS_MTD)) {
-		ret = mount_mtd(fs_type, flags, dev_name, data,
+		ret = mount_mtd(fs_type, flags, dev_name, data, data_size,
 				cramfs_mtd_fill_super);
 		if (!IS_ERR(ret))
 			return ret;
 	}
 	if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV)) {
-		ret = mount_bdev(fs_type, flags, dev_name, data,
+		ret = mount_bdev(fs_type, flags, dev_name, data, data_size,
 				 cramfs_blkdev_fill_super);
 	}
 	return ret;
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 13b01351dd1c..57ba6d891c85 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -130,7 +130,8 @@ static int debugfs_apply_options(struct super_block *sb)
 	return 0;
 }
 
-static int debugfs_remount(struct super_block *sb, int *flags, char *data)
+static int debugfs_remount(struct super_block *sb, int *flags,
+			   char *data, size_t data_size)
 {
 	int err;
 	struct debugfs_fs_info *fsi = sb->s_fs_info;
@@ -190,7 +191,7 @@ static struct vfsmount *debugfs_automount(struct path *path)
 {
 	debugfs_automount_t f;
 	f = (debugfs_automount_t)path->dentry->d_fsdata;
-	return f(path->dentry, d_inode(path->dentry)->i_private);
+	return f(path->dentry, d_inode(path->dentry)->i_private, 0);
 }
 
 static const struct dentry_operations debugfs_dops = {
@@ -199,7 +200,8 @@ static const struct dentry_operations debugfs_dops = {
 	.d_automount = debugfs_automount,
 };
 
-static int debug_fill_super(struct super_block *sb, void *data, int silent)
+static int debug_fill_super(struct super_block *sb,
+			    void *data, size_t data_size, int silent)
 {
 	static const struct tree_descr debug_files[] = {{""}};
 	struct debugfs_fs_info *fsi;
@@ -235,9 +237,9 @@ static int debug_fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *debug_mount(struct file_system_type *fs_type,
 			int flags, const char *dev_name,
-			void *data)
+			void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, debug_fill_super);
+	return mount_single(fs_type, flags, data, data_size, debug_fill_super);
 }
 
 static struct file_system_type debug_fs_type = {
@@ -539,7 +541,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_dir);
 struct dentry *debugfs_create_automount(const char *name,
 					struct dentry *parent,
 					debugfs_automount_t f,
-					void *data)
+					void *data, size_t data_size)
 {
 	struct dentry *dentry = start_creating(name, parent);
 	struct inode *inode;
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index e072e955ce33..2dee3d0c8554 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -386,7 +386,8 @@ static void update_ptmx_mode(struct pts_fs_info *fsi)
 	}
 }
 
-static int devpts_remount(struct super_block *sb, int *flags, char *data)
+static int devpts_remount(struct super_block *sb, int *flags,
+			  char *data, size_t data_size)
 {
 	int err;
 	struct pts_fs_info *fsi = DEVPTS_SB(sb);
@@ -447,7 +448,8 @@ static void *new_pts_fs_info(struct super_block *sb)
 }
 
 static int
-devpts_fill_super(struct super_block *s, void *data, int silent)
+devpts_fill_super(struct super_block *s, void *data, size_t data_size,
+		  int silent)
 {
 	struct inode *inode;
 	int error;
@@ -504,9 +506,9 @@ devpts_fill_super(struct super_block *s, void *data, int silent)
  *     instance are independent of the PTYs in other devpts instances.
  */
 static struct dentry *devpts_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, data, devpts_fill_super);
+	return mount_nodev(fs_type, flags, data, data_size, devpts_fill_super);
 }
 
 static void devpts_kill_sb(struct super_block *sb)
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 025d66a705db..5d029b7e069a 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -488,7 +488,7 @@ static struct file_system_type ecryptfs_fs_type;
  * @raw_data: The options passed into the kernel
  */
 static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags,
-			const char *dev_name, void *raw_data)
+			const char *dev_name, void *raw_data, size_t data_size)
 {
 	struct super_block *s;
 	struct ecryptfs_sb_info *sbi;
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 5b68e4294faa..db0e417f1c7e 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -191,7 +191,8 @@ static int efivarfs_destroy(struct efivar_entry *entry, void *data)
 	return 0;
 }
 
-static int efivarfs_fill_super(struct super_block *sb, void *data, int silent)
+static int efivarfs_fill_super(struct super_block *sb,
+			       void *data, size_t data_size, int silent)
 {
 	struct inode *inode = NULL;
 	struct dentry *root;
@@ -227,9 +228,11 @@ static int efivarfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *efivarfs_mount(struct file_system_type *fs_type,
-				    int flags, const char *dev_name, void *data)
+				     int flags, const char *dev_name,
+				     void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, efivarfs_fill_super);
+	return mount_single(fs_type, flags, data, data_size,
+			    efivarfs_fill_super);
 }
 
 static void efivarfs_kill_sb(struct super_block *sb)
diff --git a/fs/efs/super.c b/fs/efs/super.c
index 6ffb7ba1547a..ce85f22651f3 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -19,12 +19,14 @@
 #include <linux/efs_fs_sb.h>
 
 static int efs_statfs(struct dentry *dentry, struct kstatfs *buf);
-static int efs_fill_super(struct super_block *s, void *d, int silent);
+static int efs_fill_super(struct super_block *s, void *d, size_t data_size,
+			  int silent);
 
 static struct dentry *efs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, efs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  efs_fill_super);
 }
 
 static void efs_kill_sb(struct super_block *s)
@@ -113,7 +115,8 @@ static void destroy_inodecache(void)
 	kmem_cache_destroy(efs_inode_cachep);
 }
 
-static int efs_remount(struct super_block *sb, int *flags, char *data)
+static int efs_remount(struct super_block *sb, int *flags,
+		       char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_RDONLY;
@@ -253,7 +256,8 @@ static int efs_validate_super(struct efs_sb_info *sb, struct efs_super *super) {
 	return 0;    
 }
 
-static int efs_fill_super(struct super_block *s, void *d, int silent)
+static int efs_fill_super(struct super_block *s, void *d, size_t data_size,
+			  int silent)
 {
 	struct efs_sb_info *sb;
 	struct buffer_head *bh;
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 41cf2fbee50d..a5f94b7e7b5b 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -704,7 +704,8 @@ static int exofs_read_lookup_dev_table(struct exofs_sb_info *sbi,
 /*
  * Read the superblock from the OSD and fill in the fields
  */
-static int exofs_fill_super(struct super_block *sb, void *data, int silent)
+static int exofs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			    int silent)
 {
 	struct inode *root;
 	struct exofs_mountopt *opts = data;
@@ -860,7 +861,7 @@ static int exofs_fill_super(struct super_block *sb, void *data, int silent)
  */
 static struct dentry *exofs_mount(struct file_system_type *type,
 			  int flags, const char *dev_name,
-			  void *data)
+			  void *data, size_t data_size)
 {
 	struct exofs_mountopt opts;
 	int ret;
@@ -871,7 +872,7 @@ static struct dentry *exofs_mount(struct file_system_type *type,
 
 	if (!opts.dev_name)
 		opts.dev_name = dev_name;
-	return mount_nodev(type, flags, &opts, exofs_fill_super);
+	return mount_nodev(type, flags, &opts, 0, exofs_fill_super);
 }
 
 /*
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 25ab1274090f..8f068563622c 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -39,7 +39,8 @@
 #include "acl.h"
 
 static void ext2_write_super(struct super_block *sb);
-static int ext2_remount (struct super_block * sb, int * flags, char * data);
+static int ext2_remount (struct super_block * sb, int * flags,
+			 char * data, size_t data_size);
 static int ext2_statfs (struct dentry * dentry, struct kstatfs * buf);
 static int ext2_sync_fs(struct super_block *sb, int wait);
 static int ext2_freeze(struct super_block *sb);
@@ -815,7 +816,8 @@ static unsigned long descriptor_loc(struct super_block *sb,
 	return ext2_group_first_block_no(sb, bg) + has_super;
 }
 
-static int ext2_fill_super(struct super_block *sb, void *data, int silent)
+static int ext2_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
 	struct buffer_head * bh;
@@ -1320,7 +1322,8 @@ static void ext2_write_super(struct super_block *sb)
 		ext2_sync_fs(sb, 1);
 }
 
-static int ext2_remount (struct super_block * sb, int * flags, char * data)
+static int ext2_remount (struct super_block * sb, int * flags,
+			 char *data, size_t data_size)
 {
 	struct ext2_sb_info * sbi = EXT2_SB(sb);
 	struct ext2_super_block * es;
@@ -1474,9 +1477,10 @@ static int ext2_statfs (struct dentry * dentry, struct kstatfs * buf)
 }
 
 static struct dentry *ext2_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  ext2_fill_super);
 }
 
 #ifdef CONFIG_QUOTA
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0c4c2201b3aa..d3132c6c1a54 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -70,12 +70,13 @@ static void ext4_mark_recovery_complete(struct super_block *sb,
 static void ext4_clear_journal_err(struct super_block *sb,
 				   struct ext4_super_block *es);
 static int ext4_sync_fs(struct super_block *sb, int wait);
-static int ext4_remount(struct super_block *sb, int *flags, char *data);
+static int ext4_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size);
 static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf);
 static int ext4_unfreeze(struct super_block *sb);
 static int ext4_freeze(struct super_block *sb);
 static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags,
-		       const char *dev_name, void *data);
+		       const char *dev_name, void *data, size_t data_size);
 static inline int ext2_feature_set_ok(struct super_block *sb);
 static inline int ext3_feature_set_ok(struct super_block *sb);
 static int ext4_feature_set_ok(struct super_block *sb, int readonly);
@@ -3429,7 +3430,8 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 	atomic64_set(&sbi->s_resv_clusters, resv_clusters);
 }
 
-static int ext4_fill_super(struct super_block *sb, void *data, int silent)
+static int ext4_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
@@ -4999,7 +5001,8 @@ struct ext4_mount_options {
 #endif
 };
 
-static int ext4_remount(struct super_block *sb, int *flags, char *data)
+static int ext4_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	struct ext4_super_block *es;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -5766,9 +5769,10 @@ static int ext4_get_next_id(struct super_block *sb, struct kqid *qid)
 #endif
 
 static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags,
-		       const char *dev_name, void *data)
+		       const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, ext4_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  ext4_fill_super);
 }
 
 #if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 54bf50295d1e..3dbf69209fe2 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1387,7 +1387,8 @@ static void default_options(struct f2fs_sb_info *sbi)
 #ifdef CONFIG_QUOTA
 static int f2fs_enable_quotas(struct super_block *sb);
 #endif
-static int f2fs_remount(struct super_block *sb, int *flags, char *data)
+static int f2fs_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(sb);
 	struct f2fs_mount_info org_mount_opt;
@@ -2653,7 +2654,8 @@ static void f2fs_tuning_parameters(struct f2fs_sb_info *sbi)
 	}
 }
 
-static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
+static int f2fs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct f2fs_sb_info *sbi;
 	struct f2fs_super_block *raw_super;
@@ -3081,9 +3083,10 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *f2fs_mount(struct file_system_type *fs_type, int flags,
-			const char *dev_name, void *data)
+			const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, f2fs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  f2fs_fill_super);
 }
 
 static void kill_f2fs_super(struct super_block *sb)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 065dc919a0ce..01da6de2b052 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -788,7 +788,8 @@ static void __exit fat_destroy_inodecache(void)
 	kmem_cache_destroy(fat_inode_cachep);
 }
 
-static int fat_remount(struct super_block *sb, int *flags, char *data)
+static int fat_remount(struct super_block *sb, int *flags,
+		       char *data, size_t data_size)
 {
 	bool new_rdonly;
 	struct msdos_sb_info *sbi = MSDOS_SB(sb);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 16a832c37d66..06e1c9d81d65 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -651,16 +651,18 @@ static void setup(struct super_block *sb)
 	sb->s_flags |= SB_NOATIME;
 }
 
-static int msdos_fill_super(struct super_block *sb, void *data, int silent)
+static int msdos_fill_super(struct super_block *sb, void *data, size_t data_size,
+			    int silent)
 {
 	return fat_fill_super(sb, data, silent, 0, setup);
 }
 
 static struct dentry *msdos_mount(struct file_system_type *fs_type,
 			int flags, const char *dev_name,
-			void *data)
+			void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, msdos_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  msdos_fill_super);
 }
 
 static struct file_system_type msdos_fs_type = {
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 9a5469120caa..4ee41d511e05 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -1049,16 +1049,18 @@ static void setup(struct super_block *sb)
 		sb->s_d_op = &vfat_dentry_ops;
 }
 
-static int vfat_fill_super(struct super_block *sb, void *data, int silent)
+static int vfat_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	return fat_fill_super(sb, data, silent, 1, setup);
 }
 
 static struct dentry *vfat_mount(struct file_system_type *fs_type,
 		       int flags, const char *dev_name,
-		       void *data)
+		       void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, vfat_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  vfat_fill_super);
 }
 
 static struct file_system_type vfat_fs_type = {
diff --git a/fs/freevxfs/vxfs_super.c b/fs/freevxfs/vxfs_super.c
index 48b24bb50d02..1c6cf91f6de9 100644
--- a/fs/freevxfs/vxfs_super.c
+++ b/fs/freevxfs/vxfs_super.c
@@ -113,7 +113,8 @@ vxfs_statfs(struct dentry *dentry, struct kstatfs *bufp)
 	return 0;
 }
 
-static int vxfs_remount(struct super_block *sb, int *flags, char *data)
+static int vxfs_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_RDONLY;
@@ -199,6 +200,7 @@ static int vxfs_try_sb_magic(struct super_block *sbp, int silent,
  * vxfs_read_super - read superblock into memory and initialize filesystem
  * @sbp:		VFS superblock (to fill)
  * @dp:			fs private mount data
+ * @data_size:		size of mount data
  * @silent:		do not complain loudly when sth is wrong
  *
  * Description:
@@ -211,7 +213,8 @@ static int vxfs_try_sb_magic(struct super_block *sbp, int silent,
  * Locking:
  *   We are under @sbp->s_lock.
  */
-static int vxfs_fill_super(struct super_block *sbp, void *dp, int silent)
+static int vxfs_fill_super(struct super_block *sbp, void *dp, size_t data_size,
+			   int silent)
 {
 	struct vxfs_sb_info	*infp;
 	struct vxfs_sb		*rsbp;
@@ -312,9 +315,10 @@ static int vxfs_fill_super(struct super_block *sbp, void *dp, int silent)
  * The usual module blurb.
  */
 static struct dentry *vxfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, vxfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  vxfs_fill_super);
 }
 
 static struct file_system_type vxfs_fs_type = {
diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index 0b694655d988..e09b9cd9c3fc 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -297,7 +297,8 @@ void fuse_ctl_remove_conn(struct fuse_conn *fc)
 	drop_nlink(d_inode(fuse_control_sb->s_root));
 }
 
-static int fuse_ctl_fill_super(struct super_block *sb, void *data, int silent)
+static int fuse_ctl_fill_super(struct super_block *sb,
+			       void *data, size_t data_size, int silent)
 {
 	static const struct tree_descr empty_descr = {""};
 	struct fuse_conn *fc;
@@ -324,9 +325,11 @@ static int fuse_ctl_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *fuse_ctl_mount(struct file_system_type *fs_type,
-			int flags, const char *dev_name, void *raw_data)
+				     int flags, const char *dev_name,
+				     void *raw_data, size_t data_size)
 {
-	return mount_single(fs_type, flags, raw_data, fuse_ctl_fill_super);
+	return mount_single(fs_type, flags, raw_data, data_size,
+			    fuse_ctl_fill_super);
 }
 
 static void fuse_ctl_kill_sb(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index a24df8861b40..85b3954945af 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -138,7 +138,8 @@ static void fuse_evict_inode(struct inode *inode)
 	}
 }
 
-static int fuse_remount_fs(struct super_block *sb, int *flags, char *data)
+static int fuse_remount_fs(struct super_block *sb, int *flags,
+			   char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	if (*flags & SB_MANDLOCK)
@@ -1049,7 +1050,8 @@ void fuse_dev_free(struct fuse_dev *fud)
 }
 EXPORT_SYMBOL_GPL(fuse_dev_free);
 
-static int fuse_fill_super(struct super_block *sb, void *data, int silent)
+static int fuse_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct fuse_dev *fud;
 	struct fuse_conn *fc;
@@ -1205,9 +1207,10 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *fuse_mount(struct file_system_type *fs_type,
 		       int flags, const char *dev_name,
-		       void *raw_data)
+		       void *raw_data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, raw_data, fuse_fill_super);
+	return mount_nodev(fs_type, flags, raw_data, data_size,
+			   fuse_fill_super);
 }
 
 static void fuse_kill_sb_anon(struct super_block *sb)
@@ -1235,9 +1238,10 @@ MODULE_ALIAS_FS("fuse");
 #ifdef CONFIG_BLOCK
 static struct dentry *fuse_mount_blk(struct file_system_type *fs_type,
 			   int flags, const char *dev_name,
-			   void *raw_data)
+			   void *raw_data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, raw_data, fuse_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, raw_data, data_size,
+			  fuse_fill_super);
 }
 
 static void fuse_kill_sb_blk(struct super_block *sb)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index c2469833b4fb..f757b5dfc960 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1220,6 +1220,7 @@ static int test_gfs2_super(struct super_block *s, void *ptr)
  * @flags: Mount flags
  * @dev_name: The name of the device
  * @data: The mount arguments
+ * @data_size: The size of the mount arguments
  *
  * Q. Why not use get_sb_bdev() ?
  * A. We need to select one of two root directories to mount, independent
@@ -1229,7 +1230,7 @@ static int test_gfs2_super(struct super_block *s, void *ptr)
  */
 
 static struct dentry *gfs2_mount(struct file_system_type *fs_type, int flags,
-		       const char *dev_name, void *data)
+		       const char *dev_name, void *data, size_t data_size)
 {
 	struct block_device *bdev;
 	struct super_block *s;
@@ -1326,7 +1327,8 @@ static int set_meta_super(struct super_block *s, void *ptr)
 }
 
 static struct dentry *gfs2_mount_meta(struct file_system_type *fs_type,
-			int flags, const char *dev_name, void *data)
+				      int flags, const char *dev_name,
+				      void *data, size_t data_size)
 {
 	struct super_block *s;
 	struct gfs2_sbd *sdp;
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index af0d5b01cf0b..add7922414aa 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1228,11 +1228,13 @@ static int gfs2_statfs(struct dentry *dentry, struct kstatfs *buf)
  * @sb:  the filesystem
  * @flags:  the remount flags
  * @data:  extra data passed in (not used right now)
+ * @data_size: size of the extra data
  *
  * Returns: errno
  */
 
-static int gfs2_remount_fs(struct super_block *sb, int *flags, char *data)
+static int gfs2_remount_fs(struct super_block *sb, int *flags,
+			   char *data, size_t data_size)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	struct gfs2_args args = sdp->sd_args; /* Default to current settings */
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 173876782f73..e739b381b041 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -111,7 +111,8 @@ static int hfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return 0;
 }
 
-static int hfs_remount(struct super_block *sb, int *flags, char *data)
+static int hfs_remount(struct super_block *sb, int *flags,
+		       char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_NODIRATIME;
@@ -382,7 +383,8 @@ static int parse_options(char *options, struct hfs_sb_info *hsb)
  * hfs_btree_init() to get the necessary data about the extents and
  * catalog B-trees and, finally, reading the root inode into memory.
  */
-static int hfs_fill_super(struct super_block *sb, void *data, int silent)
+static int hfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			  int silent)
 {
 	struct hfs_sb_info *sbi;
 	struct hfs_find_data fd;
@@ -458,9 +460,11 @@ static int hfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *hfs_mount(struct file_system_type *fs_type,
-		      int flags, const char *dev_name, void *data)
+				int flags, const char *dev_name,
+				void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, hfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  hfs_fill_super);
 }
 
 static struct file_system_type hfs_fs_type = {
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index a6c0f54c48c3..9c5f19922e4a 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -326,7 +326,8 @@ static int hfsplus_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return 0;
 }
 
-static int hfsplus_remount(struct super_block *sb, int *flags, char *data)
+static int hfsplus_remount(struct super_block *sb, int *flags,
+			   char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	if ((bool)(*flags & SB_RDONLY) == sb_rdonly(sb))
@@ -371,7 +372,8 @@ static const struct super_operations hfsplus_sops = {
 	.show_options	= hfsplus_show_options,
 };
 
-static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
+static int hfsplus_fill_super(struct super_block *sb,
+			      void *data, size_t data_size, int silent)
 {
 	struct hfsplus_vh *vhdr;
 	struct hfsplus_sb_info *sbi;
@@ -641,9 +643,11 @@ static void hfsplus_destroy_inode(struct inode *inode)
 #define HFSPLUS_INODE_SIZE	sizeof(struct hfsplus_inode_info)
 
 static struct dentry *hfsplus_mount(struct file_system_type *fs_type,
-			  int flags, const char *dev_name, void *data)
+				    int flags, const char *dev_name,
+				    void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, hfsplus_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  hfsplus_fill_super);
 }
 
 static struct file_system_type hfsplus_fs_type = {
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 2597b290c2a5..42f1ec3cb9cf 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -923,7 +923,8 @@ static const struct inode_operations hostfs_link_iops = {
 	.get_link	= hostfs_get_link,
 };
 
-static int hostfs_fill_sb_common(struct super_block *sb, void *d, int silent)
+static int hostfs_fill_sb_common(struct super_block *sb,
+				 void *d, size_t data_size, int silent)
 {
 	struct inode *root_inode;
 	char *host_root_path, *req_root = d;
@@ -983,9 +984,9 @@ static int hostfs_fill_sb_common(struct super_block *sb, void *d, int silent)
 
 static struct dentry *hostfs_read_sb(struct file_system_type *type,
 			  int flags, const char *dev_name,
-			  void *data)
+			  void *data, size_t data_size)
 {
-	return mount_nodev(type, flags, data, hostfs_fill_sb_common);
+	return mount_nodev(type, flags, data, data_size, hostfs_fill_sb_common);
 }
 
 static void hostfs_kill_sb(struct super_block *s)
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index f2c3ebcd309c..53e585b27c05 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -445,7 +445,8 @@ HPFS filesystem options:\n\
 \n");
 }
 
-static int hpfs_remount_fs(struct super_block *s, int *flags, char *data)
+static int hpfs_remount_fs(struct super_block *s, int *flags,
+			   char *data, size_t data_size)
 {
 	kuid_t uid;
 	kgid_t gid;
@@ -540,7 +541,8 @@ static const struct super_operations hpfs_sops =
 	.show_options	= hpfs_show_options,
 };
 
-static int hpfs_fill_super(struct super_block *s, void *options, int silent)
+static int hpfs_fill_super(struct super_block *s,
+			   void *options, size_t data_size, int silent)
 {
 	struct buffer_head *bh0, *bh1, *bh2;
 	struct hpfs_boot_block *bootblock;
@@ -757,9 +759,10 @@ bail2:	brelse(bh0);
 }
 
 static struct dentry *hpfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, hpfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  hpfs_fill_super);
 }
 
 static struct file_system_type hpfs_fs_type = {
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d508c7844681..76fb8eb2bea8 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1220,7 +1220,8 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig)
 }
 
 static int
-hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
+hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+		     int silent)
 {
 	int ret;
 	struct hugetlbfs_config config;
@@ -1279,9 +1280,10 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, data, hugetlbfs_fill_super);
+	return mount_nodev(fs_type, flags, data, data_size,
+			   hugetlbfs_fill_super);
 }
 
 static struct file_system_type hugetlbfs_fs_type = {
@@ -1420,10 +1422,11 @@ static int __init init_hugetlbfs_fs(void)
 	for_each_hstate(h) {
 		char buf[50];
 		unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
+		int n;
 
-		snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
+		n = snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
 		hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
-							buf);
+							buf, n + 1);
 
 		if (IS_ERR(hugetlbfs_vfsmount[i])) {
 			pr_err("Cannot mount internal hugetlbfs for "
diff --git a/fs/internal.h b/fs/internal.h
index b55575b9b55c..383ee4724f77 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -99,10 +99,10 @@ extern struct file *get_empty_filp(void);
 /*
  * super.c
  */
-extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_remount_sb(struct super_block *, int, void *, size_t, int);
 extern bool trylock_super(struct super_block *sb);
 extern struct dentry *mount_fs(struct file_system_type *,
-			       int, const char *, void *);
+			       int, const char *, void *, size_t);
 extern struct super_block *user_get_super(dev_t);
 
 /*
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index ec3fba7d492f..71138cbed995 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -111,7 +111,8 @@ static void destroy_inodecache(void)
 	kmem_cache_destroy(isofs_inode_cachep);
 }
 
-static int isofs_remount(struct super_block *sb, int *flags, char *data)
+static int isofs_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	if (!(*flags & SB_RDONLY))
@@ -619,7 +620,8 @@ static bool rootdir_empty(struct super_block *sb, unsigned long block)
  * Note: a check_disk_change() has been done immediately prior
  * to this call, so we don't need to check again.
  */
-static int isofs_fill_super(struct super_block *s, void *data, int silent)
+static int isofs_fill_super(struct super_block *s, void *data, size_t data_size,
+			    int silent)
 {
 	struct buffer_head *bh = NULL, *pri_bh = NULL;
 	struct hs_primary_descriptor *h_pri = NULL;
@@ -1558,9 +1560,10 @@ struct inode *__isofs_iget(struct super_block *sb,
 }
 
 static struct dentry *isofs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, isofs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  isofs_fill_super);
 }
 
 static struct file_system_type iso9660_fs_type = {
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 87bdf0f4cba1..c4f220f1a531 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -238,7 +238,8 @@ static int jffs2_parse_options(struct jffs2_sb_info *c, char *data)
 	return 0;
 }
 
-static int jffs2_remount_fs(struct super_block *sb, int *flags, char *data)
+static int jffs2_remount_fs(struct super_block *sb, int *flags,
+			    char *data, size_t data_size)
 {
 	struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
 	int err;
@@ -267,7 +268,8 @@ static const struct super_operations jffs2_super_operations =
 /*
  * fill in the superblock
  */
-static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
+static int jffs2_fill_super(struct super_block *sb,
+			    void *data, size_t data_size, int silent)
 {
 	struct jffs2_sb_info *c;
 	int ret;
@@ -312,9 +314,9 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *jffs2_mount(struct file_system_type *fs_type,
 			int flags, const char *dev_name,
-			void *data)
+			void *data, size_t data_size)
 {
-	return mount_mtd(fs_type, flags, dev_name, data, jffs2_fill_super);
+	return mount_mtd(fs_type, flags, dev_name, data, data_size, jffs2_fill_super);
 }
 
 static void jffs2_put_super (struct super_block *sb)
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 1b9264fd54b6..88f30ff12564 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -456,7 +456,8 @@ static int parse_options(char *options, struct super_block *sb, s64 *newLVSize,
 	return 0;
 }
 
-static int jfs_remount(struct super_block *sb, int *flags, char *data)
+static int jfs_remount(struct super_block *sb, int *flags,
+		       char *data, size_t data_size)
 {
 	s64 newLVSize = 0;
 	int rc = 0;
@@ -516,7 +517,8 @@ static int jfs_remount(struct super_block *sb, int *flags, char *data)
 	return 0;
 }
 
-static int jfs_fill_super(struct super_block *sb, void *data, int silent)
+static int jfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			  int silent)
 {
 	struct jfs_sb_info *sbi;
 	struct inode *inode;
@@ -698,9 +700,10 @@ static int jfs_unfreeze(struct super_block *sb)
 }
 
 static struct dentry *jfs_do_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, jfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  jfs_fill_super);
 }
 
 static int jfs_sync_fs(struct super_block *sb, int wait)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index ff2716f9322e..f70e0b69e714 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -22,7 +22,8 @@
 
 struct kmem_cache *kernfs_node_cache;
 
-static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data)
+static int kernfs_sop_remount_fs(struct super_block *sb, int *flags,
+				 char *data, size_t data_size)
 {
 	struct kernfs_root *root = kernfs_info(sb)->root;
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0fb590d79f30..9f1f4884b7cc 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -578,7 +578,7 @@ int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *c
 	spin_lock(&pin_fs_lock);
 	if (unlikely(!*mount)) {
 		spin_unlock(&pin_fs_lock);
-		mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
+		mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
 		if (IS_ERR(mnt))
 			return PTR_ERR(mnt);
 		spin_lock(&pin_fs_lock);
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index 72e308c3e66b..3d91d9096b24 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -22,7 +22,8 @@
 static int minix_write_inode(struct inode *inode,
 		struct writeback_control *wbc);
 static int minix_statfs(struct dentry *dentry, struct kstatfs *buf);
-static int minix_remount (struct super_block * sb, int * flags, char * data);
+static int minix_remount (struct super_block * sb, int * flags,
+			  char * data, size_t data_size);
 
 static void minix_evict_inode(struct inode *inode)
 {
@@ -118,7 +119,8 @@ static const struct super_operations minix_sops = {
 	.remount_fs	= minix_remount,
 };
 
-static int minix_remount (struct super_block * sb, int * flags, char * data)
+static int minix_remount (struct super_block * sb, int * flags,
+			  char * data, size_t data_size)
 {
 	struct minix_sb_info * sbi = minix_sb(sb);
 	struct minix_super_block * ms;
@@ -155,7 +157,8 @@ static int minix_remount (struct super_block * sb, int * flags, char * data)
 	return 0;
 }
 
-static int minix_fill_super(struct super_block *s, void *data, int silent)
+static int minix_fill_super(struct super_block *s, void *data, size_t data_size,
+			    int silent)
 {
 	struct buffer_head *bh;
 	struct buffer_head **map;
@@ -651,9 +654,10 @@ void minix_truncate(struct inode * inode)
 }
 
 static struct dentry *minix_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, minix_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  minix_fill_super);
 }
 
 static struct file_system_type minix_fs_type = {
diff --git a/fs/namespace.c b/fs/namespace.c
index bd2526b24afb..3981fd7b13f5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1021,7 +1021,8 @@ static struct mount *skip_mnt_tree(struct mount *p)
 }
 
 struct vfsmount *
-vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
+vfs_kern_mount(struct file_system_type *type, int flags, const char *name,
+	       void *data, size_t data_size)
 {
 	struct mount *mnt;
 	struct dentry *root;
@@ -1036,7 +1037,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 	if (flags & SB_KERNMOUNT)
 		mnt->mnt.mnt_flags = MNT_INTERNAL;
 
-	root = mount_fs(type, flags, name, data);
+	root = mount_fs(type, flags, name, data, data_size);
 	if (IS_ERR(root)) {
 		mnt_free_id(mnt);
 		free_vfsmnt(mnt);
@@ -1056,7 +1057,7 @@ EXPORT_SYMBOL_GPL(vfs_kern_mount);
 
 struct vfsmount *
 vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
-	     const char *name, void *data)
+	     const char *name, void *data, size_t data_size)
 {
 	/* Until it is worked out how to pass the user namespace
 	 * through from the parent mount to the submount don't support
@@ -1065,7 +1066,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
 	if (mountpoint->d_sb->s_user_ns != &init_user_ns)
 		return ERR_PTR(-EPERM);
 
-	return vfs_kern_mount(type, SB_SUBMOUNT, name, data);
+	return vfs_kern_mount(type, SB_SUBMOUNT, name, data, data_size);
 }
 EXPORT_SYMBOL_GPL(vfs_submount);
 
@@ -1596,7 +1597,7 @@ static int do_umount(struct mount *mnt, int flags)
 			return -EPERM;
 		down_write(&sb->s_umount);
 		if (!sb_rdonly(sb))
-			retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
+			retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0);
 		up_write(&sb->s_umount);
 		return retval;
 	}
@@ -2377,7 +2378,7 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
  * on it - tough luck.
  */
 static int do_remount(struct path *path, int ms_flags, int sb_flags,
-		      int mnt_flags, void *data)
+		      int mnt_flags, void *data, size_t data_size)
 {
 	int err;
 	struct super_block *sb = path->mnt->mnt_sb;
@@ -2416,7 +2417,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 		return -EPERM;
 	}
 
-	err = security_sb_remount(sb, data);
+	err = security_sb_remount(sb, data, data_size);
 	if (err)
 		return err;
 
@@ -2426,7 +2427,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		err = -EPERM;
 	else
-		err = do_remount_sb(sb, sb_flags, data, 0);
+		err = do_remount_sb(sb, sb_flags, data, data_size, 0);
 	if (!err) {
 		lock_mount_hash();
 		mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2613,7 +2614,8 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
  * namespace's tree
  */
 static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
-			int mnt_flags, const char *name, void *data)
+			int mnt_flags, const char *name,
+			void *data, size_t data_size)
 {
 	struct file_system_type *type;
 	struct vfsmount *mnt;
@@ -2626,7 +2628,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
 	if (!type)
 		return -ENODEV;
 
-	mnt = vfs_kern_mount(type, sb_flags, name, data);
+	mnt = vfs_kern_mount(type, sb_flags, name, data, data_size);
 	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
 	    !mnt->mnt_sb->s_subtype)
 		mnt = fs_set_subtype(mnt, fstype);
@@ -2882,6 +2884,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 {
 	struct path path;
 	unsigned int mnt_flags = 0, sb_flags;
+	size_t data_size = data_page ? PAGE_SIZE : 0;
 	int retval = 0;
 
 	/* Discard magic */
@@ -2900,8 +2903,8 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 	if (retval)
 		return retval;
 
-	retval = security_sb_mount(dev_name, &path,
-				   type_page, flags, data_page);
+	retval = security_sb_mount(dev_name, &path, type_page, flags,
+				   data_page, data_size);
 	if (!retval && !may_mount())
 		retval = -EPERM;
 	if (!retval && (flags & SB_MANDLOCK) && !may_mandlock())
@@ -2948,7 +2951,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 
 	if (flags & MS_REMOUNT)
 		retval = do_remount(&path, flags, sb_flags, mnt_flags,
-				    data_page);
+				    data_page, data_size);
 	else if (flags & MS_BIND)
 		retval = do_loopback(&path, dev_name, flags & MS_REC);
 	else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
@@ -2957,7 +2960,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 		retval = do_move_mount_old(&path, dev_name);
 	else
 		retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
-				      dev_name, data_page);
+				      dev_name, data_page, data_size);
 dput_out:
 	path_put(&path);
 	return retval;
@@ -3404,7 +3407,7 @@ static void __init init_mount_tree(void)
 	type = get_fs_type("rootfs");
 	if (!type)
 		panic("Can't find rootfs type");
-	mnt = vfs_kern_mount(type, 0, "rootfs", NULL);
+	mnt = vfs_kern_mount(type, 0, "rootfs", NULL, 0);
 	put_filesystem(type);
 	if (IS_ERR(mnt))
 		panic("Can't create rootfs");
@@ -3466,10 +3469,11 @@ void put_mnt_ns(struct mnt_namespace *ns)
 	free_mnt_ns(ns);
 }
 
-struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
+struct vfsmount *kern_mount_data(struct file_system_type *type,
+				 void *data, size_t data_size)
 {
 	struct vfsmount *mnt;
-	mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data);
+	mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
 	if (!IS_ERR(mnt)) {
 		/*
 		 * it is a longterm mount, don't release mnt until
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 8357ff69962f..db0f3ca3a35c 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -405,7 +405,7 @@ int nfs_set_sb_security(struct super_block *, struct dentry *, struct nfs_mount_
 int nfs_clone_sb_security(struct super_block *, struct dentry *, struct nfs_mount_info *);
 struct dentry *nfs_fs_mount_common(struct nfs_server *, int, const char *,
 				   struct nfs_mount_info *, struct nfs_subversion *);
-struct dentry *nfs_fs_mount(struct file_system_type *, int, const char *, void *);
+struct dentry *nfs_fs_mount(struct file_system_type *, int, const char *, void *, size_t);
 struct dentry * nfs_xdev_mount_common(struct file_system_type *, int,
 		const char *, struct nfs_mount_info *);
 void nfs_kill_super(struct super_block *);
@@ -466,7 +466,7 @@ int  nfs_show_options(struct seq_file *, struct dentry *);
 int  nfs_show_devname(struct seq_file *, struct dentry *);
 int  nfs_show_path(struct seq_file *, struct dentry *);
 int  nfs_show_stats(struct seq_file *, struct dentry *);
-int nfs_remount(struct super_block *sb, int *flags, char *raw_data);
+int nfs_remount(struct super_block *sb, int *flags, char *raw_data, size_t data_size);
 
 /* write.c */
 extern void nfs_pageio_init_write(struct nfs_pageio_descriptor *pgio,
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index e5686be67be8..df9e87331558 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -216,7 +216,8 @@ static struct vfsmount *nfs_do_clone_mount(struct nfs_server *server,
 					   const char *devname,
 					   struct nfs_clone_mount *mountdata)
 {
-	return vfs_submount(mountdata->dentry, &nfs_xdev_fs_type, devname, mountdata);
+	return vfs_submount(mountdata->dentry, &nfs_xdev_fs_type, devname,
+			    mountdata, 0);
 }
 
 /**
diff --git a/fs/nfs/nfs4namespace.c b/fs/nfs/nfs4namespace.c
index 24f06dcc2b08..191cb4202056 100644
--- a/fs/nfs/nfs4namespace.c
+++ b/fs/nfs/nfs4namespace.c
@@ -278,7 +278,8 @@ static struct vfsmount *try_location(struct nfs_clone_mount *mountdata,
 				mountdata->hostname,
 				mountdata->mnt_path);
 
-		mnt = vfs_submount(mountdata->dentry, &nfs4_referral_fs_type, page, mountdata);
+		mnt = vfs_submount(mountdata->dentry, &nfs4_referral_fs_type, page,
+				   mountdata, 0);
 		if (!IS_ERR(mnt))
 			break;
 	}
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 6fb7cb6b3f4b..e72e5dbdfcd0 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -18,11 +18,11 @@
 static int nfs4_write_inode(struct inode *inode, struct writeback_control *wbc);
 static void nfs4_evict_inode(struct inode *inode);
 static struct dentry *nfs4_remote_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data);
+	int flags, const char *dev_name, void *raw_data, size_t data_size);
 static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data);
+	int flags, const char *dev_name, void *raw_data, size_t data_size);
 static struct dentry *nfs4_remote_referral_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data);
+	int flags, const char *dev_name, void *raw_data, size_t data_size);
 
 static struct file_system_type nfs4_remote_fs_type = {
 	.owner		= THIS_MODULE,
@@ -105,7 +105,7 @@ static void nfs4_evict_inode(struct inode *inode)
  */
 static struct dentry *
 nfs4_remote_mount(struct file_system_type *fs_type, int flags,
-		  const char *dev_name, void *info)
+		  const char *dev_name, void *info, size_t data_size)
 {
 	struct nfs_mount_info *mount_info = info;
 	struct nfs_server *server;
@@ -127,7 +127,7 @@ nfs4_remote_mount(struct file_system_type *fs_type, int flags,
 }
 
 static struct vfsmount *nfs_do_root_mount(struct file_system_type *fs_type,
-		int flags, void *data, const char *hostname)
+		int flags, void *data, size_t data_size, const char *hostname)
 {
 	struct vfsmount *root_mnt;
 	char *root_devname;
@@ -142,7 +142,8 @@ static struct vfsmount *nfs_do_root_mount(struct file_system_type *fs_type,
 		snprintf(root_devname, len, "[%s]:/", hostname);
 	else
 		snprintf(root_devname, len, "%s:/", hostname);
-	root_mnt = vfs_kern_mount(fs_type, flags, root_devname, data);
+	root_mnt = vfs_kern_mount(fs_type, flags, root_devname,
+				  data, data_size);
 	kfree(root_devname);
 	return root_mnt;
 }
@@ -247,8 +248,8 @@ struct dentry *nfs4_try_mount(int flags, const char *dev_name,
 
 	export_path = data->nfs_server.export_path;
 	data->nfs_server.export_path = "/";
-	root_mnt = nfs_do_root_mount(&nfs4_remote_fs_type, flags, mount_info,
-			data->nfs_server.hostname);
+	root_mnt = nfs_do_root_mount(&nfs4_remote_fs_type, flags, mount_info, 0,
+				     data->nfs_server.hostname);
 	data->nfs_server.export_path = export_path;
 
 	res = nfs_follow_remote_path(root_mnt, export_path);
@@ -261,7 +262,8 @@ struct dentry *nfs4_try_mount(int flags, const char *dev_name,
 
 static struct dentry *
 nfs4_remote_referral_mount(struct file_system_type *fs_type, int flags,
-			   const char *dev_name, void *raw_data)
+			   const char *dev_name,
+			   void *raw_data, size_t data_size)
 {
 	struct nfs_mount_info mount_info = {
 		.fill_super = nfs_fill_super,
@@ -294,7 +296,8 @@ nfs4_remote_referral_mount(struct file_system_type *fs_type, int flags,
  * Create an NFS4 server record on referral traversal
  */
 static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
-		int flags, const char *dev_name, void *raw_data)
+					  int flags, const char *dev_name,
+					  void *raw_data, size_t data_size)
 {
 	struct nfs_clone_mount *data = raw_data;
 	char *export_path;
@@ -306,8 +309,8 @@ static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
 	export_path = data->mnt_path;
 	data->mnt_path = "/";
 
-	root_mnt = nfs_do_root_mount(&nfs4_remote_referral_fs_type,
-			flags, data, data->hostname);
+	root_mnt = nfs_do_root_mount(&nfs4_remote_referral_fs_type, flags,
+				     data, 0, data->hostname);
 	data->mnt_path = export_path;
 
 	res = nfs_follow_remote_path(root_mnt, export_path);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 5e470e233c83..b5f27d6999e5 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -287,7 +287,8 @@ static match_table_t nfs_vers_tokens = {
 };
 
 static struct dentry *nfs_xdev_mount(struct file_system_type *fs_type,
-		int flags, const char *dev_name, void *raw_data);
+				     int flags, const char *dev_name,
+				     void *raw_data, size_t data_size);
 
 struct file_system_type nfs_fs_type = {
 	.owner		= THIS_MODULE,
@@ -1203,7 +1204,7 @@ static int nfs_get_option_ul_bound(substring_t args[], unsigned long *option,
  * skipped as they are encountered.  If there were no errors, return 1;
  * otherwise return 0 (zero).
  */
-static int nfs_parse_mount_options(char *raw,
+static int nfs_parse_mount_options(char *raw, size_t raw_size,
 				   struct nfs_parsed_mount_data *mnt)
 {
 	char *p, *string, *secdata;
@@ -1221,7 +1222,7 @@ static int nfs_parse_mount_options(char *raw,
 	if (!secdata)
 		goto out_nomem;
 
-	rc = security_sb_copy_data(raw, secdata);
+	rc = security_sb_copy_data(raw, raw_size, secdata);
 	if (rc)
 		goto out_security_failure;
 
@@ -2151,7 +2152,7 @@ static int nfs_validate_mount_data(struct file_system_type *fs_type,
 }
 #endif
 
-static int nfs_validate_text_mount_data(void *options,
+static int nfs_validate_text_mount_data(void *options, size_t data_size,
 					struct nfs_parsed_mount_data *args,
 					const char *dev_name)
 {
@@ -2160,7 +2161,7 @@ static int nfs_validate_text_mount_data(void *options,
 	int max_pathlen = NFS_MAXPATHLEN;
 	struct sockaddr *sap = (struct sockaddr *)&args->nfs_server.address;
 
-	if (nfs_parse_mount_options((char *)options, args) == 0)
+	if (nfs_parse_mount_options((char *)options, data_size, args) == 0)
 		return -EINVAL;
 
 	if (!nfs_verify_server_address(sap))
@@ -2243,7 +2244,7 @@ nfs_compare_remount_data(struct nfs_server *nfss,
 }
 
 int
-nfs_remount(struct super_block *sb, int *flags, char *raw_data)
+nfs_remount(struct super_block *sb, int *flags, char *raw_data, size_t data_size)
 {
 	int error;
 	struct nfs_server *nfss = sb->s_fs_info;
@@ -2290,7 +2291,7 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data)
 
 	/* overwrite those values with any that were specified */
 	error = -EINVAL;
-	if (!nfs_parse_mount_options((char *)options, data))
+	if (!nfs_parse_mount_options((char *)options, data_size, data))
 		goto out;
 
 	/*
@@ -2662,7 +2663,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 EXPORT_SYMBOL_GPL(nfs_fs_mount_common);
 
 struct dentry *nfs_fs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data)
+	int flags, const char *dev_name, void *raw_data, size_t data_size)
 {
 	struct nfs_mount_info mount_info = {
 		.fill_super = nfs_fill_super,
@@ -2680,7 +2681,8 @@ struct dentry *nfs_fs_mount(struct file_system_type *fs_type,
 	/* Validate the mount data */
 	error = nfs_validate_mount_data(fs_type, raw_data, mount_info.parsed, mount_info.mntfh, dev_name);
 	if (error == NFS_TEXT_DATA)
-		error = nfs_validate_text_mount_data(raw_data, mount_info.parsed, dev_name);
+		error = nfs_validate_text_mount_data(raw_data, data_size,
+						     mount_info.parsed, dev_name);
 	if (error < 0) {
 		mntroot = ERR_PTR(error);
 		goto out;
@@ -2724,7 +2726,7 @@ EXPORT_SYMBOL_GPL(nfs_kill_super);
  */
 static struct dentry *
 nfs_xdev_mount(struct file_system_type *fs_type, int flags,
-		const char *dev_name, void *raw_data)
+		const char *dev_name, void *raw_data, size_t data_size)
 {
 	struct nfs_clone_mount *data = raw_data;
 	struct nfs_mount_info mount_info = {
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index d107b4426f7e..661296305123 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1144,7 +1144,8 @@ static ssize_t write_v4_end_grace(struct file *file, char *buf, size_t size)
  *	populating the filesystem.
  */
 
-static int nfsd_fill_super(struct super_block * sb, void * data, int silent)
+static int nfsd_fill_super(struct super_block * sb,
+			   void * data, size_t data_size, int silent)
 {
 	static const struct tree_descr nfsd_files[] = {
 		[NFSD_List] = {"exports", &exports_nfsd_operations, S_IRUGO},
@@ -1179,10 +1180,11 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent)
 }
 
 static struct dentry *nfsd_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
 	struct net *net = current->nsproxy->net_ns;
-	return mount_ns(fs_type, flags, data, net, net->user_ns, nfsd_fill_super);
+	return mount_ns(fs_type, flags, data, data_size,
+			net, net->user_ns, nfsd_fill_super);
 }
 
 static void nfsd_umount(struct super_block *sb)
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 6ffeca84d7c3..3a21a1ab141f 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -69,7 +69,8 @@ struct kmem_cache *nilfs_segbuf_cachep;
 struct kmem_cache *nilfs_btree_path_cache;
 
 static int nilfs_setup_super(struct super_block *sb, int is_mount);
-static int nilfs_remount(struct super_block *sb, int *flags, char *data);
+static int nilfs_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size);
 
 void __nilfs_msg(struct super_block *sb, const char *level, const char *fmt,
 		 ...)
@@ -1118,7 +1119,8 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent)
 	return err;
 }
 
-static int nilfs_remount(struct super_block *sb, int *flags, char *data)
+static int nilfs_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size)
 {
 	struct the_nilfs *nilfs = sb->s_fs_info;
 	unsigned long old_sb_flags;
@@ -1278,7 +1280,7 @@ static int nilfs_test_bdev_super(struct super_block *s, void *data)
 
 static struct dentry *
 nilfs_mount(struct file_system_type *fs_type, int flags,
-	     const char *dev_name, void *data)
+	    const char *dev_name, void *data, size_t data_size)
 {
 	struct nilfs_super_data sd;
 	struct super_block *s;
@@ -1346,7 +1348,7 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
 			 * Try remount to setup mount states if the current
 			 * tree is not mounted and only snapshots use this sb.
 			 */
-			err = nilfs_remount(s, &flags, data);
+			err = nilfs_remount(s, &flags, data, data_size);
 			if (err)
 				goto failed_super;
 		}
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 60702d677bd4..f069eb6495b0 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -263,7 +263,8 @@ static const struct super_operations nsfs_ops = {
 	.show_path = nsfs_show_path,
 };
 static struct dentry *nsfs_mount(struct file_system_type *fs_type,
-			int flags, const char *dev_name, void *data)
+				 int flags, const char *dev_name,
+				 void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type, "nsfs:", &nsfs_ops,
 			&ns_dentry_operations, NSFS_MAGIC);
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index bb7159f697f2..8501bbcceb5a 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -456,6 +456,7 @@ static inline int ntfs_clear_volume_flags(ntfs_volume *vol, VOLUME_FLAGS flags)
  * @sb:		superblock of mounted ntfs filesystem
  * @flags:	remount flags
  * @opt:	remount options string
+ * @data_size:	size of the options string
  *
  * Change the mount options of an already mounted ntfs filesystem.
  *
@@ -463,7 +464,8 @@ static inline int ntfs_clear_volume_flags(ntfs_volume *vol, VOLUME_FLAGS flags)
  * ntfs_remount() returns successfully (i.e. returns 0).  Otherwise,
  * @sb->s_flags are not changed.
  */
-static int ntfs_remount(struct super_block *sb, int *flags, char *opt)
+static int ntfs_remount(struct super_block *sb, int *flags,
+			char *opt, size_t data_size)
 {
 	ntfs_volume *vol = NTFS_SB(sb);
 
@@ -2694,6 +2696,7 @@ static const struct super_operations ntfs_sops = {
  * ntfs_fill_super - mount an ntfs filesystem
  * @sb:		super block of ntfs filesystem to mount
  * @opt:	string containing the mount options
+ * @data_size:	size of the mount options string
  * @silent:	silence error output
  *
  * ntfs_fill_super() is called by the VFS to mount the device described by @sb
@@ -2708,7 +2711,8 @@ static const struct super_operations ntfs_sops = {
  *
  * NOTE: @sb->s_flags contains the mount options flags.
  */
-static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
+static int ntfs_fill_super(struct super_block *sb, void *opt, size_t data_size,
+			   const int silent)
 {
 	ntfs_volume *vol;
 	struct buffer_head *bh;
@@ -3060,9 +3064,10 @@ struct kmem_cache *ntfs_index_ctx_cache;
 DEFINE_MUTEX(ntfs_lock);
 
 static struct dentry *ntfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, ntfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  ntfs_fill_super);
 }
 
 static struct file_system_type ntfs_fs_type = {
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 602c71f32740..642e471a6472 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -568,6 +568,7 @@ static int dlmfs_unlink(struct inode *dir,
 
 static int dlmfs_fill_super(struct super_block * sb,
 			    void * data,
+			    size_t data_size,
 			    int silent)
 {
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
@@ -617,9 +618,9 @@ static const struct inode_operations dlmfs_file_inode_operations = {
 };
 
 static struct dentry *dlmfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, data, dlmfs_fill_super);
+	return mount_nodev(fs_type, flags, data, data_size, dlmfs_fill_super);
 }
 
 static struct file_system_type dlmfs_fs_type = {
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 3415e0b09398..62237837a098 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -107,7 +107,8 @@ static int ocfs2_check_set_options(struct super_block *sb,
 static int ocfs2_show_options(struct seq_file *s, struct dentry *root);
 static void ocfs2_put_super(struct super_block *sb);
 static int ocfs2_mount_volume(struct super_block *sb);
-static int ocfs2_remount(struct super_block *sb, int *flags, char *data);
+static int ocfs2_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size);
 static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err);
 static int ocfs2_initialize_mem_caches(void);
 static void ocfs2_free_mem_caches(void);
@@ -633,7 +634,8 @@ static unsigned long long ocfs2_max_file_offset(unsigned int bbits,
 	return (((unsigned long long)bytes) << bitshift) - trim;
 }
 
-static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
+static int ocfs2_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size)
 {
 	int incompat_features;
 	int ret = 0;
@@ -999,7 +1001,8 @@ static void ocfs2_disable_quotas(struct ocfs2_super *osb)
 	}
 }
 
-static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
+static int ocfs2_fill_super(struct super_block *sb, void *data, size_t data_size,
+			    int silent)
 {
 	struct dentry *root;
 	int status, sector_size;
@@ -1236,9 +1239,10 @@ static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
 static struct dentry *ocfs2_mount(struct file_system_type *fs_type,
 			int flags,
 			const char *dev_name,
-			void *data)
+			void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, ocfs2_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  ocfs2_fill_super);
 }
 
 static struct file_system_type ocfs2_fs_type = {
diff --git a/fs/omfs/inode.c b/fs/omfs/inode.c
index ee14af9e26f2..e5258fefcd2b 100644
--- a/fs/omfs/inode.c
+++ b/fs/omfs/inode.c
@@ -454,7 +454,8 @@ static int parse_options(char *options, struct omfs_sb_info *sbi)
 	return 1;
 }
 
-static int omfs_fill_super(struct super_block *sb, void *data, int silent)
+static int omfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct buffer_head *bh, *bh2;
 	struct omfs_super_block *omfs_sb;
@@ -596,9 +597,11 @@ static int omfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *omfs_mount(struct file_system_type *fs_type,
-			int flags, const char *dev_name, void *data)
+				 int flags, const char *dev_name,
+				 void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, omfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  omfs_fill_super);
 }
 
 static struct file_system_type omfs_fs_type = {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index 607092f367ad..e11c042b4766 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -365,7 +365,8 @@ static struct inode *openprom_iget(struct super_block *sb, ino_t ino)
 	return inode;
 }
 
-static int openprom_remount(struct super_block *sb, int *flags, char *data)
+static int openprom_remount(struct super_block *sb, int *flags,
+			    char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_NOATIME;
@@ -379,7 +380,8 @@ static const struct super_operations openprom_sops = {
 	.remount_fs	= openprom_remount,
 };
 
-static int openprom_fill_super(struct super_block *s, void *data, int silent)
+static int openprom_fill_super(struct super_block *s,
+			       void *data, size_t data_size, int silent)
 {
 	struct inode *root_inode;
 	struct op_inode_info *oi;
@@ -414,9 +416,10 @@ static int openprom_fill_super(struct super_block *s, void *data, int silent)
 }
 
 static struct dentry *openprom_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, openprom_fill_super);
+	return mount_single(fs_type, flags, data, data_size,
+			    openprom_fill_super);
 }
 
 static struct file_system_type openprom_fs_type = {
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 17b24ad6b264..ed38b9a5e43a 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -318,7 +318,7 @@ extern uint64_t orangefs_features;
 struct dentry *orangefs_mount(struct file_system_type *fst,
 			   int flags,
 			   const char *devname,
-			   void *data);
+			   void *data, size_t data_size);
 
 void orangefs_kill_sb(struct super_block *sb);
 int orangefs_remount(struct orangefs_sb_info_s *);
diff --git a/fs/orangefs/super.c b/fs/orangefs/super.c
index dfaee90d30bd..784daf6667d1 100644
--- a/fs/orangefs/super.c
+++ b/fs/orangefs/super.c
@@ -207,7 +207,8 @@ static int orangefs_statfs(struct dentry *dentry, struct kstatfs *buf)
  * Remount as initiated by VFS layer.  We just need to reparse the mount
  * options, no need to signal pvfs2-client-core about it.
  */
-static int orangefs_remount_fs(struct super_block *sb, int *flags, char *data)
+static int orangefs_remount_fs(struct super_block *sb, int *flags,
+			       char *data, size_t data_size)
 {
 	gossip_debug(GOSSIP_SUPER_DEBUG, "orangefs_remount_fs: called\n");
 	return parse_mount_options(sb, data, 1);
@@ -457,7 +458,7 @@ static int orangefs_fill_sb(struct super_block *sb,
 struct dentry *orangefs_mount(struct file_system_type *fst,
 			   int flags,
 			   const char *devname,
-			   void *data)
+			   void *data, size_t data_size)
 {
 	int ret = -EINVAL;
 	struct super_block *sb = ERR_PTR(-EINVAL);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 704b37311467..ac34b1d5dec2 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -379,7 +379,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
 	return 0;
 }
 
-static int ovl_remount(struct super_block *sb, int *flags, char *data)
+static int ovl_remount(struct super_block *sb, int *flags,
+		       char *data, size_t data_size)
 {
 	struct ovl_fs *ofs = sb->s_fs_info;
 
@@ -1354,7 +1355,8 @@ static struct ovl_entry *ovl_get_lowerstack(struct super_block *sb,
 	goto out;
 }
 
-static int ovl_fill_super(struct super_block *sb, void *data, int silent)
+static int ovl_fill_super(struct super_block *sb, void *data, size_t data_size,
+			  int silent)
 {
 	struct path upperpath = { };
 	struct dentry *root_dentry;
@@ -1492,9 +1494,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *ovl_mount(struct file_system_type *fs_type, int flags,
-				const char *dev_name, void *raw_data)
+				const char *dev_name,
+				void *raw_data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, raw_data, ovl_fill_super);
+	return mount_nodev(fs_type, flags, raw_data, data_size, ovl_fill_super);
 }
 
 static struct file_system_type ovl_fs_type = {
diff --git a/fs/pipe.c b/fs/pipe.c
index bb0840e234f3..697cbb01b96f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1183,7 +1183,8 @@ static const struct super_operations pipefs_ops = {
  * d_name - pipe: will go nicely and kill the special-casing in procfs.
  */
 static struct dentry *pipefs_mount(struct file_system_type *fs_type,
-			 int flags, const char *dev_name, void *data)
+				   int flags, const char *dev_name,
+				   void *data, size_t data_size)
 {
 	return mount_pseudo(fs_type, "pipe:", &pipefs_ops,
 			&pipefs_dentry_operations, PIPEFS_MAGIC);
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 85ffbd27f288..faf401935fa9 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -489,7 +489,8 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 	return inode;
 }
 
-int proc_fill_super(struct super_block *s, void *data, int silent)
+int proc_fill_super(struct super_block *s, void *data, size_t data_size,
+		    int silent)
 {
 	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
 	struct inode *root_inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index da3dbfa09e79..841b4391deb6 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -214,7 +214,7 @@ extern const struct inode_operations proc_pid_link_inode_operations;
 void proc_init_kmemcache(void);
 void set_proc_pid_nlink(void);
 extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *data, int flags);
+extern int proc_fill_super(struct super_block *, void *, size_t, int);
 extern void proc_entry_rundown(struct proc_dir_entry *);
 
 /*
@@ -275,7 +275,7 @@ extern struct proc_dir_entry proc_root;
 extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *);
+extern int proc_remount(struct super_block *, int *, char *, size_t);
 
 /*
  * task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index f4b1a9d2eca6..28fadb0c51ab 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -78,7 +78,8 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
-int proc_remount(struct super_block *sb, int *flags, char *data)
+int proc_remount(struct super_block *sb, int *flags,
+		 char *data, size_t data_size)
 {
 	struct pid_namespace *pid = sb->s_fs_info;
 
@@ -87,7 +88,8 @@ int proc_remount(struct super_block *sb, int *flags, char *data)
 }
 
 static struct dentry *proc_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+				 int flags, const char *dev_name,
+				 void *data, size_t data_size)
 {
 	struct pid_namespace *ns;
 
@@ -98,7 +100,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 		ns = task_active_pid_ns(current);
 	}
 
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+	return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
+			proc_fill_super);
 }
 
 static void proc_kill_sb(struct super_block *sb)
@@ -211,7 +214,7 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
 {
 	struct vfsmount *mnt;
 
-	mnt = kern_mount_data(&proc_fs_type, ns);
+	mnt = kern_mount_data(&proc_fs_type, ns, 0);
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 5fcb845b9fec..793258231096 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -271,7 +271,8 @@ static int pstore_show_options(struct seq_file *m, struct dentry *root)
 	return 0;
 }
 
-static int pstore_remount(struct super_block *sb, int *flags, char *data)
+static int pstore_remount(struct super_block *sb, int *flags,
+			  char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	parse_options(data);
@@ -432,7 +433,8 @@ void pstore_get_records(int quiet)
 	inode_unlock(d_inode(root));
 }
 
-static int pstore_fill_super(struct super_block *sb, void *data, int silent)
+static int pstore_fill_super(struct super_block *sb,
+			     void *data, size_t data_size, int silent)
 {
 	struct inode *inode;
 
@@ -464,9 +466,9 @@ static int pstore_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *pstore_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, pstore_fill_super);
+	return mount_single(fs_type, flags, data, data_size, pstore_fill_super);
 }
 
 static void pstore_kill_sb(struct super_block *sb)
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index 3d46fe302fcb..be35529c8052 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -29,7 +29,8 @@ static const struct super_operations qnx4_sops;
 
 static struct inode *qnx4_alloc_inode(struct super_block *sb);
 static void qnx4_destroy_inode(struct inode *inode);
-static int qnx4_remount(struct super_block *sb, int *flags, char *data);
+static int qnx4_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size);
 static int qnx4_statfs(struct dentry *, struct kstatfs *);
 
 static const struct super_operations qnx4_sops =
@@ -40,7 +41,8 @@ static const struct super_operations qnx4_sops =
 	.remount_fs	= qnx4_remount,
 };
 
-static int qnx4_remount(struct super_block *sb, int *flags, char *data)
+static int qnx4_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	struct qnx4_sb_info *qs;
 
@@ -183,7 +185,8 @@ static const char *qnx4_checkroot(struct super_block *sb,
 	return "bitmap file not found.";
 }
 
-static int qnx4_fill_super(struct super_block *s, void *data, int silent)
+static int qnx4_fill_super(struct super_block *s, void *data, size_t data_size,
+			   int silent)
 {
 	struct buffer_head *bh;
 	struct inode *root;
@@ -383,9 +386,10 @@ static void destroy_inodecache(void)
 }
 
 static struct dentry *qnx4_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, qnx4_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  qnx4_fill_super);
 }
 
 static struct file_system_type qnx4_fs_type = {
diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
index 4aeb26bcb4d0..a415c1b5f936 100644
--- a/fs/qnx6/inode.c
+++ b/fs/qnx6/inode.c
@@ -30,7 +30,8 @@ static const struct super_operations qnx6_sops;
 static void qnx6_put_super(struct super_block *sb);
 static struct inode *qnx6_alloc_inode(struct super_block *sb);
 static void qnx6_destroy_inode(struct inode *inode);
-static int qnx6_remount(struct super_block *sb, int *flags, char *data);
+static int qnx6_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size);
 static int qnx6_statfs(struct dentry *dentry, struct kstatfs *buf);
 static int qnx6_show_options(struct seq_file *seq, struct dentry *root);
 
@@ -53,7 +54,8 @@ static int qnx6_show_options(struct seq_file *seq, struct dentry *root)
 	return 0;
 }
 
-static int qnx6_remount(struct super_block *sb, int *flags, char *data)
+static int qnx6_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_RDONLY;
@@ -294,7 +296,8 @@ static struct buffer_head *qnx6_check_first_superblock(struct super_block *s,
 static struct inode *qnx6_private_inode(struct super_block *s,
 					struct qnx6_root_node *p);
 
-static int qnx6_fill_super(struct super_block *s, void *data, int silent)
+static int qnx6_fill_super(struct super_block *s, void *data, size_t data_size,
+			   int silent)
 {
 	struct buffer_head *bh1 = NULL, *bh2 = NULL;
 	struct qnx6_super_block *sb1 = NULL, *sb2 = NULL;
@@ -643,9 +646,10 @@ static void destroy_inodecache(void)
 }
 
 static struct dentry *qnx6_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, qnx6_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  qnx6_fill_super);
 }
 
 static struct file_system_type qnx6_fs_type = {
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 11201b2d06b9..2e9b23b4a98b 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -217,7 +217,7 @@ static int ramfs_parse_options(char *data, struct ramfs_mount_opts *opts)
 	return 0;
 }
 
-int ramfs_fill_super(struct super_block *sb, void *data, int silent)
+int ramfs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
 {
 	struct ramfs_fs_info *fsi;
 	struct inode *inode;
@@ -248,9 +248,9 @@ int ramfs_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 struct dentry *ramfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, data, ramfs_fill_super);
+	return mount_nodev(fs_type, flags, data, data_size, ramfs_fill_super);
 }
 
 static void ramfs_kill_sb(struct super_block *sb)
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 1fc934d24459..d8631cb38485 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -61,7 +61,8 @@ static int is_any_reiserfs_magic_string(struct reiserfs_super_block *rs)
 		is_reiserfs_jr(rs));
 }
 
-static int reiserfs_remount(struct super_block *s, int *flags, char *data);
+static int reiserfs_remount(struct super_block *s, int *flags,
+			    char *data, size_t data_size);
 static int reiserfs_statfs(struct dentry *dentry, struct kstatfs *buf);
 
 static int reiserfs_sync_fs(struct super_block *s, int wait)
@@ -1433,7 +1434,8 @@ static void handle_quota_files(struct super_block *s, char **qf_names,
 }
 #endif
 
-static int reiserfs_remount(struct super_block *s, int *mount_flags, char *arg)
+static int reiserfs_remount(struct super_block *s, int *mount_flags,
+			    char *arg, size_t data_size)
 {
 	struct reiserfs_super_block *rs;
 	struct reiserfs_transaction_handle th;
@@ -1898,7 +1900,8 @@ static int function2code(hashf_t func)
 	if (!(silent))				\
 		reiserfs_warning(s, id, __VA_ARGS__)
 
-static int reiserfs_fill_super(struct super_block *s, void *data, int silent)
+static int reiserfs_fill_super(struct super_block *s, void *data, size_t data_size,
+			       int silent)
 {
 	struct inode *root_inode;
 	struct reiserfs_transaction_handle th;
@@ -2600,9 +2603,10 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
 
 static struct dentry *get_super_block(struct file_system_type *fs_type,
 			   int flags, const char *dev_name,
-			   void *data)
+			   void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, reiserfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  reiserfs_fill_super);
 }
 
 static int __init init_reiserfs_fs(void)
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 6ccb51993a76..a6a53403a035 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -430,7 +430,8 @@ static int romfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 /*
  * remounting must involve read-only
  */
-static int romfs_remount(struct super_block *sb, int *flags, char *data)
+static int romfs_remount(struct super_block *sb, int *flags,
+			 char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_RDONLY;
@@ -464,7 +465,8 @@ static __u32 romfs_checksum(const void *data, int size)
 /*
  * fill in the superblock
  */
-static int romfs_fill_super(struct super_block *sb, void *data, int silent)
+static int romfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			    int silent)
 {
 	struct romfs_super_block *rsb;
 	struct inode *root;
@@ -557,16 +559,17 @@ static int romfs_fill_super(struct super_block *sb, void *data, int silent)
  */
 static struct dentry *romfs_mount(struct file_system_type *fs_type,
 			int flags, const char *dev_name,
-			void *data)
+			void *data, size_t data_size)
 {
 	struct dentry *ret = ERR_PTR(-EINVAL);
 
 #ifdef CONFIG_ROMFS_ON_MTD
-	ret = mount_mtd(fs_type, flags, dev_name, data, romfs_fill_super);
+	ret = mount_mtd(fs_type, flags, dev_name, data, data_size,
+			romfs_fill_super);
 #endif
 #ifdef CONFIG_ROMFS_ON_BLOCK
 	if (ret == ERR_PTR(-EINVAL))
-		ret = mount_bdev(fs_type, flags, dev_name, data,
+		ret = mount_bdev(fs_type, flags, dev_name, data, data_size,
 				  romfs_fill_super);
 #endif
 	return ret;
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index 8a73b97217c8..ed6881d97b3c 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -76,7 +76,8 @@ static const struct squashfs_decompressor *supported_squashfs_filesystem(short
 }
 
 
-static int squashfs_fill_super(struct super_block *sb, void *data, int silent)
+static int squashfs_fill_super(struct super_block *sb,
+			       void *data, size_t data_size, int silent)
 {
 	struct squashfs_sb_info *msblk;
 	struct squashfs_super_block *sblk = NULL;
@@ -370,7 +371,8 @@ static int squashfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 }
 
 
-static int squashfs_remount(struct super_block *sb, int *flags, char *data)
+static int squashfs_remount(struct super_block *sb, int *flags,
+			    char *data, size_t data_size)
 {
 	sync_filesystem(sb);
 	*flags |= SB_RDONLY;
@@ -398,9 +400,11 @@ static void squashfs_put_super(struct super_block *sb)
 
 
 static struct dentry *squashfs_mount(struct file_system_type *fs_type,
-				int flags, const char *dev_name, void *data)
+				     int flags, const char *dev_name,
+				     void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, squashfs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  squashfs_fill_super);
 }
 
 
diff --git a/fs/super.c b/fs/super.c
index 5132a32e5ebc..c9d208b7999e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -836,11 +836,13 @@ struct super_block *user_get_super(dev_t dev)
  *	@sb:	superblock in question
  *	@sb_flags: revised superblock flags
  *	@data:	the rest of options
+ *	@data_size: The size of the data
  *      @force: whether or not to force the change
  *
  *	Alters the mount options of a mounted file system.
  */
-int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
+int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
+		  size_t data_size, int force)
 {
 	int retval;
 	int remount_ro;
@@ -883,7 +885,7 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
 	}
 
 	if (sb->s_op->remount_fs) {
-		retval = sb->s_op->remount_fs(sb, &sb_flags, data);
+		retval = sb->s_op->remount_fs(sb, &sb_flags, data, data_size);
 		if (retval) {
 			if (!force)
 				goto cancel_readonly;
@@ -922,7 +924,7 @@ static void do_emergency_remount_callback(struct super_block *sb)
 		/*
 		 * What lock protects sb->s_flags??
 		 */
-		do_remount_sb(sb, SB_RDONLY, NULL, 1);
+		do_remount_sb(sb, SB_RDONLY, NULL, 0, 1);
 	}
 	up_write(&sb->s_umount);
 }
@@ -1071,8 +1073,9 @@ static int ns_set_super(struct super_block *sb, void *data)
 }
 
 struct dentry *mount_ns(struct file_system_type *fs_type,
-	int flags, void *data, void *ns, struct user_namespace *user_ns,
-	int (*fill_super)(struct super_block *, void *, int))
+	int flags, void *data, size_t data_size,
+	void *ns, struct user_namespace *user_ns,
+	int (*fill_super)(struct super_block *, void *, size_t, int))
 {
 	struct super_block *sb;
 
@@ -1089,7 +1092,7 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
 
 	if (!sb->s_root) {
 		int err;
-		err = fill_super(sb, data, flags & SB_SILENT ? 1 : 0);
+		err = fill_super(sb, data, data_size, flags & SB_SILENT ? 1 : 0);
 		if (err) {
 			deactivate_locked_super(sb);
 			return ERR_PTR(err);
@@ -1119,8 +1122,8 @@ static int test_bdev_super(struct super_block *s, void *data)
 }
 
 struct dentry *mount_bdev(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data,
-	int (*fill_super)(struct super_block *, void *, int))
+	int flags, const char *dev_name, void *data, size_t data_size,
+	int (*fill_super)(struct super_block *, void *, size_t, int))
 {
 	struct block_device *bdev;
 	struct super_block *s;
@@ -1172,7 +1175,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
 		s->s_mode = mode;
 		snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
 		sb_set_blocksize(s, block_size(bdev));
-		error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
+		error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
 		if (error) {
 			deactivate_locked_super(s);
 			goto error;
@@ -1209,8 +1212,8 @@ EXPORT_SYMBOL(kill_block_super);
 #endif
 
 struct dentry *mount_nodev(struct file_system_type *fs_type,
-	int flags, void *data,
-	int (*fill_super)(struct super_block *, void *, int))
+	int flags, void *data, size_t data_size,
+	int (*fill_super)(struct super_block *, void *, size_t, int))
 {
 	int error;
 	struct super_block *s = sget(fs_type, NULL, set_anon_super, flags, NULL);
@@ -1218,7 +1221,7 @@ struct dentry *mount_nodev(struct file_system_type *fs_type,
 	if (IS_ERR(s))
 		return ERR_CAST(s);
 
-	error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
+	error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
 	if (error) {
 		deactivate_locked_super(s);
 		return ERR_PTR(error);
@@ -1234,8 +1237,8 @@ static int compare_single(struct super_block *s, void *p)
 }
 
 struct dentry *mount_single(struct file_system_type *fs_type,
-	int flags, void *data,
-	int (*fill_super)(struct super_block *, void *, int))
+	int flags, void *data, size_t data_size,
+	int (*fill_super)(struct super_block *, void *, size_t, int))
 {
 	struct super_block *s;
 	int error;
@@ -1244,21 +1247,22 @@ struct dentry *mount_single(struct file_system_type *fs_type,
 	if (IS_ERR(s))
 		return ERR_CAST(s);
 	if (!s->s_root) {
-		error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
+		error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
 		if (error) {
 			deactivate_locked_super(s);
 			return ERR_PTR(error);
 		}
 		s->s_flags |= SB_ACTIVE;
 	} else {
-		do_remount_sb(s, flags, data, 0);
+		do_remount_sb(s, flags, data, data_size, 0);
 	}
 	return dget(s->s_root);
 }
 EXPORT_SYMBOL(mount_single);
 
 struct dentry *
-mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
+mount_fs(struct file_system_type *type, int flags, const char *name,
+	 void *data, size_t data_size)
 {
 	struct dentry *root;
 	struct super_block *sb;
@@ -1270,12 +1274,12 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
 		if (!secdata)
 			goto out;
 
-		error = security_sb_copy_data(data, secdata);
+		error = security_sb_copy_data(data, data_size, secdata);
 		if (error)
 			goto out_free_secdata;
 	}
 
-	root = type->mount(type, flags, name, data);
+	root = type->mount(type, flags, name, data, data_size);
 	if (IS_ERR(root)) {
 		error = PTR_ERR(root);
 		goto out_free_secdata;
@@ -1293,7 +1297,7 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
 	smp_wmb();
 	sb->s_flags |= SB_BORN;
 
-	error = security_sb_kern_mount(sb, flags, secdata);
+	error = security_sb_kern_mount(sb, flags, secdata, data_size);
 	if (error)
 		goto out_sb;
 
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 92682fcc41f6..77302c35b0ff 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -21,7 +21,7 @@ static struct kernfs_root *sysfs_root;
 struct kernfs_node *sysfs_root_kn;
 
 static struct dentry *sysfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
 	struct dentry *root;
 	void *ns;
diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index bec9f79adb25..47f66bbc4578 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -57,7 +57,8 @@ static int sysv_sync_fs(struct super_block *sb, int wait)
 	return 0;
 }
 
-static int sysv_remount(struct super_block *sb, int *flags, char *data)
+static int sysv_remount(struct super_block *sb, int *flags,
+			char *data, size_t data_size)
 {
 	struct sysv_sb_info *sbi = SYSV_SB(sb);
 
diff --git a/fs/sysv/super.c b/fs/sysv/super.c
index 89765ddfb738..275c7038eecd 100644
--- a/fs/sysv/super.c
+++ b/fs/sysv/super.c
@@ -349,7 +349,8 @@ static int complete_read_super(struct super_block *sb, int silent, int size)
 	return 1;
 }
 
-static int sysv_fill_super(struct super_block *sb, void *data, int silent)
+static int sysv_fill_super(struct super_block *sb, void *data, size_t data_size,
+			   int silent)
 {
 	struct buffer_head *bh1, *bh = NULL;
 	struct sysv_sb_info *sbi;
@@ -470,7 +471,8 @@ static int v7_sanity_check(struct super_block *sb, struct buffer_head *bh)
 	return 1;
 }
 
-static int v7_fill_super(struct super_block *sb, void *data, int silent)
+static int v7_fill_super(struct super_block *sb, void *data, size_t data_size,
+			 int silent)
 {
 	struct sysv_sb_info *sbi;
 	struct buffer_head *bh;
@@ -528,15 +530,17 @@ static int v7_fill_super(struct super_block *sb, void *data, int silent)
 /* Every kernel module contains stuff like this. */
 
 static struct dentry *sysv_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, sysv_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  sysv_fill_super);
 }
 
 static struct dentry *v7_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, v7_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  v7_fill_super);
 }
 
 static struct file_system_type sysv_fs_type = {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index bea8ad876bf9..85b3f230e202 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -225,7 +225,8 @@ static int tracefs_apply_options(struct super_block *sb)
 	return 0;
 }
 
-static int tracefs_remount(struct super_block *sb, int *flags, char *data)
+static int tracefs_remount(struct super_block *sb, int *flags,
+			   char *data, size_t data_size)
 {
 	int err;
 	struct tracefs_fs_info *fsi = sb->s_fs_info;
@@ -264,7 +265,8 @@ static const struct super_operations tracefs_super_operations = {
 	.show_options	= tracefs_show_options,
 };
 
-static int trace_fill_super(struct super_block *sb, void *data, int silent)
+static int trace_fill_super(struct super_block *sb,
+			    void *data, size_t data_size, int silent)
 {
 	static const struct tree_descr trace_files[] = {{""}};
 	struct tracefs_fs_info *fsi;
@@ -299,9 +301,9 @@ static int trace_fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *trace_mount(struct file_system_type *fs_type,
 			int flags, const char *dev_name,
-			void *data)
+			void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, trace_fill_super);
+	return mount_single(fs_type, flags, data, data_size, trace_fill_super);
 }
 
 static struct file_system_type trace_fs_type = {
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index c5466c70d620..6a74374d866e 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1843,7 +1843,8 @@ static void ubifs_put_super(struct super_block *sb)
 	mutex_unlock(&c->umount_mutex);
 }
 
-static int ubifs_remount_fs(struct super_block *sb, int *flags, char *data)
+static int ubifs_remount_fs(struct super_block *sb, int *flags,
+			    char *data, size_t data_size)
 {
 	int err;
 	struct ubifs_info *c = sb->s_fs_info;
@@ -2106,7 +2107,7 @@ static int sb_set(struct super_block *sb, void *data)
 }
 
 static struct dentry *ubifs_mount(struct file_system_type *fs_type, int flags,
-			const char *name, void *data)
+			const char *name, void *data, size_t data_size)
 {
 	struct ubi_volume_desc *ubi;
 	struct ubifs_info *c;
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 0c504c8031d3..9082f45bc46c 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -87,10 +87,10 @@ enum {
 enum { UDF_MAX_LINKS = 0xffff };
 
 /* These are the "meat" - everything else is stuffing */
-static int udf_fill_super(struct super_block *, void *, int);
+static int udf_fill_super(struct super_block *, void *, size_t, int);
 static void udf_put_super(struct super_block *);
 static int udf_sync_fs(struct super_block *, int);
-static int udf_remount_fs(struct super_block *, int *, char *);
+static int udf_remount_fs(struct super_block *, int *, char *, size_t);
 static void udf_load_logicalvolint(struct super_block *, struct kernel_extent_ad);
 static int udf_find_fileset(struct super_block *, struct kernel_lb_addr *,
 			    struct kernel_lb_addr *);
@@ -126,9 +126,11 @@ struct logicalVolIntegrityDescImpUse *udf_sb_lvidiu(struct super_block *sb)
 
 /* UDF filesystem type */
 static struct dentry *udf_mount(struct file_system_type *fs_type,
-		      int flags, const char *dev_name, void *data)
+				int flags, const char *dev_name,
+				void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, udf_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  udf_fill_super);
 }
 
 static struct file_system_type udf_fstype = {
@@ -608,7 +610,8 @@ static int udf_parse_options(char *options, struct udf_options *uopt,
 	return 1;
 }
 
-static int udf_remount_fs(struct super_block *sb, int *flags, char *options)
+static int udf_remount_fs(struct super_block *sb, int *flags,
+			  char *options, size_t data_size)
 {
 	struct udf_options uopt;
 	struct udf_sb_info *sbi = UDF_SB(sb);
@@ -2085,7 +2088,8 @@ u64 lvid_get_unique_id(struct super_block *sb)
 	return ret;
 }
 
-static int udf_fill_super(struct super_block *sb, void *options, int silent)
+static int udf_fill_super(struct super_block *sb,
+			  void *options, size_t data_size, int silent)
 {
 	int ret = -EINVAL;
 	struct inode *inode = NULL;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 488088141451..96a20a76e3c4 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -774,7 +774,8 @@ static u64 ufs_max_bytes(struct super_block *sb)
 	return res << uspi->s_bshift;
 }
 
-static int ufs_fill_super(struct super_block *sb, void *data, int silent)
+static int ufs_fill_super(struct super_block *sb, void *data, size_t data_size,
+			  int silent)
 {
 	struct ufs_sb_info * sbi;
 	struct ufs_sb_private_info * uspi;
@@ -1297,7 +1298,8 @@ static int ufs_fill_super(struct super_block *sb, void *data, int silent)
 	return -ENOMEM;
 }
 
-static int ufs_remount (struct super_block *sb, int *mount_flags, char *data)
+static int ufs_remount (struct super_block *sb, int *mount_flags,
+			char *data, size_t data_size)
 {
 	struct ufs_sb_private_info * uspi;
 	struct ufs_super_block_first * usb1;
@@ -1505,9 +1507,10 @@ static const struct super_operations ufs_super_ops = {
 };
 
 static struct dentry *ufs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, ufs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  ufs_fill_super);
 }
 
 static struct file_system_type ufs_fs_type = {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9d791f158dfe..b6776d8a644c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1256,7 +1256,8 @@ STATIC int
 xfs_fs_remount(
 	struct super_block	*sb,
 	int			*flags,
-	char			*options)
+	char			*options,
+	size_t			data_size)
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 	xfs_sb_t		*sbp = &mp->m_sb;
@@ -1595,6 +1596,7 @@ STATIC int
 xfs_fs_fill_super(
 	struct super_block	*sb,
 	void			*data,
+	size_t			data_size,
 	int			silent)
 {
 	struct inode		*root;
@@ -1808,9 +1810,11 @@ xfs_fs_mount(
 	struct file_system_type	*fs_type,
 	int			flags,
 	const char		*dev_name,
-	void			*data)
+	void			*data,
+	size_t			data_size)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, xfs_fs_fill_super);
+	return mount_bdev(fs_type, flags, dev_name, data, data_size,
+			  xfs_fs_fill_super);
 }
 
 static long
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index 3b0ba54cc4d5..a02de1b397ca 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -75,11 +75,11 @@ struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
 struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent,
 				      const char *dest);
 
-typedef struct vfsmount *(*debugfs_automount_t)(struct dentry *, void *);
+typedef struct vfsmount *(*debugfs_automount_t)(struct dentry *, void *, size_t);
 struct dentry *debugfs_create_automount(const char *name,
 					struct dentry *parent,
 					debugfs_automount_t f,
-					void *data);
+					void *data, size_t data_size);
 
 void debugfs_remove(struct dentry *dentry);
 void debugfs_remove_recursive(struct dentry *dentry);
@@ -204,8 +204,8 @@ static inline struct dentry *debugfs_create_symlink(const char *name,
 
 static inline struct dentry *debugfs_create_automount(const char *name,
 					struct dentry *parent,
-					struct vfsmount *(*f)(void *),
-					void *data)
+					struct vfsmount *(*f)(void *, size_t),
+					void *data, size_t data_size)
 {
 	return ERR_PTR(-ENODEV);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 00e255c195f2..067f0e31aec7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1841,7 +1841,7 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
-	int (*remount_fs) (struct super_block *, int *, char *);
+	int (*remount_fs) (struct super_block *, int *, char *, size_t);
 	void (*umount_begin) (struct super_block *);
 
 	int (*show_options)(struct seq_file *, struct dentry *);
@@ -2099,7 +2099,7 @@ struct file_system_type {
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
-		       const char *, void *);
+				 const char *, void *, size_t);
 	void (*kill_sb) (struct super_block *);
 	struct module *owner;
 	struct file_system_type * next;
@@ -2118,26 +2118,27 @@ struct file_system_type {
 #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
 
 extern struct dentry *mount_ns(struct file_system_type *fs_type,
-	int flags, void *data, void *ns, struct user_namespace *user_ns,
-	int (*fill_super)(struct super_block *, void *, int));
+	int flags, void *data, size_t data_size,
+	void *ns, struct user_namespace *user_ns,
+	int (*fill_super)(struct super_block *, void *, size_t, int));
 #ifdef CONFIG_BLOCK
 extern struct dentry *mount_bdev(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data,
-	int (*fill_super)(struct super_block *, void *, int));
+	int flags, const char *dev_name, void *data, size_t data_size,
+	int (*fill_super)(struct super_block *, void *, size_t, int));
 #else
 static inline struct dentry *mount_bdev(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data,
-	int (*fill_super)(struct super_block *, void *, int))
+	int flags, const char *dev_name, void *data, size_t data_size,
+	int (*fill_super)(struct super_block *, void *, size_t, int))
 {
 	return ERR_PTR(-ENODEV);
 }
 #endif
 extern struct dentry *mount_single(struct file_system_type *fs_type,
-	int flags, void *data,
-	int (*fill_super)(struct super_block *, void *, int));
+	int flags, void *data, size_t data_size,
+	int (*fill_super)(struct super_block *, void *, size_t, int));
 extern struct dentry *mount_nodev(struct file_system_type *fs_type,
-	int flags, void *data,
-	int (*fill_super)(struct super_block *, void *, int));
+	int flags, void *data, size_t data_size,
+	int (*fill_super)(struct super_block *, void *, size_t, int));
 extern struct dentry *mount_subtree(struct vfsmount *mnt, const char *path);
 void generic_shutdown_super(struct super_block *sb);
 #ifdef CONFIG_BLOCK
@@ -2197,8 +2198,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
 
 extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
-#define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
+#define kern_mount(type) kern_mount_data(type, NULL, 0)
 extern void kern_unmount(struct vfsmount *mnt);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 43ca087b6454..7ff5a980399a 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -154,6 +154,7 @@
  *	@type contains the filesystem type.
  *	@flags contains the mount flags.
  *	@data contains the filesystem-specific data.
+ *	@data_size contains the size of the data.
  *	Return 0 if permission is granted.
  * @sb_copy_data:
  *	Allow mount option data to be copied prior to parsing by the filesystem,
@@ -163,6 +164,7 @@
  *	specific options to avoid having to make filesystems aware of them.
  *	@type the type of filesystem being mounted.
  *	@orig the original mount data copied from userspace.
+ *	@orig_data is the size of the original data
  *	@copy copied data which will be passed to the security module.
  *	Returns 0 if the copy was successful.
  * @sb_remount:
@@ -170,6 +172,7 @@
  *	are being made to those options.
  *	@sb superblock being remounted
  *	@data contains the filesystem-specific data.
+ *	@data_size contains the size of the data.
  *	Return 0 if permission is granted.
  * @sb_umount:
  *	Check permission before the @mnt file system is unmounted.
@@ -1522,13 +1525,15 @@ union security_list_options {
 
 	int (*sb_alloc_security)(struct super_block *sb);
 	void (*sb_free_security)(struct super_block *sb);
-	int (*sb_copy_data)(char *orig, char *copy);
-	int (*sb_remount)(struct super_block *sb, void *data);
-	int (*sb_kern_mount)(struct super_block *sb, int flags, void *data);
+	int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
+	int (*sb_remount)(struct super_block *sb, void *data, size_t data_size);
+	int (*sb_kern_mount)(struct super_block *sb, int flags,
+			     void *data, size_t data_size);
 	int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
 	int (*sb_statfs)(struct dentry *dentry);
 	int (*sb_mount)(const char *dev_name, const struct path *path,
-			const char *type, unsigned long flags, void *data);
+			const char *type, unsigned long flags,
+			void *data, size_t data_size);
 	int (*sb_umount)(struct vfsmount *mnt, int flags);
 	int (*sb_pivotroot)(const struct path *old_path, const struct path *new_path);
 	int (*sb_set_mnt_opts)(struct super_block *sb,
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 45b1f56c6c2f..8a1031a511c9 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -90,10 +90,11 @@ extern struct vfsmount *clone_private_mount(const struct path *path);
 struct file_system_type;
 extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
 				      int flags, const char *name,
-				      void *data);
+				      void *data, size_t data_size);
 extern struct vfsmount *vfs_submount(const struct dentry *mountpoint,
 				     struct file_system_type *type,
-				     const char *name, void *data);
+				     const char *name,
+				     void *data, size_t data_size);
 
 extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
 extern void mark_mounts_for_expiry(struct list_head *mounts);
diff --git a/include/linux/mtd/super.h b/include/linux/mtd/super.h
index f456230f9330..3f37c7cd711c 100644
--- a/include/linux/mtd/super.h
+++ b/include/linux/mtd/super.h
@@ -19,8 +19,8 @@
 #include <linux/mount.h>
 
 extern struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
-		      const char *dev_name, void *data,
-		      int (*fill_super)(struct super_block *, void *, int));
+		      const char *dev_name, void *data, size_t data_size,
+		      int (*fill_super)(struct super_block *, void *, size_t, int));
 extern void kill_mtd_super(struct super_block *sb);
 
 
diff --git a/include/linux/ramfs.h b/include/linux/ramfs.h
index 5ef7d54caac2..6d64e6be9928 100644
--- a/include/linux/ramfs.h
+++ b/include/linux/ramfs.h
@@ -5,7 +5,7 @@
 struct inode *ramfs_get_inode(struct super_block *sb, const struct inode *dir,
 	 umode_t mode, dev_t dev);
 extern struct dentry *ramfs_mount(struct file_system_type *fs_type,
-	 int flags, const char *dev_name, void *data);
+	 int flags, const char *dev_name, void *data, size_t data_size);
 
 #ifdef CONFIG_MMU
 static inline int
@@ -21,6 +21,6 @@ extern const struct file_operations ramfs_file_operations;
 extern const struct vm_operations_struct generic_file_vm_ops;
 extern int __init init_ramfs_fs(void);
 
-int ramfs_fill_super(struct super_block *sb, void *data, int silent);
+int ramfs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent);
 
 #endif
diff --git a/include/linux/security.h b/include/linux/security.h
index 7f093b27169d..93964808da59 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -238,13 +238,13 @@ int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
 			   unsigned int mnt_flags);
 int security_sb_alloc(struct super_block *sb);
 void security_sb_free(struct super_block *sb);
-int security_sb_copy_data(char *orig, char *copy);
-int security_sb_remount(struct super_block *sb, void *data);
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data);
+int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
+int security_sb_remount(struct super_block *sb, void *data, size_t data_size);
+int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size);
 int security_sb_show_options(struct seq_file *m, struct super_block *sb);
 int security_sb_statfs(struct dentry *dentry);
 int security_sb_mount(const char *dev_name, const struct path *path,
-		      const char *type, unsigned long flags, void *data);
+		      const char *type, unsigned long flags, void *data, size_t data_size);
 int security_sb_umount(struct vfsmount *mnt, int flags);
 int security_sb_pivotroot(const struct path *old_path, const struct path *new_path);
 int security_sb_set_mnt_opts(struct super_block *sb,
@@ -583,17 +583,18 @@ static inline int security_sb_alloc(struct super_block *sb)
 static inline void security_sb_free(struct super_block *sb)
 { }
 
-static inline int security_sb_copy_data(char *orig, char *copy)
+static inline int security_sb_copy_data(char *orig, size_t orig_size, char *copy)
 {
 	return 0;
 }
 
-static inline int security_sb_remount(struct super_block *sb, void *data)
+static inline int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
 {
 	return 0;
 }
 
-static inline int security_sb_kern_mount(struct super_block *sb, int flags, void *data)
+static inline int security_sb_kern_mount(struct super_block *sb, int flags,
+					 void *data, size_t data_size)
 {
 	return 0;
 }
@@ -611,7 +612,7 @@ static inline int security_sb_statfs(struct dentry *dentry)
 
 static inline int security_sb_mount(const char *dev_name, const struct path *path,
 				    const char *type, unsigned long flags,
-				    void *data)
+				    void *data, size_t data_size)
 {
 	return 0;
 }
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f155dc607112..66772728cb74 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,7 +49,8 @@ static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
  * Functions in mm/shmem.c called directly from elsewhere:
  */
 extern int shmem_init(void);
-extern int shmem_fill_super(struct super_block *sb, void *data, int silent);
+extern int shmem_fill_super(struct super_block *sb, void *data, size_t data_size,
+			    int silent);
 extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
diff --git a/init/do_mounts.c b/init/do_mounts.c
index ea6f21bb9440..d4fc2a5afdb6 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -606,7 +606,7 @@ void __init prepare_namespace(void)
 
 static bool is_tmpfs;
 static struct dentry *rootfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
 	static unsigned long once;
 	void *fill = ramfs_fill_super;
@@ -617,7 +617,7 @@ static struct dentry *rootfs_mount(struct file_system_type *fs_type,
 	if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
 		fill = shmem_fill_super;
 
-	return mount_nodev(fs_type, flags, data, fill);
+	return mount_nodev(fs_type, flags, data, data_size, fill);
 }
 
 static struct file_system_type rootfs_fs_type = {
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c0d58f390c3b..4671d215cb84 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -322,7 +322,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 	return ERR_PTR(ret);
 }
 
-static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
+static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
 {
 	struct inode *inode;
 	struct ipc_namespace *ns = sb->s_fs_info;
@@ -345,7 +345,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *mqueue_mount(struct file_system_type *fs_type,
 			 int flags, const char *dev_name,
-			 void *data)
+			 void *data, size_t data_size)
 {
 	struct ipc_namespace *ns;
 	if (flags & SB_KERNMOUNT) {
@@ -354,7 +354,8 @@ static struct dentry *mqueue_mount(struct file_system_type *fs_type,
 	} else {
 		ns = current->nsproxy->ipc_ns;
 	}
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
+	return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
+			mqueue_fill_super);
 }
 
 static void init_once(void *foo)
@@ -1538,7 +1539,7 @@ int mq_init_ns(struct ipc_namespace *ns)
 	ns->mq_msg_default   = DFLT_MSG;
 	ns->mq_msgsize_default  = DFLT_MSGSIZE;
 
-	ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
+	ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0);
 	if (IS_ERR(ns->mq_mnt)) {
 		int err = PTR_ERR(ns->mq_mnt);
 		ns->mq_mnt = NULL;
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 76efe9a183f5..e7df1c15ac96 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -626,7 +626,7 @@ static int bpf_parse_options(char *data, struct bpf_mount_opts *opts)
 	return 0;
 }
 
-static int bpf_fill_super(struct super_block *sb, void *data, int silent)
+static int bpf_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
 {
 	static const struct tree_descr bpf_rfiles[] = { { "" } };
 	struct bpf_mount_opts opts;
@@ -652,9 +652,10 @@ static int bpf_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *bpf_mount(struct file_system_type *type, int flags,
-				const char *dev_name, void *data)
+				const char *dev_name, void *data,
+				size_t data_size)
 {
-	return mount_nodev(type, flags, data, bpf_fill_super);
+	return mount_nodev(type, flags, data, data_size, bpf_fill_super);
 }
 
 static struct file_system_type bpf_fs_type = {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 077370bf8964..ddb1a60ae3c0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2028,7 +2028,7 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
 
 static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 			 int flags, const char *unused_dev_name,
-			 void *data)
+			 void *data, size_t data_size)
 {
 	struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
 	struct dentry *dentry;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10cb7222..6d9f1a709af9 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -316,7 +316,8 @@ static inline bool is_in_v2_mode(void)
  * silently switch it to mount "cgroup" instead
  */
 static struct dentry *cpuset_mount(struct file_system_type *fs_type,
-			 int flags, const char *unused_dev_name, void *data)
+				   int flags, const char *unused_dev_name,
+				   void *data, size_t data_size)
 {
 	struct file_system_type *cgroup_fs = get_fs_type("cgroup");
 	struct dentry *ret = ERR_PTR(-ENODEV);
@@ -324,8 +325,8 @@ static struct dentry *cpuset_mount(struct file_system_type *fs_type,
 		char mountopts[] =
 			"cpuset,noprefix,"
 			"release_agent=/sbin/cpuset_release_agent";
-		ret = cgroup_fs->mount(cgroup_fs, flags,
-					   unused_dev_name, mountopts);
+		ret = cgroup_fs->mount(cgroup_fs, flags, unused_dev_name,
+				       mountopts, data_size);
 		put_filesystem(cgroup_fs);
 	}
 	return ret;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c9336e98ac59..77b429c7f584 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7975,7 +7975,8 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
 	ftrace_init_tracefs(tr, d_tracer);
 }
 
-static struct vfsmount *trace_automount(struct dentry *mntpt, void *ingore)
+static struct vfsmount *trace_automount(struct dentry *mntpt,
+					void *data, size_t data_size)
 {
 	struct vfsmount *mnt;
 	struct file_system_type *type;
@@ -7988,7 +7989,7 @@ static struct vfsmount *trace_automount(struct dentry *mntpt, void *ingore)
 	type = get_fs_type("tracefs");
 	if (!type)
 		return NULL;
-	mnt = vfs_submount(mntpt, type, "tracefs", NULL);
+	mnt = vfs_submount(mntpt, type, "tracefs", NULL, 0);
 	put_filesystem(type);
 	if (IS_ERR(mnt))
 		return NULL;
@@ -8024,7 +8025,7 @@ struct dentry *tracing_init_dentry(void)
 	 * work with the newer kerenl.
 	 */
 	tr->dir = debugfs_create_automount("tracing", NULL,
-					   trace_automount, NULL);
+					   trace_automount, NULL, 0);
 	if (!tr->dir) {
 		pr_warn_once("Could not create debugfs directory 'tracing'\n");
 		return ERR_PTR(-ENOMEM);
diff --git a/mm/shmem.c b/mm/shmem.c
index 2cab84403055..bd68f452152d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3392,7 +3392,8 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 
 }
 
-static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
+static int shmem_remount_fs(struct super_block *sb, int *flags,
+			    char *data, size_t data_size)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	struct shmem_sb_info config = *sbinfo;
@@ -3475,7 +3476,8 @@ static void shmem_put_super(struct super_block *sb)
 	sb->s_fs_info = NULL;
 }
 
-int shmem_fill_super(struct super_block *sb, void *data, int silent)
+int shmem_fill_super(struct super_block *sb, void *data, size_t data_size,
+		     int silent)
 {
 	struct inode *inode;
 	struct shmem_sb_info *sbinfo;
@@ -3689,9 +3691,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
 };
 
 static struct dentry *shmem_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+	int flags, const char *dev_name, void *data, size_t data_size)
 {
-	return mount_nodev(fs_type, flags, data, shmem_fill_super);
+	return mount_nodev(fs_type, flags, data, data_size, shmem_fill_super);
 }
 
 static struct file_system_type shmem_fs_type = {
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 8d87e973a4f5..a1a9debb6fc8 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1815,7 +1815,8 @@ static enum fullness_group putback_zspage(struct size_class *class,
 
 #ifdef CONFIG_COMPACTION
 static struct dentry *zs_mount(struct file_system_type *fs_type,
-				int flags, const char *dev_name, void *data)
+			       int flags, const char *dev_name,
+			       void *data, size_t data_size)
 {
 	static const struct dentry_operations ops = {
 		.d_dname = simple_dname,
diff --git a/net/socket.c b/net/socket.c
index 8a109012608a..930def0b428b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -357,7 +357,8 @@ static const struct xattr_handler *sockfs_xattr_handlers[] = {
 };
 
 static struct dentry *sockfs_mount(struct file_system_type *fs_type,
-			 int flags, const char *dev_name, void *data)
+				   int flags, const char *dev_name,
+				   void *data, size_t data_size)
 {
 	return mount_pseudo_xattr(fs_type, "socket:", &sockfs_ops,
 				  sockfs_xattr_handlers,
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 4fda18d47e2c..023c2a6389e7 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1367,7 +1367,7 @@ rpc_gssd_dummy_depopulate(struct dentry *pipe_dentry)
 }
 
 static int
-rpc_fill_super(struct super_block *sb, void *data, int silent)
+rpc_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
 {
 	struct inode *inode;
 	struct dentry *root, *gssd_dentry;
@@ -1430,10 +1430,11 @@ EXPORT_SYMBOL_GPL(gssd_running);
 
 static struct dentry *
 rpc_mount(struct file_system_type *fs_type,
-		int flags, const char *dev_name, void *data)
+	  int flags, const char *dev_name, void *data, size_t data_size)
 {
 	struct net *net = current->nsproxy->net_ns;
-	return mount_ns(fs_type, flags, data, net, net->user_ns, rpc_fill_super);
+	return mount_ns(fs_type, flags, data, data_size,
+			net, net->user_ns, rpc_fill_super);
 }
 
 static void rpc_kill_sb(struct super_block *sb)
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 949dd8a48164..04548c8102f3 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -137,7 +137,8 @@ static const struct super_operations aafs_super_ops = {
 	.show_path = aafs_show_path,
 };
 
-static int fill_super(struct super_block *sb, void *data, int silent)
+static int fill_super(struct super_block *sb, void *data, size_t data_size,
+		      int silent)
 {
 	static struct tree_descr files[] = { {""} };
 	int error;
@@ -151,9 +152,10 @@ static int fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *aafs_mount(struct file_system_type *fs_type,
-				 int flags, const char *dev_name, void *data)
+				 int flags, const char *dev_name, void *data,
+				 size_t data_size)
 {
-	return mount_single(fs_type, flags, data, fill_super);
+	return mount_single(fs_type, flags, data, data_size, fill_super);
 }
 
 static struct file_system_type aafs_ops = {
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 29803dc604f8..9a5915dffbdc 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -593,7 +593,8 @@ static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint
 }
 
 static int apparmor_sb_mount(const char *dev_name, const struct path *path,
-			     const char *type, unsigned long flags, void *data)
+			     const char *type, unsigned long flags,
+			     void *data, size_t data_size)
 {
 	struct aa_label *label;
 	int error = 0;
diff --git a/security/inode.c b/security/inode.c
index 8dd9ca8848e4..a89a00714f33 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -39,7 +39,8 @@ static const struct super_operations securityfs_super_operations = {
 	.evict_inode	= securityfs_evict_inode,
 };
 
-static int fill_super(struct super_block *sb, void *data, int silent)
+static int fill_super(struct super_block *sb, void *data, size_t data_size,
+		      int silent)
 {
 	static const struct tree_descr files[] = {{""}};
 	int error;
@@ -55,9 +56,9 @@ static int fill_super(struct super_block *sb, void *data, int silent)
 
 static struct dentry *get_sb(struct file_system_type *fs_type,
 		  int flags, const char *dev_name,
-		  void *data)
+		  void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, fill_super);
+	return mount_single(fs_type, flags, data, data_size, fill_super);
 }
 
 static struct file_system_type fs_type = {
diff --git a/security/security.c b/security/security.c
index 597470fd3727..27a5fb308d20 100644
--- a/security/security.c
+++ b/security/security.c
@@ -414,20 +414,20 @@ void security_sb_free(struct super_block *sb)
 	call_void_hook(sb_free_security, sb);
 }
 
-int security_sb_copy_data(char *orig, char *copy)
+int security_sb_copy_data(char *orig, size_t data_size, char *copy)
 {
-	return call_int_hook(sb_copy_data, 0, orig, copy);
+	return call_int_hook(sb_copy_data, 0, orig, data_size, copy);
 }
 EXPORT_SYMBOL(security_sb_copy_data);
 
-int security_sb_remount(struct super_block *sb, void *data)
+int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
 {
-	return call_int_hook(sb_remount, 0, sb, data);
+	return call_int_hook(sb_remount, 0, sb, data, data_size);
 }
 
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data)
+int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
 {
-	return call_int_hook(sb_kern_mount, 0, sb, flags, data);
+	return call_int_hook(sb_kern_mount, 0, sb, flags, data, data_size);
 }
 
 int security_sb_show_options(struct seq_file *m, struct super_block *sb)
@@ -441,9 +441,11 @@ int security_sb_statfs(struct dentry *dentry)
 }
 
 int security_sb_mount(const char *dev_name, const struct path *path,
-                       const char *type, unsigned long flags, void *data)
+		      const char *type, unsigned long flags,
+		      void *data, size_t data_size)
 {
-	return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
+	return call_int_hook(sb_mount, 0, dev_name, path, type, flags,
+			     data, data_size);
 }
 
 int security_sb_umount(struct vfsmount *mnt, int flags)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index bdecae4b7306..189f5284fc3f 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2795,7 +2795,7 @@ static inline void take_selinux_option(char **to, char *from, int *first,
 	}
 }
 
-static int selinux_sb_copy_data(char *orig, char *copy)
+static int selinux_sb_copy_data(char *orig, size_t data_size, char *copy)
 {
 	int fnosec, fsec, rc = 0;
 	char *in_save, *in_curr, *in_end;
@@ -2837,7 +2837,7 @@ static int selinux_sb_copy_data(char *orig, char *copy)
 	return rc;
 }
 
-static int selinux_sb_remount(struct super_block *sb, void *data)
+static int selinux_sb_remount(struct super_block *sb, void *data, size_t data_size)
 {
 	int rc, i, *flags;
 	struct security_mnt_opts opts;
@@ -2857,7 +2857,7 @@ static int selinux_sb_remount(struct super_block *sb, void *data)
 	secdata = alloc_secdata();
 	if (!secdata)
 		return -ENOMEM;
-	rc = selinux_sb_copy_data(data, secdata);
+	rc = selinux_sb_copy_data(data, data_size, secdata);
 	if (rc)
 		goto out_free_secdata;
 
@@ -2922,7 +2922,7 @@ static int selinux_sb_remount(struct super_block *sb, void *data)
 	goto out_free_opts;
 }
 
-static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data)
+static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
 {
 	const struct cred *cred = current_cred();
 	struct common_audit_data ad;
@@ -2955,7 +2955,8 @@ static int selinux_mount(const char *dev_name,
 			 const struct path *path,
 			 const char *type,
 			 unsigned long flags,
-			 void *data)
+			 void *data,
+			 size_t data_size)
 {
 	const struct cred *cred = current_cred();
 
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index f3d374d2ca04..71834dd5a70f 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1890,7 +1890,8 @@ static struct dentry *sel_make_dir(struct dentry *dir, const char *name,
 
 #define NULL_FILE_NAME "null"
 
-static int sel_fill_super(struct super_block *sb, void *data, int silent)
+static int sel_fill_super(struct super_block *sb, void *data, size_t data_size,
+			  int silent)
 {
 	struct selinux_fs_info *fsi;
 	int ret;
@@ -2005,9 +2006,10 @@ static int sel_fill_super(struct super_block *sb, void *data, int silent)
 }
 
 static struct dentry *sel_mount(struct file_system_type *fs_type,
-		      int flags, const char *dev_name, void *data)
+				int flags, const char *dev_name,
+				void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, sel_fill_super);
+	return mount_single(fs_type, flags, data, data_size, sel_fill_super);
 }
 
 static void sel_kill_sb(struct super_block *sb)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 39780b06469b..fb55a16a484c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -869,6 +869,7 @@ static void smack_sb_free_security(struct super_block *sb)
 /**
  * smack_sb_copy_data - copy mount options data for processing
  * @orig: where to start
+ * @orig_size: Size of orig buffer
  * @smackopts: mount options string
  *
  * Returns 0 on success or -ENOMEM on error.
@@ -876,7 +877,7 @@ static void smack_sb_free_security(struct super_block *sb)
  * Copy the Smack specific mount options out of the mount
  * options list.
  */
-static int smack_sb_copy_data(char *orig, char *smackopts)
+static int smack_sb_copy_data(char *orig, size_t orig_size, char *smackopts)
 {
 	char *cp, *commap, *otheropts, *dp;
 
@@ -1157,7 +1158,8 @@ static int smack_set_mnt_opts(struct super_block *sb,
  *
  * Returns 0 on success, an error code on failure
  */
-static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
+static int smack_sb_kern_mount(struct super_block *sb, int flags,
+			       void *data, size_t data_size)
 {
 	int rc = 0;
 	char *options = data;
diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
index f6482e53d55a..f4e91c5d6c2c 100644
--- a/security/smack/smackfs.c
+++ b/security/smack/smackfs.c
@@ -2844,13 +2844,15 @@ static const struct file_operations smk_ptrace_ops = {
  * smk_fill_super - fill the smackfs superblock
  * @sb: the empty superblock
  * @data: unused
+ * @data_size: size of data buffer
  * @silent: unused
  *
  * Fill in the well known entries for the smack filesystem
  *
  * Returns 0 on success, an error code on failure
  */
-static int smk_fill_super(struct super_block *sb, void *data, int silent)
+static int smk_fill_super(struct super_block *sb, void *data, size_t data_size,
+			  int silent)
 {
 	int rc;
 	struct inode *root_inode;
@@ -2934,9 +2936,10 @@ static int smk_fill_super(struct super_block *sb, void *data, int silent)
  * Returns what the lower level code does.
  */
 static struct dentry *smk_mount(struct file_system_type *fs_type,
-		      int flags, const char *dev_name, void *data)
+				int flags, const char *dev_name,
+				void *data, size_t data_size)
 {
-	return mount_single(fs_type, flags, data, smk_fill_super);
+	return mount_single(fs_type, flags, data, data_size, smk_fill_super);
 }
 
 static struct file_system_type smk_fs_type = {
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index 31fd6bd4f657..c3a0ae4fa7ce 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -413,11 +413,13 @@ static int tomoyo_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
  * @type:     Name of filesystem type. Maybe NULL.
  * @flags:    Mount options.
  * @data:     Optional data. Maybe NULL.
+ * @data_size: Size of data.
  *
  * Returns 0 on success, negative value otherwise.
  */
 static int tomoyo_sb_mount(const char *dev_name, const struct path *path,
-			   const char *type, unsigned long flags, void *data)
+			   const char *type, unsigned long flags,
+			   void *data, size_t data_size)
 {
 	return tomoyo_mount_permission(dev_name, path, type, flags, data);
 }


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 12/32] vfs: Separate changing mount flags full remount [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (10 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 11/32] vfs: Require specification of size of mount data for internal mounts " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 22:42 ` [PATCH 13/32] vfs: Implement a filesystem superblock creation/configuration context " David Howells
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel, Eric Biggers

Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full
remount because the mount data will get parsed with the new fs_context
stuff prior to doing a remount - and this causes the syscall to fail under
some circumstances.

To quote Eric's explanation:

  [...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options
  string, which breaks systemd unit files with ProtectControlGroups=yes
  (e.g.  systemd-networkd.service) when systemd does the following to
  change a cgroup (v1) mount to read-only:

    mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL,
	  MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL)

  ... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems
  enabled, since in that case the error "cgroup1: Need name or subsystem
  set" is hit when the mount options string is empty.

  Probably it doesn't make sense to validate the mount options string at
  all in the MS_REMOUNT|MS_BIND case, though maybe you had something else
  in mind.

This is also worthwhile doing because we will need to add a mount_setattr()
syscall to take over the remount-bind function.

Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c        |  146 +++++++++++++++++++++++++++++++------------------
 include/linux/mount.h |    2 -
 2 files changed, 93 insertions(+), 55 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3981fd7b13f5..859dc473e2ad 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -273,13 +273,9 @@ static struct mount *alloc_vfsmnt(const char *name)
  * mnt_want/drop_write() will _keep_ the filesystem
  * r/w.
  */
-int __mnt_is_readonly(struct vfsmount *mnt)
+bool __mnt_is_readonly(struct vfsmount *mnt)
 {
-	if (mnt->mnt_flags & MNT_READONLY)
-		return 1;
-	if (sb_rdonly(mnt->mnt_sb))
-		return 1;
-	return 0;
+	return (mnt->mnt_flags & MNT_READONLY) || sb_rdonly(mnt->mnt_sb);
 }
 EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 
@@ -594,11 +590,12 @@ static int mnt_make_readonly(struct mount *mnt)
 	return ret;
 }
 
-static void __mnt_unmake_readonly(struct mount *mnt)
+static int __mnt_unmake_readonly(struct mount *mnt)
 {
 	lock_mount_hash();
 	mnt->mnt.mnt_flags &= ~MNT_READONLY;
 	unlock_mount_hash();
+	return 0;
 }
 
 int sb_prepare_remount_readonly(struct super_block *sb)
@@ -2355,21 +2352,91 @@ SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
 	return error;
 }
 
-static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+/*
+ * Don't allow locked mount flags to be cleared.
+ *
+ * No locks need to be held here while testing the various MNT_LOCK
+ * flags because those flags can never be cleared once they are set.
+ */
+static bool can_change_locked_flags(struct mount *mnt, unsigned int mnt_flags)
+{
+	unsigned int fl = mnt->mnt.mnt_flags;
+
+	if ((fl & MNT_LOCK_READONLY) &&
+	    !(mnt_flags & MNT_READONLY))
+		return false;
+
+	if ((fl & MNT_LOCK_NODEV) &&
+	    !(mnt_flags & MNT_NODEV))
+		return false;
+
+	if ((fl & MNT_LOCK_NOSUID) &&
+	    !(mnt_flags & MNT_NOSUID))
+		return false;
+
+	if ((fl & MNT_LOCK_NOEXEC) &&
+	    !(mnt_flags & MNT_NOEXEC))
+		return false;
+
+	if ((fl & MNT_LOCK_ATIME) &&
+	    ((fl & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK)))
+		return false;
+
+	return true;
+}
+
+static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
 {
-	int error = 0;
-	int readonly_request = 0;
+	bool readonly_request = (mnt_flags & MNT_READONLY);
 
-	if (ms_flags & MS_RDONLY)
-		readonly_request = 1;
-	if (readonly_request == __mnt_is_readonly(mnt))
+	if (readonly_request == __mnt_is_readonly(&mnt->mnt))
 		return 0;
 
 	if (readonly_request)
-		error = mnt_make_readonly(real_mount(mnt));
-	else
-		__mnt_unmake_readonly(real_mount(mnt));
-	return error;
+		return mnt_make_readonly(mnt);
+
+	return __mnt_unmake_readonly(mnt);
+}
+
+/*
+ * Update the user-settable attributes on a mount.  The caller must hold
+ * sb->s_umount for writing.
+ */
+static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
+{
+	lock_mount_hash();
+	mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
+	mnt->mnt.mnt_flags = mnt_flags;
+	touch_mnt_namespace(mnt->mnt_ns);
+	unlock_mount_hash();
+}
+
+/*
+ * Handle reconfiguration of the mountpoint only without alteration of the
+ * superblock it refers to.  This is triggered by specifying MS_REMOUNT|MS_BIND
+ * to mount(2).
+ */
+static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
+{
+	struct super_block *sb = path->mnt->mnt_sb;
+	struct mount *mnt = real_mount(path->mnt);
+	int ret;
+
+	if (!check_mnt(mnt))
+		return -EINVAL;
+
+	if (path->dentry != mnt->mnt.mnt_root)
+		return -EINVAL;
+
+	if (!can_change_locked_flags(mnt, mnt_flags))
+		return -EPERM;
+
+	down_write(&sb->s_umount);
+	ret = change_mount_ro_state(mnt, mnt_flags);
+	if (ret == 0)
+		set_mount_attributes(mnt, mnt_flags);
+	up_write(&sb->s_umount);
+	return ret;
 }
 
 /*
@@ -2390,50 +2457,19 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 	if (path->dentry != path->mnt->mnt_root)
 		return -EINVAL;
 
-	/* Don't allow changing of locked mnt flags.
-	 *
-	 * No locks need to be held here while testing the various
-	 * MNT_LOCK flags because those flags can never be cleared
-	 * once they are set.
-	 */
-	if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
-	    !(mnt_flags & MNT_READONLY)) {
-		return -EPERM;
-	}
-	if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
-	    !(mnt_flags & MNT_NODEV)) {
-		return -EPERM;
-	}
-	if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
-	    !(mnt_flags & MNT_NOSUID)) {
-		return -EPERM;
-	}
-	if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
-	    !(mnt_flags & MNT_NOEXEC)) {
+	if (!can_change_locked_flags(mnt, mnt_flags))
 		return -EPERM;
-	}
-	if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
-	    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
-		return -EPERM;
-	}
 
 	err = security_sb_remount(sb, data, data_size);
 	if (err)
 		return err;
 
 	down_write(&sb->s_umount);
-	if (ms_flags & MS_BIND)
-		err = change_mount_flags(path->mnt, ms_flags);
-	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
-		err = -EPERM;
-	else
+	err = -EPERM;
+	if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
 		err = do_remount_sb(sb, sb_flags, data, data_size, 0);
-	if (!err) {
-		lock_mount_hash();
-		mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
-		mnt->mnt.mnt_flags = mnt_flags;
-		touch_mnt_namespace(mnt->mnt_ns);
-		unlock_mount_hash();
+		if (!err)
+			set_mount_attributes(mnt, mnt_flags);
 	}
 	up_write(&sb->s_umount);
 	return err;
@@ -2949,7 +2985,9 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 			    SB_LAZYTIME |
 			    SB_I_VERSION);
 
-	if (flags & MS_REMOUNT)
+	if ((flags & (MS_REMOUNT | MS_BIND)) == (MS_REMOUNT | MS_BIND))
+		retval = do_reconfigure_mnt(&path, mnt_flags);
+	else if (flags & MS_REMOUNT)
 		retval = do_remount(&path, flags, sb_flags, mnt_flags,
 				    data_page, data_size);
 	else if (flags & MS_BIND)
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 8a1031a511c9..c9edd284f0af 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,7 +81,7 @@ extern void mnt_drop_write_file(struct file *file);
 extern void mntput(struct vfsmount *mnt);
 extern struct vfsmount *mntget(struct vfsmount *mnt);
 extern struct vfsmount *mnt_clone_internal(const struct path *path);
-extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool __mnt_is_readonly(struct vfsmount *mnt);
 extern bool mnt_may_suid(struct vfsmount *mnt);
 
 struct path;


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 13/32] vfs: Implement a filesystem superblock creation/configuration context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (11 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 12/32] vfs: Separate changing mount flags full remount " David Howells
@ 2018-07-10 22:42 ` David Howells
  2018-07-10 22:43 ` [PATCH 14/32] vfs: Remove unused code after filesystem context changes " David Howells
                   ` (24 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:42 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Implement a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.

The mounting procedure then becomes:

 (1) Allocate new fs_context context.

 (2) Configure the context.

 (3) Create superblock.

 (4) Mount the superblock any number of times.

 (5) Destroy the context.

Rather than calling fs_type->mount(), an fs_context struct is created and
fs_type->init_fs_context() is called to set it up.  Pointers exist for the
filesystem and LSM to hang their private data off.

A set of operations has to be set by ->init_fs_context() to provide
freeing, duplication, option parsing, binary data parsing, validation,
mounting and superblock filling.

Legacy filesystems are supported by the provision of a set of legacy
fs_context operations that build up a list of mount options and then invoke
fs_type->mount() from within the fs_context ->get_tree() operation.  This
allows all filesystems to be accessed using fs_context.

It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/Makefile                |    3 
 fs/fs_context.c            |  625 ++++++++++++++++++++++++++++++++++++++++++++
 fs/internal.h              |    3 
 fs/libfs.c                 |   19 +
 fs/namespace.c             |  350 ++++++++++++++++---------
 fs/super.c                 |  303 ++++++++++++++++++++-
 include/linux/fs.h         |   14 +
 include/linux/fs_context.h |   45 +++
 include/linux/mount.h      |    3 
 9 files changed, 1222 insertions(+), 143 deletions(-)
 create mode 100644 fs/fs_context.c

diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..5563cf34f7c2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -12,7 +12,8 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
-		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
+		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
+		fs_context.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
new file mode 100644
index 000000000000..b7c84e0aa2f9
--- /dev/null
+++ b/fs/fs_context.c
@@ -0,0 +1,625 @@
+/* Provide a way to create a superblock configuration context within the kernel
+ * that allows a superblock to be set up prior to mounting.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/fs_context.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/nsproxy.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/security.h>
+#include <linux/parser.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include <net/net_namespace.h>
+#include "mount.h"
+
+enum legacy_fs_param {
+	LEGACY_FS_UNSET_PARAMS,
+	LEGACY_FS_NO_PARAMS,
+	LEGACY_FS_MONOLITHIC_PARAMS,
+	LEGACY_FS_INDIVIDUAL_PARAMS,
+	LEGACY_FS_MAGIC_PARAMS,
+};
+
+struct legacy_fs_context {
+	char			*legacy_data;	/* Data page for legacy filesystems */
+	char			*secdata;
+	size_t			data_size;
+	enum legacy_fs_param	param_type;
+};
+
+static int legacy_init_fs_context(struct fs_context *fc, struct dentry *dentry);
+static const struct fs_context_operations legacy_fs_context_ops;
+
+static const match_table_t common_set_sb_flag = {
+	{ SB_DIRSYNC,		"dirsync" },
+	{ SB_LAZYTIME,		"lazytime" },
+	{ SB_MANDLOCK,		"mand" },
+	{ SB_POSIXACL,		"posixacl" },
+	{ SB_RDONLY,		"ro" },
+	{ SB_SYNCHRONOUS,	"sync" },
+	{ },
+};
+
+static const match_table_t common_clear_sb_flag = {
+	{ SB_LAZYTIME,		"nolazytime" },
+	{ SB_MANDLOCK,		"nomand" },
+	{ SB_RDONLY,		"rw" },
+	{ SB_SILENT,		"silent" },
+	{ SB_SYNCHRONOUS,	"async" },
+	{ },
+};
+
+static const match_table_t forbidden_sb_flag = {
+	{ 1,	"bind" },
+	{ 1,	"move" },
+	{ 1,	"private" },
+	{ 1,	"remount" },
+	{ 1,	"shared" },
+	{ 1,	"slave" },
+	{ 1,	"unbindable" },
+	{ 1,	"rec" },
+	{ 1,	"noatime" },
+	{ 1,	"relatime" },
+	{ 1,	"norelatime" },
+	{ 1,	"strictatime" },
+	{ 1,	"nostrictatime" },
+	{ 1,	"nodiratime" },
+	{ 1,	"dev" },
+	{ 1,	"nodev" },
+	{ 1,	"exec" },
+	{ 1,	"noexec" },
+	{ 1,	"suid" },
+	{ 1,	"nosuid" },
+	{ },
+};
+
+/*
+ * Check for a common mount option that manipulates s_flags.
+ */
+static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
+{
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int token;
+
+	token = match_token(data, common_set_sb_flag, args);
+	if (token) {
+		fc->sb_flags |= token;
+		return 1;
+	}
+
+	token = match_token(data, common_clear_sb_flag, args);
+	if (token) {
+		fc->sb_flags &= ~token;
+		return 1;
+	}
+
+	token = match_token(data, forbidden_sb_flag, args);
+	if (token)
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
+ * vfs_parse_fs_option - Add a single mount option to a superblock config
+ * @fc: The filesystem context to modify
+ * @opt: The option to apply.
+ * @len: The length of the option.
+ *
+ * A single mount option in string form is applied to the filesystem context
+ * being set up.  Certain standard options (for example "ro") are translated
+ * into flag bits without going to the filesystem.  The active security module
+ * is allowed to observe and poach options.  Any other options are passed over
+ * to the filesystem to parse.
+ *
+ * This may be called multiple times for a context.
+ *
+ * Returns 0 on success and a negative error code on failure.  In the event of
+ * failure, supplementary error information may have been set.
+ */
+int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len)
+{
+	int ret;
+
+	ret = vfs_parse_sb_flag_option(fc, opt);
+	if (ret < 0)
+		return ret;
+	if (ret == 1)
+		return 0;
+
+	ret = security_fs_context_parse_option(fc, opt, len);
+	if (ret < 0)
+		return ret;
+	if (ret == 1)
+		return 0;
+
+	if (fc->ops->parse_option)
+		return fc->ops->parse_option(fc, opt, len);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL(vfs_parse_fs_option);
+
+/**
+ * vfs_set_fs_source - Set the source/device name in a filesystem context
+ * @fc: The filesystem context to alter
+ * @source: The name of the source
+ * @slen: Length of @source string
+ */
+int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen)
+{
+	char *src;
+	int ret;
+
+	if (fc->source)
+		return -EINVAL;
+	src = kmemdup_nul(source, slen, GFP_KERNEL);
+	if (!src)
+		return -ENOMEM;
+
+	ret = security_fs_context_parse_source(fc, src);
+	if (ret < 0)
+		goto error;
+
+	if (fc->ops->parse_source) {
+		ret = fc->ops->parse_source(fc, src);
+		if (ret < 0)
+			goto error;
+	}
+
+	fc->source = src;
+	return 0;
+
+error:
+	kfree(src);
+	return ret;
+}
+EXPORT_SYMBOL(vfs_set_fs_source);
+
+/**
+ * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
+ * @ctx: The superblock configuration to fill in.
+ * @data: The data to parse
+ * @data_size: The amount of data
+ *
+ * Parse a blob of data that's in key[=val][,key[=val]]* form.  This can be
+ * called from the ->monolithic_mount_data() fs_context operation.
+ *
+ * Returns 0 on success or the error returned by the ->parse_option() fs_context
+ * operation on failure.
+ */
+int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
+{
+	char *options = data, *opt;
+	int ret;
+
+	if (!options)
+		return 0;
+
+	while ((opt = strsep(&options, ",")) != NULL) {
+		if (*opt) {
+			ret = vfs_parse_fs_option(fc, opt, strlen(opt));
+			if (ret < 0)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(generic_parse_monolithic);
+
+/**
+ * vfs_new_fs_context - Create a filesystem context.
+ * @fs_type: The filesystem type.
+ * @reference: The dentry from which this one derives (or NULL)
+ * @sb_flags: Filesystem/superblock flags (SB_*)
+ * @purpose: The purpose that this configuration shall be used for.
+ *
+ * Open a filesystem and create a mount context.  The mount context is
+ * initialised with the supplied flags and, if a submount/automount from
+ * another superblock (referred to by @reference) is supplied, may have
+ * parameters such as namespaces copied across from that superblock.
+ */
+struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+				      struct dentry *reference,
+				      unsigned int sb_flags,
+				      enum fs_context_purpose purpose)
+{
+	int (*init_fs_context)(struct fs_context *, struct dentry *);
+	struct fs_context *fc;
+	int ret = -ENOMEM;
+
+	fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
+	if (!fc)
+		return ERR_PTR(-ENOMEM);
+
+	fc->purpose	= purpose;
+	fc->sb_flags	= sb_flags;
+	fc->fs_type	= get_filesystem(fs_type);
+	fc->cred	= get_current_cred();
+
+	switch (purpose) {
+	case FS_CONTEXT_FOR_KERNEL_MOUNT:
+		fc->sb_flags |= SB_KERNMOUNT;
+		/* Fallthrough */
+	case FS_CONTEXT_FOR_USER_MOUNT:
+		fc->user_ns = get_user_ns(fc->cred->user_ns);
+		fc->net_ns = get_net(current->nsproxy->net_ns);
+		break;
+	case FS_CONTEXT_FOR_SUBMOUNT:
+		fc->user_ns = get_user_ns(reference->d_sb->s_user_ns);
+		fc->net_ns = get_net(current->nsproxy->net_ns);
+		break;
+	case FS_CONTEXT_FOR_RECONFIGURE:
+		/* We don't pin any namespaces as the superblock's
+		 * subscriptions cannot be changed at this point.
+		 */
+		atomic_inc(&reference->d_sb->s_active);
+		fc->root = dget(reference);
+		break;
+	}
+
+
+	/* TODO: Make all filesystems support this unconditionally */
+	init_fs_context = fc->fs_type->init_fs_context;
+	if (!init_fs_context)
+		init_fs_context = legacy_init_fs_context;
+
+	ret = (*init_fs_context)(fc, reference);
+	if (ret < 0)
+		goto err_fc;
+
+	/* Do the security check last because ->init_fs_context may change the
+	 * namespace subscriptions.
+	 */
+	ret = security_fs_context_alloc(fc, reference);
+	if (ret < 0)
+		goto err_fc;
+
+	return fc;
+
+err_fc:
+	put_fs_context(fc);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_new_fs_context);
+
+/**
+ * vfs_sb_reconfig - Create a filesystem context for remount/reconfiguration
+ * @mountpoint: The mountpoint to open
+ * @sb_flags: Filesystem/superblock flags (SB_*)
+ *
+ * Open a mounted filesystem and create a filesystem context such that a
+ * remount can be effected.
+ */
+struct fs_context *vfs_sb_reconfig(struct path *mountpoint,
+				   unsigned int sb_flags)
+{
+	struct fs_context *fc;
+
+	fc = vfs_new_fs_context(mountpoint->dentry->d_sb->s_type,
+				mountpoint->dentry,
+				sb_flags, FS_CONTEXT_FOR_RECONFIGURE);
+	if (IS_ERR(fc))
+		return fc;
+
+	return fc;
+}
+
+/**
+ * vfs_dup_fc_config: Duplicate a filesytem context.
+ * @src_fc: The context to copy.
+ */
+struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
+{
+	struct fs_context *fc;
+	int ret;
+
+	if (!src_fc->ops->dup)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	fc = kmemdup(src_fc, sizeof(struct legacy_fs_context), GFP_KERNEL);
+	if (!fc)
+		return ERR_PTR(-ENOMEM);
+
+	fc->fs_private	= NULL;
+	fc->s_fs_info	= NULL;
+	fc->source	= NULL;
+	fc->security	= NULL;
+	get_filesystem(fc->fs_type);
+	get_net(fc->net_ns);
+	get_user_ns(fc->user_ns);
+	get_cred(fc->cred);
+
+	/* Can't call put until we've called ->dup */
+	ret = fc->ops->dup(fc, src_fc);
+	if (ret < 0)
+		goto err_fc;
+
+	ret = security_fs_context_dup(fc, src_fc);
+	if (ret < 0)
+		goto err_fc;
+	return fc;
+
+err_fc:
+	put_fs_context(fc);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_dup_fs_context);
+
+/**
+ * put_fs_context - Dispose of a superblock configuration context.
+ * @fc: The context to dispose of.
+ */
+void put_fs_context(struct fs_context *fc)
+{
+	struct super_block *sb;
+
+	if (fc->root) {
+		sb = fc->root->d_sb;
+		dput(fc->root);
+		fc->root = NULL;
+		deactivate_super(sb);
+	}
+
+	if (fc->ops && fc->ops->free)
+		fc->ops->free(fc);
+
+	security_fs_context_free(fc);
+	if (fc->net_ns)
+		put_net(fc->net_ns);
+	put_user_ns(fc->user_ns);
+	if (fc->cred)
+		put_cred(fc->cred);
+	kfree(fc->subtype);
+	put_filesystem(fc->fs_type);
+	kfree(fc->source);
+	kfree(fc);
+}
+EXPORT_SYMBOL(put_fs_context);
+
+/*
+ * Free the config for a filesystem that doesn't support fs_context.
+ */
+static void legacy_fs_context_free(struct fs_context *fc)
+{
+	struct legacy_fs_context *ctx = fc->fs_private;
+
+	if (ctx) {
+		free_secdata(ctx->secdata);
+		switch (ctx->param_type) {
+		case LEGACY_FS_UNSET_PARAMS:
+		case LEGACY_FS_NO_PARAMS:
+			break;
+		case LEGACY_FS_MAGIC_PARAMS:
+			break; /* ctx->data is a weird pointer */
+		default:
+			kfree(ctx->legacy_data);
+			break;
+		}
+
+		kfree(ctx);
+	}
+}
+
+/*
+ * Duplicate a legacy config.
+ */
+static int legacy_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+	struct legacy_fs_context *ctx;
+	struct legacy_fs_context *src_ctx = src_fc->fs_private;
+
+	ctx = kmemdup(src_ctx, sizeof(*src_ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	switch (ctx->param_type) {
+	case LEGACY_FS_MONOLITHIC_PARAMS:
+	case LEGACY_FS_INDIVIDUAL_PARAMS:
+		ctx->legacy_data = kmemdup(src_ctx->legacy_data,
+					   src_ctx->data_size, GFP_KERNEL);
+		if (!ctx->legacy_data) {
+			kfree(ctx);
+			return -ENOMEM;
+		}
+		/* Fall through */
+	default:
+		break;
+	}
+
+	fc->fs_private = ctx;
+	return 0;
+}
+
+/*
+ * Add an option to a legacy config.  We build up a comma-separated list of
+ * options.
+ */
+static int legacy_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+	struct legacy_fs_context *ctx = fc->fs_private;
+	unsigned int size = ctx->data_size;
+
+	if (ctx->param_type != LEGACY_FS_UNSET_PARAMS &&
+	    ctx->param_type != LEGACY_FS_INDIVIDUAL_PARAMS) {
+		pr_warn("VFS: Can't mix monolithic and individual options\n");
+		return -EINVAL;
+	}
+
+	if (len > PAGE_SIZE - 2 - size)
+		return -EINVAL;
+	if (memchr(opt, ',', len) != NULL)
+		return -EINVAL;
+	if (!ctx->legacy_data) {
+		ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+		if (!ctx->legacy_data)
+			return -ENOMEM;
+	}
+
+	ctx->legacy_data[size++] = ',';
+	memcpy(ctx->legacy_data + size, opt, len);
+	size += len;
+	ctx->legacy_data[size] = '\0';
+	ctx->data_size = size;
+	ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS;
+	return 0;
+}
+
+/*
+ * Add monolithic mount data.
+ */
+static int legacy_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
+{
+	struct legacy_fs_context *ctx = fc->fs_private;
+
+	if (ctx->param_type != LEGACY_FS_UNSET_PARAMS) {
+		pr_warn("VFS: Can't mix monolithic and individual options\n");
+		return -EINVAL;
+	}
+
+	if (!data) {
+		ctx->param_type = LEGACY_FS_NO_PARAMS;
+		return 0;
+	}
+
+	ctx->data_size = data_size;
+	if (data_size > 0) {
+		ctx->legacy_data = kmemdup(data, data_size, GFP_KERNEL);
+		if (!ctx->legacy_data)
+			return -ENOMEM;
+		ctx->param_type = LEGACY_FS_MONOLITHIC_PARAMS;
+	} else {
+		/* Some filesystems pass weird pointers through that we don't
+		 * want to copy.  They can indicate this by setting data_size
+		 * to 0.
+		 */
+		ctx->legacy_data = data;
+		ctx->param_type = LEGACY_FS_MAGIC_PARAMS;
+	}
+
+	return 0;
+}
+
+/*
+ * Use the legacy mount validation step to strip out and process security
+ * config options.
+ */
+static int legacy_validate(struct fs_context *fc)
+{
+	struct legacy_fs_context *ctx = fc->fs_private;
+
+	switch (ctx->param_type) {
+	case LEGACY_FS_UNSET_PARAMS:
+		ctx->param_type = LEGACY_FS_NO_PARAMS;
+		/* Fall through */
+	case LEGACY_FS_NO_PARAMS:
+	case LEGACY_FS_MAGIC_PARAMS:
+		return 0;
+	default:
+		break;
+	}
+
+	if (fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA)
+		return 0;
+
+	ctx->secdata = alloc_secdata();
+	if (!ctx->secdata)
+		return -ENOMEM;
+
+	return security_sb_copy_data(ctx->legacy_data, ctx->data_size,
+				     ctx->secdata);
+}
+
+/*
+ * Determine the superblock subtype.
+ */
+static int legacy_set_subtype(struct fs_context *fc)
+{
+	const char *subtype = strchr(fc->fs_type->name, '.');
+
+	if (subtype) {
+		subtype++;
+		if (!subtype[0])
+			return -EINVAL;
+	} else {
+		subtype = "";
+	}
+
+	fc->subtype = kstrdup(subtype, GFP_KERNEL);
+	if (!fc->subtype)
+		return -ENOMEM;
+	return 0;
+}
+
+/*
+ * Get a mountable root with the legacy mount command.
+ */
+static int legacy_get_tree(struct fs_context *fc)
+{
+	struct legacy_fs_context *ctx = fc->fs_private;
+	struct super_block *sb;
+	struct dentry *root;
+	int ret;
+
+	root = fc->fs_type->mount(fc->fs_type, fc->sb_flags,
+				      fc->source, ctx->legacy_data,
+				      ctx->data_size);
+	if (IS_ERR(root))
+		return PTR_ERR(root);
+
+	sb = root->d_sb;
+	BUG_ON(!sb);
+
+	if ((fc->fs_type->fs_flags & FS_HAS_SUBTYPE) &&
+	    !fc->subtype) {
+		ret = legacy_set_subtype(fc);
+		if (ret < 0)
+			goto err_sb;
+	}
+
+	fc->root = root;
+	return 0;
+
+err_sb:
+	dput(root);
+	deactivate_locked_super(sb);
+	return ret;
+}
+
+static const struct fs_context_operations legacy_fs_context_ops = {
+	.free			= legacy_fs_context_free,
+	.dup			= legacy_fs_context_dup,
+	.parse_option		= legacy_parse_option,
+	.parse_monolithic	= legacy_parse_monolithic,
+	.validate		= legacy_validate,
+	.get_tree		= legacy_get_tree,
+};
+
+/*
+ * Initialise a legacy context for a filesystem that doesn't support
+ * fs_context.
+ */
+static int legacy_init_fs_context(struct fs_context *fc, struct dentry *dentry)
+{
+
+	fc->fs_private = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL);
+	if (!fc->fs_private)
+		return -ENOMEM;
+
+	fc->ops = &legacy_fs_context_ops;
+	return 0;
+}
diff --git a/fs/internal.h b/fs/internal.h
index 383ee4724f77..13febddab0f8 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -99,7 +99,8 @@ extern struct file *get_empty_filp(void);
 /*
  * super.c
  */
-extern int do_remount_sb(struct super_block *, int, void *, size_t, int);
+extern int do_remount_sb(struct super_block *, int, void *, size_t, int,
+			 struct fs_context *);
 extern bool trylock_super(struct super_block *sb);
 extern struct dentry *mount_fs(struct file_system_type *,
 			       int, const char *, void *, size_t);
diff --git a/fs/libfs.c b/fs/libfs.c
index 9f1f4884b7cc..d9a5d883dc3f 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -9,6 +9,7 @@
 #include <linux/slab.h>
 #include <linux/cred.h>
 #include <linux/mount.h>
+#include <linux/fs_context.h>
 #include <linux/vfs.h>
 #include <linux/quotaops.h>
 #include <linux/mutex.h>
@@ -574,13 +575,29 @@ static DEFINE_SPINLOCK(pin_fs_lock);
 
 int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count)
 {
+	struct fs_context *fc;
 	struct vfsmount *mnt = NULL;
+	int ret;
+
 	spin_lock(&pin_fs_lock);
 	if (unlikely(!*mount)) {
 		spin_unlock(&pin_fs_lock);
-		mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+
+		fc = vfs_new_fs_context(type, NULL, 0, FS_CONTEXT_FOR_KERNEL_MOUNT);
+		if (IS_ERR(fc))
+			return PTR_ERR(fc);
+
+		ret = vfs_get_tree(fc);
+		if (ret < 0) {
+			put_fs_context(fc);
+			return ret;
+		}
+
+		mnt = vfs_create_mount(fc, 0);
+		put_fs_context(fc);
 		if (IS_ERR(mnt))
 			return PTR_ERR(mnt);
+
 		spin_lock(&pin_fs_lock);
 		if (!*mount)
 			*mount = mnt;
diff --git a/fs/namespace.c b/fs/namespace.c
index 859dc473e2ad..3bae16db1b1d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -26,8 +26,10 @@
 #include <linux/magic.h>
 #include <linux/bootmem.h>
 #include <linux/task_work.h>
+#include <linux/file.h>
 #include <linux/sched/task.h>
 #include <uapi/linux/mount.h>
+#include <linux/fs_context.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -1017,56 +1019,6 @@ static struct mount *skip_mnt_tree(struct mount *p)
 	return p;
 }
 
-struct vfsmount *
-vfs_kern_mount(struct file_system_type *type, int flags, const char *name,
-	       void *data, size_t data_size)
-{
-	struct mount *mnt;
-	struct dentry *root;
-
-	if (!type)
-		return ERR_PTR(-ENODEV);
-
-	mnt = alloc_vfsmnt(name);
-	if (!mnt)
-		return ERR_PTR(-ENOMEM);
-
-	if (flags & SB_KERNMOUNT)
-		mnt->mnt.mnt_flags = MNT_INTERNAL;
-
-	root = mount_fs(type, flags, name, data, data_size);
-	if (IS_ERR(root)) {
-		mnt_free_id(mnt);
-		free_vfsmnt(mnt);
-		return ERR_CAST(root);
-	}
-
-	mnt->mnt.mnt_root = root;
-	mnt->mnt.mnt_sb = root->d_sb;
-	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
-	mnt->mnt_parent = mnt;
-	lock_mount_hash();
-	list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
-	unlock_mount_hash();
-	return &mnt->mnt;
-}
-EXPORT_SYMBOL_GPL(vfs_kern_mount);
-
-struct vfsmount *
-vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
-	     const char *name, void *data, size_t data_size)
-{
-	/* Until it is worked out how to pass the user namespace
-	 * through from the parent mount to the submount don't support
-	 * unprivileged mounts with submounts.
-	 */
-	if (mountpoint->d_sb->s_user_ns != &init_user_ns)
-		return ERR_PTR(-EPERM);
-
-	return vfs_kern_mount(type, SB_SUBMOUNT, name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(vfs_submount);
-
 static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 					int flag)
 {
@@ -1594,7 +1546,7 @@ static int do_umount(struct mount *mnt, int flags)
 			return -EPERM;
 		down_write(&sb->s_umount);
 		if (!sb_rdonly(sb))
-			retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0);
+			retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0, NULL);
 		up_write(&sb->s_umount);
 		return retval;
 	}
@@ -2439,6 +2391,20 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 	return ret;
 }
 
+/*
+ * Parse the monolithic page of mount data given to sys_mount().
+ */
+static int parse_monolithic_mount_data(struct fs_context *fc, void *data, size_t data_size)
+{
+	int (*monolithic_mount_data)(struct fs_context *, void *, size_t);
+
+	monolithic_mount_data = fc->ops->parse_monolithic;
+	if (!monolithic_mount_data)
+		monolithic_mount_data = generic_parse_monolithic;
+
+	return monolithic_mount_data(fc, data, data_size);
+}
+
 /*
  * change filesystem flags. dir should be a physical root of filesystem.
  * If you've mounted a non-root directory somewhere and want to do remount
@@ -2447,9 +2413,11 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 static int do_remount(struct path *path, int ms_flags, int sb_flags,
 		      int mnt_flags, void *data, size_t data_size)
 {
+	struct fs_context *fc = NULL;
 	int err;
 	struct super_block *sb = path->mnt->mnt_sb;
 	struct mount *mnt = real_mount(path->mnt);
+	struct file_system_type *type = sb->s_type;
 
 	if (!check_mnt(mnt))
 		return -EINVAL;
@@ -2460,18 +2428,41 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 	if (!can_change_locked_flags(mnt, mnt_flags))
 		return -EPERM;
 
-	err = security_sb_remount(sb, data, data_size);
-	if (err)
-		return err;
+	if (type->init_fs_context) {
+		fc = vfs_sb_reconfig(path, sb_flags);
+		if (IS_ERR(fc))
+			return PTR_ERR(fc);
+
+		err = parse_monolithic_mount_data(fc, data, data_size);
+		if (err < 0)
+			goto err_fc;
+
+		if (fc->ops->validate) {
+			err = fc->ops->validate(fc);
+			if (err < 0)
+				goto err_fc;
+		}
+
+		err = security_fs_context_validate(fc);
+		if (err)
+			return err;
+	} else {
+		err = security_sb_remount(sb, data, data_size);
+		if (err)
+			return err;
+	}
 
 	down_write(&sb->s_umount);
 	err = -EPERM;
 	if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
-		err = do_remount_sb(sb, sb_flags, data, data_size, 0);
+		err = do_remount_sb(sb, sb_flags, data, data_size, 0, fc);
 		if (!err)
 			set_mount_attributes(mnt, mnt_flags);
 	}
 	up_write(&sb->s_umount);
+err_fc:
+	if (fc)
+		put_fs_context(fc);
 	return err;
 }
 
@@ -2576,29 +2567,6 @@ static int do_move_mount_old(struct path *path, const char *old_name)
 	return err;
 }
 
-static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
-{
-	int err;
-	const char *subtype = strchr(fstype, '.');
-	if (subtype) {
-		subtype++;
-		err = -EINVAL;
-		if (!subtype[0])
-			goto err;
-	} else
-		subtype = "";
-
-	mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
-	err = -ENOMEM;
-	if (!mnt->mnt_sb->s_subtype)
-		goto err;
-	return mnt;
-
- err:
-	mntput(mnt);
-	return ERR_PTR(err);
-}
-
 /*
  * add a mount into a namespace's mount tree
  */
@@ -2643,44 +2611,88 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
 	return err;
 }
 
-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
+static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
+
+/*
+ * Create a new mount using a superblock configuration and request it
+ * be added to the namespace tree.
+ */
+static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
+			   unsigned int mnt_flags)
+{
+	struct vfsmount *mnt;
+	int ret;
+
+	ret = security_sb_mountpoint(fc, mountpoint,
+				     mnt_flags & ~MNT_INTERNAL_FLAGS);
+	if (ret < 0)
+		return ret;
+
+	if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+		pr_warn("VFS: Mount too revealing\n");
+		return -EPERM;
+	}
+
+	mnt = vfs_create_mount(fc, mnt_flags);
+	if (IS_ERR(mnt))
+		return PTR_ERR(mnt);
+
+	ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+	if (ret < 0)
+		goto err_mnt;
+	return ret;
+
+err_mnt:
+	mntput(mnt);
+	return ret;
+}
 
 /*
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
-			int mnt_flags, const char *name,
+static int do_new_mount(struct path *mountpoint, const char *fstype,
+			int sb_flags, int mnt_flags, const char *name,
 			void *data, size_t data_size)
 {
-	struct file_system_type *type;
-	struct vfsmount *mnt;
+	struct file_system_type *fs_type;
+	struct fs_context *fc;
 	int err;
 
 	if (!fstype)
 		return -EINVAL;
 
-	type = get_fs_type(fstype);
-	if (!type)
-		return -ENODEV;
-
-	mnt = vfs_kern_mount(type, sb_flags, name, data, data_size);
-	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
-	    !mnt->mnt_sb->s_subtype)
-		mnt = fs_set_subtype(mnt, fstype);
+	err = -ENODEV;
+	fs_type = get_fs_type(fstype);
+	if (!fs_type)
+		goto out;
 
-	put_filesystem(type);
-	if (IS_ERR(mnt))
-		return PTR_ERR(mnt);
+	fc = vfs_new_fs_context(fs_type, NULL, sb_flags,
+				FS_CONTEXT_FOR_USER_MOUNT);
+	put_filesystem(fs_type);
+	if (IS_ERR(fc)) {
+		err = PTR_ERR(fc);
+		goto out;
+	}
 
-	if (mount_too_revealing(mnt, &mnt_flags)) {
-		mntput(mnt);
-		return -EPERM;
+	if (name) {
+		err = vfs_set_fs_source(fc, name, strlen(name));
+		if (err < 0)
+			goto out_fc;
 	}
 
-	err = do_add_mount(real_mount(mnt), path, mnt_flags);
-	if (err)
-		mntput(mnt);
+	err = parse_monolithic_mount_data(fc, data, data_size);
+	if (err < 0)
+		goto out_fc;
+
+	err = vfs_get_tree(fc);
+	if (err < 0)
+		goto out_fc;
+
+	err = do_new_mount_fc(fc, mountpoint, mnt_flags);
+out_fc:
+	put_fs_context(fc);
+out:
 	return err;
 }
 
@@ -3230,6 +3242,117 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 	return ksys_mount(dev_name, dir_name, type, flags, data);
 }
 
+/**
+ * vfs_create_mount - Create a mount for a configured superblock
+ * @fc: The configuration context with the superblock attached
+ * @mnt_flags: The mount flags to apply
+ *
+ * Create a mount to an already configured superblock.  If necessary, the
+ * caller should invoke vfs_get_tree() before calling this.
+ *
+ * Note that this does not attach the mount to anything.
+ */
+struct vfsmount *vfs_create_mount(struct fs_context *fc, unsigned int mnt_flags)
+{
+	struct mount *mnt;
+
+	if (!fc->root)
+		return ERR_PTR(-EINVAL);
+
+	mnt = alloc_vfsmnt(fc->source ?: "none");
+	if (!mnt)
+		return ERR_PTR(-ENOMEM);
+
+	if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+		/* It's a longterm mount, don't release mnt until we unmount
+		 * before file sys is unregistered
+		 */
+		mnt_flags |= MNT_INTERNAL;
+
+	atomic_inc(&fc->root->d_sb->s_active);
+	mnt->mnt.mnt_flags	= mnt_flags;
+	mnt->mnt.mnt_sb		= fc->root->d_sb;
+	mnt->mnt.mnt_root	= dget(fc->root);
+	mnt->mnt_mountpoint	= mnt->mnt.mnt_root;
+	mnt->mnt_parent		= mnt;
+
+	lock_mount_hash();
+	list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
+	unlock_mount_hash();
+	return &mnt->mnt;
+}
+EXPORT_SYMBOL(vfs_create_mount);
+
+struct vfsmount *vfs_kern_mount(struct file_system_type *type,
+				int sb_flags, const char *devname,
+				void *data, size_t data_size)
+{
+	struct fs_context *fc;
+	struct vfsmount *mnt;
+	int ret;
+
+	if (!type)
+		return ERR_PTR(-EINVAL);
+
+	fc = vfs_new_fs_context(type, NULL, sb_flags,
+				sb_flags & SB_KERNMOUNT ?
+				FS_CONTEXT_FOR_KERNEL_MOUNT :
+				FS_CONTEXT_FOR_USER_MOUNT);
+	if (IS_ERR(fc))
+		return ERR_CAST(fc);
+
+	if (devname) {
+		ret = vfs_set_fs_source(fc, devname, strlen(devname));
+		if (ret < 0)
+			goto err_fc;
+	}
+
+	ret = parse_monolithic_mount_data(fc, data, data_size);
+	if (ret < 0)
+		goto err_fc;
+
+	ret = vfs_get_tree(fc);
+	if (ret < 0)
+		goto err_fc;
+
+	mnt = vfs_create_mount(fc, 0);
+out:
+	put_fs_context(fc);
+	return mnt;
+err_fc:
+	mnt = ERR_PTR(ret);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(vfs_kern_mount);
+
+struct vfsmount *
+vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
+	     const char *name, void *data, size_t data_size)
+{
+	/* Until it is worked out how to pass the user namespace
+	 * through from the parent mount to the submount don't support
+	 * unprivileged mounts with submounts.
+	 */
+	if (mountpoint->d_sb->s_user_ns != &init_user_ns)
+		return ERR_PTR(-EPERM);
+
+	return vfs_kern_mount(type, MS_SUBMOUNT, name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(vfs_submount);
+
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+	return vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+}
+EXPORT_SYMBOL_GPL(kern_mount);
+
+struct vfsmount *kern_mount_data(struct file_system_type *type,
+				 void *data, size_t data_size)
+{
+	return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(kern_mount_data);
+
 /*
  * Move a mount from one place to another.
  * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
@@ -3507,22 +3630,6 @@ void put_mnt_ns(struct mnt_namespace *ns)
 	free_mnt_ns(ns);
 }
 
-struct vfsmount *kern_mount_data(struct file_system_type *type,
-				 void *data, size_t data_size)
-{
-	struct vfsmount *mnt;
-	mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
-	if (!IS_ERR(mnt)) {
-		/*
-		 * it is a longterm mount, don't release mnt until
-		 * we unmount before file sys is unregistered
-		*/
-		real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
-	}
-	return mnt;
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
 void kern_unmount(struct vfsmount *mnt)
 {
 	/* release long term mount so mount point can be released */
@@ -3563,7 +3670,8 @@ bool current_chrooted(void)
 	return chrooted;
 }
 
-static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
+static bool mnt_already_visible(struct mnt_namespace *ns,
+				const struct super_block *sb,
 				int *new_mnt_flags)
 {
 	int new_flags = *new_mnt_flags;
@@ -3575,7 +3683,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
 		struct mount *child;
 		int mnt_flags;
 
-		if (mnt->mnt.mnt_sb->s_type != new->mnt_sb->s_type)
+		if (mnt->mnt.mnt_sb->s_type != sb->s_type)
 			continue;
 
 		/* This mount is not fully visible if it's root directory
@@ -3626,7 +3734,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
 	return visible;
 }
 
-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
+static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
 {
 	const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
 	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
@@ -3636,7 +3744,7 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
 		return false;
 
 	/* Can this filesystem be too revealing? */
-	s_iflags = mnt->mnt_sb->s_iflags;
+	s_iflags = sb->s_iflags;
 	if (!(s_iflags & SB_I_USERNS_VISIBLE))
 		return false;
 
@@ -3646,7 +3754,7 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
 		return true;
 	}
 
-	return !mnt_already_visible(ns, mnt, new_mnt_flags);
+	return !mnt_already_visible(ns, sb, new_mnt_flags);
 }
 
 bool mnt_may_suid(struct vfsmount *mnt)
diff --git a/fs/super.c b/fs/super.c
index c9d208b7999e..7c5541453081 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
 #include <linux/lockdep.h>
 #include <linux/user_namespace.h>
 #include <uapi/linux/mount.h>
+#include <linux/fs_context.h>
 #include "internal.h"
 
 static int thaw_super_locked(struct super_block *sb);
@@ -184,16 +185,13 @@ static void destroy_unused_super(struct super_block *s)
 }
 
 /**
- *	alloc_super	-	create new superblock
- *	@type:	filesystem type superblock should belong to
- *	@flags: the mount flags
- *	@user_ns: User namespace for the super_block
+ *	alloc_super - Create new superblock
+ *	@fc: The filesystem configuration context
  *
  *	Allocates and initializes a new &struct super_block.  alloc_super()
  *	returns a pointer new superblock or %NULL if allocation had failed.
  */
-static struct super_block *alloc_super(struct file_system_type *type, int flags,
-				       struct user_namespace *user_ns)
+static struct super_block *alloc_super(struct fs_context *fc)
 {
 	struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
 	static const struct super_operations default_op;
@@ -203,9 +201,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 		return NULL;
 
 	INIT_LIST_HEAD(&s->s_mounts);
-	s->s_user_ns = get_user_ns(user_ns);
+	s->s_user_ns = get_user_ns(fc->user_ns);
 	init_rwsem(&s->s_umount);
-	lockdep_set_class(&s->s_umount, &type->s_umount_key);
+	lockdep_set_class(&s->s_umount, &fc->fs_type->s_umount_key);
 	/*
 	 * sget() can have s_umount recursion.
 	 *
@@ -229,12 +227,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	for (i = 0; i < SB_FREEZE_LEVELS; i++) {
 		if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
 					sb_writers_name[i],
-					&type->s_writers_key[i]))
+					&fc->fs_type->s_writers_key[i]))
 			goto fail;
 	}
 	init_waitqueue_head(&s->s_writers.wait_unfrozen);
 	s->s_bdi = &noop_backing_dev_info;
-	s->s_flags = flags;
+	s->s_flags = fc->sb_flags;
 	if (s->s_user_ns != &init_user_ns)
 		s->s_iflags |= SB_I_NODEV;
 	INIT_HLIST_NODE(&s->s_instances);
@@ -252,7 +250,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	s->s_count = 1;
 	atomic_set(&s->s_active, 1);
 	mutex_init(&s->s_vfs_rename_mutex);
-	lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
+	lockdep_set_class(&s->s_vfs_rename_mutex, &fc->fs_type->s_vfs_rename_key);
 	init_rwsem(&s->s_dquot.dqio_sem);
 	s->s_maxbytes = MAX_NON_LFS;
 	s->s_op = &default_op;
@@ -472,6 +470,89 @@ void generic_shutdown_super(struct super_block *sb)
 
 EXPORT_SYMBOL(generic_shutdown_super);
 
+/**
+ * sget_fc - Find or create a superblock
+ * @fc:	Filesystem context.
+ * @test: Comparison callback
+ * @set: Setup callback
+ *
+ * Find or create a superblock using the parameters stored in the filesystem
+ * context and the two callback functions.
+ *
+ * If an extant superblock is matched, then that will be returned with an
+ * elevated reference count that the caller must transfer or discard.
+ *
+ * If no match is made, a new superblock will be allocated and basic
+ * initialisation will be performed (s_type, s_fs_info and s_id will be set and
+ * the set() callback will be invoked), the superblock will be published and it
+ * will be returned in a partially constructed state with SB_BORN and SB_ACTIVE
+ * as yet unset.
+ */
+struct super_block *sget_fc(struct fs_context *fc,
+			    int (*test)(struct super_block *, struct fs_context *),
+			    int (*set)(struct super_block *, struct fs_context *))
+{
+	struct super_block *s = NULL;
+	struct super_block *old;
+	int err;
+
+	if (!(fc->sb_flags & SB_KERNMOUNT) &&
+	    fc->purpose != FS_CONTEXT_FOR_SUBMOUNT) {
+		/* Don't allow mounting unless the caller has CAP_SYS_ADMIN
+		 * over the namespace.
+		 */
+		if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
+		    !capable(CAP_SYS_ADMIN))
+			return ERR_PTR(-EPERM);
+		else if (!ns_capable(fc->user_ns, CAP_SYS_ADMIN))
+			return ERR_PTR(-EPERM);
+	}
+
+retry:
+	spin_lock(&sb_lock);
+	if (test) {
+		hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
+			if (!test(old, fc))
+				continue;
+			if (fc->user_ns != old->s_user_ns) {
+				spin_unlock(&sb_lock);
+				destroy_unused_super(s);
+				return ERR_PTR(-EBUSY);
+			}
+			if (!grab_super(old))
+				goto retry;
+			destroy_unused_super(s);
+			return old;
+		}
+	}
+	if (!s) {
+		spin_unlock(&sb_lock);
+		s = alloc_super(fc);
+		if (!s)
+			return ERR_PTR(-ENOMEM);
+		goto retry;
+	}
+
+	s->s_fs_info = fc->s_fs_info;
+	err = set(s, fc);
+	if (err) {
+		s->s_fs_info = NULL;
+		spin_unlock(&sb_lock);
+		destroy_unused_super(s);
+		return ERR_PTR(err);
+	}
+	fc->s_fs_info = NULL;
+	s->s_type = fc->fs_type;
+	strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
+	list_add_tail(&s->s_list, &super_blocks);
+	hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
+	spin_unlock(&sb_lock);
+	get_filesystem(s->s_type);
+	register_shrinker_prepared(&s->s_shrink);
+	return s;
+}
+EXPORT_SYMBOL(sget_fc);
+
 /**
  *	sget_userns -	find or create a superblock
  *	@type:	filesystem type superblock should belong to
@@ -514,7 +595,14 @@ struct super_block *sget_userns(struct file_system_type *type,
 	}
 	if (!s) {
 		spin_unlock(&sb_lock);
-		s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
+		{
+			struct fs_context fc = {
+				.fs_type	= type,
+				.sb_flags	= flags & ~SB_SUBMOUNT,
+				.user_ns	= user_ns,
+			};
+			s = alloc_super(&fc);
+		}
 		if (!s)
 			return ERR_PTR(-ENOMEM);
 		goto retry;
@@ -838,11 +926,13 @@ struct super_block *user_get_super(dev_t dev)
  *	@data:	the rest of options
  *	@data_size: The size of the data
  *      @force: whether or not to force the change
+ *	@fc:	the superblock config for filesystems that support it
+ *		(NULL if called from emergency or umount)
  *
  *	Alters the mount options of a mounted file system.
  */
 int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
-		  size_t data_size, int force)
+		  size_t data_size, int force, struct fs_context *fc)
 {
 	int retval;
 	int remount_ro;
@@ -884,8 +974,17 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
 		}
 	}
 
-	if (sb->s_op->remount_fs) {
-		retval = sb->s_op->remount_fs(sb, &sb_flags, data, data_size);
+	if (sb->s_op->reconfigure ||
+	    sb->s_op->remount_fs) {
+		if (sb->s_op->reconfigure) {
+			retval = sb->s_op->reconfigure(sb, fc);
+			sb_flags = fc->sb_flags;
+			if (retval == 0)
+				security_sb_reconfigure(fc);
+		} else {
+			retval = sb->s_op->remount_fs(sb, &sb_flags,
+						      data, data_size);
+		}
 		if (retval) {
 			if (!force)
 				goto cancel_readonly;
@@ -924,7 +1023,7 @@ static void do_emergency_remount_callback(struct super_block *sb)
 		/*
 		 * What lock protects sb->s_flags??
 		 */
-		do_remount_sb(sb, SB_RDONLY, NULL, 0, 1);
+		do_remount_sb(sb, SB_RDONLY, NULL, 0, 1, NULL);
 	}
 	up_write(&sb->s_umount);
 }
@@ -1106,6 +1205,89 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
 
 EXPORT_SYMBOL(mount_ns);
 
+int set_anon_super_fc(struct super_block *sb, struct fs_context *fc)
+{
+	return set_anon_super(sb, NULL);
+}
+EXPORT_SYMBOL(set_anon_super_fc);
+
+static int test_keyed_super(struct super_block *sb, struct fs_context *fc)
+{
+	return sb->s_fs_info == fc->s_fs_info;
+}
+
+static int test_single_super(struct super_block *s, struct fs_context *fc)
+{
+	return 1;
+}
+
+/**
+ * vfs_get_super - Get a superblock with a search key set in s_fs_info.
+ * @fc: The filesystem context holding the parameters
+ * @keying: How to distinguish superblocks
+ * @fill_super: Helper to initialise a new superblock
+ *
+ * Search for a superblock and create a new one if not found.  The search
+ * criterion is controlled by @keying.  If the search fails, a new superblock
+ * is created and @fill_super() is called to initialise it.
+ *
+ * @keying can take one of a number of values:
+ *
+ * (1) vfs_get_single_super - Only one superblock of this type may exist on the
+ *     system.  This is typically used for special system filesystems.
+ *
+ * (2) vfs_get_keyed_super - Multiple superblocks may exist, but they must have
+ *     distinct keys (where the key is in s_fs_info).  Searching for the same
+ *     key again will turn up the superblock for that key.
+ *
+ * (3) vfs_get_independent_super - Multiple superblocks may exist and are
+ *     unkeyed.  Each call will get a new superblock.
+ *
+ * A permissions check is made by sget_fc() unless we're getting a superblock
+ * for a kernel-internal mount or a submount.
+ */
+int vfs_get_super(struct fs_context *fc,
+		  enum vfs_get_super_keying keying,
+		  int (*fill_super)(struct super_block *sb,
+				    struct fs_context *fc))
+{
+	int (*test)(struct super_block *, struct fs_context *);
+	struct super_block *sb;
+
+	switch (keying) {
+	case vfs_get_single_super:
+		test = test_single_super;
+		break;
+	case vfs_get_keyed_super:
+		test = test_keyed_super;
+		break;
+	case vfs_get_independent_super:
+		test = NULL;
+		break;
+	default:
+		BUG();
+	}
+
+	sb = sget_fc(fc, test, set_anon_super_fc);
+	if (IS_ERR(sb))
+		return PTR_ERR(sb);
+
+	if (!sb->s_root) {
+		int err = fill_super(sb, fc);
+		if (err) {
+			deactivate_locked_super(sb);
+			return err;
+		}
+
+		sb->s_flags |= SB_ACTIVE;
+	}
+
+	BUG_ON(fc->root);
+	fc->root = dget(sb->s_root);
+	return 0;
+}
+EXPORT_SYMBOL(vfs_get_super);
+
 #ifdef CONFIG_BLOCK
 static int set_bdev_super(struct super_block *s, void *data)
 {
@@ -1254,7 +1436,7 @@ struct dentry *mount_single(struct file_system_type *fs_type,
 		}
 		s->s_flags |= SB_ACTIVE;
 	} else {
-		do_remount_sb(s, flags, data, data_size, 0);
+		do_remount_sb(s, flags, data, data_size, 0, NULL);
 	}
 	return dget(s->s_root);
 }
@@ -1601,3 +1783,90 @@ int thaw_super(struct super_block *sb)
 	return thaw_super_locked(sb);
 }
 EXPORT_SYMBOL(thaw_super);
+
+/**
+ * vfs_get_tree - Get the mountable root
+ * @fc: The superblock configuration context.
+ *
+ * The filesystem is invoked to get or create a superblock which can then later
+ * be used for mounting.  The filesystem places a pointer to the root to be
+ * used for mounting in @fc->root.
+ */
+int vfs_get_tree(struct fs_context *fc)
+{
+	struct super_block *sb;
+	int ret;
+
+	if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source)
+		return -ENOENT;
+
+	if (fc->root)
+		return -EBUSY;
+
+	if (fc->ops->validate) {
+		ret = fc->ops->validate(fc);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = security_fs_context_validate(fc);
+	if (ret < 0)
+		return ret;
+
+	/* Get the mountable root in fc->root, with a ref on the root and a ref
+	 * on the superblock.
+	 */
+	ret = fc->ops->get_tree(fc);
+	if (ret < 0)
+		return ret;
+
+	if (!fc->root) {
+		pr_err("Filesystem %s get_tree() didn't set fc->root\n",
+		       fc->fs_type->name);
+		/* We don't know what the locking state of the superblock is -
+		 * if there is a superblock.
+		 */
+		BUG();
+	}
+
+	sb = fc->root->d_sb;
+	WARN_ON(!sb->s_bdi);
+
+	ret = security_sb_get_tree(fc);
+	if (ret < 0)
+		goto err_sb;
+
+	ret = -ENOMEM;
+	if (fc->subtype && !sb->s_subtype) {
+		sb->s_subtype = kstrdup(fc->subtype, GFP_KERNEL);
+		if (!sb->s_subtype)
+			goto err_sb;
+	}
+
+	/* Write barrier is for super_cache_count(). We place it before setting
+	 * SB_BORN as the data dependency between the two functions is the
+	 * superblock structure contents that we just set up, not the SB_BORN
+	 * flag.
+	 */
+	smp_wmb();
+	sb->s_flags |= SB_BORN;
+
+	/* Filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
+	 * but s_maxbytes was an unsigned long long for many releases.  Throw
+	 * this warning for a little while to try and catch filesystems that
+	 * violate this rule.
+	 */
+	WARN(sb->s_maxbytes < 0,
+	     "%s set sb->s_maxbytes to negative value (%lld)\n",
+	     fc->fs_type->name, sb->s_maxbytes);
+
+	up_write(&sb->s_umount);
+	return 0;
+
+err_sb:
+	dput(fc->root);
+	fc->root = NULL;
+	deactivate_locked_super(sb);
+	return ret;
+}
+EXPORT_SYMBOL(vfs_get_tree);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 067f0e31aec7..88de0f586b38 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -61,6 +61,7 @@ struct workqueue_struct;
 struct iov_iter;
 struct fscrypt_info;
 struct fscrypt_operations;
+struct fs_context;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -723,6 +724,11 @@ static inline void inode_unlock(struct inode *inode)
 	up_write(&inode->i_rwsem);
 }
 
+static inline int inode_lock_killable(struct inode *inode)
+{
+	return down_write_killable(&inode->i_rwsem);
+}
+
 static inline void inode_lock_shared(struct inode *inode)
 {
 	down_read(&inode->i_rwsem);
@@ -1842,6 +1848,7 @@ struct super_operations {
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *, size_t);
+	int (*reconfigure) (struct super_block *, struct fs_context *);
 	void (*umount_begin) (struct super_block *);
 
 	int (*show_options)(struct seq_file *, struct dentry *);
@@ -2098,6 +2105,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
+	int (*init_fs_context)(struct fs_context *, struct dentry *);
 	struct dentry *(*mount) (struct file_system_type *, int,
 				 const char *, void *, size_t);
 	void (*kill_sb) (struct super_block *);
@@ -2154,8 +2162,12 @@ void kill_litter_super(struct super_block *sb);
 void deactivate_super(struct super_block *sb);
 void deactivate_locked_super(struct super_block *sb);
 int set_anon_super(struct super_block *s, void *data);
+int set_anon_super_fc(struct super_block *s, struct fs_context *fc);
 int get_anon_bdev(dev_t *);
 void free_anon_bdev(dev_t);
+struct super_block *sget_fc(struct fs_context *fc,
+			    int (*test)(struct super_block *, struct fs_context *),
+			    int (*set)(struct super_block *, struct fs_context *));
 struct super_block *sget_userns(struct file_system_type *type,
 			int (*test)(struct super_block *,void *),
 			int (*set)(struct super_block *,void *),
@@ -2198,8 +2210,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
 
 extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
+extern struct vfsmount *kern_mount(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
-#define kern_mount(type) kern_mount_data(type, NULL, 0)
 extern void kern_unmount(struct vfsmount *mnt);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 0bde0a2a782e..f157ff935a1e 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -25,6 +25,7 @@ struct pid_namespace;
 struct super_block;
 struct user_namespace;
 struct vfsmount;
+struct path;
 
 enum fs_context_purpose {
 	FS_CONTEXT_FOR_USER_MOUNT,	/* New superblock for user-specified mount */
@@ -33,6 +34,19 @@ enum fs_context_purpose {
 	FS_CONTEXT_FOR_RECONFIGURE,	/* Superblock reconfiguration (remount) */
 };
 
+/*
+ * Userspace usage phase for fsopen/fspick.
+ */
+enum fs_context_phase {
+	FS_CONTEXT_CREATE_PARAMS,	/* Loading params for sb creation */
+	FS_CONTEXT_CREATING,		/* A superblock is being created */
+	FS_CONTEXT_AWAITING_MOUNT,	/* Superblock created, awaiting fsmount() */
+	FS_CONTEXT_AWAITING_RECONF,	/* Awaiting initialisation for reconfiguration */
+	FS_CONTEXT_RECONF_PARAMS,	/* Loading params for reconfiguration */
+	FS_CONTEXT_RECONFIGURING,	/* Reconfiguring the superblock */
+	FS_CONTEXT_FAILED,		/* Failed to correctly transition a context */
+};
+
 /*
  * Filesystem context for holding the parameters used in the creation or
  * reconfiguration of a superblock.
@@ -56,6 +70,7 @@ struct fs_context {
 	void			*s_fs_info;	/* Proposed s_fs_info */
 	unsigned int		sb_flags;	/* Proposed superblock flags (SB_*) */
 	enum fs_context_purpose	purpose:8;
+	enum fs_context_phase	phase:8;	/* The phase the context is in */
 	bool			sloppy:1;	/* T if unrecognised options are okay */
 	bool			silent:1;	/* T if "o silent" specified */
 };
@@ -65,9 +80,37 @@ struct fs_context_operations {
 	int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
 	int (*parse_source)(struct fs_context *fc, char *source);
 	int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
-	int (*parse_monolithic)(struct fs_context *fc, void *data);
+	int (*parse_monolithic)(struct fs_context *fc, void *data, size_t data_size);
 	int (*validate)(struct fs_context *fc);
 	int (*get_tree)(struct fs_context *fc);
 };
 
+/*
+ * fs_context manipulation functions.
+ */
+extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+					     struct dentry *reference,
+					     unsigned int ms_flags,
+					     enum fs_context_purpose purpose);
+extern struct fs_context *vfs_sb_reconfig(struct path *path, unsigned int ms_flags);
+extern struct fs_context *vfs_dup_fs_context(struct fs_context *src);
+extern int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t len);
+extern int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len);
+extern int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size);
+extern int vfs_get_tree(struct fs_context *fc);
+extern void put_fs_context(struct fs_context *fc);
+
+/*
+ * sget() wrapper to be called from the ->get_tree() op.
+ */
+enum vfs_get_super_keying {
+	vfs_get_single_super,	/* Only one such superblock may exist */
+	vfs_get_keyed_super,	/* Superblocks with different s_fs_info keys may exist */
+	vfs_get_independent_super, /* Multiple independent superblocks may exist */
+};
+extern int vfs_get_super(struct fs_context *fc,
+			 enum vfs_get_super_keying keying,
+			 int (*fill_super)(struct super_block *sb,
+					   struct fs_context *fc));
+
 #endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c9edd284f0af..41b6b080ffd0 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -21,6 +21,7 @@ struct super_block;
 struct vfsmount;
 struct dentry;
 struct mnt_namespace;
+struct fs_context;
 
 #define MNT_NOSUID	0x01
 #define MNT_NODEV	0x02
@@ -88,6 +89,8 @@ struct path;
 extern struct vfsmount *clone_private_mount(const struct path *path);
 
 struct file_system_type;
+extern struct vfsmount *vfs_create_mount(struct fs_context *fc,
+					 unsigned int mnt_flags);
 extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
 				      int flags, const char *name,
 				      void *data, size_t data_size);


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 14/32] vfs: Remove unused code after filesystem context changes [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (12 preceding siblings ...)
  2018-07-10 22:42 ` [PATCH 13/32] vfs: Implement a filesystem superblock creation/configuration context " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 15/32] procfs: Move proc_fill_super() to fs/proc/root.c " David Howells
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Remove code that is now unused after the filesystem context changes.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/internal.h              |    2 -
 fs/super.c                 |   62 --------------------------------------------
 include/linux/lsm_hooks.h  |    3 --
 include/linux/security.h   |    7 -----
 security/security.c        |    5 ----
 security/selinux/hooks.c   |   20 --------------
 security/smack/smack_lsm.c |   33 -----------------------
 7 files changed, 132 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 13febddab0f8..f51805b9226d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -102,8 +102,6 @@ extern struct file *get_empty_filp(void);
 extern int do_remount_sb(struct super_block *, int, void *, size_t, int,
 			 struct fs_context *);
 extern bool trylock_super(struct super_block *sb);
-extern struct dentry *mount_fs(struct file_system_type *,
-			       int, const char *, void *, size_t);
 extern struct super_block *user_get_super(dev_t);
 
 /*
diff --git a/fs/super.c b/fs/super.c
index 7c5541453081..bbef5a5057c0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1442,68 +1442,6 @@ struct dentry *mount_single(struct file_system_type *fs_type,
 }
 EXPORT_SYMBOL(mount_single);
 
-struct dentry *
-mount_fs(struct file_system_type *type, int flags, const char *name,
-	 void *data, size_t data_size)
-{
-	struct dentry *root;
-	struct super_block *sb;
-	char *secdata = NULL;
-	int error = -ENOMEM;
-
-	if (data && !(type->fs_flags & FS_BINARY_MOUNTDATA)) {
-		secdata = alloc_secdata();
-		if (!secdata)
-			goto out;
-
-		error = security_sb_copy_data(data, data_size, secdata);
-		if (error)
-			goto out_free_secdata;
-	}
-
-	root = type->mount(type, flags, name, data, data_size);
-	if (IS_ERR(root)) {
-		error = PTR_ERR(root);
-		goto out_free_secdata;
-	}
-	sb = root->d_sb;
-	BUG_ON(!sb);
-	WARN_ON(!sb->s_bdi);
-
-	/*
-	 * Write barrier is for super_cache_count(). We place it before setting
-	 * SB_BORN as the data dependency between the two functions is the
-	 * superblock structure contents that we just set up, not the SB_BORN
-	 * flag.
-	 */
-	smp_wmb();
-	sb->s_flags |= SB_BORN;
-
-	error = security_sb_kern_mount(sb, flags, secdata, data_size);
-	if (error)
-		goto out_sb;
-
-	/*
-	 * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
-	 * but s_maxbytes was an unsigned long long for many releases. Throw
-	 * this warning for a little while to try and catch filesystems that
-	 * violate this rule.
-	 */
-	WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
-		"negative value (%lld)\n", type->name, sb->s_maxbytes);
-
-	up_write(&sb->s_umount);
-	free_secdata(secdata);
-	return root;
-out_sb:
-	dput(root);
-	deactivate_locked_super(sb);
-out_free_secdata:
-	free_secdata(secdata);
-out:
-	return ERR_PTR(error);
-}
-
 /*
  * Setup private BDI for given superblock. It gets automatically cleaned up
  * in generic_shutdown_super().
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 7ff5a980399a..18cc2f8bd680 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1527,8 +1527,6 @@ union security_list_options {
 	void (*sb_free_security)(struct super_block *sb);
 	int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
 	int (*sb_remount)(struct super_block *sb, void *data, size_t data_size);
-	int (*sb_kern_mount)(struct super_block *sb, int flags,
-			     void *data, size_t data_size);
 	int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
 	int (*sb_statfs)(struct dentry *dentry);
 	int (*sb_mount)(const char *dev_name, const struct path *path,
@@ -1877,7 +1875,6 @@ struct security_hook_heads {
 	struct hlist_head sb_free_security;
 	struct hlist_head sb_copy_data;
 	struct hlist_head sb_remount;
-	struct hlist_head sb_kern_mount;
 	struct hlist_head sb_show_options;
 	struct hlist_head sb_statfs;
 	struct hlist_head sb_mount;
diff --git a/include/linux/security.h b/include/linux/security.h
index 93964808da59..ca327639ee96 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -240,7 +240,6 @@ int security_sb_alloc(struct super_block *sb);
 void security_sb_free(struct super_block *sb);
 int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
 int security_sb_remount(struct super_block *sb, void *data, size_t data_size);
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size);
 int security_sb_show_options(struct seq_file *m, struct super_block *sb);
 int security_sb_statfs(struct dentry *dentry);
 int security_sb_mount(const char *dev_name, const struct path *path,
@@ -593,12 +592,6 @@ static inline int security_sb_remount(struct super_block *sb, void *data, size_t
 	return 0;
 }
 
-static inline int security_sb_kern_mount(struct super_block *sb, int flags,
-					 void *data, size_t data_size)
-{
-	return 0;
-}
-
 static inline int security_sb_show_options(struct seq_file *m,
 					   struct super_block *sb)
 {
diff --git a/security/security.c b/security/security.c
index 27a5fb308d20..ab1c02268e98 100644
--- a/security/security.c
+++ b/security/security.c
@@ -425,11 +425,6 @@ int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
 	return call_int_hook(sb_remount, 0, sb, data, data_size);
 }
 
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
-	return call_int_hook(sb_kern_mount, 0, sb, flags, data, data_size);
-}
-
 int security_sb_show_options(struct seq_file *m, struct super_block *sb)
 {
 	return call_int_hook(sb_show_options, 0, m, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 189f5284fc3f..9c2754c98ff2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2922,25 +2922,6 @@ static int selinux_sb_remount(struct super_block *sb, void *data, size_t data_si
 	goto out_free_opts;
 }
 
-static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
-	const struct cred *cred = current_cred();
-	struct common_audit_data ad;
-	int rc;
-
-	rc = superblock_doinit(sb, data);
-	if (rc)
-		return rc;
-
-	/* Allow all mounts performed by the kernel */
-	if (flags & MS_KERNMOUNT)
-		return 0;
-
-	ad.type = LSM_AUDIT_DATA_DENTRY;
-	ad.u.dentry = sb->s_root;
-	return superblock_has_perm(cred, sb, FILESYSTEM__MOUNT, &ad);
-}
-
 static int selinux_sb_statfs(struct dentry *dentry)
 {
 	const struct cred *cred = current_cred();
@@ -7174,7 +7155,6 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
 	LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
 	LSM_HOOK_INIT(sb_remount, selinux_sb_remount),
-	LSM_HOOK_INIT(sb_kern_mount, selinux_sb_kern_mount),
 	LSM_HOOK_INIT(sb_show_options, selinux_sb_show_options),
 	LSM_HOOK_INIT(sb_statfs, selinux_sb_statfs),
 	LSM_HOOK_INIT(sb_mount, selinux_mount),
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index fb55a16a484c..7ebaf48dcb65 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1150,38 +1150,6 @@ static int smack_set_mnt_opts(struct super_block *sb,
 	return 0;
 }
 
-/**
- * smack_sb_kern_mount - Smack specific mount processing
- * @sb: the file system superblock
- * @flags: the mount flags
- * @data: the smack mount options
- *
- * Returns 0 on success, an error code on failure
- */
-static int smack_sb_kern_mount(struct super_block *sb, int flags,
-			       void *data, size_t data_size)
-{
-	int rc = 0;
-	char *options = data;
-	struct security_mnt_opts opts;
-
-	security_init_mnt_opts(&opts);
-
-	if (!options)
-		goto out;
-
-	rc = smack_parse_opts_str(options, &opts);
-	if (rc)
-		goto out_err;
-
-out:
-	rc = smack_set_mnt_opts(sb, &opts, 0, NULL);
-
-out_err:
-	security_free_mnt_opts(&opts);
-	return rc;
-}
-
 /**
  * smack_sb_statfs - Smack check on statfs
  * @dentry: identifies the file system in question
@@ -4961,7 +4929,6 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
 	LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
 	LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),
-	LSM_HOOK_INIT(sb_kern_mount, smack_sb_kern_mount),
 	LSM_HOOK_INIT(sb_statfs, smack_sb_statfs),
 	LSM_HOOK_INIT(sb_set_mnt_opts, smack_set_mnt_opts),
 	LSM_HOOK_INIT(sb_parse_opts_str, smack_parse_opts_str),


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 15/32] procfs: Move proc_fill_super() to fs/proc/root.c [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (13 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 14/32] vfs: Remove unused code after filesystem context changes " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 16/32] proc: Add fs_context support to procfs " David Howells
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/proc/inode.c    |   49 +------------------------------------------------
 fs/proc/internal.h |    4 +---
 fs/proc/root.c     |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 49 insertions(+), 52 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index faf401935fa9..c5e7bbf81e10 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -24,7 +24,6 @@
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 #include <linux/mount.h>
-#include <linux/magic.h>
 
 #include <linux/uaccess.h>
 
@@ -122,7 +121,7 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 	return 0;
 }
 
-static const struct super_operations proc_sops = {
+const struct super_operations proc_sops = {
 	.alloc_inode	= proc_alloc_inode,
 	.destroy_inode	= proc_destroy_inode,
 	.drop_inode	= generic_delete_inode,
@@ -488,49 +487,3 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 	       pde_put(de);
 	return inode;
 }
-
-int proc_fill_super(struct super_block *s, void *data, size_t data_size,
-		    int silent)
-{
-	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
-	struct inode *root_inode;
-	int ret;
-
-	if (!proc_parse_options(data, ns))
-		return -EINVAL;
-
-	/* User space would break if executables or devices appear on proc */
-	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
-	s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
-	s->s_blocksize = 1024;
-	s->s_blocksize_bits = 10;
-	s->s_magic = PROC_SUPER_MAGIC;
-	s->s_op = &proc_sops;
-	s->s_time_gran = 1;
-
-	/*
-	 * procfs isn't actually a stacking filesystem; however, there is
-	 * too much magic going on inside it to permit stacking things on
-	 * top of it
-	 */
-	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
-	pde_get(&proc_root);
-	root_inode = proc_get_inode(s, &proc_root);
-	if (!root_inode) {
-		pr_err("proc_fill_super: get root inode failed\n");
-		return -ENOMEM;
-	}
-
-	s->s_root = d_make_root(root_inode);
-	if (!s->s_root) {
-		pr_err("proc_fill_super: allocate dentry failed\n");
-		return -ENOMEM;
-	}
-
-	ret = proc_setup_self(s);
-	if (ret) {
-		return ret;
-	}
-	return proc_setup_thread_self(s);
-}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 841b4391deb6..bfe2bea2c71d 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -208,13 +208,12 @@ struct pde_opener {
 	struct completion *c;
 } __randomize_layout;
 extern const struct inode_operations proc_link_inode_operations;
-
 extern const struct inode_operations proc_pid_link_inode_operations;
+extern const struct super_operations proc_sops;
 
 void proc_init_kmemcache(void);
 void set_proc_pid_nlink(void);
 extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *, size_t, int);
 extern void proc_entry_rundown(struct proc_dir_entry *);
 
 /*
@@ -272,7 +271,6 @@ static inline void proc_tty_init(void) {}
  * root.c
  */
 extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
 extern int proc_remount(struct super_block *, int *, char *, size_t);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 28fadb0c51ab..15da85cefd3f 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,6 +23,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/parser.h>
 #include <linux/cred.h>
+#include <linux/magic.h>
 
 #include "internal.h"
 
@@ -36,7 +37,7 @@ static const match_table_t tokens = {
 	{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_options(char *options, struct pid_namespace *pid)
 {
 	char *p;
 	substring_t args[MAX_OPT_ARGS];
@@ -78,6 +79,51 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
+static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+{
+	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct inode *root_inode;
+	int ret;
+
+	if (!proc_parse_options(data, ns))
+		return -EINVAL;
+
+	/* User space would break if executables or devices appear on proc */
+	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
+	s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
+	s->s_blocksize = 1024;
+	s->s_blocksize_bits = 10;
+	s->s_magic = PROC_SUPER_MAGIC;
+	s->s_op = &proc_sops;
+	s->s_time_gran = 1;
+
+	/*
+	 * procfs isn't actually a stacking filesystem; however, there is
+	 * too much magic going on inside it to permit stacking things on
+	 * top of it
+	 */
+	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
+	
+	pde_get(&proc_root);
+	root_inode = proc_get_inode(s, &proc_root);
+	if (!root_inode) {
+		pr_err("proc_fill_super: get root inode failed\n");
+		return -ENOMEM;
+	}
+
+	s->s_root = d_make_root(root_inode);
+	if (!s->s_root) {
+		pr_err("proc_fill_super: allocate dentry failed\n");
+		return -ENOMEM;
+	}
+
+	ret = proc_setup_self(s);
+	if (ret) {
+		return ret;
+	}
+	return proc_setup_thread_self(s);
+}
+
 int proc_remount(struct super_block *sb, int *flags,
 		 char *data, size_t data_size)
 {


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 16/32] proc: Add fs_context support to procfs [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (14 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 15/32] procfs: Move proc_fill_super() to fs/proc/root.c " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 17/32] ipc: Convert mqueue fs to fs_context " David Howells
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Add fs_context support to procfs.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/proc/inode.c    |    2 -
 fs/proc/internal.h |    2 -
 fs/proc/root.c     |  179 ++++++++++++++++++++++++++++++++++------------------
 3 files changed, 120 insertions(+), 63 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index c5e7bbf81e10..38155bec4a54 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -127,7 +127,7 @@ const struct super_operations proc_sops = {
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= proc_evict_inode,
 	.statfs		= simple_statfs,
-	.remount_fs	= proc_remount,
+	.reconfigure	= proc_reconfigure,
 	.show_options	= proc_show_options,
 };
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index bfe2bea2c71d..ea8c5468eafc 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -273,7 +273,7 @@ static inline void proc_tty_init(void) {}
 extern struct proc_dir_entry proc_root;
 
 extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *, size_t);
+extern int proc_reconfigure(struct super_block *, struct fs_context *);
 
 /*
  * task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 15da85cefd3f..efbdc08a3c86 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -19,14 +19,23 @@
 #include <linux/module.h>
 #include <linux/bitops.h>
 #include <linux/user_namespace.h>
+#include <linux/fs_context.h>
 #include <linux/mount.h>
 #include <linux/pid_namespace.h>
 #include <linux/parser.h>
 #include <linux/cred.h>
 #include <linux/magic.h>
+#include <linux/slab.h>
 
 #include "internal.h"
 
+struct proc_fs_context {
+	struct pid_namespace	*pid_ns;
+	unsigned long		mask;
+	int			hidepid;
+	int			gid;
+};
+
 enum {
 	Opt_gid, Opt_hidepid, Opt_err,
 };
@@ -37,56 +46,60 @@ static const match_table_t tokens = {
 	{Opt_err, NULL},
 };
 
-static int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_option(struct fs_context *fc, char *opt, size_t len)
 {
-	char *p;
+	struct proc_fs_context *ctx = fc->fs_private;
 	substring_t args[MAX_OPT_ARGS];
-	int option;
-
-	if (!options)
-		return 1;
-
-	while ((p = strsep(&options, ",")) != NULL) {
-		int token;
-		if (!*p)
-			continue;
-
-		args[0].to = args[0].from = NULL;
-		token = match_token(p, tokens, args);
-		switch (token) {
-		case Opt_gid:
-			if (match_int(&args[0], &option))
-				return 0;
-			pid->pid_gid = make_kgid(current_user_ns(), option);
-			break;
-		case Opt_hidepid:
-			if (match_int(&args[0], &option))
-				return 0;
-			if (option < HIDEPID_OFF ||
-			    option > HIDEPID_INVISIBLE) {
-				pr_err("proc: hidepid value must be between 0 and 2.\n");
-				return 0;
-			}
-			pid->hide_pid = option;
-			break;
-		default:
-			pr_err("proc: unrecognized mount option \"%s\" "
-			       "or missing value\n", p);
-			return 0;
+	int token;
+	
+	args[0].to = args[0].from = NULL;
+	token = match_token(opt, tokens, args);
+	switch (token) {
+	case Opt_gid:
+		if (match_int(&args[0], &ctx->gid))
+			return -EINVAL;
+		break;
+
+	case Opt_hidepid:
+		if (match_int(&args[0], &ctx->hidepid))
+			return -EINVAL;
+		if (ctx->hidepid < HIDEPID_OFF ||
+		    ctx->hidepid > HIDEPID_INVISIBLE) {
+			pr_err("proc: hidepid value must be between 0 and 2.\n");
+			return -EINVAL;
 		}
+		break;
+
+	default:
+		pr_err("proc: unrecognized mount option \"%s\" or missing value\n",
+		       opt);
+		return -EINVAL;
 	}
 
-	return 1;
+	ctx->mask |= 1 << token;
+	return 0;
+}
+
+static void proc_set_options(struct super_block *s,
+			     struct fs_context *fc,
+			     struct pid_namespace *pid_ns,
+			     struct user_namespace *user_ns)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+
+	if (ctx->mask & (1 << Opt_gid))
+		pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+	if (ctx->mask & (1 << Opt_hidepid))
+		pid_ns->hide_pid = ctx->hidepid;
 }
 
-static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 {
-	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
 	struct inode *root_inode;
 	int ret;
 
-	if (!proc_parse_options(data, ns))
-		return -EINVAL;
+	proc_set_options(s, fc, pid_ns, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -103,7 +116,7 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
 	 * top of it
 	 */
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
+
 	pde_get(&proc_root);
 	root_inode = proc_get_inode(s, &proc_root);
 	if (!root_inode) {
@@ -124,30 +137,52 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
 	return proc_setup_thread_self(s);
 }
 
-int proc_remount(struct super_block *sb, int *flags,
-		 char *data, size_t data_size)
+int proc_reconfigure(struct super_block *sb, struct fs_context *fc)
 {
 	struct pid_namespace *pid = sb->s_fs_info;
 
 	sync_filesystem(sb);
-	return !proc_parse_options(data, pid);
+
+	if (fc)
+		proc_set_options(sb, fc, pid, current_user_ns());
+	return 0;
 }
 
-static struct dentry *proc_mount(struct file_system_type *fs_type,
-				 int flags, const char *dev_name,
-				 void *data, size_t data_size)
+static int proc_get_tree(struct fs_context *fc)
 {
-	struct pid_namespace *ns;
+	struct proc_fs_context *ctx = fc->fs_private;
 
-	if (flags & SB_KERNMOUNT) {
-		ns = data;
-		data = NULL;
-	} else {
-		ns = task_active_pid_ns(current);
-	}
+	fc->s_fs_info = ctx->pid_ns;
+	return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
+}
+
+static void proc_fs_context_free(struct fs_context *fc)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+
+	if (ctx->pid_ns)
+		put_pid_ns(ctx->pid_ns);
+	kfree(ctx);
+}
+
+static const struct fs_context_operations proc_fs_context_ops = {
+	.free		= proc_fs_context_free,
+	.parse_option	= proc_parse_option,
+	.get_tree	= proc_get_tree,
+};
 
-	return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
-			proc_fill_super);
+static int proc_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+	struct proc_fs_context *ctx;
+
+	ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	
+	ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->fs_private = ctx;
+	fc->ops = &proc_fs_context_ops;
+	return 0;
 }
 
 static void proc_kill_sb(struct super_block *sb)
@@ -164,10 +199,10 @@ static void proc_kill_sb(struct super_block *sb)
 }
 
 static struct file_system_type proc_fs_type = {
-	.name		= "proc",
-	.mount		= proc_mount,
-	.kill_sb	= proc_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.name			= "proc",
+	.init_fs_context	= proc_init_fs_context,
+	.kill_sb		= proc_kill_sb,
+	.fs_flags		= FS_USERNS_MOUNT,
 };
 
 void __init proc_root_init(void)
@@ -205,7 +240,7 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
 {
 	if (!proc_pid_lookup(dir, dentry, flags))
 		return NULL;
-	
+
 	return proc_lookup(dir, dentry, flags);
 }
 
@@ -258,9 +293,31 @@ struct proc_dir_entry proc_root = {
 
 int pid_ns_prepare_proc(struct pid_namespace *ns)
 {
+	struct proc_fs_context *ctx;
+	struct fs_context *fc;
 	struct vfsmount *mnt;
+	int ret;
+
+	fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
+				FS_CONTEXT_FOR_KERNEL_MOUNT);
+	if (IS_ERR(fc))
+		return PTR_ERR(fc);
+
+	ctx = fc->fs_private;
+	if (ctx->pid_ns != ns) {
+		put_pid_ns(ctx->pid_ns);
+		get_pid_ns(ns);
+		ctx->pid_ns = ns;
+	}
+
+	ret = vfs_get_tree(fc);
+	if (ret < 0) {
+		put_fs_context(fc);
+		return ret;
+	}
 
-	mnt = kern_mount_data(&proc_fs_type, ns, 0);
+	mnt = vfs_create_mount(fc, 0);
+	put_fs_context(fc);
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 17/32] ipc: Convert mqueue fs to fs_context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (15 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 16/32] proc: Add fs_context support to procfs " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 18/32] cpuset: Use " David Howells
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Convert the mqueue filesystem to use the filesystem context stuff.

Notes:

 (1) The relevant ipc namespace is selected in when the context is
     initialised (and it defaults to the current task's ipc namespace).
     The caller can override this before calling vfs_get_tree().

 (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
     mq_internal_mount() create a context, adjust it and then do the rest
     of the mount procedure.

 (3) The lazy mqueue mounting on creation of a new namespace is retained
     from a previous patch, but the avoidance of sget() if no superblock
     yet exists is reverted and the superblock is again keyed on the
     namespace pointer.

     Yes, there was a performance gain in not searching the superblock
     hash, but it's only paid once per ipc namespace - and only if someone
     uses mqueue within that namespace, so I'm not sure it's worth it,
     especially as calling sget() allows avoidance of recursion.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 ipc/mqueue.c |  121 +++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 99 insertions(+), 22 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 4671d215cb84..0f102210f89e 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -18,6 +18,7 @@
 #include <linux/pagemap.h>
 #include <linux/file.h>
 #include <linux/mount.h>
+#include <linux/fs_context.h>
 #include <linux/namei.h>
 #include <linux/sysctl.h>
 #include <linux/poll.h>
@@ -42,6 +43,10 @@
 #include <net/sock.h>
 #include "util.h"
 
+struct mqueue_fs_context {
+	struct ipc_namespace	*ipc_ns;
+};
+
 #define MQUEUE_MAGIC	0x19800202
 #define DIRENT_SIZE	20
 #define FILENT_SIZE	80
@@ -87,9 +92,11 @@ struct mqueue_inode_info {
 	unsigned long qsize; /* size of queue in memory (sum of all msgs) */
 };
 
+static struct file_system_type mqueue_fs_type;
 static const struct inode_operations mqueue_dir_inode_operations;
 static const struct file_operations mqueue_file_operations;
 static const struct super_operations mqueue_super_ops;
+static const struct fs_context_operations mqueue_fs_context_ops;
 static void remove_notification(struct mqueue_inode_info *info);
 
 static struct kmem_cache *mqueue_inode_cachep;
@@ -322,7 +329,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 	return ERR_PTR(ret);
 }
 
-static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
+static int mqueue_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct inode *inode;
 	struct ipc_namespace *ns = sb->s_fs_info;
@@ -343,19 +350,84 @@ static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_siz
 	return 0;
 }
 
-static struct dentry *mqueue_mount(struct file_system_type *fs_type,
-			 int flags, const char *dev_name,
-			 void *data, size_t data_size)
+static int mqueue_get_tree(struct fs_context *fc)
 {
-	struct ipc_namespace *ns;
-	if (flags & SB_KERNMOUNT) {
-		ns = data;
-		data = NULL;
-	} else {
-		ns = current->nsproxy->ipc_ns;
+	struct mqueue_fs_context *ctx = fc->fs_private;
+
+	/* As a shortcut, if the namespace already has a superblock created,
+	 * use the root from that directly rather than invoking sget() again.
+	 */
+	spin_lock(&mq_lock);
+	if (ctx->ipc_ns->mq_mnt) {
+		fc->root = dget(ctx->ipc_ns->mq_mnt->mnt_sb->s_root);
+		atomic_inc(&fc->root->d_sb->s_active);
 	}
-	return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
-			mqueue_fill_super);
+	spin_unlock(&mq_lock);
+	if (fc->root) {
+		down_write(&fc->root->d_sb->s_umount);
+		return 0;
+	}
+
+	fc->s_fs_info = ctx->ipc_ns;
+	return vfs_get_super(fc, vfs_get_keyed_super, mqueue_fill_super);
+}
+
+static void mqueue_fs_context_free(struct fs_context *fc)
+{
+	struct mqueue_fs_context *ctx = fc->fs_private;
+
+	if (ctx->ipc_ns)
+		put_ipc_ns(ctx->ipc_ns);
+	kfree(ctx);
+}
+
+static int mqueue_init_fs_context(struct fs_context *fc,
+				  struct dentry *reference)
+{
+	struct mqueue_fs_context *ctx;
+
+	ctx = kzalloc(sizeof(struct mqueue_fs_context), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
+	fc->fs_private = ctx;
+	fc->ops = &mqueue_fs_context_ops;
+	return 0;
+}
+
+static struct vfsmount *mq_create_mount(struct ipc_namespace *ns)
+{
+	struct mqueue_fs_context *ctx;
+	struct fs_context *fc;
+	struct vfsmount *mnt;
+	int ret;
+
+	fc = vfs_new_fs_context(&mqueue_fs_type, NULL, 0,
+				FS_CONTEXT_FOR_KERNEL_MOUNT);
+	if (IS_ERR(fc))
+		return ERR_CAST(fc);
+
+	ctx = fc->fs_private;
+	put_ipc_ns(ctx->ipc_ns);
+	ctx->ipc_ns = get_ipc_ns(ns);
+
+	ret = vfs_get_tree(fc);
+	if (ret < 0)
+		goto err_fc;
+
+	mnt = vfs_create_mount(fc, 0);
+	if (IS_ERR(mnt)) {
+		ret = PTR_ERR(mnt);
+		goto err_fc;
+	}
+
+	put_fs_context(fc);
+	return mnt;
+
+err_fc:
+	put_fs_context(fc);
+	return ERR_PTR(ret);
 }
 
 static void init_once(void *foo)
@@ -1523,15 +1595,22 @@ static const struct super_operations mqueue_super_ops = {
 	.statfs = simple_statfs,
 };
 
+static const struct fs_context_operations mqueue_fs_context_ops = {
+	.free		= mqueue_fs_context_free,
+	.get_tree	= mqueue_get_tree,
+};
+
 static struct file_system_type mqueue_fs_type = {
-	.name = "mqueue",
-	.mount = mqueue_mount,
-	.kill_sb = kill_litter_super,
-	.fs_flags = FS_USERNS_MOUNT,
+	.name			= "mqueue",
+	.init_fs_context	= mqueue_init_fs_context,
+	.kill_sb		= kill_litter_super,
+	.fs_flags		= FS_USERNS_MOUNT,
 };
 
 int mq_init_ns(struct ipc_namespace *ns)
 {
+	struct vfsmount *m;
+
 	ns->mq_queues_count  = 0;
 	ns->mq_queues_max    = DFLT_QUEUESMAX;
 	ns->mq_msg_max       = DFLT_MSGMAX;
@@ -1539,12 +1618,10 @@ int mq_init_ns(struct ipc_namespace *ns)
 	ns->mq_msg_default   = DFLT_MSG;
 	ns->mq_msgsize_default  = DFLT_MSGSIZE;
 
-	ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0);
-	if (IS_ERR(ns->mq_mnt)) {
-		int err = PTR_ERR(ns->mq_mnt);
-		ns->mq_mnt = NULL;
-		return err;
-	}
+	m = mq_create_mount(&init_ipc_ns);
+	if (IS_ERR(m))
+		return PTR_ERR(m);
+	ns->mq_mnt = m;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 18/32] cpuset: Use fs_context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (16 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 17/32] ipc: Convert mqueue fs to fs_context " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 19/32] kernfs, sysfs, cgroup, intel_rdt: Support " David Howells
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: Tejun Heo, dhowells, linux-fsdevel, torvalds, linux-kernel

Make the cpuset filesystem use the filesystem context.  This is potentially
tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
with some special parameters.

This can, however, be handled by setting up an appropriate cgroup
filesystem and returning the root directory of that as the root dir of this
one.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tejun Heo <tj@kernel.org>
---

 kernel/cgroup/cpuset.c |   66 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 52 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6d9f1a709af9..e6582b2f5144 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -38,7 +38,7 @@
 #include <linux/mm.h>
 #include <linux/memory.h>
 #include <linux/export.h>
-#include <linux/mount.h>
+#include <linux/fs_context.h>
 #include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/proc_fs.h>
@@ -315,26 +315,64 @@ static inline bool is_in_v2_mode(void)
  * users. If someone tries to mount the "cpuset" filesystem, we
  * silently switch it to mount "cgroup" instead
  */
-static struct dentry *cpuset_mount(struct file_system_type *fs_type,
-				   int flags, const char *unused_dev_name,
-				   void *data, size_t data_size)
+static int cpuset_get_tree(struct fs_context *fc)
 {
-	struct file_system_type *cgroup_fs = get_fs_type("cgroup");
-	struct dentry *ret = ERR_PTR(-ENODEV);
+	static const char opts[] = "cpuset,noprefix,release_agent=/sbin/cpuset_release_agent";
+	struct file_system_type *cgroup_fs;
+	struct fs_context *cg_fc;
+	char *p;
+	int ret = -ENODEV;
+
+	cgroup_fs = get_fs_type("cgroup");
 	if (cgroup_fs) {
-		char mountopts[] =
-			"cpuset,noprefix,"
-			"release_agent=/sbin/cpuset_release_agent";
-		ret = cgroup_fs->mount(cgroup_fs, flags, unused_dev_name,
-				       mountopts, data_size);
-		put_filesystem(cgroup_fs);
+		ret = PTR_ERR(cgroup_fs);
+		goto out;
+	}
+
+	cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->purpose);
+	put_filesystem(cgroup_fs);
+	if (IS_ERR(cg_fc)) {
+		ret = PTR_ERR(cg_fc);
+		goto out;
 	}
+
+	ret = -ENOMEM;
+	p = kstrdup(opts, GFP_KERNEL);
+	if (!p)
+		goto out_fc;
+
+	ret = generic_parse_monolithic(fc, p, sizeof(opts) - 1);
+	kfree(p);
+	if (ret < 0)
+		goto out_fc;
+
+	ret = vfs_get_tree(cg_fc);
+	if (ret < 0)
+		goto out_fc;
+
+	fc->root = dget(cg_fc->root);
+	ret = 0;
+
+out_fc:
+	put_fs_context(cg_fc);
+out:
 	return ret;
 }
 
+static const struct fs_context_operations cpuset_fs_context_ops = {
+	.get_tree	= cpuset_get_tree,
+};
+
+static int cpuset_init_fs_context(struct fs_context *fc,
+				  struct dentry *reference)
+{
+	fc->ops = &cpuset_fs_context_ops;
+	return 0;
+}
+
 static struct file_system_type cpuset_fs_type = {
-	.name = "cpuset",
-	.mount = cpuset_mount,
+	.name			= "cpuset",
+	.init_fs_context	= cpuset_init_fs_context,
 };
 
 /*


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 19/32] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (17 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 18/32] cpuset: Use " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 20/32] hugetlbfs: Convert to " David Howells
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro
  Cc: fenghua.yu, Greg Kroah-Hartman, linux-kernel, dhowells,
	linux-fsdevel, Li Zefan, Johannes Weiner, Tejun Heo, cgroups,
	torvalds

Make kernfs support superblock creation/mount/remount with fs_context.

This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.

Notes:

 (1) A kernfs_fs_context struct is created to wrap fs_context and the
     kernfs mount parameters are moved in here (or are in fs_context).

 (2) kernfs_mount{,_ns}() are made into kernfs_get_tree().  The extra
     namespace tag parameter is passed in the context if desired

 (3) kernfs_free_fs_context() is provided as a destructor for the
     kernfs_fs_context struct, but for the moment it does nothing except
     get called in the right places.

 (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
     pass, but possibly this should be done anyway in case someone wants to
     add a parameter in future.

 (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
     the cgroup v1 and v2 mount parameters are all moved there.

 (6) cgroup1 parameter parsing error messages are now handled by invalf(),
     which allows userspace to collect them directly.

 (7) cgroup1 parameter cleanup is now done in the context destructor rather
     than in the mount/get_tree and remount functions.

Weirdies:

 (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
     but then uses the resulting pointer after dropping the locks.  I'm
     told this is okay and needs commenting.

 (*) The cgroup refcount web.  This really needs documenting.

 (*) cgroup2 only has one root?

Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
---

 arch/x86/kernel/cpu/intel_rdt.h          |   15 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  149 ++++++++-------
 fs/kernfs/mount.c                        |   89 ++++-----
 fs/sysfs/mount.c                         |   67 +++++--
 include/linux/cgroup.h                   |    3 
 include/linux/kernfs.h                   |   39 ++--
 kernel/cgroup/cgroup-internal.h          |   49 +++--
 kernel/cgroup/cgroup-v1.c                |  302 +++++++++++++++---------------
 kernel/cgroup/cgroup.c                   |  226 +++++++++++++---------
 kernel/cgroup/cpuset.c                   |    4 
 10 files changed, 527 insertions(+), 416 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index be152b3b2543..82dda4daec7f 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -33,6 +33,21 @@
 #define RMID_VAL_ERROR			BIT_ULL(63)
 #define RMID_VAL_UNAVAIL		BIT_ULL(62)
 
+
+struct rdt_fs_context {
+	struct kernfs_fs_context	kfc;
+	bool				enable_cdpl2;
+	bool				enable_cdpl3;
+	bool				enable_mba_mbps;
+};
+
+static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
+{
+	struct kernfs_fs_context *kfc = fc->fs_private;
+
+	return container_of(kfc, struct rdt_fs_context, kfc);
+}
+
 DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
 
 /**
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index c74365b78253..005006f3d0f4 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1131,43 +1131,6 @@ static void cdp_disable_all(void)
 		cdpl2_disable();
 }
 
-static int parse_rdtgroupfs_options(char *data)
-{
-	char *token, *o = data;
-	int ret = 0;
-
-	while ((token = strsep(&o, ",")) != NULL) {
-		if (!*token) {
-			ret = -EINVAL;
-			goto out;
-		}
-
-		if (!strcmp(token, "cdp")) {
-			ret = cdpl3_enable();
-			if (ret)
-				goto out;
-		} else if (!strcmp(token, "cdpl2")) {
-			ret = cdpl2_enable();
-			if (ret)
-				goto out;
-		} else if (!strcmp(token, "mba_MBps")) {
-			ret = set_mba_sc(true);
-			if (ret)
-				goto out;
-		} else {
-			ret = -EINVAL;
-			goto out;
-		}
-	}
-
-	return 0;
-
-out:
-	pr_err("Invalid mount option \"%s\"\n", token);
-
-	return ret;
-}
-
 /*
  * We don't allow rdtgroup directories to be created anywhere
  * except the root directory. Thus when looking for the rdtgroup
@@ -1236,13 +1199,27 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
 			     struct rdtgroup *prgrp,
 			     struct kernfs_node **mon_data_kn);
 
-static struct dentry *rdt_mount(struct file_system_type *fs_type,
-				int flags, const char *unused_dev_name,
-				void *data, size_t data_size)
+static int rdt_enable_ctx(struct rdt_fs_context *ctx)
+{
+	int ret = 0;
+
+	if (ctx->enable_cdpl2)
+		ret = cdpl2_enable();
+
+	if (!ret && ctx->enable_cdpl3)
+		ret = cdpl3_enable();
+
+	if (!ret && ctx->enable_mba_mbps)
+		ret = set_mba_sc(true);
+
+	return ret;
+}
+
+static int rdt_get_tree(struct fs_context *fc)
 {
+	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	struct rdt_domain *dom;
 	struct rdt_resource *r;
-	struct dentry *dentry;
 	int ret;
 
 	cpus_read_lock();
@@ -1251,53 +1228,42 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 	 * resctrl file system can only be mounted once.
 	 */
 	if (static_branch_unlikely(&rdt_enable_key)) {
-		dentry = ERR_PTR(-EBUSY);
+		ret = -EBUSY;
 		goto out;
 	}
 
-	ret = parse_rdtgroupfs_options(data);
-	if (ret) {
-		dentry = ERR_PTR(ret);
+	ret = rdt_enable_ctx(ctx);
+	if (ret < 0)
 		goto out_cdp;
-	}
 
 	closid_init();
 
 	ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
-	if (ret) {
-		dentry = ERR_PTR(ret);
-		goto out_cdp;
-	}
+	if (ret < 0)
+		goto out_mba;
 
 	if (rdt_mon_capable) {
 		ret = mongroup_create_dir(rdtgroup_default.kn,
 					  NULL, "mon_groups",
 					  &kn_mongrp);
-		if (ret) {
-			dentry = ERR_PTR(ret);
+		if (ret < 0)
 			goto out_info;
-		}
 		kernfs_get(kn_mongrp);
 
 		ret = mkdir_mondata_all(rdtgroup_default.kn,
 					&rdtgroup_default, &kn_mondata);
-		if (ret) {
-			dentry = ERR_PTR(ret);
+		if (ret < 0)
 			goto out_mongrp;
-		}
 		kernfs_get(kn_mondata);
 		rdtgroup_default.mon.mon_data_kn = kn_mondata;
 	}
 
 	ret = rdt_pseudo_lock_init();
-	if (ret) {
-		dentry = ERR_PTR(ret);
+	if (ret)
 		goto out_mondata;
-	}
 
-	dentry = kernfs_mount(fs_type, flags, rdt_root,
-			      RDTGROUP_SUPER_MAGIC, NULL);
-	if (IS_ERR(dentry))
+	ret = kernfs_get_tree(fc);
+	if (ret < 0)
 		goto out_psl;
 
 	if (rdt_alloc_capable)
@@ -1326,14 +1292,65 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 		kernfs_remove(kn_mongrp);
 out_info:
 	kernfs_remove(kn_info);
+out_mba:
+	if (ctx->enable_mba_mbps)
+		set_mba_sc(false);
 out_cdp:
 	cdp_disable_all();
 out:
 	rdt_last_cmd_clear();
 	mutex_unlock(&rdtgroup_mutex);
 	cpus_read_unlock();
+	return ret;
+}
+
+static int rdt_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+	struct rdt_fs_context *ctx = rdt_fc2context(fc);
+
+	if (strcmp(opt, "cdp") == 0) {
+		ctx->enable_cdpl3 = true;
+		return 0;
+	}
+	if (strcmp(opt, "cdpl2") == 0) {
+		ctx->enable_cdpl2 = true;
+		return 0;
+	}
+	if (strcmp(opt, "mba_MBps") == 0) {
+		ctx->enable_mba_mbps = true;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static void rdt_fs_context_free(struct fs_context *fc)
+{
+	struct rdt_fs_context *ctx = rdt_fc2context(fc);
+
+	kernfs_free_fs_context(fc);
+	kfree(ctx);
+}
+
+static const struct fs_context_operations rdt_fs_context_ops = {
+	.free		= rdt_fs_context_free,
+	.parse_option	= rdt_parse_option,
+	.get_tree	= rdt_get_tree,
+};
+
+static int rdt_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+	struct rdt_fs_context *ctx;
+
+	ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
 
-	return dentry;
+	ctx->kfc.root = rdt_root;
+	ctx->kfc.magic = RDTGROUP_SUPER_MAGIC;
+	fc->fs_private = &ctx->kfc;
+	fc->ops = &rdt_fs_context_ops;
+	return 0;
 }
 
 static int reset_all_ctrls(struct rdt_resource *r)
@@ -1500,9 +1517,9 @@ static void rdt_kill_sb(struct super_block *sb)
 }
 
 static struct file_system_type rdt_fs_type = {
-	.name    = "resctrl",
-	.mount   = rdt_mount,
-	.kill_sb = rdt_kill_sb,
+	.name			= "resctrl",
+	.init_fs_context	= rdt_init_fs_context,
+	.kill_sb		= rdt_kill_sb,
 };
 
 static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f70e0b69e714..8be71b8943c3 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -22,14 +22,13 @@
 
 struct kmem_cache *kernfs_node_cache;
 
-static int kernfs_sop_remount_fs(struct super_block *sb, int *flags,
-				 char *data, size_t data_size)
+static int kernfs_sop_reconfigure(struct super_block *sb, struct fs_context *fc)
 {
 	struct kernfs_root *root = kernfs_info(sb)->root;
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
 
-	if (scops && scops->remount_fs)
-		return scops->remount_fs(root, flags, data);
+	if (scops && scops->reconfigure)
+		return scops->reconfigure(root, fc);
 	return 0;
 }
 
@@ -61,7 +60,7 @@ const struct super_operations kernfs_sops = {
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= kernfs_evict_inode,
 
-	.remount_fs	= kernfs_sop_remount_fs,
+	.reconfigure	= kernfs_sop_reconfigure,
 	.show_options	= kernfs_sop_show_options,
 	.show_path	= kernfs_sop_show_path,
 };
@@ -219,7 +218,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
 	} while (true);
 }
 
-static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
+static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
 	struct inode *inode;
@@ -230,7 +229,7 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
 	sb->s_blocksize = PAGE_SIZE;
 	sb->s_blocksize_bits = PAGE_SHIFT;
-	sb->s_magic = magic;
+	sb->s_magic = kfc->magic;
 	sb->s_op = &kernfs_sops;
 	sb->s_xattr = kernfs_xattr_handlers;
 	if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
@@ -257,21 +256,20 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 	return 0;
 }
 
-static int kernfs_test_super(struct super_block *sb, void *data)
+static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct kernfs_super_info *sb_info = kernfs_info(sb);
-	struct kernfs_super_info *info = data;
+	struct kernfs_super_info *info = fc->s_fs_info;
 
 	return sb_info->root == info->root && sb_info->ns == info->ns;
 }
 
-static int kernfs_set_super(struct super_block *sb, void *data)
+static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
 {
-	int error;
-	error = set_anon_super(sb, data);
-	if (!error)
-		sb->s_fs_info = data;
-	return error;
+	struct kernfs_fs_context *kfc = fc->fs_private;
+
+	kfc->ns_tag = NULL;
+	return set_anon_super_fc(sb, fc);
 }
 
 /**
@@ -288,63 +286,62 @@ const void *kernfs_super_ns(struct super_block *sb)
 }
 
 /**
- * kernfs_mount_ns - kernfs mount helper
- * @fs_type: file_system_type of the fs being mounted
- * @flags: mount flags specified for the mount
- * @root: kernfs_root of the hierarchy being mounted
- * @magic: file system specific magic number
- * @new_sb_created: tell the caller if we allocated a new superblock
- * @ns: optional namespace tag of the mount
- *
- * This is to be called from each kernfs user's file_system_type->mount()
- * implementation, which should pass through the specified @fs_type and
- * @flags, and specify the hierarchy and namespace tag to mount via @root
- * and @ns, respectively.
+ * kernfs_get_tree - kernfs filesystem access/retrieval helper
+ * @fc: The filesystem context.
  *
- * The return value can be passed to the vfs layer verbatim.
+ * This is to be called from each kernfs user's fs_context->ops->get_tree()
+ * implementation, which should set the specified ->@fs_type and ->@flags, and
+ * specify the hierarchy and namespace tag to mount via ->@root and ->@ns,
+ * respectively.
  */
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-				struct kernfs_root *root, unsigned long magic,
-				bool *new_sb_created, const void *ns)
+int kernfs_get_tree(struct fs_context *fc)
 {
+	struct kernfs_fs_context *kfc = fc->fs_private;
 	struct super_block *sb;
 	struct kernfs_super_info *info;
 	int error;
 
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info)
-		return ERR_PTR(-ENOMEM);
+		return -ENOMEM;
 
-	info->root = root;
-	info->ns = ns;
+	info->root = kfc->root;
+	info->ns = kfc->ns_tag;
 	INIT_LIST_HEAD(&info->node);
 
-	sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
-			 &init_user_ns, info);
-	if (IS_ERR(sb) || sb->s_fs_info != info)
-		kfree(info);
+	fc->s_fs_info = info;
+	sb = sget_fc(fc, kernfs_test_super, kernfs_set_super);
 	if (IS_ERR(sb))
-		return ERR_CAST(sb);
-
-	if (new_sb_created)
-		*new_sb_created = !sb->s_root;
+		return PTR_ERR(sb);
 
 	if (!sb->s_root) {
 		struct kernfs_super_info *info = kernfs_info(sb);
 
-		error = kernfs_fill_super(sb, magic);
+		kfc->new_sb_created = true;
+
+		error = kernfs_fill_super(sb, kfc);
 		if (error) {
 			deactivate_locked_super(sb);
-			return ERR_PTR(error);
+			return error;
 		}
 		sb->s_flags |= SB_ACTIVE;
 
 		mutex_lock(&kernfs_mutex);
-		list_add(&info->node, &root->supers);
+		list_add(&info->node, &info->root->supers);
 		mutex_unlock(&kernfs_mutex);
 	}
 
-	return dget(sb->s_root);
+	fc->root = dget(sb->s_root);
+	return 0;
+}
+
+void kernfs_free_fs_context(struct fs_context *fc)
+{
+	/* Note that we don't deal with kfc->ns_tag here. */
+	if (fc->s_fs_info) {
+		kfree(fc->s_fs_info);
+		fc->s_fs_info = NULL;
+	}
 }
 
 /**
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 77302c35b0ff..1e1c0ccc6a36 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -13,6 +13,7 @@
 #include <linux/magic.h>
 #include <linux/mount.h>
 #include <linux/init.h>
+#include <linux/slab.h>
 #include <linux/user_namespace.h>
 
 #include "sysfs.h"
@@ -20,27 +21,55 @@
 static struct kernfs_root *sysfs_root;
 struct kernfs_node *sysfs_root_kn;
 
-static struct dentry *sysfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data, size_t data_size)
+static int sysfs_get_tree(struct fs_context *fc)
 {
-	struct dentry *root;
-	void *ns;
-	bool new_sb = false;
+	struct kernfs_fs_context *kfc = fc->fs_private;
+	int ret;
 
-	if (!(flags & SB_KERNMOUNT)) {
+	ret = kernfs_get_tree(fc);
+	if (ret)
+		return ret;
+
+	if (kfc->new_sb_created)
+		fc->root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+	return 0;
+}
+
+static void sysfs_fs_context_free(struct fs_context *fc)
+{
+	struct kernfs_fs_context *kfc = fc->fs_private;
+
+	if (kfc->ns_tag)
+		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+	kernfs_free_fs_context(fc);
+	kfree(kfc);
+}
+
+static const struct fs_context_operations sysfs_fs_context_ops = {
+	.free		= sysfs_fs_context_free,
+	.get_tree	= sysfs_get_tree,
+};
+
+static int sysfs_init_fs_context(struct fs_context *fc,
+				 struct dentry *reference)
+{
+	struct kernfs_fs_context *kfc;
+
+	if (!(fc->sb_flags & SB_KERNMOUNT)) {
 		if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
-			return ERR_PTR(-EPERM);
+			return -EPERM;
 	}
 
-	ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
-	root = kernfs_mount_ns(fs_type, flags, sysfs_root,
-				SYSFS_MAGIC, &new_sb, ns);
-	if (!new_sb)
-		kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
-	else if (!IS_ERR(root))
-		root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+	kfc = kzalloc(sizeof(struct kernfs_fs_context), GFP_KERNEL);
+	if (!kfc)
+		return -ENOMEM;
 
-	return root;
+	kfc->ns_tag = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+	kfc->root = sysfs_root;
+	kfc->magic = SYSFS_MAGIC;
+	fc->fs_private = kfc;
+	fc->ops = &sysfs_fs_context_ops;
+	return 0;
 }
 
 static void sysfs_kill_sb(struct super_block *sb)
@@ -52,10 +81,10 @@ static void sysfs_kill_sb(struct super_block *sb)
 }
 
 static struct file_system_type sysfs_fs_type = {
-	.name		= "sysfs",
-	.mount		= sysfs_mount,
-	.kill_sb	= sysfs_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.name			= "sysfs",
+	.init_fs_context	= sysfs_init_fs_context,
+	.kill_sb		= sysfs_kill_sb,
+	.fs_flags		= FS_USERNS_MOUNT,
 };
 
 int __init sysfs_init(void)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c9fdf6f57913..ac198f0c466f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -829,10 +829,11 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
 
 #endif /* !CONFIG_CGROUPS */
 
-static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+static inline struct cgroup_namespace *get_cgroup_ns(struct cgroup_namespace *ns)
 {
 	if (ns)
 		refcount_inc(&ns->count);
+	return ns;
 }
 
 static inline void put_cgroup_ns(struct cgroup_namespace *ns)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..627fa3956146 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -16,6 +16,7 @@
 #include <linux/rbtree.h>
 #include <linux/atomic.h>
 #include <linux/wait.h>
+#include <linux/fs_context.h>
 
 struct file;
 struct dentry;
@@ -25,6 +26,7 @@ struct vm_area_struct;
 struct super_block;
 struct file_system_type;
 
+struct kernfs_fs_context;
 struct kernfs_open_node;
 struct kernfs_iattrs;
 
@@ -166,7 +168,7 @@ struct kernfs_node {
  * kernfs_node parameter.
  */
 struct kernfs_syscall_ops {
-	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
+	int (*reconfigure)(struct kernfs_root *root, struct fs_context *fc);
 	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
 
 	int (*mkdir)(struct kernfs_node *parent, const char *name,
@@ -267,6 +269,18 @@ struct kernfs_ops {
 #endif
 };
 
+/*
+ * The kernfs superblock creation/mount parameter context.
+ */
+struct kernfs_fs_context {
+	struct kernfs_root	*root;		/* Root of the hierarchy being mounted */
+	void			*ns_tag;	/* Namespace tag of the mount (or NULL) */
+	unsigned long		magic;		/* File system specific magic number */
+
+	/* The following are set/used by kernfs_mount() */
+	bool			new_sb_created;	/* Set to T if we allocated a new sb */
+};
+
 #ifdef CONFIG_KERNFS
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -350,9 +364,8 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
 void kernfs_notify(struct kernfs_node *kn);
 
 const void *kernfs_super_ns(struct super_block *sb);
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-			       struct kernfs_root *root, unsigned long magic,
-			       bool *new_sb_created, const void *ns);
+int kernfs_get_tree(struct fs_context *fc);
+void kernfs_free_fs_context(struct fs_context *fc);
 void kernfs_kill_sb(struct super_block *sb);
 struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
 
@@ -454,11 +467,10 @@ static inline void kernfs_notify(struct kernfs_node *kn) { }
 static inline const void *kernfs_super_ns(struct super_block *sb)
 { return NULL; }
 
-static inline struct dentry *
-kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-		struct kernfs_root *root, unsigned long magic,
-		bool *new_sb_created, const void *ns)
-{ return ERR_PTR(-ENOSYS); }
+static inline int kernfs_get_tree(struct fs_context *fc)
+{ return -ENOSYS; }
+
+static inline void kernfs_free_fs_context(struct fs_context *fc) { }
 
 static inline void kernfs_kill_sb(struct super_block *sb) { }
 
@@ -535,13 +547,4 @@ static inline int kernfs_rename(struct kernfs_node *kn,
 	return kernfs_rename_ns(kn, new_parent, new_name, NULL);
 }
 
-static inline struct dentry *
-kernfs_mount(struct file_system_type *fs_type, int flags,
-		struct kernfs_root *root, unsigned long magic,
-		bool *new_sb_created)
-{
-	return kernfs_mount_ns(fs_type, flags, root,
-				magic, new_sb_created, NULL);
-}
-
 #endif	/* __LINUX_KERNFS_H */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 77ff1cd6a252..d50527b06ac0 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -8,6 +8,33 @@
 #include <linux/list.h>
 #include <linux/refcount.h>
 
+/*
+ * The cgroup filesystem superblock creation/mount context.
+ */
+struct cgroup_fs_context {
+	struct kernfs_fs_context kfc;
+	struct cgroup_root	*root;
+	struct cgroup_namespace	*ns;
+	u8		version;		/* cgroups version */
+	unsigned int	flags;			/* CGRP_ROOT_* flags */
+
+	/* cgroup1 bits */
+	bool		cpuset_clone_children;
+	bool		none;			/* User explicitly requested empty subsystem */
+	bool		all_ss;			/* Seen 'all' option */
+	bool		one_ss;			/* Seen 'none' option */
+	u16		subsys_mask;		/* Selected subsystems */
+	char		*name;			/* Hierarchy name */
+	char		*release_agent;		/* Path for release notifications */
+};
+
+static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc)
+{
+	struct kernfs_fs_context *kfc = fc->fs_private;
+
+	return container_of(kfc, struct cgroup_fs_context, kfc);
+}
+
 /*
  * A cgroup can be associated with multiple css_sets as different tasks may
  * belong to different cgroups on different hierarchies.  In the other
@@ -89,16 +116,6 @@ struct cgroup_mgctx {
 #define DEFINE_CGROUP_MGCTX(name)						\
 	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
 
-struct cgroup_sb_opts {
-	u16 subsys_mask;
-	unsigned int flags;
-	char *release_agent;
-	bool cpuset_clone_children;
-	char *name;
-	/* User explicitly requested empty subsystem */
-	bool none;
-};
-
 extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 extern struct cgroup_subsys *cgroup_subsys[];
@@ -169,12 +186,10 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
 			  struct cgroup_namespace *ns);
 
 void cgroup_free_root(struct cgroup_root *root);
-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
+void init_cgroup_root(struct cgroup_fs_context *ctx);
 int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
 int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
-			       struct cgroup_root *root, unsigned long magic,
-			       struct cgroup_namespace *ns);
+int cgroup_do_get_tree(struct fs_context *fc);
 
 int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
 void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
@@ -224,8 +239,8 @@ bool cgroup1_ssid_disabled(int ssid);
 void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
 void cgroup1_release_agent(struct work_struct *work);
 void cgroup1_check_for_release(struct cgroup *cgrp);
-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
-			     void *data, unsigned long magic,
-			     struct cgroup_namespace *ns);
+int cgroup1_parse_option(struct fs_context *fc, char *p);
+int cgroup1_validate(struct fs_context *fc);
+int cgroup1_get_tree(struct fs_context *fc);
 
 #endif /* __CGROUP_INTERNAL_H */
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 8b4f0768efd6..749ccf5c0690 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -16,6 +16,8 @@
 
 #include <trace/events/cgroup.h>
 
+#define cg_invalf(fc, fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); -EINVAL; })
+
 /*
  * pidlists linger the following amount before being destroyed.  The goal
  * is avoiding frequent destruction in the middle of consecutive read calls
@@ -903,168 +905,168 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo
 	return 0;
 }
 
-static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
+int cgroup1_parse_option(struct fs_context *fc, char *token)
 {
-	char *token, *o = data;
-	bool all_ss = false, one_ss = false;
-	u16 mask = U16_MAX;
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 	struct cgroup_subsys *ss;
-	int nr_opts = 0;
 	int i;
 
-#ifdef CONFIG_CPUSETS
-	mask = ~((u16)1 << cpuset_cgrp_id);
-#endif
-
-	memset(opts, 0, sizeof(*opts));
-
-	while ((token = strsep(&o, ",")) != NULL) {
-		nr_opts++;
+	if (!strcmp(token, "none")) {
+		/* Explicitly have no subsystems */
+		ctx->none = true;
+		return 0;
+	}
+	if (!strcmp(token, "all")) {
+		/* Mutually exclusive option 'all' + subsystem name */
+		if (ctx->one_ss)
+			return cg_invalf(fc, "cgroup1: all conflicts with subsys name");
+		ctx->all_ss = true;
+		return 0;
+	}
+	if (!strcmp(token, "noprefix")) {
+		ctx->flags |= CGRP_ROOT_NOPREFIX;
+		return 0;
+	}
+	if (!strcmp(token, "clone_children")) {
+		ctx->cpuset_clone_children = true;
+		return 0;
+	}
+	if (!strcmp(token, "xattr")) {
+		ctx->flags |= CGRP_ROOT_XATTR;
+		return 0;
+	}
+	if (!strncmp(token, "release_agent=", 14)) {
+		/* Specifying two release agents is forbidden */
+		if (ctx->release_agent)
+			return cg_invalf(fc, "cgroup1: release_agent respecified");
+		ctx->release_agent =
+			kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
+		if (!ctx->release_agent)
+			return -ENOMEM;
+		return 0;
+	}
 
-		if (!*token)
-			return -EINVAL;
-		if (!strcmp(token, "none")) {
-			/* Explicitly have no subsystems */
-			opts->none = true;
-			continue;
-		}
-		if (!strcmp(token, "all")) {
-			/* Mutually exclusive option 'all' + subsystem name */
-			if (one_ss)
-				return -EINVAL;
-			all_ss = true;
-			continue;
-		}
-		if (!strcmp(token, "noprefix")) {
-			opts->flags |= CGRP_ROOT_NOPREFIX;
-			continue;
+	if (!strncmp(token, "name=", 5)) {
+		const char *name = token + 5;
+		/* Can't specify an empty name */
+		if (!strlen(name))
+			return cg_invalf(fc, "cgroup1: Empty name");
+		/* Must match [\w.-]+ */
+		for (i = 0; i < strlen(name); i++) {
+			char c = name[i];
+			if (isalnum(c))
+				continue;
+			if ((c == '.') || (c == '-') || (c == '_'))
+				continue;
+			return cg_invalf(fc, "cgroup1: Invalid name");
 		}
-		if (!strcmp(token, "clone_children")) {
-			opts->cpuset_clone_children = true;
+		/* Specifying two names is forbidden */
+		if (ctx->name)
+			return cg_invalf(fc, "cgroup1: name respecified");
+		ctx->name = kstrndup(name,
+				     MAX_CGROUP_ROOT_NAMELEN - 1,
+				     GFP_KERNEL);
+		if (!ctx->name)
+			return -ENOMEM;
+
+		return 0;
+	}
+
+	for_each_subsys(ss, i) {
+		if (strcmp(token, ss->legacy_name))
 			continue;
-		}
 		if (!strcmp(token, "cpuset_v2_mode")) {
-			opts->flags |= CGRP_ROOT_CPUSET_V2_MODE;
+			ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE;
 			continue;
 		}
 		if (!strcmp(token, "xattr")) {
-			opts->flags |= CGRP_ROOT_XATTR;
+			ctx->flags |= CGRP_ROOT_XATTR;
 			continue;
 		}
-		if (!strncmp(token, "release_agent=", 14)) {
-			/* Specifying two release agents is forbidden */
-			if (opts->release_agent)
-				return -EINVAL;
-			opts->release_agent =
-				kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
-			if (!opts->release_agent)
-				return -ENOMEM;
+		if (cgroup1_ssid_disabled(i))
 			continue;
-		}
-		if (!strncmp(token, "name=", 5)) {
-			const char *name = token + 5;
-			/* Can't specify an empty name */
-			if (!strlen(name))
-				return -EINVAL;
-			/* Must match [\w.-]+ */
-			for (i = 0; i < strlen(name); i++) {
-				char c = name[i];
-				if (isalnum(c))
-					continue;
-				if ((c == '.') || (c == '-') || (c == '_'))
-					continue;
-				return -EINVAL;
-			}
-			/* Specifying two names is forbidden */
-			if (opts->name)
-				return -EINVAL;
-			opts->name = kstrndup(name,
-					      MAX_CGROUP_ROOT_NAMELEN - 1,
-					      GFP_KERNEL);
-			if (!opts->name)
-				return -ENOMEM;
 
-			continue;
-		}
+		/* Mutually exclusive option 'all' + subsystem name */
+		if (ctx->all_ss)
+			return cg_invalf(fc, "cgroup1: subsys name conflicts with all");
+		ctx->subsys_mask |= (1 << i);
+		ctx->one_ss = true;
+		return 0;
+	}
 
-		for_each_subsys(ss, i) {
-			if (strcmp(token, ss->legacy_name))
-				continue;
-			if (!cgroup_ssid_enabled(i))
-				continue;
-			if (cgroup1_ssid_disabled(i))
-				continue;
+	if (i == CGROUP_SUBSYS_COUNT)
+		return -ENOENT;
+
+	return 0;
+}
 
-			/* Mutually exclusive option 'all' + subsystem name */
-			if (all_ss)
-				return -EINVAL;
-			opts->subsys_mask |= (1 << i);
-			one_ss = true;
+/*
+ * Validate the options that have been parsed.
+ */
+int cgroup1_validate(struct fs_context *fc)
+{
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+	struct cgroup_subsys *ss;
+	u16 mask = U16_MAX;
+	int i;
 
-			break;
-		}
-		if (i == CGROUP_SUBSYS_COUNT)
-			return -ENOENT;
-	}
+#ifdef CONFIG_CPUSETS
+	mask = ~((u16)1 << cpuset_cgrp_id);
+#endif
 
 	/*
 	 * If the 'all' option was specified select all the subsystems,
 	 * otherwise if 'none', 'name=' and a subsystem name options were
 	 * not specified, let's default to 'all'
 	 */
-	if (all_ss || (!one_ss && !opts->none && !opts->name))
+	if (ctx->all_ss || (!ctx->one_ss && !ctx->none && !ctx->name))
 		for_each_subsys(ss, i)
 			if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
-				opts->subsys_mask |= (1 << i);
+				ctx->subsys_mask |= (1 << i);
 
 	/*
 	 * We either have to specify by name or by subsystems. (So all
 	 * empty hierarchies must have a name).
 	 */
-	if (!opts->subsys_mask && !opts->name)
-		return -EINVAL;
+	if (!ctx->subsys_mask && !ctx->name)
+		return cg_invalf(fc, "cgroup1: Need name or subsystem set");
 
 	/*
 	 * Option noprefix was introduced just for backward compatibility
 	 * with the old cpuset, so we allow noprefix only if mounting just
 	 * the cpuset subsystem.
 	 */
-	if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
-		return -EINVAL;
+	if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask))
+		return cg_invalf(fc, "cgroup1: noprefix used incorrectly");
 
 	/* Can't specify "none" and some subsystems */
-	if (opts->subsys_mask && opts->none)
-		return -EINVAL;
+	if (ctx->subsys_mask && ctx->none)
+		return cg_invalf(fc, "cgroup1: none used incorrectly");
 
 	return 0;
 }
 
-static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup1_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
 {
-	int ret = 0;
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
-	struct cgroup_sb_opts opts;
 	u16 added_mask, removed_mask;
+	int ret = 0;
 
 	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
 
-	/* See what subsystems are wanted */
-	ret = parse_cgroupfs_options(data, &opts);
-	if (ret)
-		goto out_unlock;
-
-	if (opts.subsys_mask != root->subsys_mask || opts.release_agent)
+	if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent)
 		pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n",
 			task_tgid_nr(current), current->comm);
 
-	added_mask = opts.subsys_mask & ~root->subsys_mask;
-	removed_mask = root->subsys_mask & ~opts.subsys_mask;
+	added_mask = ctx->subsys_mask & ~root->subsys_mask;
+	removed_mask = root->subsys_mask & ~ctx->subsys_mask;
 
 	/* Don't allow flags or name to change at remount */
-	if ((opts.flags ^ root->flags) ||
-	    (opts.name && strcmp(opts.name, root->name))) {
-		pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n",
-		       opts.flags, opts.name ?: "", root->flags, root->name);
+	if ((ctx->flags ^ root->flags) ||
+	    (ctx->name && strcmp(ctx->name, root->name))) {
+		cg_invalf(fc, "option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"",
+		       ctx->flags, ctx->name ?: "", root->flags, root->name);
 		ret = -EINVAL;
 		goto out_unlock;
 	}
@@ -1081,17 +1083,15 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
 
 	WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask));
 
-	if (opts.release_agent) {
+	if (ctx->release_agent) {
 		spin_lock(&release_agent_path_lock);
-		strcpy(root->release_agent_path, opts.release_agent);
+		strcpy(root->release_agent_path, ctx->release_agent);
 		spin_unlock(&release_agent_path_lock);
 	}
 
 	trace_cgroup_remount(root);
 
  out_unlock:
-	kfree(opts.release_agent);
-	kfree(opts.name);
 	mutex_unlock(&cgroup_mutex);
 	return ret;
 }
@@ -1099,31 +1099,26 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
 struct kernfs_syscall_ops cgroup1_kf_syscall_ops = {
 	.rename			= cgroup1_rename,
 	.show_options		= cgroup1_show_options,
-	.remount_fs		= cgroup1_remount,
+	.reconfigure		= cgroup1_reconfigure,
 	.mkdir			= cgroup_mkdir,
 	.rmdir			= cgroup_rmdir,
 	.show_path		= cgroup_show_path,
 };
 
-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
-			     void *data, unsigned long magic,
-			     struct cgroup_namespace *ns)
+/*
+ * Find or create a v1 cgroups superblock.
+ */
+int cgroup1_get_tree(struct fs_context *fc)
 {
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 	struct super_block *pinned_sb = NULL;
-	struct cgroup_sb_opts opts;
 	struct cgroup_root *root;
 	struct cgroup_subsys *ss;
-	struct dentry *dentry;
 	int i, ret;
 	bool new_root = false;
 
 	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
 
-	/* First find the desired set of subsystems */
-	ret = parse_cgroupfs_options(data, &opts);
-	if (ret)
-		goto out_unlock;
-
 	/*
 	 * Destruction of cgroup root is asynchronous, so subsystems may
 	 * still be dying after the previous unmount.  Let's drain the
@@ -1132,15 +1127,13 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
 	 * starting.  Testing ref liveliness is good enough.
 	 */
 	for_each_subsys(ss, i) {
-		if (!(opts.subsys_mask & (1 << i)) ||
+		if (!(ctx->subsys_mask & (1 << i)) ||
 		    ss->root == &cgrp_dfl_root)
 			continue;
 
 		if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
 			mutex_unlock(&cgroup_mutex);
-			msleep(10);
-			ret = restart_syscall();
-			goto out_free;
+			goto err_restart;
 		}
 		cgroup_put(&ss->root->cgrp);
 	}
@@ -1156,8 +1149,8 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
 		 * name matches but sybsys_mask doesn't, we should fail.
 		 * Remember whether name matched.
 		 */
-		if (opts.name) {
-			if (strcmp(opts.name, root->name))
+		if (ctx->name) {
+			if (strcmp(ctx->name, root->name))
 				continue;
 			name_match = true;
 		}
@@ -1166,15 +1159,15 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
 		 * If we asked for subsystems (or explicitly for no
 		 * subsystems) then they must match.
 		 */
-		if ((opts.subsys_mask || opts.none) &&
-		    (opts.subsys_mask != root->subsys_mask)) {
+		if ((ctx->subsys_mask || ctx->none) &&
+		    (ctx->subsys_mask != root->subsys_mask)) {
 			if (!name_match)
 				continue;
 			ret = -EBUSY;
-			goto out_unlock;
+			goto err_unlock;
 		}
 
-		if (root->flags ^ opts.flags)
+		if (root->flags ^ ctx->flags)
 			pr_warn("new mount options do not match the existing superblock, will be ignored\n");
 
 		/*
@@ -1195,11 +1188,10 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
 			mutex_unlock(&cgroup_mutex);
 			if (!IS_ERR_OR_NULL(pinned_sb))
 				deactivate_super(pinned_sb);
-			msleep(10);
-			ret = restart_syscall();
-			goto out_free;
+			goto err_restart;
 		}
 
+		ctx->root = root;
 		ret = 0;
 		goto out_unlock;
 	}
@@ -1209,41 +1201,35 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
 	 * specification is allowed for already existing hierarchies but we
 	 * can't create new one without subsys specification.
 	 */
-	if (!opts.subsys_mask && !opts.none) {
-		ret = -EINVAL;
-		goto out_unlock;
+	if (!ctx->subsys_mask && !ctx->none) {
+		ret = cg_invalf(fc, "cgroup1: No subsys list or none specified");
+		goto err_unlock;
 	}
 
 	/* Hierarchies may only be created in the initial cgroup namespace. */
-	if (ns != &init_cgroup_ns) {
+	if (ctx->ns != &init_cgroup_ns) {
 		ret = -EPERM;
-		goto out_unlock;
+		goto err_unlock;
 	}
 
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root) {
 		ret = -ENOMEM;
-		goto out_unlock;
+		goto err_unlock;
 	}
 	new_root = true;
+	ctx->root = root;
 
-	init_cgroup_root(root, &opts);
+	init_cgroup_root(ctx);
 
-	ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
+	ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
 	if (ret)
-		cgroup_free_root(root);
+		goto err_unlock;
 
 out_unlock:
 	mutex_unlock(&cgroup_mutex);
-out_free:
-	kfree(opts.release_agent);
-	kfree(opts.name);
-
-	if (ret)
-		return ERR_PTR(ret);
 
-	dentry = cgroup_do_mount(&cgroup_fs_type, flags, root,
-				 CGROUP_SUPER_MAGIC, ns);
+	ret = cgroup_do_get_tree(fc);
 
 	/*
 	 * There's a race window after we release cgroup_mutex and before
@@ -1256,6 +1242,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
 		percpu_ref_reinit(&root->cgrp.self.refcnt);
 		mutex_unlock(&cgroup_mutex);
 	}
+	cgroup_get(&root->cgrp);
 
 	/*
 	 * If @pinned_sb, we're reusing an existing root and holding an
@@ -1264,7 +1251,14 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
 	if (pinned_sb)
 		deactivate_super(pinned_sb);
 
-	return dentry;
+	return ret;
+
+err_restart:
+	msleep(10);
+	return restart_syscall();
+err_unlock:
+	mutex_unlock(&cgroup_mutex);
+	return ret;
 }
 
 static int __init cgroup1_wq_init(void)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index ddb1a60ae3c0..33a11d941d11 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1734,25 +1734,23 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 	return len;
 }
 
-static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
+static int cgroup2_parse_option(struct fs_context *fc, char *token)
 {
-	char *token;
-
-	*root_flags = 0;
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 
-	if (!data)
+	if (!strcmp(token, "nsdelegate")) {
+		ctx->flags |= CGRP_ROOT_NS_DELEGATE;
 		return 0;
-
-	while ((token = strsep(&data, ",")) != NULL) {
-		if (!strcmp(token, "nsdelegate")) {
-			*root_flags |= CGRP_ROOT_NS_DELEGATE;
-			continue;
-		}
-
-		pr_err("cgroup2: unknown option \"%s\"\n", token);
-		return -EINVAL;
 	}
 
+	return -EINVAL;
+}
+
+static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
+{
+	if (current->nsproxy->cgroup_ns == &init_cgroup_ns &&
+	    cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
+		seq_puts(seq, ",nsdelegate");
 	return 0;
 }
 
@@ -1766,23 +1764,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 	}
 }
 
-static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
+static int cgroup_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
 {
-	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
-		seq_puts(seq, ",nsdelegate");
-	return 0;
-}
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 
-static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
-{
-	unsigned int root_flags;
-	int ret;
-
-	ret = parse_cgroup_root_flags(data, &root_flags);
-	if (ret)
-		return ret;
-
-	apply_cgroup_root_flags(root_flags);
+	apply_cgroup_root_flags(ctx->flags);
 	return 0;
 }
 
@@ -1870,8 +1856,9 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
 }
 
-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
+void init_cgroup_root(struct cgroup_fs_context *ctx)
 {
+	struct cgroup_root *root = ctx->root;
 	struct cgroup *cgrp = &root->cgrp;
 
 	INIT_LIST_HEAD(&root->root_list);
@@ -1880,12 +1867,12 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
 	init_cgroup_housekeeping(cgrp);
 	idr_init(&root->cgroup_idr);
 
-	root->flags = opts->flags;
-	if (opts->release_agent)
-		strscpy(root->release_agent_path, opts->release_agent, PATH_MAX);
-	if (opts->name)
-		strscpy(root->name, opts->name, MAX_CGROUP_ROOT_NAMELEN);
-	if (opts->cpuset_clone_children)
+	root->flags = ctx->flags;
+	if (ctx->release_agent)
+		strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
+	if (ctx->name)
+		strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
+	if (ctx->cpuset_clone_children)
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
@@ -1990,57 +1977,53 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
 	return ret;
 }
 
-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
-			       struct cgroup_root *root, unsigned long magic,
-			       struct cgroup_namespace *ns)
+int cgroup_do_get_tree(struct fs_context *fc)
 {
-	struct dentry *dentry;
-	bool new_sb;
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+	int ret;
 
-	dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
+	ctx->kfc.root = ctx->root->kf_root;
+
+	ret = kernfs_get_tree(fc);
+	if (ret < 0)
+		goto out_cgrp;
 
 	/*
 	 * In non-init cgroup namespace, instead of root cgroup's dentry,
 	 * we return the dentry corresponding to the cgroupns->root_cgrp.
 	 */
-	if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+	if (ctx->ns != &init_cgroup_ns) {
 		struct dentry *nsdentry;
 		struct cgroup *cgrp;
 
 		mutex_lock(&cgroup_mutex);
 		spin_lock_irq(&css_set_lock);
 
-		cgrp = cset_cgroup_from_root(ns->root_cset, root);
+		cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);
 
 		spin_unlock_irq(&css_set_lock);
 		mutex_unlock(&cgroup_mutex);
 
-		nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
-		dput(dentry);
-		dentry = nsdentry;
+		nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
+		if (IS_ERR(nsdentry))
+			return PTR_ERR(nsdentry);
+		dput(fc->root);
+		fc->root = nsdentry;
 	}
 
-	if (IS_ERR(dentry) || !new_sb)
-		cgroup_put(&root->cgrp);
+	ret = 0;
+	if (ctx->kfc.new_sb_created)
+		goto out_cgrp;
+	apply_cgroup_root_flags(ctx->flags);
+	return 0;
 
-	return dentry;
+out_cgrp:
+	return ret;
 }
 
-static struct dentry *cgroup_mount(struct file_system_type *fs_type,
-			 int flags, const char *unused_dev_name,
-			 void *data, size_t data_size)
+static int cgroup_get_tree(struct fs_context *fc)
 {
-	struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
-	struct dentry *dentry;
-	int ret;
-
-	get_cgroup_ns(ns);
-
-	/* Check if the caller has permission to mount. */
-	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
-		put_cgroup_ns(ns);
-		return ERR_PTR(-EPERM);
-	}
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -2049,29 +2032,87 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	if (!use_task_css_set_links)
 		cgroup_enable_task_cg_lists();
 
-	if (fs_type == &cgroup2_fs_type) {
-		unsigned int root_flags;
-
-		ret = parse_cgroup_root_flags(data, &root_flags);
-		if (ret) {
-			put_cgroup_ns(ns);
-			return ERR_PTR(ret);
-		}
+	switch (ctx->version) {
+	case 1:
+		return cgroup1_get_tree(fc);
 
+	case 2:
 		cgrp_dfl_visible = true;
 		cgroup_get_live(&cgrp_dfl_root.cgrp);
 
-		dentry = cgroup_do_mount(&cgroup2_fs_type, flags, &cgrp_dfl_root,
-					 CGROUP2_SUPER_MAGIC, ns);
-		if (!IS_ERR(dentry))
-			apply_cgroup_root_flags(root_flags);
-	} else {
-		dentry = cgroup1_mount(&cgroup_fs_type, flags, data,
-				       CGROUP_SUPER_MAGIC, ns);
+		ctx->root = &cgrp_dfl_root;
+		return cgroup_do_get_tree(fc);
+
+	default:
+		BUG();
 	}
+}
+
+static int cgroup_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+
+	if (ctx->version == 1)
+		return cgroup1_parse_option(fc, opt);
+
+	return cgroup2_parse_option(fc, opt);
+}
+
+static int cgroup_validate(struct fs_context *fc)
+{
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 
-	put_cgroup_ns(ns);
-	return dentry;
+	if (ctx->version == 1)
+		return cgroup1_validate(fc);
+	return 0;
+}
+
+/*
+ * Destroy a cgroup filesystem context.
+ */
+static void cgroup_fs_context_free(struct fs_context *fc)
+{
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+
+	kfree(ctx->name);
+	kfree(ctx->release_agent);
+	if (ctx->root)
+		cgroup_put(&ctx->root->cgrp);
+	put_cgroup_ns(ctx->ns);
+	kernfs_free_fs_context(fc);
+	kfree(ctx);
+}
+
+static const struct fs_context_operations cgroup_fs_context_ops = {
+	.free		= cgroup_fs_context_free,
+	.parse_option	= cgroup_parse_option,
+	.validate	= cgroup_validate,
+	.get_tree	= cgroup_get_tree,
+};
+
+/*
+ * Initialise the cgroup filesystem creation/reconfiguration context.  Notably,
+ * we select the namespace we're going to use.
+ */
+static int cgroup_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+	struct cgroup_fs_context *ctx;
+	struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->ns = get_cgroup_ns(ns);
+	ctx->version = (fc->fs_type == &cgroup2_fs_type) ? 2 : 1;
+	ctx->kfc.magic = (ctx->version == 2) ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC;
+	fc->fs_private = &ctx->kfc;
+	fc->ops = &cgroup_fs_context_ops;
+	return 0;
 }
 
 static void cgroup_kill_sb(struct super_block *sb)
@@ -2096,17 +2137,17 @@ static void cgroup_kill_sb(struct super_block *sb)
 }
 
 struct file_system_type cgroup_fs_type = {
-	.name = "cgroup",
-	.mount = cgroup_mount,
-	.kill_sb = cgroup_kill_sb,
-	.fs_flags = FS_USERNS_MOUNT,
+	.name			= "cgroup",
+	.init_fs_context	= cgroup_init_fs_context,
+	.kill_sb		= cgroup_kill_sb,
+	.fs_flags		= FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
-	.name = "cgroup2",
-	.mount = cgroup_mount,
-	.kill_sb = cgroup_kill_sb,
-	.fs_flags = FS_USERNS_MOUNT,
+	.name			= "cgroup2",
+	.init_fs_context	= cgroup_init_fs_context,
+	.kill_sb		= cgroup_kill_sb,
+	.fs_flags		= FS_USERNS_MOUNT,
 };
 
 int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
@@ -5175,7 +5216,7 @@ int cgroup_rmdir(struct kernfs_node *kn)
 
 static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
 	.show_options		= cgroup_show_options,
-	.remount_fs		= cgroup_remount,
+	.reconfigure		= cgroup_reconfigure,
 	.mkdir			= cgroup_mkdir,
 	.rmdir			= cgroup_rmdir,
 	.show_path		= cgroup_show_path,
@@ -5242,11 +5283,12 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
  */
 int __init cgroup_init_early(void)
 {
-	static struct cgroup_sb_opts __initdata opts;
+	static struct cgroup_fs_context __initdata ctx;
 	struct cgroup_subsys *ss;
 	int i;
 
-	init_cgroup_root(&cgrp_dfl_root, &opts);
+	ctx.root = &cgrp_dfl_root;
+	init_cgroup_root(&ctx);
 	cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
 
 	RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e6582b2f5144..b02161a41d5a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -324,10 +324,8 @@ static int cpuset_get_tree(struct fs_context *fc)
 	int ret = -ENODEV;
 
 	cgroup_fs = get_fs_type("cgroup");
-	if (cgroup_fs) {
-		ret = PTR_ERR(cgroup_fs);
+	if (!cgroup_fs)
 		goto out;
-	}
 
 	cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->purpose);
 	put_filesystem(cgroup_fs);


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 20/32] hugetlbfs: Convert to fs_context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (18 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 19/32] kernfs, sysfs, cgroup, intel_rdt: Support " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 21/32] vfs: Remove kern_mount_data() " David Howells
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Convert the hugetlbfs to use the fs_context during mount.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/hugetlbfs/inode.c |  342 +++++++++++++++++++++++++++++---------------------
 1 file changed, 197 insertions(+), 145 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 76fb8eb2bea8..91fadca3c8e6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -45,11 +45,17 @@ const struct file_operations hugetlbfs_file_operations;
 static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
-struct hugetlbfs_config {
+enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
+
+struct hugetlbfs_fs_context {
 	struct hstate		*hstate;
+	unsigned long long	max_size_opt;
+	unsigned long long	min_size_opt;
 	long			max_hpages;
 	long			nr_inodes;
 	long			min_hpages;
+	enum hugetlbfs_size_type max_val_type;
+	enum hugetlbfs_size_type min_val_type;
 	kuid_t			uid;
 	kgid_t			gid;
 	umode_t			mode;
@@ -708,16 +714,16 @@ static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 }
 
 static struct inode *hugetlbfs_get_root(struct super_block *sb,
-					struct hugetlbfs_config *config)
+					struct hugetlbfs_fs_context *ctx)
 {
 	struct inode *inode;
 
 	inode = new_inode(sb);
 	if (inode) {
 		inode->i_ino = get_next_ino();
-		inode->i_mode = S_IFDIR | config->mode;
-		inode->i_uid = config->uid;
-		inode->i_gid = config->gid;
+		inode->i_mode = S_IFDIR | ctx->mode;
+		inode->i_uid = ctx->uid;
+		inode->i_gid = ctx->gid;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
 		inode->i_op = &hugetlbfs_dir_inode_operations;
 		inode->i_fop = &simple_dir_operations;
@@ -1081,8 +1087,6 @@ static const struct super_operations hugetlbfs_ops = {
 	.show_options	= hugetlbfs_show_options,
 };
 
-enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
-
 /*
  * Convert size option passed from command line to number of huge pages
  * in the pool specified by hstate.  Size option could be in bytes
@@ -1105,171 +1109,156 @@ hugetlbfs_size_to_hpages(struct hstate *h, unsigned long long size_opt,
 	return size_opt;
 }
 
-static int
-hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig)
+/*
+ * Parse one mount option.
+ */
+static int hugetlbfs_parse_option(struct fs_context *fc, char *opt, size_t len)
 {
-	char *p, *rest;
+	struct hugetlbfs_fs_context *ctx = fc->fs_private;
+	char *rest;
+	unsigned long ps;
 	substring_t args[MAX_OPT_ARGS];
-	int option;
-	unsigned long long max_size_opt = 0, min_size_opt = 0;
-	enum hugetlbfs_size_type max_val_type = NO_SIZE, min_val_type = NO_SIZE;
-
-	if (!options)
+	int token, option;
+
+	token = match_token(opt, tokens, args);
+	switch (token) {
+	case Opt_uid:
+		if (match_int(&args[0], &option))
+			goto bad_val;
+		ctx->uid = make_kuid(current_user_ns(), option);
+		if (!uid_valid(ctx->uid))
+			goto bad_val;
 		return 0;
 
-	while ((p = strsep(&options, ",")) != NULL) {
-		int token;
-		if (!*p)
-			continue;
+	case Opt_gid:
+		if (match_int(&args[0], &option))
+			goto bad_val;
+		ctx->gid = make_kgid(current_user_ns(), option);
+		if (!gid_valid(ctx->gid))
+			goto bad_val;
+		return 0;
 
-		token = match_token(p, tokens, args);
-		switch (token) {
-		case Opt_uid:
-			if (match_int(&args[0], &option))
- 				goto bad_val;
-			pconfig->uid = make_kuid(current_user_ns(), option);
-			if (!uid_valid(pconfig->uid))
-				goto bad_val;
-			break;
+	case Opt_mode:
+		if (match_octal(&args[0], &option))
+			goto bad_val;
+		ctx->mode = option & 01777U;
+		return 0;
 
-		case Opt_gid:
-			if (match_int(&args[0], &option))
- 				goto bad_val;
-			pconfig->gid = make_kgid(current_user_ns(), option);
-			if (!gid_valid(pconfig->gid))
-				goto bad_val;
-			break;
+	case Opt_size:
+		/* memparse() will accept a K/M/G without a digit */
+		if (!isdigit(*args[0].from))
+			goto bad_val;
+		ctx->max_size_opt = memparse(args[0].from, &rest);
+		ctx->max_val_type = SIZE_STD;
+		if (*rest == '%')
+			ctx->max_val_type = SIZE_PERCENT;
+		return 0;
 
-		case Opt_mode:
-			if (match_octal(&args[0], &option))
- 				goto bad_val;
-			pconfig->mode = option & 01777U;
-			break;
+	case Opt_nr_inodes:
+		/* memparse() will accept a K/M/G without a digit */
+		if (!isdigit(*args[0].from))
+			goto bad_val;
+		ctx->nr_inodes = memparse(args[0].from, &rest);
+		return 0;
 
-		case Opt_size: {
-			/* memparse() will accept a K/M/G without a digit */
-			if (!isdigit(*args[0].from))
-				goto bad_val;
-			max_size_opt = memparse(args[0].from, &rest);
-			max_val_type = SIZE_STD;
-			if (*rest == '%')
-				max_val_type = SIZE_PERCENT;
-			break;
+	case Opt_pagesize:
+		ps = memparse(args[0].from, &rest);
+		ctx->hstate = size_to_hstate(ps);
+		if (!ctx->hstate) {
+			pr_err("Unsupported page size %lu MB\n", ps >> 20);
+			return -EINVAL;
 		}
+		return 0;
 
-		case Opt_nr_inodes:
-			/* memparse() will accept a K/M/G without a digit */
-			if (!isdigit(*args[0].from))
-				goto bad_val;
-			pconfig->nr_inodes = memparse(args[0].from, &rest);
-			break;
+	case Opt_min_size:
+		/* memparse() will accept a K/M/G without a digit */
+		if (!isdigit(*args[0].from))
+			goto bad_val;
+		ctx->min_size_opt = memparse(args[0].from, &rest);
+		ctx->min_val_type = SIZE_STD;
+		if (*rest == '%')
+			ctx->min_val_type = SIZE_PERCENT;
+		return 0;
 
-		case Opt_pagesize: {
-			unsigned long ps;
-			ps = memparse(args[0].from, &rest);
-			pconfig->hstate = size_to_hstate(ps);
-			if (!pconfig->hstate) {
-				pr_err("Unsupported page size %lu MB\n",
-					ps >> 20);
-				return -EINVAL;
-			}
-			break;
-		}
+	default:
+		pr_err("Bad mount option: \"%s\"\n", opt);
+		return -EINVAL;
+	}
 
-		case Opt_min_size: {
-			/* memparse() will accept a K/M/G without a digit */
-			if (!isdigit(*args[0].from))
-				goto bad_val;
-			min_size_opt = memparse(args[0].from, &rest);
-			min_val_type = SIZE_STD;
-			if (*rest == '%')
-				min_val_type = SIZE_PERCENT;
-			break;
-		}
+bad_val:
+	pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, opt);
+	return -EINVAL;
+}
 
-		default:
-			pr_err("Bad mount option: \"%s\"\n", p);
-			return -EINVAL;
-			break;
-		}
-	}
+/*
+ * Validate the parsed options.
+ */
+static int hugetlbfs_validate(struct fs_context *fc)
+{
+	struct hugetlbfs_fs_context *ctx = fc->fs_private;
 
 	/*
 	 * Use huge page pool size (in hstate) to convert the size
 	 * options to number of huge pages.  If NO_SIZE, -1 is returned.
 	 */
-	pconfig->max_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
-						max_size_opt, max_val_type);
-	pconfig->min_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
-						min_size_opt, min_val_type);
+	ctx->max_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+						   ctx->max_size_opt,
+						   ctx->max_val_type);
+	ctx->min_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+						   ctx->min_size_opt,
+						   ctx->min_val_type);
 
 	/*
 	 * If max_size was specified, then min_size must be smaller
 	 */
-	if (max_val_type > NO_SIZE &&
-	    pconfig->min_hpages > pconfig->max_hpages) {
-		pr_err("minimum size can not be greater than maximum size\n");
+	if (ctx->max_val_type > NO_SIZE &&
+	    ctx->min_hpages > ctx->max_hpages) {
+		pr_err("Minimum size can not be greater than maximum size\n");
 		return -EINVAL;
 	}
 
 	return 0;
-
-bad_val:
-	pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, p);
- 	return -EINVAL;
 }
 
 static int
-hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
-		     int silent)
+hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
-	int ret;
-	struct hugetlbfs_config config;
+	struct hugetlbfs_fs_context *ctx =
+		fc->fs_private;
 	struct hugetlbfs_sb_info *sbinfo;
 
-	config.max_hpages = -1; /* No limit on size by default */
-	config.nr_inodes = -1; /* No limit on number of inodes by default */
-	config.uid = current_fsuid();
-	config.gid = current_fsgid();
-	config.mode = 0755;
-	config.hstate = &default_hstate;
-	config.min_hpages = -1; /* No default minimum size */
-	ret = hugetlbfs_parse_options(data, &config);
-	if (ret)
-		return ret;
-
 	sbinfo = kmalloc(sizeof(struct hugetlbfs_sb_info), GFP_KERNEL);
 	if (!sbinfo)
 		return -ENOMEM;
 	sb->s_fs_info = sbinfo;
-	sbinfo->hstate = config.hstate;
 	spin_lock_init(&sbinfo->stat_lock);
-	sbinfo->max_inodes = config.nr_inodes;
-	sbinfo->free_inodes = config.nr_inodes;
-	sbinfo->spool = NULL;
-	sbinfo->uid = config.uid;
-	sbinfo->gid = config.gid;
-	sbinfo->mode = config.mode;
+	sbinfo->hstate		= ctx->hstate;
+	sbinfo->max_inodes	= ctx->nr_inodes;
+	sbinfo->free_inodes	= ctx->nr_inodes;
+	sbinfo->spool		= NULL;
+	sbinfo->uid		= ctx->uid;
+	sbinfo->gid		= ctx->gid;
+	sbinfo->mode		= ctx->mode;
 
 	/*
 	 * Allocate and initialize subpool if maximum or minimum size is
 	 * specified.  Any needed reservations (for minimim size) are taken
 	 * taken when the subpool is created.
 	 */
-	if (config.max_hpages != -1 || config.min_hpages != -1) {
-		sbinfo->spool = hugepage_new_subpool(config.hstate,
-							config.max_hpages,
-							config.min_hpages);
+	if (ctx->max_hpages != -1 || ctx->min_hpages != -1) {
+		sbinfo->spool = hugepage_new_subpool(ctx->hstate,
+						     ctx->max_hpages,
+						     ctx->min_hpages);
 		if (!sbinfo->spool)
 			goto out_free;
 	}
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = huge_page_size(config.hstate);
-	sb->s_blocksize_bits = huge_page_shift(config.hstate);
+	sb->s_blocksize = huge_page_size(ctx->hstate);
+	sb->s_blocksize_bits = huge_page_shift(ctx->hstate);
 	sb->s_magic = HUGETLBFS_MAGIC;
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
-	sb->s_root = d_make_root(hugetlbfs_get_root(sb, &config));
+	sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx));
 	if (!sb->s_root)
 		goto out_free;
 	return 0;
@@ -1279,17 +1268,50 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
 	return -ENOMEM;
 }
 
-static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data, size_t data_size)
+static int hugetlbfs_get_tree(struct fs_context *fc)
+{
+	return vfs_get_super(fc, vfs_get_independent_super, hugetlbfs_fill_super);
+}
+
+static void hugetlbfs_fs_context_free(struct fs_context *fc)
 {
-	return mount_nodev(fs_type, flags, data, data_size,
-			   hugetlbfs_fill_super);
+	kfree(fc->fs_private);
+}
+
+static const struct fs_context_operations hugetlbfs_fs_context_ops = {
+	.free		= hugetlbfs_fs_context_free,
+	.parse_option	= hugetlbfs_parse_option,
+	.validate	= hugetlbfs_validate,
+	.get_tree	= hugetlbfs_get_tree,
+};
+
+static int hugetlbfs_init_fs_context(struct fs_context *fc,
+				     struct dentry *reference)
+{
+	struct hugetlbfs_fs_context *ctx;
+
+	ctx = kzalloc(sizeof(struct hugetlbfs_fs_context), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->max_hpages	= -1; /* No limit on size by default */
+	ctx->nr_inodes	= -1; /* No limit on number of inodes by default */
+	ctx->uid	= current_fsuid();
+	ctx->gid	= current_fsgid();
+	ctx->mode	= 0755;
+	ctx->hstate	= &default_hstate;
+	ctx->min_hpages	= -1; /* No default minimum size */
+	ctx->max_val_type = NO_SIZE;
+	ctx->min_val_type = NO_SIZE;
+	fc->fs_private = ctx;
+	fc->ops	= &hugetlbfs_fs_context_ops;
+	return 0;
 }
 
 static struct file_system_type hugetlbfs_fs_type = {
-	.name		= "hugetlbfs",
-	.mount		= hugetlbfs_mount,
-	.kill_sb	= kill_litter_super,
+	.name			= "hugetlbfs",
+	.init_fs_context	= hugetlbfs_init_fs_context,
+	.kill_sb		= kill_litter_super,
 };
 
 static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
@@ -1396,8 +1418,47 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
 	return file;
 }
 
+static struct vfsmount *__init mount_one_hugetlbfs(struct hstate *h)
+{
+	struct hugetlbfs_fs_context *ctx;
+	struct fs_context *fc;
+	struct vfsmount *mnt;
+	int ret;
+
+	fc = vfs_new_fs_context(&hugetlbfs_fs_type, NULL, 0,
+				FS_CONTEXT_FOR_KERNEL_MOUNT);
+	if (IS_ERR(fc)) {
+		ret = PTR_ERR(fc);
+		goto err;
+	}
+
+	ctx = fc->fs_private;
+	ctx->hstate = h;
+
+	ret = vfs_get_tree(fc);
+	if (ret < 0)
+		goto err_fc;
+
+	mnt = vfs_create_mount(fc, 0);
+	if (IS_ERR(mnt)) {
+		ret = PTR_ERR(mnt);
+		goto err_fc;
+	}
+
+	put_fs_context(fc);
+	return mnt;
+
+err_fc:
+	put_fs_context(fc);
+err:
+	pr_err("Cannot mount internal hugetlbfs for page size %uK",
+	       1U << (h->order + PAGE_SHIFT - 10));
+	return ERR_PTR(ret);
+}
+
 static int __init init_hugetlbfs_fs(void)
 {
+	struct vfsmount *mnt;
 	struct hstate *h;
 	int error;
 	int i;
@@ -1420,25 +1481,16 @@ static int __init init_hugetlbfs_fs(void)
 
 	i = 0;
 	for_each_hstate(h) {
-		char buf[50];
-		unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
-		int n;
-
-		n = snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
-		hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
-							buf, n + 1);
-
-		if (IS_ERR(hugetlbfs_vfsmount[i])) {
-			pr_err("Cannot mount internal hugetlbfs for "
-				"page size %uK", ps_kb);
-			error = PTR_ERR(hugetlbfs_vfsmount[i]);
-			hugetlbfs_vfsmount[i] = NULL;
+		mnt = mount_one_hugetlbfs(h);
+		if (IS_ERR(mnt) && i == 0) {
+			error = PTR_ERR(mnt);
+			goto out;
 		}
+		hugetlbfs_vfsmount[i] = mnt;
 		i++;
 	}
-	/* Non default hstates are optional */
-	if (!IS_ERR_OR_NULL(hugetlbfs_vfsmount[default_hstate_idx]))
-		return 0;
+
+	return 0;
 
  out:
 	kmem_cache_destroy(hugetlbfs_inode_cachep);


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 21/32] vfs: Remove kern_mount_data() [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (19 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 20/32] hugetlbfs: Convert to " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-10 22:43 ` [PATCH 22/32] vfs: Provide documentation for new mount API " David Howells
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

The kern_mount_data() isn't used any more so remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c     |    7 -------
 include/linux/fs.h |    1 -
 2 files changed, 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3bae16db1b1d..d5a4d9351a17 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3346,13 +3346,6 @@ struct vfsmount *kern_mount(struct file_system_type *type)
 }
 EXPORT_SYMBOL_GPL(kern_mount);
 
-struct vfsmount *kern_mount_data(struct file_system_type *type,
-				 void *data, size_t data_size)
-{
-	return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
 /*
  * Move a mount from one place to another.
  * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 88de0f586b38..e6d963f2fdc2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2211,7 +2211,6 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
 extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
 extern void kern_unmount(struct vfsmount *mnt);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 22/32] vfs: Provide documentation for new mount API [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (20 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 21/32] vfs: Remove kern_mount_data() " David Howells
@ 2018-07-10 22:43 ` David Howells
  2018-07-13  1:37   ` Randy Dunlap
  2018-07-13  9:45   ` David Howells
  2018-07-10 22:44 ` [PATCH 23/32] Make anon_inodes unconditional " David Howells
                   ` (15 subsequent siblings)
  37 siblings, 2 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:43 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Provide documentation for the new mount API.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/mount_api.txt |  439 +++++++++++++++++++++++++++++++
 1 file changed, 439 insertions(+)
 create mode 100644 Documentation/filesystems/mount_api.txt

diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt
new file mode 100644
index 000000000000..6269fb2476f8
--- /dev/null
+++ b/Documentation/filesystems/mount_api.txt
@@ -0,0 +1,439 @@
+			     ====================
+			     FILESYSTEM MOUNT API
+			     ====================
+
+CONTENTS
+
+ (1) Overview.
+
+ (2) The filesystem context.
+
+ (3) The filesystem context operations.
+
+ (4) Filesystem context security.
+
+ (5) VFS filesystem context operations.
+
+
+========
+OVERVIEW
+========
+
+The creation of new mounts is now to be done in a multistep process:
+
+ (1) Create a filesystem context.
+
+ (2) Parse the options and attach them to the context.  Options are expected to
+     be passed individually from userspace, though legacy binary options can be
+     handled.
+
+ (3) Validate and pre-process the context.
+
+ (4) Get or create a superblock and mountable root.
+
+ (5) Perform the mount.
+
+ (6) Return an error message attached to the context.
+
+ (7) Destroy the context.
+
+To support this, the file_system_type struct gains a new field:
+
+	int (*init_fs_context)(struct fs_context *fc, struct dentry *reference);
+
+which is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space.  The reference parameter is used to
+convey a superblock and an automount point or a point to reconfigure from which
+the filesystem may draw extra information (such as namespaces) for submount
+(FS_CONTEXT_FOR_SUBMOUNT) or reconfiguration (FS_CONTEXT_FOR_RECONFIGURE)
+purposes - otherwise it will be NULL.
+
+Note that security initialisation is done *after* the filesystem is called so
+that the namespaces may be adjusted first.
+
+And the super_operations struct gains one field:
+
+	int (*reconfigure)(struct super_block *, struct fs_context *);
+
+This shadows the ->reconfigure() operation and takes a prepared filesystem
+context instead of the mount flags and data page.  It may modify the sb_flags
+in the context for the caller to pick up.
+
+[NOTE] reconfigure is intended as a replacement for remount_fs.
+
+
+======================
+THE FILESYSTEM CONTEXT
+======================
+
+The creation and reconfiguration of a superblock is governed by a filesystem
+context.  This is represented by the fs_context structure:
+
+	struct fs_context {
+		const struct fs_context_operations *ops;
+		struct file_system_type *fs_type;
+		void			*fs_private;
+		struct dentry		*root;
+		struct user_namespace	*user_ns;
+		struct net		*net_ns;
+		const struct cred	*cred;
+		char			*source;
+		char			*subtype;
+		void			*security;
+		void			*s_fs_info;
+		unsigned int		sb_flags;
+		enum fs_context_purpose	purpose:8;
+		bool			sloppy:1;
+		bool			silent:1;
+		...
+	};
+
+The fs_context fields are as follows:
+
+ (*) const struct fs_context_operations *ops
+
+     These are operations that can be done on a filesystem context (see
+     below).  This must be set by the ->init_fs_context() file_system_type
+     operation.
+
+ (*) struct file_system_type *fs_type
+
+     A pointer to the file_system_type of the filesystem that is being
+     constructed or reconfigured.  This retains a reference on the type owner.
+
+ (*) void *fs_private
+
+     A pointer to the file system's private data.  This is where the filesystem
+     will need to store any options it parses.
+
+ (*) struct dentry *root
+
+     A pointer to the root of the mountable tree (and indirectly, the
+     superblock thereof).  This is filled in by the ->get_tree() op.  If this
+     is set, an active reference on root->d_sb must also be held.
+
+ (*) struct user_namespace *user_ns
+ (*) struct net *net_ns
+
+     There are a subset of the namespaces in use by the invoking process.  They
+     retain references on each namespace.  The subscribed namespaces may be
+     replaced by the filesystem to reflect other sources, such as the parent
+     mount superblock on an automount.
+
+ (*) const struct cred *cred
+
+     The mounter's credentials.  This retains a reference on the credentials.
+
+ (*) char *source
+
+     This specifies the source.  It may be a block device (e.g. /dev/sda1) or
+     something more exotic, such as the "host:/path" that NFS desires.
+
+ (*) char *subtype
+
+     This is a string to be added to the type displayed in /proc/mounts to
+     qualify it (used by FUSE).  This is available for the filesystem to set if
+     desired.
+
+ (*) void *security
+
+     A place for the LSMs to hang their security data for the superblock.  The
+     relevant security operations are described below.
+
+ (*) void *s_fs_info
+
+     The proposed s_fs_info for a new superblock, set in the superblock by
+     sget_fc().  This can be used to distinguish superblocks.
+
+ (*) unsigned int sb_flags
+
+     This holds the SB_* flags to be set in super_block::s_flags.
+
+ (*) enum fs_context_purpose
+
+     This indicates the purpose for which the context is intended.  The
+     available values are:
+
+	FS_CONTEXT_FOR_USER_MOUNT,	-- New superblock for user-specified mount
+	FS_CONTEXT_FOR_KERNEL_MOUNT,	-- New superblock for kernel-internal mount
+	FS_CONTEXT_FOR_SUBMOUNT		-- New automatic submount of extant mount
+	FS_CONTEXT_FOR_RECONFIGURE	-- Change an existing mount
+
+ (*) bool sloppy
+ (*) bool silent
+
+     These are set if the sloppy or silent mount options are given.
+
+     [NOTE] sloppy is probably unnecessary when userspace passes over one
+     option at a time since the error can just be ignored if userspace deems it
+     to be unimportant.
+
+     [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
+
+The mount context is created by calling vfs_new_fs_context(), vfs_sb_reconfig()
+or vfs_dup_fs_context() and is destroyed with put_fs_context().  Note that the
+structure is not refcounted.
+
+VFS, security and filesystem mount options are set individually with
+vfs_parse_mount_option().  Options provided by the old mount(2) system call as
+a page of data can be parsed with generic_parse_monolithic().
+
+When mounting, the filesystem is allowed to take data from any of the pointers
+and attach it to the superblock (or whatever), provided it clears the pointer
+in the mount context.
+
+The filesystem is also allowed to allocate resources and pin them with the
+mount context.  For instance, NFS might pin the appropriate protocol version
+module.
+
+
+=================================
+THE FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+The filesystem context points to a table of operations:
+
+	struct fs_context_operations {
+		void (*free)(struct fs_context *fc);
+		int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+		int (*parse_source)(struct fs_context *fc, char *source);
+		int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
+		int (*parse_monolithic)(struct fs_context *fc, void *data,
+					size_t data_size);
+		int (*validate)(struct fs_context *fc);
+		int (*get_tree)(struct fs_context *fc);
+	};
+
+These operations are invoked by the various stages of the mount procedure to
+manage the filesystem context.  They are as follows:
+
+ (*) void (*free)(struct fs_context *fc);
+
+     Called to clean up the filesystem-specific part of the filesystem context
+     when the context is destroyed.  It should be aware that parts of the
+     context may have been removed and NULL'd out by ->get_tree().
+
+ (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+
+     Called when a filesystem context has been duplicated to duplicate the
+     filesystem-private data.  An error may be returned to indicate failure to
+     do this.
+
+     [!] Note that even if this fails, put_fs_context() will be called
+	 immediately thereafter, so ->dup() *must* make the
+	 filesystem-private data safe for ->free().
+
+ (*) int (*parse_source)(struct fs_context *fc, char *source);
+
+     Called when a source or device is specified for a filesystem context.
+     This may be called multiple times if the filesystem supports it.  If
+     successful, 0 should be returned or a negative error code otherwise.
+
+ (*) int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
+
+     Called when an option is to be added to the filesystem context.  opt
+     points to the option string, likely in "key[=val]" format.  VFS-specific
+     options will have been weeded out and fc->sb_flags updated in the context.
+     Security options will also have been weeded out and fc->security updated.
+
+     If successful, 0 should be returned or a negative error code otherwise.
+
+ (*) int (*parse_monolithic)(struct fs_context *fc,
+			     void *data, size_t data_size);
+
+     Called when the mount(2) system call is invoked to pass the entire data
+     page in one go.  If this is expected to be just a list of "key[=val]"
+     items separated by commas, then this may be set to NULL.
+
+     The return value is as for ->parse_option().
+
+     If the filesystem (e.g. NFS) needs to examine the data first and then
+     finds it's the standard key-val list then it may pass it off to
+     generic_parse_monolithic().
+
+ (*) int (*validate)(struct fs_context *fc);
+
+     Called when all the options have been applied and the mount is about to
+     take place.  It is should check for inconsistencies from mount options and
+     it is also allowed to do preliminary resource acquisition.  For instance,
+     the core NFS module could load the NFS protocol module here.
+
+     Note that if fc->purpose == FS_CONTEXT_FOR_RECONFIGURE, some of the
+     options necessary for a new mount may not be set.
+
+     The return value is as for ->parse_option().
+
+ (*) int (*get_tree)(struct fs_context *fc);
+
+     Called to get or create the mountable root and superblock, using the
+     information stored in the filesystem context (reconfiguration goes via a
+     different vector).  It may detach any resources it desires from the
+     filesystem context and transfer them to the superblock it creates.
+
+     On success it should set fc->root to the mountable root and return 0.  In
+     the case of an error, it should return a negative error code.
+
+     The phase on a userspace-driven context will be set to only allow this to
+     be called once on any particular context.
+
+
+===========================
+FILESYSTEM CONTEXT SECURITY
+===========================
+
+The filesystem context contains a security pointer that the LSMs can use for
+building up a security context for the superblock to be mounted.  There are a
+number of operations used by the new mount code for this purpose:
+
+ (*) int security_fs_context_alloc(struct fs_context *fc,
+				   struct dentry *reference);
+
+     Called to initialise fc->security (which is preset to NULL) and allocate
+     any resources needed.  It should return 0 on success or a negative error
+     code on failure.
+
+     reference will be non-NULL if the context is being created for superblock
+     reconfiguration (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates
+     the root dentry of the superblock to be reconfigured.  It will also be
+     non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case
+     it indicates the automount point.
+
+ (*) int security_fs_context_dup(struct fs_context *fc,
+				 struct fs_context *src_fc);
+
+     Called to initialise fc->security (which is preset to NULL) and allocate
+     any resources needed.  The original filesystem context is pointed to by
+     src_fc and may be used for reference.  It should return 0 on success or a
+     negative error code on failure.
+
+ (*) void security_fs_context_free(struct fs_context *fc);
+
+     Called to clean up anything attached to fc->security.  Note that the
+     contents may have been transferred to a superblock and the pointer cleared
+     during get_tree.
+
+ (*) int security_fs_context_parse_source(struct fs_context *fc, char *src);
+
+     Called for each source (there may be more than one if the filesystem
+     supports it).  The arguments are as for the ->parse_source() method.  It
+     should return 0 on success or a negative error code on failure.
+
+ (*) int security_fs_context_parse_option(struct fs_context *fc,
+					  char *opt, size_t len);
+
+     Called for each mount option.  The arguments are as for the
+     ->parse_option() method.  It should return 0 to indicate that the option
+     should be passed on to the filesystem, 1 to indicate that the option
+     should be discarded or an error to indicate that the option should be
+     rejected.
+
+     The buffer pointed to by opt may be modified.
+
+ (*) int security_fs_context_validate(struct fs_context *fc);
+
+     Called after all the options have been parsed to validate the collection
+     as a whole and to do any necessary allocation so that
+     security_sb_get_tree() is less likely to fail.  It should return 0 or a
+     negative error code.
+
+ (*) int security_sb_get_tree(struct fs_context *fc);
+
+     Called during the mount procedure to verify that the specified superblock
+     is allowed to be mounted and to transfer the security data there.  It
+     should return 0 or a negative error code.
+
+ (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+				unsigned int mnt_flags);
+
+     Called during the mount procedure to verify that the root dentry attached
+     to the context is permitted to be attached to the specified mountpoint.
+     It should return 0 on success or a negative error code on failure.
+
+
+=================================
+VFS FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+There are four operations for creating a filesystem context and
+one for destroying a context:
+
+ (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+					   struct dentry *reference,
+					   unsigned int sb_flags,
+					   enum fs_context_purpose purpose);
+
+     Create a filesystem context for a given filesystem type and purpose.  This
+     allocates the filesystem context, sets the flags, initialises the security
+     and calls fs_type->init_fs_context() to initialise the filesystem private
+     data.
+
+     reference can be NULL or it may indicate the root dentry of a superblock
+     that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or the
+     automount point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT).  This
+     is provided as a source of namespace information.
+
+ (*) struct fs_context *vfs_sb_reconfig(struct vfsmount *mnt,
+					unsigned int sb_flags);
+
+     Create a filesystem context from the same filesystem as an extant mount
+     and initialise the mount parameters from the superblock underlying that
+     mount.  This is for use by superblock parameter reconfiguration.
+
+ (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
+
+     Duplicate a filesystem context, copying any options noted and duplicating
+     or additionally referencing any resources held therein.  This is available
+     for use where a filesystem has to get a mount within a mount, such as NFS4
+     does by internally mounting the root of the target server and then doing a
+     private pathwalk to the target directory.
+
+ (*) void put_fs_context(struct fs_context *fc);
+
+     Destroy a filesystem context, releasing any resources it holds.  This
+     calls the ->free() operation.  This is intended to be called by anyone who
+     created a filesystem context.
+
+     [!] filesystem contexts are not refcounted, so this causes unconditional
+	 destruction.
+
+In all the above operations, apart from the put op, the return is a mount
+context pointer or a negative error code.
+
+For the remaining operations, if an error occurs, a negative error code will be
+returned.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+     Get or create the mountable root and superblock, using the parameters in
+     the filesystem context to select/configure the superblock.  This invokes
+     the ->validate() op and then the ->get_tree() op.
+
+     [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
+     ->reconfigure().
+
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+
+     Create a mount given the parameters in the specified filesystem context.
+     Note that this does not attach the mount to anything.
+
+ (*) int vfs_set_fs_source(struct fs_context *fc, char *source, size_t len);
+
+     Supply one or more source names or device names for the mount.  This may
+     cause the filesystem to access the source.  Multiple sources may be
+     specified if the filesystem supports it.
+
+ (*) int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len);
+
+     Supply a single mount option to the filesystem context.  The mount option
+     should likely be in a "key[=val]" string form.  The option is first
+     checked to see if it corresponds to a standard mount flag (in which case
+     it is used to set an SB_xxx flag and consumed) or a security option (in
+     which case the LSM consumes it) before it is passed on to the filesystem.
+
+ (*) int generic_parse_monolithic(struct fs_context *fc,
+				  void *data, size_t data_len);
+
+     Parse a sys_mount() data page, assuming the form to be a text list
+     consisting of key[=val] options separated by commas.  Each item in the
+     list is passed to vfs_mount_option().  This is the default when the
+     ->parse_monolithic() operation is NULL.


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 23/32] Make anon_inodes unconditional [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (21 preceding siblings ...)
  2018-07-10 22:43 ` [PATCH 22/32] vfs: Provide documentation for new mount API " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Make the anon_inodes facility unconditional so that it can be used by core
VFS code.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/Makefile  |    2 +-
 init/Kconfig |   10 ----------
 2 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/fs/Makefile b/fs/Makefile
index 5563cf34f7c2..7e9ca59ac3a7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o
 
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
-obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
+obj-y				+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
diff --git a/init/Kconfig b/init/Kconfig
index 5a52f07259a2..d8303f4af5d2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1066,9 +1066,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
 config SYSCTL
 	bool
 
-config ANON_INODES
-	bool
-
 config HAVE_UID16
 	bool
 
@@ -1273,14 +1270,12 @@ config HAVE_FUTEX_CMPXCHG
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
-	select ANON_INODES
 	help
 	  Disabling this option will cause the kernel to be built without
 	  support for epoll family of system calls.
 
 config SIGNALFD
 	bool "Enable signalfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the signalfd() system call that allows to receive signals
@@ -1290,7 +1285,6 @@ config SIGNALFD
 
 config TIMERFD
 	bool "Enable timerfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the timerfd() system call that allows to receive timer
@@ -1300,7 +1294,6 @@ config TIMERFD
 
 config EVENTFD
 	bool "Enable eventfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the eventfd() system call that allows to receive both
@@ -1414,7 +1407,6 @@ config KALLSYMS_BASE_RELATIVE
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
-	select ANON_INODES
 	select BPF
 	select IRQ_WORK
 	default n
@@ -1431,7 +1423,6 @@ config BPF_JIT_ALWAYS_ON
 
 config USERFAULTFD
 	bool "Enable userfaultfd() system call"
-	select ANON_INODES
 	depends on MMU
 	help
 	  Enable the userfaultfd() system call that allows to intercept and
@@ -1498,7 +1489,6 @@ config PERF_EVENTS
 	bool "Kernel performance events and counters"
 	default y if PROFILING
 	depends on HAVE_PERF_EVENTS
-	select ANON_INODES
 	select IRQ_WORK
 	select SRCU
 	help


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (22 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 23/32] Make anon_inodes unconditional " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 23:59   ` Andy Lutomirski
                     ` (4 more replies)
  2018-07-10 22:44 ` [PATCH 25/32] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
                   ` (13 subsequent siblings)
  37 siblings, 5 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-api, linux-fsdevel, torvalds, linux-kernel

Provide an fsopen() system call that starts the process of preparing to
create a superblock that will then be mountable, using an fd as a context
handle.  fsopen() is given the name of the filesystem that will be used:

	int mfd = fsopen(const char *fsname, unsigned int flags);

where flags can be 0 or FSOPEN_CLOEXEC.

For example:

	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
	write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
	write(sfd, "o noatime");
	write(sfd, "o acl");
	write(sfd, "o user_attr");
	write(sfd, "o iversion");
	write(sfd, "o ");
	write(sfd, "r /my/container"); // root inside the fs
	write(sfd, "x create"); // create the superblock
	fsinfo(sfd, NULL, ...); // query new superblock attributes
	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

	sfd = fsopen("afs", -1);
	write(sfd, "s %grand.central.org:root.cell");
	write(sfd, "o cell=grand.central.org");
	write(sfd, "r /");
	write(sfd, "x create");
	mfd = fsmount(sfd, 0, MS_NODEV);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:

	"e <subsys>:<problem>"
	"e SELinux:Mount on mountpoint not permitted"

Once fsmount() has been called, further write() calls will incur EBUSY,
even if the fsmount() fails.  read() is still possible to retrieve error
information.

The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.

Netlink is not used because it is optional and would make the core VFS
dependent on the networking layer and also potentially add network
namespace issues.

Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/Makefile                            |    2 
 fs/fs_context.c                        |    4 +
 fs/fsopen.c                            |  209 ++++++++++++++++++++++++++++++++
 include/linux/fs_context.h             |    2 
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fs.h                |    5 +
 8 files changed, 224 insertions(+), 1 deletion(-)
 create mode 100644 fs/fsopen.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 76d092b7d1b0..1647fefd2969 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
+389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37ba4e65eee6..235d33dbccb2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 335	common	open_tree		__x64_sys_open_tree
 336	common	move_mount		__x64_sys_move_mount
+337	common	fsopen			__x64_sys_fsopen
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 7e9ca59ac3a7..d3b33798998e 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
-		fs_context.o
+		fs_context.o fsopen.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
index b7c84e0aa2f9..a2d745e6d356 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -251,6 +251,8 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
 	fc->fs_type	= get_filesystem(fs_type);
 	fc->cred	= get_current_cred();
 
+	mutex_init(&fc->uapi_mutex);
+
 	switch (purpose) {
 	case FS_CONTEXT_FOR_KERNEL_MOUNT:
 		fc->sb_flags |= SB_KERNMOUNT;
@@ -335,6 +337,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
 	if (!fc)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&fc->uapi_mutex);
+
 	fc->fs_private	= NULL;
 	fc->s_fs_info	= NULL;
 	fc->source	= NULL;
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..28bb72bda163
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,209 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include <linux/anon_inodes.h>
+#include "mount.h"
+
+/*
+ * Userspace writes configuration data and commands to the fd and we parse it
+ * here.  For the moment, we assume a single option or command per write.  Each
+ * line written is of the form
+ *
+ *	<command_type><space><stuff...>
+ *
+ *	s /dev/sda1				-- Source device
+ *	o noatime				-- Option without value
+ *	o cell=grand.central.org		-- Option with value
+ *	x create				-- Create a superblock
+ *	x reconfigure				-- Reconfigure a superblock
+ */
+static ssize_t fscontext_write(struct file *file,
+			       const char __user *_buf, size_t len, loff_t *pos)
+{
+	struct fs_context *fc = file->private_data;
+	char opt[2], *data;
+	ssize_t ret;
+
+	if (len < 3 || len > 4095)
+		return -EINVAL;
+
+	if (copy_from_user(opt, _buf, 2) != 0)
+		return -EFAULT;
+	switch (opt[0]) {
+	case 's':
+	case 'o':
+	case 'x':
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (opt[1] != ' ')
+		return -EINVAL;
+
+	data = memdup_user_nul(_buf + 2, len - 2);
+	if (IS_ERR(data))
+		return PTR_ERR(data);
+
+	/* From this point onwards we need to lock the fd against someone
+	 * trying to mount it.
+	 */
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret < 0)
+		goto err_free;
+
+	if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+		if (fc->fs_type->init_fs_context) {
+			ret = fc->fs_type->init_fs_context(fc, fc->root);
+			if (ret < 0) {
+				fc->phase = FS_CONTEXT_FAILED;
+				goto err_unlock;
+			}
+		} else {
+			/* Leave legacy context ops in place */
+		}
+
+		/* Do the security check last because ->init_fs_context may
+		 * change the namespace subscriptions.
+		 */
+		ret = security_fs_context_alloc(fc, fc->root);
+		if (ret < 0) {
+			fc->phase = FS_CONTEXT_FAILED;
+			goto err_unlock;
+		}
+
+		fc->phase = FS_CONTEXT_RECONF_PARAMS;
+	}
+
+	ret = -EINVAL;
+	switch (opt[0]) {
+	case 's':
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+			goto wrong_phase;
+		ret = vfs_set_fs_source(fc, data, len - 2);
+		if (ret < 0)
+			goto err_unlock;
+		break;
+
+	case 'o':
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+			goto wrong_phase;
+		ret = vfs_parse_fs_option(fc, data, len - 2);
+		if (ret < 0)
+			goto err_unlock;
+		break;
+
+	case 'x':
+		if (strcmp(data, "create") == 0) {
+			if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+				goto wrong_phase;
+			fc->phase = FS_CONTEXT_CREATING;
+			ret = vfs_get_tree(fc);
+			if (ret == 0)
+				fc->phase = FS_CONTEXT_AWAITING_MOUNT;
+			else
+				fc->phase = FS_CONTEXT_FAILED;
+		} else {
+			ret = -EOPNOTSUPP;
+		}
+		if (ret < 0)
+			goto err_unlock;
+		break;
+
+	default:
+		goto err_unlock;
+	}
+
+	ret = len;
+err_unlock:
+	mutex_unlock(&fc->uapi_mutex);
+err_free:
+	kfree(data);
+	return ret;
+
+wrong_phase:
+	ret = -EBUSY;
+	goto err_unlock;
+}
+
+static int fscontext_release(struct inode *inode, struct file *file)
+{
+	struct fs_context *fc = file->private_data;
+
+	if (fc) {
+		file->private_data = NULL;
+		put_fs_context(fc);
+	}
+	return 0;
+}
+
+const struct file_operations fscontext_fs_fops = {
+	.write		= fscontext_write,
+	.release	= fscontext_release,
+	.llseek		= no_llseek,
+};
+
+/*
+ * Attach a filesystem context to a file and an fd.
+ */
+static int fscontext_create_fd(struct fs_context *fc, unsigned int o_flags)
+{
+	int fd;
+
+	fd = anon_inode_getfd("fscontext", &fscontext_fs_fops, fc,
+			      O_RDWR | o_flags);
+	if (fd < 0)
+		put_fs_context(fc);
+	return fd;
+}
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
+{
+	struct file_system_type *fs_type;
+	struct fs_context *fc;
+	const char *fs_name;
+
+	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (flags & ~FSOPEN_CLOEXEC)
+		return -EINVAL;
+
+	fs_name = strndup_user(_fs_name, PAGE_SIZE);
+	if (IS_ERR(fs_name))
+		return PTR_ERR(fs_name);
+
+	fs_type = get_fs_type(fs_name);
+	kfree(fs_name);
+	if (!fs_type)
+		return -ENODEV;
+
+	fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
+	put_filesystem(fs_type);
+	if (IS_ERR(fc))
+		return PTR_ERR(fc);
+
+	fc->phase = FS_CONTEXT_CREATE_PARAMS;
+	return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
+}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index f157ff935a1e..387f25d7acc4 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -14,6 +14,7 @@
 
 #include <linux/kernel.h>
 #include <linux/errno.h>
+#include <linux/mutex.h>
 
 struct cred;
 struct dentry;
@@ -58,6 +59,7 @@ enum fs_context_phase {
  */
 struct fs_context {
 	const struct fs_context_operations *ops;
+	struct mutex		uapi_mutex;	/* Userspace access mutex */
 	struct file_system_type	*fs_type;
 	void			*fs_private;	/* The filesystem's context */
 	struct dentry		*root;		/* The root and superblock */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3c0855d9b105..ad6c7ff33c01 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -904,6 +904,7 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
+asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 1c982eb44ff4..f8818e6cddd6 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -344,4 +344,9 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/*
+ * Flags for fsopen() and co.
+ */
+#define FSOPEN_CLOEXEC		0x00000001
+
 #endif /* _UAPI_LINUX_FS_H */


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 25/32] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (23 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:44 ` [PATCH 26/32] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-api, linux-fsdevel, torvalds, linux-kernel

Provide a system call by which a filesystem opened with fsopen() and
configured by a series of writes can be mounted:

	int ret = fsmount(int fsfd, unsigned int flags,
			  unsigned int ms_flags);

where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
FSMOUNT_CLOEXEC.  ms_flags is a bitwise-OR of the following flags:

	MS_RDONLY
	MS_NOSUID
	MS_NODEV
	MS_NOEXEC
	MS_NOATIME
	MS_NODIRATIME
	MS_RELATIME
	MS_STRICTATIME

	MS_UNBINDABLE
	MS_PRIVATE
	MS_SLAVE
	MS_SHARED

In the event that fsmount() fails, it may be possible to get an error
message by calling read() on fsfd.  If no message is available, ENODATA
will be reported.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |  140 +++++++++++++++++++++++++++++++-
 include/linux/fs_context.h             |    2 
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fs.h                |    2 
 6 files changed, 143 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1647fefd2969..537572098032 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,3 +401,4 @@
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
+390	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 235d33dbccb2..47abbc2a2bbe 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
 335	common	open_tree		__x64_sys_open_tree
 336	common	move_mount		__x64_sys_move_mount
 337	common	fsopen			__x64_sys_fsopen
+338	common	fsmount			__x64_sys_fsmount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index d5a4d9351a17..a6fbfba8e448 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2503,7 +2503,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 
 	attached = mnt_has_parent(old);
 	/*
-	 * We need to allow open_tree(OPEN_TREE_CLONE) followed by
+	 * We need to allow open_tree(OPEN_TREE_CLONE) or fsmount() followed by
 	 * move_mount(), but mustn't allow "/" to be moved.
 	 */
 	if (old->mnt_ns && !attached)
@@ -3347,9 +3347,141 @@ struct vfsmount *kern_mount(struct file_system_type *type)
 EXPORT_SYMBOL_GPL(kern_mount);
 
 /*
- * Move a mount from one place to another.
- * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
- * used to copy a mount subtree.
+ * Create a kernel mount representation for a new, prepared superblock
+ * (specified by fs_fd) and attach to an open_tree-like file descriptor.
+ */
+SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags)
+{
+	struct fs_context *fc;
+	struct file *file;
+	struct path newmount;
+	struct fd f;
+	unsigned int mnt_flags = 0;
+	long ret;
+
+	if (!may_mount())
+		return -EPERM;
+
+	if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
+		return -EINVAL;
+
+	if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+			 MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+			 MS_STRICTATIME))
+		return -EINVAL;
+
+	if (ms_flags & MS_RDONLY)
+		mnt_flags |= MNT_READONLY;
+	if (ms_flags & MS_NOSUID)
+		mnt_flags |= MNT_NOSUID;
+	if (ms_flags & MS_NODEV)
+		mnt_flags |= MNT_NODEV;
+	if (ms_flags & MS_NOEXEC)
+		mnt_flags |= MNT_NOEXEC;
+	if (ms_flags & MS_NODIRATIME)
+		mnt_flags |= MNT_NODIRATIME;
+
+	if (ms_flags & MS_STRICTATIME) {
+		if (ms_flags & MS_NOATIME)
+			return -EINVAL;
+	} else if (ms_flags & MS_NOATIME) {
+		mnt_flags |= MNT_NOATIME;
+	} else {
+		mnt_flags |= MNT_RELATIME;
+	}
+
+	f = fdget(fs_fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EINVAL;
+	if (f.file->f_op != &fscontext_fs_fops)
+		goto err_fsfd;
+
+	fc = f.file->private_data;
+
+	/* There must be a valid superblock or we can't mount it */
+	ret = -EINVAL;
+	if (!fc->root)
+		goto err_fsfd;
+
+	ret = -EPERM;
+	if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+		pr_warn("VFS: Mount too revealing\n");
+		goto err_fsfd;
+	}
+
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret < 0)
+		goto err_fsfd;
+
+	ret = -EBUSY;
+	if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
+		goto err_unlock;
+
+	ret = -EPERM;
+	if ((fc->sb_flags & SB_MANDLOCK) && !may_mandlock())
+		goto err_unlock;
+
+	newmount.mnt = vfs_create_mount(fc, mnt_flags);
+	if (IS_ERR(newmount.mnt)) {
+		ret = PTR_ERR(newmount.mnt);
+		goto err_unlock;
+	}
+	newmount.dentry = dget(fc->root);
+
+	/* We've done the mount bit - now move the file context into more or
+	 * less the same state as if we'd done an fspick().  We don't want to
+	 * do any memory allocation or anything like that at this point as we
+	 * don't want to have to handle any errors incurred.
+	 */
+	if (fc->ops && fc->ops->free)
+		fc->ops->free(fc);
+	fc->fs_private = NULL;
+	fc->s_fs_info = NULL;
+	fc->sb_flags = 0;
+	fc->sloppy = false;
+	fc->silent = false;
+	security_fs_context_free(fc);
+	fc->security = NULL;
+	kfree(fc->subtype);
+	fc->subtype = NULL;
+	kfree(fc->source);
+	fc->source = NULL;
+
+	fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
+	fc->phase = FS_CONTEXT_AWAITING_RECONF;
+
+	/* Attach to an apparent O_PATH fd with a note that we need to unmount
+	 * it, not just simply put it.
+	 */
+	file = dentry_open(&newmount, O_PATH, fc->cred);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto err_path;
+	}
+	file->f_mode |= FMODE_NEED_UNMOUNT;
+
+	ret = get_unused_fd_flags((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0);
+	if (ret >= 0)
+		fd_install(ret, file);
+	else
+		fput(file);
+
+err_path:
+	path_put(&newmount);
+err_unlock:
+	mutex_unlock(&fc->uapi_mutex);
+err_fsfd:
+	fdput(f);
+	return ret;
+}
+
+/*
+ * Move a mount from one place to another.  In combination with
+ * fsopen()/fsmount() this is used to install a new mount and in combination
+ * with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be used to copy
+ * a mount subtree.
  *
  * Note the flags value is a combination of MOVE_MOUNT_* flags.
  */
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 387f25d7acc4..2cde97490c6f 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -115,4 +115,6 @@ extern int vfs_get_super(struct fs_context *fc,
 			 int (*fill_super)(struct super_block *sb,
 					   struct fs_context *fc));
 
+extern const struct file_operations fscontext_fs_fops;
+
 #endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ad6c7ff33c01..917fe10e1030 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -905,6 +905,7 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
+asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f8818e6cddd6..30a2fb85c4b7 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,4 +349,6 @@ typedef int __bitwise __kernel_rwf_t;
  */
 #define FSOPEN_CLOEXEC		0x00000001
 
+#define FSMOUNT_CLOEXEC		0x00000001
+
 #endif /* _UAPI_LINUX_FS_H */


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 26/32] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (24 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 25/32] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:44 ` [PATCH 27/32] vfs: Implement logging through fs_context " David Howells
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-api, linux-fsdevel, torvalds, linux-kernel

Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).

This looks like:

	int fd = fspick(AT_FDCWD, "/mnt",
			FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
	write(fd, "o intr");
	write(fd, "o noac");
	write(fd, "x reconfigure");

At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 fs/fsopen.c                            |   53 ++++++++++++++++++++++++++++++++
 include/linux/syscalls.h               |    1 +
 include/uapi/linux/fs.h                |    5 +++
 5 files changed, 61 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 537572098032..5587bcede253 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -402,3 +402,4 @@
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
 390	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
+391	i386	fspick			sys_fspick			__ia32_sys_fspick
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 47abbc2a2bbe..460a464024bf 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,7 @@
 336	common	move_mount		__x64_sys_move_mount
 337	common	fsopen			__x64_sys_fsopen
 338	common	fsmount			__x64_sys_fsmount
+339	common	fspick			__x64_sys_fspick
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 28bb72bda163..35c2a94d0c68 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -15,6 +15,7 @@
 #include <linux/syscalls.h>
 #include <linux/security.h>
 #include <linux/anon_inodes.h>
+#include <linux/namei.h>
 #include "mount.h"
 
 /*
@@ -207,3 +208,55 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 	fc->phase = FS_CONTEXT_CREATE_PARAMS;
 	return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
 }
+
+/*
+ * Pick a superblock into a context for reconfiguration.
+ */
+SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
+{
+	struct fs_context *fc;
+	struct path target;
+	unsigned int lookup_flags;
+	int ret;
+
+	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if ((flags & ~(FSPICK_CLOEXEC |
+		       FSPICK_SYMLINK_NOFOLLOW |
+		       FSPICK_NO_AUTOMOUNT |
+		       FSPICK_EMPTY_PATH)) != 0)
+		return -EINVAL;
+
+	lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	if (flags & FSPICK_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & FSPICK_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & FSPICK_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+	ret = user_path_at(dfd, path, lookup_flags, &target);
+	if (ret < 0)
+		goto err;
+
+	ret = -EOPNOTSUPP;
+	if (!target.dentry->d_sb->s_op->reconfigure)
+		goto err_path;
+
+	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
+				0, FS_CONTEXT_FOR_RECONFIGURE);
+	if (IS_ERR(fc)) {
+		ret = PTR_ERR(fc);
+		goto err_path;
+	}
+
+	fc->phase = FS_CONTEXT_RECONF_PARAMS;
+
+	path_put(&target);
+	return fscontext_create_fd(fc, flags & FSPICK_CLOEXEC ? O_CLOEXEC : 0);
+
+err_path:
+	path_put(&target);
+err:
+	return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 917fe10e1030..ac803f5c0822 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -906,6 +906,7 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       unsigned int ms_flags);
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
+asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 30a2fb85c4b7..c27576d471c2 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -351,4 +351,9 @@ typedef int __bitwise __kernel_rwf_t;
 
 #define FSMOUNT_CLOEXEC		0x00000001
 
+#define FSPICK_CLOEXEC		0x00000001
+#define FSPICK_SYMLINK_NOFOLLOW	0x00000002
+#define FSPICK_NO_AUTOMOUNT	0x00000004
+#define FSPICK_EMPTY_PATH	0x00000008
+
 #endif /* _UAPI_LINUX_FS_H */


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 27/32] vfs: Implement logging through fs_context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (25 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 26/32] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:44 ` [PATCH 28/32] vfs: Add some logging to the core users of the fs_context log " David Howells
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Implement the ability for filesystems to log error, warning and
informational messages through the fs_context.  These can be extracted by
userspace by reading from an fd created by fsopen().

Error messages are prefixed with "e ", warnings with "w " and informational
messages with "i ".

Inside the kernel, formatted messages are malloc'd but unformatted messages
are not copied if they're either in the core .rodata section or in the
.rodata section of the filesystem module pinned by fs_context::fs_type.
The messages are only good till the fs_type is released.

Note that the logging object is shared between duplicated fs_context
structures.  This is so that such as NFS which do a mount within a mount
can get at least some of the errors from the inner mount.

Five logging functions are provided for this:

 (1) void logfc(struct fs_context *fc, const char *fmt, ...);

     This logs a message into the context.  If the buffer is full, the
     earliest message is discarded.

 (2) void errorf(fc, fmt, ...);

     This wraps logfc() to log an error.

 (3) void invalf(fc, fmt, ...);

     This wraps errorf() and returns -EINVAL for convenience.

 (4) void warnf(fc, fmt, ...);

     This wraps logfc() to log a warning.

 (5) void infof(fc, fmt, ...);

     This wraps logfc() to log an informational message.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fs_context.c            |   92 ++++++++++++++++++++++++++++++++++++++++++++
 fs/fsopen.c                |   73 +++++++++++++++++++++++++++++++++++
 include/linux/fs_context.h |   58 ++++++++++++++++++++++++++++
 include/linux/module.h     |    6 +++
 4 files changed, 229 insertions(+)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index a2d745e6d356..f388ab29d37d 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -11,6 +11,7 @@
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
 #include <linux/fs_context.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
@@ -23,6 +24,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <net/net_namespace.h>
+#include <asm/sections.h>
 #include "mount.h"
 
 enum legacy_fs_param {
@@ -347,6 +349,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
 	get_net(fc->net_ns);
 	get_user_ns(fc->user_ns);
 	get_cred(fc->cred);
+	if (fc->log)
+		refcount_inc(&fc->log->usage);
 
 	/* Can't call put until we've called ->dup */
 	ret = fc->ops->dup(fc, src_fc);
@@ -364,6 +368,93 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
 }
 EXPORT_SYMBOL(vfs_dup_fs_context);
 
+/**
+ * logfc - Log a message to a filesystem context
+ * @fc: The filesystem context to log to.
+ * @fmt: The format of the buffer.
+ */
+void logfc(struct fs_context *fc, const char *fmt, ...)
+{
+	static const char store_failure[] = "OOM: Can't store error string";
+	struct fc_log *log = fc->log;
+	unsigned int logsize = ARRAY_SIZE(log->buffer);
+	const char *p;
+	va_list va;
+	char *q;
+	u8 freeable, index;
+
+	if (!log)
+		return;
+
+	va_start(va, fmt);
+	if (!strchr(fmt, '%')) {
+		p = fmt;
+		goto unformatted_string;
+	}
+	if (strcmp(fmt, "%s") == 0) {
+		p = va_arg(va, const char *);
+		goto unformatted_string;
+	}
+
+	q = kvasprintf(GFP_KERNEL, fmt, va);
+copied_string:
+	if (!q)
+		goto store_failure;
+	freeable = 1;
+	goto store_string;
+
+unformatted_string:
+	if ((unsigned long)p >= (unsigned long)__start_rodata &&
+	    (unsigned long)p <  (unsigned long)__end_rodata)
+		goto const_string;
+	if (within_module_core((unsigned long)p, log->owner))
+		goto const_string;
+	q = kstrdup(p, GFP_KERNEL);
+	goto copied_string;
+
+store_failure:
+	p = store_failure;
+const_string:
+	q = (char *)p;
+	freeable = 0;
+store_string:
+	index = log->head & (logsize - 1);
+	BUILD_BUG_ON(sizeof(log->head) != sizeof(u8) ||
+		     sizeof(log->tail) != sizeof(u8));
+	if ((u8)(log->head - log->tail) == logsize) {
+		/* The buffer is full, discard the oldest message */
+		if (log->need_free & (1 << index))
+			kfree(log->buffer[index]);
+		log->tail++;
+	}
+
+	log->buffer[index] = q;
+	log->need_free &= ~(1 << index);
+	log->need_free |= freeable << index;
+	log->head++;
+	va_end(va);
+}
+EXPORT_SYMBOL(logfc);
+
+/*
+ * Free a logging structure.
+ */
+static void put_fc_log(struct fs_context *fc)
+{
+	struct fc_log *log = fc->log;
+	int i;
+
+	if (log) {
+		if (refcount_dec_and_test(&log->usage)) {
+			fc->log = NULL;
+			for (i = 0; i <= 7; i++)
+				if (log->need_free & (1 << i))
+					kfree(log->buffer[i]);
+			kfree(log);
+		}
+	}
+}
+
 /**
  * put_fs_context - Dispose of a superblock configuration context.
  * @fc: The context to dispose of.
@@ -389,6 +480,7 @@ void put_fs_context(struct fs_context *fc)
 	if (fc->cred)
 		put_cred(fc->cred);
 	kfree(fc->subtype);
+	put_fc_log(fc);
 	put_filesystem(fc->fs_type);
 	kfree(fc->source);
 	kfree(fc);
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 35c2a94d0c68..6947fed9df3b 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -141,6 +141,52 @@ static ssize_t fscontext_write(struct file *file,
 	goto err_unlock;
 }
 
+/*
+ * Allow the user to read back any error, warning or informational messages.
+ */
+static ssize_t fscontext_read(struct file *file,
+			      char __user *_buf, size_t len, loff_t *pos)
+{
+	struct fs_context *fc = file->private_data;
+	struct fc_log *log = fc->log;
+	unsigned int logsize = ARRAY_SIZE(log->buffer);
+	ssize_t ret;
+	char *p;
+	bool need_free;
+	int index, n;
+
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret < 0)
+		return ret;
+
+	if (log->head == log->tail) {
+		mutex_unlock(&fc->uapi_mutex);
+		return -ENODATA;
+	}
+
+	index = log->tail & (logsize - 1);
+	p = log->buffer[index];
+	need_free = log->need_free & (1 << index);
+	log->buffer[index] = NULL;
+	log->need_free &= ~(1 << index);
+	log->tail++;
+	mutex_unlock(&fc->uapi_mutex);
+
+	ret = -EMSGSIZE;
+	n = strlen(p);
+	if (n > len)
+		goto err_free;
+	ret = -EFAULT;
+	if (copy_to_user(_buf, p, n) != 0)
+		goto err_free;
+	ret = n;
+
+err_free:
+	if (need_free)
+		kfree(p);
+	return ret;
+}
+
 static int fscontext_release(struct inode *inode, struct file *file)
 {
 	struct fs_context *fc = file->private_data;
@@ -153,6 +199,7 @@ static int fscontext_release(struct inode *inode, struct file *file)
 }
 
 const struct file_operations fscontext_fs_fops = {
+	.read		= fscontext_read,
 	.write		= fscontext_write,
 	.release	= fscontext_release,
 	.llseek		= no_llseek,
@@ -172,6 +219,16 @@ static int fscontext_create_fd(struct fs_context *fc, unsigned int o_flags)
 	return fd;
 }
 
+static int fscontext_alloc_log(struct fs_context *fc)
+{
+	fc->log = kzalloc(sizeof(*fc->log), GFP_KERNEL);
+	if (!fc->log)
+		return -ENOMEM;
+	refcount_set(&fc->log->usage, 1);
+	fc->log->owner = fc->fs_type->owner;
+	return 0;
+}
+
 /*
  * Open a filesystem by name so that it can be configured for mounting.
  *
@@ -184,6 +241,7 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 	struct file_system_type *fs_type;
 	struct fs_context *fc;
 	const char *fs_name;
+	int ret;
 
 	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
@@ -206,7 +264,16 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 		return PTR_ERR(fc);
 
 	fc->phase = FS_CONTEXT_CREATE_PARAMS;
+
+	ret = fscontext_alloc_log(fc);
+	if (ret < 0)
+		goto err_fc;
+
 	return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
+
+err_fc:
+	put_fs_context(fc);
+	return ret;
 }
 
 /*
@@ -252,9 +319,15 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
 
 	fc->phase = FS_CONTEXT_RECONF_PARAMS;
 
+	ret = fscontext_alloc_log(fc);
+	if (ret < 0)
+		goto err_fc;
+
 	path_put(&target);
 	return fscontext_create_fd(fc, flags & FSPICK_CLOEXEC ? O_CLOEXEC : 0);
 
+err_fc:
+	put_fs_context(fc);
 err_path:
 	path_put(&target);
 err:
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 2cde97490c6f..04ea338ff490 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -13,6 +13,7 @@
 #define _LINUX_FS_CONTEXT_H
 
 #include <linux/kernel.h>
+#include <linux/refcount.h>
 #include <linux/errno.h>
 #include <linux/mutex.h>
 
@@ -66,6 +67,7 @@ struct fs_context {
 	struct user_namespace	*user_ns;	/* The user namespace for this mount */
 	struct net		*net_ns;	/* The network namespace for this mount */
 	const struct cred	*cred;		/* The mounter's credentials */
+	struct fc_log		*log;		/* Logging buffer */
 	char			*source;	/* The source name (eg. dev path) */
 	char			*subtype;	/* The subtype to set on the superblock */
 	void			*security;	/* The LSM context */
@@ -117,4 +119,60 @@ extern int vfs_get_super(struct fs_context *fc,
 
 extern const struct file_operations fscontext_fs_fops;
 
+/*
+ * Mount error, warning and informational message logging.  This structure is
+ * shareable between a mount and a subordinate mount.
+ */
+struct fc_log {
+	refcount_t	usage;
+	u8		head;		/* Insertion index in buffer[] */
+	u8		tail;		/* Removal index in buffer[] */
+	u8		need_free;	/* Mask of kfree'able items in buffer[] */
+	struct module	*owner;		/* Owner module for strings that don't then need freeing */
+	char		*buffer[8];
+};
+
+extern __attribute__((format(printf, 2, 3)))
+void logfc(struct fs_context *fc, const char *fmt, ...);
+
+/**
+ * infof - Store supplementary informational message
+ * @fc: The context in which to log the informational message
+ * @fmt: The format string
+ *
+ * Store the supplementary informational message for the process if the process
+ * has enabled the facility.
+ */
+#define infof(fc, fmt, ...) ({ logfc(fc, "i "fmt, ## __VA_ARGS__); })
+
+/**
+ * warnf - Store supplementary warning message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary warning message for the process if the process has
+ * enabled the facility.
+ */
+#define warnf(fc, fmt, ...) ({ logfc(fc, "w "fmt, ## __VA_ARGS__); })
+
+/**
+ * errorf - Store supplementary error message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary error message for the process if the process has
+ * enabled the facility.
+ */
+#define errorf(fc, fmt, ...) ({ logfc(fc, "e "fmt, ## __VA_ARGS__); })
+
+/**
+ * invalf - Store supplementary invalid argument error message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary error message for the process if the process has
+ * enabled the facility and return -EINVAL.
+ */
+#define invalf(fc, fmt, ...) ({	errorf(fc, fmt, ## __VA_ARGS__); -EINVAL; })
+
 #endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/module.h b/include/linux/module.h
index d44df9b2c131..a5892fd68f5a 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -682,6 +682,12 @@ static inline bool is_module_text_address(unsigned long addr)
 	return false;
 }
 
+static inline bool within_module_core(unsigned long addr,
+				      const struct module *mod)
+{
+	return false;
+}
+
 /* Get/put a kernel symbol (calls should be symmetric) */
 #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
 #define symbol_put(x) do { } while (0)


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 28/32] vfs: Add some logging to the core users of the fs_context log [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (26 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 27/32] vfs: Implement logging through fs_context " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:44 ` [PATCH 29/32] afs: Add fs_context support " David Howells
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Add some logging to the core users of the fs_context log so that
information can be extracted from them as to the reason for failure.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/super.c                |    4 +++-
 kernel/cgroup/cgroup-v1.c |    2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index bbef5a5057c0..3fe5d12b7697 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1735,8 +1735,10 @@ int vfs_get_tree(struct fs_context *fc)
 	struct super_block *sb;
 	int ret;
 
-	if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source)
+	if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source) {
+		errorf(fc, "Filesystem requires source device");
 		return -ENOENT;
+	}
 
 	if (fc->root)
 		return -EBUSY;
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 749ccf5c0690..b3d0f37dc80a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -16,7 +16,7 @@
 
 #include <trace/events/cgroup.h>
 
-#define cg_invalf(fc, fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); -EINVAL; })
+#define cg_invalf(fc, fmt, ...) invalf(fc, fmt, ## __VA_ARGS__)
 
 /*
  * pidlists linger the following amount before being destroyed.  The goal


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 29/32] afs: Add fs_context support [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (27 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 28/32] vfs: Add some logging to the core users of the fs_context log " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:44 ` [PATCH 30/32] afs: Use fs_context to pass parameters over automount " David Howells
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Add fs_context support to the AFS filesystem, converting the parameter
parsing to store options there.

This will form the basis for namespace propagation over mountpoints within
the AFS model, thereby allowing AFS to be used in containers more easily.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/internal.h |    8 +
 fs/afs/super.c    |  424 ++++++++++++++++++++++++++++++-----------------------
 fs/afs/volume.c   |    4 -
 3 files changed, 247 insertions(+), 189 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 9778df135717..d54aab35a1ca 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -34,15 +34,15 @@
 struct pagevec;
 struct afs_call;
 
-struct afs_mount_params {
+struct afs_fs_context {
 	bool			rwpath;		/* T if the parent should be considered R/W */
 	bool			force;		/* T to force cell type */
 	bool			autocell;	/* T if set auto mount operation */
 	bool			dyn_root;	/* T if dynamic root */
+	bool			no_cell;	/* T if the source is "none" (for dynroot) */
 	afs_voltype_t		type;		/* type of volume requested */
-	int			volnamesz;	/* size of volume name */
+	unsigned int		volnamesz;	/* size of volume name */
 	const char		*volname;	/* name of volume to mount */
-	struct net		*net_ns;	/* Network namespace in effect */
 	struct afs_net		*net;		/* the AFS net namespace stuff */
 	struct afs_cell		*cell;		/* cell in which to find volume */
 	struct afs_volume	*volume;	/* volume record */
@@ -1055,7 +1055,7 @@ static inline struct afs_volume *__afs_get_volume(struct afs_volume *volume)
 	return volume;
 }
 
-extern struct afs_volume *afs_create_volume(struct afs_mount_params *);
+extern struct afs_volume *afs_create_volume(struct afs_fs_context *);
 extern void afs_activate_volume(struct afs_volume *);
 extern void afs_deactivate_volume(struct afs_volume *);
 extern void afs_put_volume(struct afs_cell *, struct afs_volume *);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index b85f5e993539..a2237bc411e1 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -1,6 +1,6 @@
 /* AFS superblock handling
  *
- * Copyright (c) 2002, 2007 Red Hat, Inc. All rights reserved.
+ * Copyright (c) 2002, 2007, 2018 Red Hat, Inc. All rights reserved.
  *
  * This software may be freely redistributed under the terms of the
  * GNU General Public License.
@@ -30,22 +30,20 @@
 #include "internal.h"
 
 static void afs_i_init_once(void *foo);
-static struct dentry *afs_mount(struct file_system_type *fs_type,
-				int flags, const char *dev_name,
-				void *data, size_t data_size);
 static void afs_kill_super(struct super_block *sb);
 static struct inode *afs_alloc_inode(struct super_block *sb);
 static void afs_destroy_inode(struct inode *inode);
 static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
 static int afs_show_devname(struct seq_file *m, struct dentry *root);
 static int afs_show_options(struct seq_file *m, struct dentry *root);
+static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference);
 
 struct file_system_type afs_fs_type = {
-	.owner		= THIS_MODULE,
-	.name		= "afs",
-	.mount		= afs_mount,
-	.kill_sb	= afs_kill_super,
-	.fs_flags	= 0,
+	.owner			= THIS_MODULE,
+	.name			= "afs",
+	.init_fs_context	= afs_init_fs_context,
+	.kill_sb		= afs_kill_super,
+	.fs_flags		= 0,
 };
 MODULE_ALIAS_FS("afs");
 
@@ -191,61 +189,53 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
 }
 
 /*
- * parse the mount options
- * - this function has been shamelessly adapted from the ext3 fs which
- *   shamelessly adapted it from the msdos fs
+ * Parse an single mount option.
  */
-static int afs_parse_options(struct afs_mount_params *params,
-			     char *options, const char **devname)
+static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
 {
+	struct afs_fs_context *ctx = fc->fs_private;
 	struct afs_cell *cell;
 	substring_t args[MAX_OPT_ARGS];
-	char *p;
-	int token;
-
-	_enter("%s", options);
-
-	options[PAGE_SIZE - 1] = 0;
-
-	while ((p = strsep(&options, ","))) {
-		if (!*p)
-			continue;
-
-		token = match_token(p, afs_options_list, args);
-		switch (token) {
-		case afs_opt_cell:
-			rcu_read_lock();
-			cell = afs_lookup_cell_rcu(params->net,
-						   args[0].from,
-						   args[0].to - args[0].from);
-			rcu_read_unlock();
-			if (IS_ERR(cell))
-				return PTR_ERR(cell);
-			afs_put_cell(params->net, params->cell);
-			params->cell = cell;
-			break;
-
-		case afs_opt_rwpath:
-			params->rwpath = true;
-			break;
-
-		case afs_opt_vol:
-			*devname = args[0].from;
-			break;
-
-		case afs_opt_autocell:
-			params->autocell = true;
-			break;
-
-		case afs_opt_dyn:
-			params->dyn_root = true;
-			break;
-
-		default:
-			printk(KERN_ERR "kAFS:"
-			       " Unknown or invalid mount option: '%s'\n", p);
+	int token, size;
+
+	_enter("%s", opt);
+
+	token = match_token(opt, afs_options_list, args);
+	switch (token) {
+	case afs_opt_cell:
+		size = args[0].to - args[0].from;
+		if (size <= 0)
 			return -EINVAL;
-		}
+		if (size > AFS_MAXCELLNAME)
+			return -ENAMETOOLONG;
+
+		rcu_read_lock();
+		cell = afs_lookup_cell_rcu(ctx->net, args[0].from, size);
+		rcu_read_unlock();
+		if (IS_ERR(cell))
+			return PTR_ERR(cell);
+		afs_put_cell(ctx->net, ctx->cell);
+		ctx->cell = cell;
+		break;
+
+	case afs_opt_rwpath:
+		ctx->rwpath = true;
+		break;
+
+	case afs_opt_vol:
+		return -EINVAL; /* Not required for automount */
+
+	case afs_opt_autocell:
+		ctx->autocell = true;
+		break;
+
+	case afs_opt_dyn:
+		ctx->dyn_root = true;
+		break;
+
+	default:
+		printk(KERN_ERR "kAFS: Unknown or invalid mount option: '%s'\n", opt);
+		return -EINVAL;
 	}
 
 	_leave(" = 0");
@@ -253,9 +243,10 @@ static int afs_parse_options(struct afs_mount_params *params,
 }
 
 /*
- * parse a device name to get cell name, volume name, volume type and R/W
- * selector
- * - this can be one of the following:
+ * Parse the source name to get cell name, volume name, volume type and R/W
+ * selector.
+ *
+ * This can be one of the following:
  *	"%[cell:]volume[.]"		R/W volume
  *	"#[cell:]volume[.]"		R/O or R/W volume (rwpath=0),
  *					 or R/W (rwpath=1) volume
@@ -264,9 +255,9 @@ static int afs_parse_options(struct afs_mount_params *params,
  *	"%[cell:]volume.backup"		Backup volume
  *	"#[cell:]volume.backup"		Backup volume
  */
-static int afs_parse_device_name(struct afs_mount_params *params,
-				 const char *name)
+static int afs_parse_source(struct fs_context *fc, char *name)
 {
+	struct afs_fs_context *ctx = fc->fs_private;
 	struct afs_cell *cell;
 	const char *cellname, *suffix;
 	int cellnamesz;
@@ -279,69 +270,116 @@ static int afs_parse_device_name(struct afs_mount_params *params,
 	}
 
 	if ((name[0] != '%' && name[0] != '#') || !name[1]) {
+		/* To use dynroot, we don't want to have to provide a source */
+		if (strcmp(name, "none") == 0) {
+			ctx->no_cell = true;
+			return 0;
+		}
 		printk(KERN_ERR "kAFS: unparsable volume name\n");
 		return -EINVAL;
 	}
 
 	/* determine the type of volume we're looking for */
-	params->type = AFSVL_ROVOL;
-	params->force = false;
-	if (params->rwpath || name[0] == '%') {
-		params->type = AFSVL_RWVOL;
-		params->force = true;
+	ctx->type = AFSVL_ROVOL;
+	ctx->force = false;
+	if (ctx->rwpath || name[0] == '%') {
+		ctx->type = AFSVL_RWVOL;
+		ctx->force = true;
 	}
 	name++;
 
 	/* split the cell name out if there is one */
-	params->volname = strchr(name, ':');
-	if (params->volname) {
+	ctx->volname = strchr(name, ':');
+	if (ctx->volname) {
 		cellname = name;
-		cellnamesz = params->volname - name;
-		params->volname++;
+		cellnamesz = ctx->volname - name;
+		ctx->volname++;
 	} else {
-		params->volname = name;
+		ctx->volname = name;
 		cellname = NULL;
 		cellnamesz = 0;
 	}
 
 	/* the volume type is further affected by a possible suffix */
-	suffix = strrchr(params->volname, '.');
+	suffix = strrchr(ctx->volname, '.');
 	if (suffix) {
 		if (strcmp(suffix, ".readonly") == 0) {
-			params->type = AFSVL_ROVOL;
-			params->force = true;
+			ctx->type = AFSVL_ROVOL;
+			ctx->force = true;
 		} else if (strcmp(suffix, ".backup") == 0) {
-			params->type = AFSVL_BACKVOL;
-			params->force = true;
+			ctx->type = AFSVL_BACKVOL;
+			ctx->force = true;
 		} else if (suffix[1] == 0) {
 		} else {
 			suffix = NULL;
 		}
 	}
 
-	params->volnamesz = suffix ?
-		suffix - params->volname : strlen(params->volname);
+	ctx->volnamesz = suffix ?
+		suffix - ctx->volname : strlen(ctx->volname);
 
 	_debug("cell %*.*s [%p]",
-	       cellnamesz, cellnamesz, cellname ?: "", params->cell);
+	       cellnamesz, cellnamesz, cellname ?: "", ctx->cell);
 
 	/* lookup the cell record */
-	if (cellname || !params->cell) {
-		cell = afs_lookup_cell(params->net, cellname, cellnamesz,
+	if (cellname) {
+		cell = afs_lookup_cell(ctx->net, cellname, cellnamesz,
 				       NULL, false);
 		if (IS_ERR(cell)) {
-			printk(KERN_ERR "kAFS: unable to lookup cell '%*.*s'\n",
+			pr_err("kAFS: unable to lookup cell '%*.*s'\n",
 			       cellnamesz, cellnamesz, cellname ?: "");
 			return PTR_ERR(cell);
 		}
-		afs_put_cell(params->net, params->cell);
-		params->cell = cell;
+		afs_put_cell(ctx->net, ctx->cell);
+		ctx->cell = cell;
 	}
 
 	_debug("CELL:%s [%p] VOLUME:%*.*s SUFFIX:%s TYPE:%d%s",
-	       params->cell->name, params->cell,
-	       params->volnamesz, params->volnamesz, params->volname,
-	       suffix ?: "-", params->type, params->force ? " FORCE" : "");
+	       ctx->cell->name, ctx->cell,
+	       ctx->volnamesz, ctx->volnamesz, ctx->volname,
+	       suffix ?: "-", ctx->type, ctx->force ? " FORCE" : "");
+
+	return 0;
+}
+
+/*
+ * Validate the options, get the cell key and look up the volume.
+ */
+static int afs_validate_fc(struct fs_context *fc)
+{
+	struct afs_fs_context *ctx = fc->fs_private;
+	struct afs_volume *volume;
+	struct key *key;
+
+	if (!ctx->dyn_root) {
+		if (ctx->no_cell) {
+			pr_warn("kAFS: Can only specify source 'none' with -o dyn\n");
+			return -EINVAL;
+		}
+
+		if (!ctx->cell) {
+			pr_warn("kAFS: No cell specified\n");
+			return -EDESTADDRREQ;
+		}
+
+		/* We try to do the mount securely. */
+		key = afs_request_key(ctx->cell);
+		if (IS_ERR(key))
+			return PTR_ERR(key);
+
+		ctx->key = key;
+
+		if (ctx->volume) {
+			afs_put_volume(ctx->cell, ctx->volume);
+			ctx->volume = NULL;
+		}
+
+		volume = afs_create_volume(ctx);
+		if (IS_ERR(volume))
+			return PTR_ERR(volume);
+
+		ctx->volume = volume;
+	}
 
 	return 0;
 }
@@ -349,39 +387,34 @@ static int afs_parse_device_name(struct afs_mount_params *params,
 /*
  * check a superblock to see if it's the one we're looking for
  */
-static int afs_test_super(struct super_block *sb, void *data)
+static int afs_test_super(struct super_block *sb, struct fs_context *fc)
 {
-	struct afs_super_info *as1 = data;
+	struct afs_fs_context *ctx = fc->fs_private;
 	struct afs_super_info *as = AFS_FS_S(sb);
 
-	return (as->net_ns == as1->net_ns &&
+	return (as->net_ns == fc->net_ns &&
 		as->volume &&
-		as->volume->vid == as1->volume->vid &&
+		as->volume->vid == ctx->volume->vid &&
 		!as->dyn_root);
 }
 
-static int afs_dynroot_test_super(struct super_block *sb, void *data)
+static int afs_dynroot_test_super(struct super_block *sb, struct fs_context *fc)
 {
-	struct afs_super_info *as1 = data;
 	struct afs_super_info *as = AFS_FS_S(sb);
 
-	return (as->net_ns == as1->net_ns &&
+	return (as->net_ns == fc->net_ns &&
 		as->dyn_root);
 }
 
-static int afs_set_super(struct super_block *sb, void *data)
+static int afs_set_super(struct super_block *sb, struct fs_context *fc)
 {
-	struct afs_super_info *as = data;
-
-	sb->s_fs_info = as;
 	return set_anon_super(sb, NULL);
 }
 
 /*
  * fill in the superblock
  */
-static int afs_fill_super(struct super_block *sb,
-			  struct afs_mount_params *params)
+static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
 {
 	struct afs_super_info *as = AFS_FS_S(sb);
 	struct afs_fid fid;
@@ -412,13 +445,13 @@ static int afs_fill_super(struct super_block *sb,
 		fid.vid		= as->volume->vid;
 		fid.vnode	= 1;
 		fid.unique	= 1;
-		inode = afs_iget(sb, params->key, &fid, NULL, NULL, NULL);
+		inode = afs_iget(sb, ctx->key, &fid, NULL, NULL, NULL);
 	}
 
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	if (params->autocell || params->dyn_root)
+	if (ctx->autocell || as->dyn_root)
 		set_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(inode)->flags);
 
 	ret = -ENOMEM;
@@ -443,17 +476,20 @@ static int afs_fill_super(struct super_block *sb,
 	return ret;
 }
 
-static struct afs_super_info *afs_alloc_sbi(struct afs_mount_params *params)
+static struct afs_super_info *afs_alloc_sbi(struct fs_context *fc)
 {
+	struct afs_fs_context *ctx = fc->fs_private;
 	struct afs_super_info *as;
 
 	as = kzalloc(sizeof(struct afs_super_info), GFP_KERNEL);
 	if (as) {
-		as->net_ns = get_net(params->net_ns);
-		if (params->dyn_root)
+		as->net_ns = get_net(fc->net_ns);
+		if (ctx->dyn_root) {
 			as->dyn_root = true;
-		else
-			as->cell = afs_get_cell(params->cell);
+		} else {
+			as->cell = afs_get_cell(ctx->cell);
+			as->volume = __afs_get_volume(ctx->volume);
+		}
 	}
 	return as;
 }
@@ -488,112 +524,134 @@ static void afs_kill_super(struct super_block *sb)
 }
 
 /*
- * get an AFS superblock
+ * Get an AFS superblock and root directory.
  */
-static struct dentry *afs_mount(struct file_system_type *fs_type,
-				int flags, const char *dev_name,
-				void *options, size_t data_size)
+static int afs_get_tree(struct fs_context *fc)
 {
-	struct afs_mount_params params;
+	struct afs_fs_context *ctx = fc->fs_private;
 	struct super_block *sb;
-	struct afs_volume *candidate;
-	struct key *key;
 	struct afs_super_info *as;
 	int ret;
 
-	_enter(",,%s,%p", dev_name, options);
-
-	memset(&params, 0, sizeof(params));
-
-	ret = -EINVAL;
-	if (current->nsproxy->net_ns != &init_net)
-		goto error;
-	params.net_ns = current->nsproxy->net_ns;
-	params.net = afs_net(params.net_ns);
-	
-	/* parse the options and device name */
-	if (options) {
-		ret = afs_parse_options(&params, options, &dev_name);
-		if (ret < 0)
-			goto error;
-	}
-
-	if (!params.dyn_root) {
-		ret = afs_parse_device_name(&params, dev_name);
-		if (ret < 0)
-			goto error;
-
-		/* try and do the mount securely */
-		key = afs_request_key(params.cell);
-		if (IS_ERR(key)) {
-			_leave(" = %ld [key]", PTR_ERR(key));
-			ret = PTR_ERR(key);
-			goto error;
-		}
-		params.key = key;
-	}
+	_enter("%s", fc->source);
 
 	/* allocate a superblock info record */
 	ret = -ENOMEM;
-	as = afs_alloc_sbi(&params);
+	as = afs_alloc_sbi(fc);
 	if (!as)
-		goto error_key;
-
-	if (!params.dyn_root) {
-		/* Assume we're going to need a volume record; at the very
-		 * least we can use it to update the volume record if we have
-		 * one already.  This checks that the volume exists within the
-		 * cell.
-		 */
-		candidate = afs_create_volume(&params);
-		if (IS_ERR(candidate)) {
-			ret = PTR_ERR(candidate);
-			goto error_as;
-		}
-
-		as->volume = candidate;
-	}
+		goto error;
+	fc->s_fs_info = as;
 
 	/* allocate a deviceless superblock */
-	sb = sget(fs_type,
-		  as->dyn_root ? afs_dynroot_test_super : afs_test_super,
-		  afs_set_super, flags, as);
+	sb = sget_fc(fc,
+		     as->dyn_root ? afs_dynroot_test_super : afs_test_super,
+		     afs_set_super);
 	if (IS_ERR(sb)) {
 		ret = PTR_ERR(sb);
-		goto error_as;
+		goto error;
 	}
 
 	if (!sb->s_root) {
 		/* initial superblock/root creation */
 		_debug("create");
-		ret = afs_fill_super(sb, &params);
+		ret = afs_fill_super(sb, ctx);
 		if (ret < 0)
 			goto error_sb;
-		as = NULL;
 		sb->s_flags |= SB_ACTIVE;
 	} else {
 		_debug("reuse");
 		ASSERTCMP(sb->s_flags, &, SB_ACTIVE);
-		afs_destroy_sbi(as);
-		as = NULL;
 	}
 
-	afs_put_cell(params.net, params.cell);
-	key_put(params.key);
+	fc->root = dget(sb->s_root);
 	_leave(" = 0 [%p]", sb);
-	return dget(sb->s_root);
+	return 0;
 
 error_sb:
 	deactivate_locked_super(sb);
-	goto error_key;
-error_as:
-	afs_destroy_sbi(as);
-error_key:
-	key_put(params.key);
 error:
-	afs_put_cell(params.net, params.cell);
 	_leave(" = %d", ret);
-	return ERR_PTR(ret);
+	return ret;
+}
+
+static void afs_free_fc(struct fs_context *fc)
+{
+	struct afs_fs_context *ctx = fc->fs_private;
+
+	if (ctx) {
+		afs_destroy_sbi(fc->s_fs_info);
+		afs_put_volume(ctx->cell, ctx->volume);
+		afs_put_cell(ctx->net, ctx->cell);
+		key_put(ctx->key);
+		kfree(ctx);
+	}
+}
+
+static const struct fs_context_operations afs_context_ops = {
+	.free		= afs_free_fc,
+	.parse_source	= afs_parse_source,
+	.parse_option	= afs_parse_option,
+	.validate	= afs_validate_fc,
+	.get_tree	= afs_get_tree,
+};
+
+/*
+ * Set up the filesystem mount context.
+ */
+static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+	struct afs_fs_context *ctx;
+	struct afs_super_info *src_as;
+	struct afs_cell *cell;
+
+	if (current->nsproxy->net_ns != &init_net)
+		return -EINVAL;
+
+	ctx = kzalloc(sizeof(struct afs_fs_context), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->type = AFSVL_ROVOL;
+
+	switch (fc->purpose) {
+	case FS_CONTEXT_FOR_USER_MOUNT:
+	case FS_CONTEXT_FOR_KERNEL_MOUNT:
+		ctx->net = afs_net(fc->net_ns);
+
+		/* Default to the workstation cell. */
+		rcu_read_lock();
+		cell = afs_lookup_cell_rcu(ctx->net, NULL, 0);
+		rcu_read_unlock();
+		if (IS_ERR(cell))
+			cell = NULL;
+		ctx->cell = cell;
+		break;
+
+	case FS_CONTEXT_FOR_SUBMOUNT:
+		if (!reference) {
+			kfree(ctx);
+			return -EINVAL;
+		}
+
+		src_as = AFS_FS_S(reference->d_sb);
+		ASSERT(src_as);
+
+		ctx->net = afs_net(fc->net_ns);
+		if (src_as->cell)
+			ctx->cell = afs_get_cell(src_as->cell);
+		if (src_as->volume && src_as->volume->type == AFSVL_RWVOL) {
+			ctx->type = AFSVL_RWVOL;
+			ctx->force = true;
+		}
+		break;
+
+	case FS_CONTEXT_FOR_RECONFIGURE:
+		break;
+	}
+
+	fc->fs_private = ctx;
+	fc->ops = &afs_context_ops;
+	return 0;
 }
 
 /*
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 3037bd01f617..7adcddf02e66 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -21,7 +21,7 @@ static const char *const afs_voltypes[] = { "R/W", "R/O", "BAK" };
 /*
  * Allocate a volume record and load it up from a vldb record.
  */
-static struct afs_volume *afs_alloc_volume(struct afs_mount_params *params,
+static struct afs_volume *afs_alloc_volume(struct afs_fs_context *params,
 					   struct afs_vldb_entry *vldb,
 					   unsigned long type_mask)
 {
@@ -149,7 +149,7 @@ static struct afs_vldb_entry *afs_vl_lookup_vldb(struct afs_cell *cell,
  * - Rule 3: If parent volume is R/W, then only mount R/W volume unless
  *           explicitly told otherwise
  */
-struct afs_volume *afs_create_volume(struct afs_mount_params *params)
+struct afs_volume *afs_create_volume(struct afs_fs_context *params)
 {
 	struct afs_vldb_entry *vldb;
 	struct afs_volume *volume;


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 30/32] afs: Use fs_context to pass parameters over automount [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (28 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 29/32] afs: Add fs_context support " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:44 ` [PATCH 31/32] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, Eric W. Biederman, linux-fsdevel, torvalds, linux-kernel

Alter the AFS automounting code to create and modify an fs_context struct
when parameterising a new mount triggered by an AFS mountpoint rather than
constructing device name and option strings.

Also remove the cell=, vol= and rwpath options as they are then redundant.
The reason they existed is because the 'device name' may be derived
literally from a mountpoint object in the filesystem, so default cell and
parent-type information needed to be passed in by some other method from
the automount routines.  The vol= option didn't end up being used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@redhat.com>
---

 fs/afs/internal.h |    1 
 fs/afs/mntpt.c    |  148 +++++++++++++++++++++++++++--------------------------
 fs/afs/super.c    |   43 +--------------
 3 files changed, 79 insertions(+), 113 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index d54aab35a1ca..e35d59761d47 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -35,7 +35,6 @@ struct pagevec;
 struct afs_call;
 
 struct afs_fs_context {
-	bool			rwpath;		/* T if the parent should be considered R/W */
 	bool			force;		/* T to force cell type */
 	bool			autocell;	/* T if set auto mount operation */
 	bool			dyn_root;	/* T if dynamic root */
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index c45aa1776591..fc383d727552 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -47,6 +47,8 @@ static DECLARE_DELAYED_WORK(afs_mntpt_expiry_timer, afs_mntpt_expiry_timed_out);
 
 static unsigned long afs_mntpt_expiry_timeout = 10 * 60;
 
+static const char afs_root_volume[] = "root.cell";
+
 /*
  * no valid lookup procedure on this sort of dir
  */
@@ -68,107 +70,107 @@ static int afs_mntpt_open(struct inode *inode, struct file *file)
 }
 
 /*
- * create a vfsmount to be automounted
+ * Set the parameters for the proposed superblock.
  */
-static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+static int afs_mntpt_set_params(struct fs_context *fc, struct dentry *mntpt)
 {
-	struct afs_super_info *as;
-	struct vfsmount *mnt;
-	struct afs_vnode *vnode;
-	struct page *page;
-	char *devname, *options;
-	bool rwpath = false;
+	struct afs_fs_context *ctx = fc->fs_private;
+	struct afs_vnode *vnode = AFS_FS_I(d_inode(mntpt));
+	struct afs_cell *cell;
+	const char *p;
 	int ret;
 
-	_enter("{%pd}", mntpt);
-
-	BUG_ON(!d_inode(mntpt));
-
-	ret = -ENOMEM;
-	devname = (char *) get_zeroed_page(GFP_KERNEL);
-	if (!devname)
-		goto error_no_devname;
-
-	options = (char *) get_zeroed_page(GFP_KERNEL);
-	if (!options)
-		goto error_no_options;
-
-	vnode = AFS_FS_I(d_inode(mntpt));
 	if (test_bit(AFS_VNODE_PSEUDODIR, &vnode->flags)) {
 		/* if the directory is a pseudo directory, use the d_name */
-		static const char afs_root_cell[] = ":root.cell.";
 		unsigned size = mntpt->d_name.len;
 
-		ret = -ENOENT;
-		if (size < 2 || size > AFS_MAXCELLNAME)
-			goto error_no_page;
+		if (size < 2)
+			return -ENOENT;
 
+		p = mntpt->d_name.name;
 		if (mntpt->d_name.name[0] == '.') {
-			devname[0] = '%';
-			memcpy(devname + 1, mntpt->d_name.name + 1, size - 1);
-			memcpy(devname + size, afs_root_cell,
-			       sizeof(afs_root_cell));
-			rwpath = true;
-		} else {
-			devname[0] = '#';
-			memcpy(devname + 1, mntpt->d_name.name, size);
-			memcpy(devname + size + 1, afs_root_cell,
-			       sizeof(afs_root_cell));
+			size--;
+			p++;
+			ctx->type = AFSVL_RWVOL;
+			ctx->force = true;
+		}
+		if (size > AFS_MAXCELLNAME)
+			return -ENAMETOOLONG;
+
+		cell = afs_lookup_cell(ctx->net, p, size, NULL, false);
+		if (IS_ERR(cell)) {
+			pr_err("kAFS: unable to lookup cell '%pd'\n", mntpt);
+			return PTR_ERR(cell);
 		}
+		afs_put_cell(ctx->net, ctx->cell);
+		ctx->cell = cell;
+
+		ctx->volname = afs_root_volume;
+		ctx->volnamesz = sizeof(afs_root_volume) - 1;
 	} else {
 		/* read the contents of the AFS special symlink */
+		struct page *page;
 		loff_t size = i_size_read(d_inode(mntpt));
 		char *buf;
 
-		ret = -EINVAL;
 		if (size > PAGE_SIZE - 1)
-			goto error_no_page;
+			return -EINVAL;
 
 		page = read_mapping_page(d_inode(mntpt)->i_mapping, 0, NULL);
-		if (IS_ERR(page)) {
-			ret = PTR_ERR(page);
-			goto error_no_page;
-		}
+		if (IS_ERR(page))
+			return PTR_ERR(page);
 
-		ret = -EIO;
-		if (PageError(page))
-			goto error;
+		if (PageError(page)) {
+			put_page(page);
+			return -EIO;
+		}
 
-		buf = kmap_atomic(page);
-		memcpy(devname, buf, size);
-		kunmap_atomic(buf);
+		buf = kmap(page);
+		ret = vfs_set_fs_source(fc, buf, size);
+		kunmap(page);
 		put_page(page);
-		page = NULL;
+		if (ret < 0)
+			return ret;
 	}
 
-	/* work out what options we want */
-	as = AFS_FS_S(mntpt->d_sb);
-	if (as->cell) {
-		memcpy(options, "cell=", 5);
-		strcpy(options + 5, as->cell->name);
-		if ((as->volume && as->volume->type == AFSVL_RWVOL) || rwpath)
-			strcat(options, ",rwpath");
-	}
+	return 0;
+}
 
-	/* try and do the mount */
-	_debug("--- attempting mount %s -o %s ---", devname, options);
-	mnt = vfs_submount(mntpt, &afs_fs_type, devname,
-			   options, strlen(options) + 1);
-	_debug("--- mount result %p ---", mnt);
+/*
+ * create a vfsmount to be automounted
+ */
+static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+{
+	struct fs_context *fc;
+	struct vfsmount *mnt;
+	int ret;
+
+	BUG_ON(!d_inode(mntpt));
+
+	fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0,
+				FS_CONTEXT_FOR_SUBMOUNT);
+	if (IS_ERR(fc))
+		return ERR_CAST(fc);
+
+	ret = afs_mntpt_set_params(fc, mntpt);
+	if (ret < 0)
+		goto error_fc;
+
+	ret = vfs_get_tree(fc);
+	if (ret < 0)
+		goto error_fc;
+
+	mnt = vfs_create_mount(fc, 0);
+	if (IS_ERR(mnt)) {
+		ret = PTR_ERR(mnt);
+		goto error_fc;
+	}
 
-	free_page((unsigned long) devname);
-	free_page((unsigned long) options);
-	_leave(" = %p", mnt);
+	put_fs_context(fc);
 	return mnt;
 
-error:
-	put_page(page);
-error_no_page:
-	free_page((unsigned long) options);
-error_no_options:
-	free_page((unsigned long) devname);
-error_no_devname:
-	_leave(" = %d", ret);
+error_fc:
+	put_fs_context(fc);
 	return ERR_PTR(ret);
 }
 
diff --git a/fs/afs/super.c b/fs/afs/super.c
index a2237bc411e1..ab64edff11af 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -64,18 +64,12 @@ static atomic_t afs_count_active_inodes;
 
 enum {
 	afs_no_opt,
-	afs_opt_cell,
 	afs_opt_dyn,
-	afs_opt_rwpath,
-	afs_opt_vol,
 	afs_opt_autocell,
 };
 
 static const match_table_t afs_options_list = {
-	{ afs_opt_cell,		"cell=%s"	},
 	{ afs_opt_dyn,		"dyn"		},
-	{ afs_opt_rwpath,	"rwpath"	},
-	{ afs_opt_vol,		"vol=%s"	},
 	{ afs_opt_autocell,	"autocell"	},
 	{ afs_no_opt,		NULL		},
 };
@@ -194,37 +188,13 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
 static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
 {
 	struct afs_fs_context *ctx = fc->fs_private;
-	struct afs_cell *cell;
 	substring_t args[MAX_OPT_ARGS];
-	int token, size;
+	int token;
 
 	_enter("%s", opt);
 
 	token = match_token(opt, afs_options_list, args);
 	switch (token) {
-	case afs_opt_cell:
-		size = args[0].to - args[0].from;
-		if (size <= 0)
-			return -EINVAL;
-		if (size > AFS_MAXCELLNAME)
-			return -ENAMETOOLONG;
-
-		rcu_read_lock();
-		cell = afs_lookup_cell_rcu(ctx->net, args[0].from, size);
-		rcu_read_unlock();
-		if (IS_ERR(cell))
-			return PTR_ERR(cell);
-		afs_put_cell(ctx->net, ctx->cell);
-		ctx->cell = cell;
-		break;
-
-	case afs_opt_rwpath:
-		ctx->rwpath = true;
-		break;
-
-	case afs_opt_vol:
-		return -EINVAL; /* Not required for automount */
-
 	case afs_opt_autocell:
 		ctx->autocell = true;
 		break;
@@ -248,8 +218,8 @@ static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
  *
  * This can be one of the following:
  *	"%[cell:]volume[.]"		R/W volume
- *	"#[cell:]volume[.]"		R/O or R/W volume (rwpath=0),
- *					 or R/W (rwpath=1) volume
+ *	"#[cell:]volume[.]"		R/O or R/W volume (R/O parent),
+ *					 or R/W (R/W parent) volume
  *	"%[cell:]volume.readonly"	R/O volume
  *	"#[cell:]volume.readonly"	R/O volume
  *	"%[cell:]volume.backup"		Backup volume
@@ -280,9 +250,7 @@ static int afs_parse_source(struct fs_context *fc, char *name)
 	}
 
 	/* determine the type of volume we're looking for */
-	ctx->type = AFSVL_ROVOL;
-	ctx->force = false;
-	if (ctx->rwpath || name[0] == '%') {
+	if (name[0] == '%') {
 		ctx->type = AFSVL_RWVOL;
 		ctx->force = true;
 	}
@@ -604,9 +572,6 @@ static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference)
 	struct afs_super_info *src_as;
 	struct afs_cell *cell;
 
-	if (current->nsproxy->net_ns != &init_net)
-		return -EINVAL;
-
 	ctx = kzalloc(sizeof(struct afs_fs_context), GFP_KERNEL);
 	if (!ctx)
 		return -ENOMEM;


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 31/32] vfs: syscall: Add fsinfo() to query filesystem information [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (29 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 30/32] afs: Use fs_context to pass parameters over automount " David Howells
@ 2018-07-10 22:44 ` David Howells
  2018-07-10 22:45 ` [PATCH 32/32] afs: Add fsinfo support " David Howells
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:44 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-api, linux-fsdevel, torvalds, linux-kernel

Add a system call to allow filesystem information to be queried.  A request
value can be given to indicate the desired attribute.  Support is provided
for enumerating multi-value attributes.

===============
NEW SYSTEM CALL
===============

The new system call looks like:

	int ret = fsinfo(int dfd,
			 const char *filename,
			 const struct fsinfo_params *params,
			 void *buffer,
			 size_t buf_size);

The params parameter optionally points to a block of parameters:

	struct fsinfo_params {
		__u32	at_flags;
		__u32	request;
		__u32	Nth;
		__u32	Mth;
		__u32	__reserved[6];
	};

If params is NULL, it is assumed params->request should be
fsinfo_attr_statfs, params->Nth should be 0, params->Mth should be 0 and
params->at_flags should be 0.

If params is given, all of params->__reserved[] must be 0.

dfd, filename and params->at_flags indicate the file to query.  There is no
equivalent of lstat() as that can be emulated with fsinfo() by setting
AT_SYMLINK_NOFOLLOW in params->at_flags.  There is also no equivalent of
fstat() as that can be emulated by passing a NULL filename to fsinfo() with
the fd of interest in dfd.  AT_NO_AUTOMOUNT can also be used to an allow
automount point to be queried without triggering it.

params->request indicates the attribute/attributes to be queried.  This can
be one of:

	fsinfo_attr_statfs		- statfs-style info
	fsinfo_attr_fsinfo		- Information about fsinfo()
	fsinfo_attr_ids			- Filesystem IDs
	fsinfo_attr_limits		- Filesystem limits
	fsinfo_attr_supports		- What's supported in statx(), IOC flags
	fsinfo_attr_capabilities	- Filesystem capabilities
	fsinfo_attr_timestamp_info	- Inode timestamp info
	fsinfo_attr_volume_id		- Volume ID (string)
	fsinfo_attr_volume_uuid		- Volume UUID
	fsinfo_attr_volume_name		- Volume name (string)
	fsinfo_attr_cell_name		- Cell name (string)
	fsinfo_attr_domain_name		- Domain name (string)
	fsinfo_attr_realm_name		- Realm name (string)
	fsinfo_attr_server_name		- Name of the Nth server (string)
	fsinfo_attr_server_address	- Mth address of the Nth server
	fsinfo_attr_parameter		- Nth mount parameter (string)
	fsinfo_attr_source		- Nth mount source name (string)
	fsinfo_attr_name_encoding	- Filename encoding (string)
	fsinfo_attr_name_codepage	- Filename codepage (string)
	fsinfo_attr_io_size		- Optimal I/O sizes

Some attributes (such as the servers backing a network filesystem) can have
multiple values.  These can be enumerated by setting params->Nth and
params->Mth to 0, 1, ... until ENODATA is returned.

buffer and buf_size point to the reply buffer.  The buffer is filled up to
the specified size, even if this means truncating the reply.  The full size
of the reply is returned.  In future versions, this will allow extra fields
to be tacked on to the end of the reply, but anyone not expecting them will
only get the subset they're expecting.  If either buffer of buf_size are 0,
no copy will take place and the data size will be returned.

At the moment, this will only work on x86_64 and i386 as it requires the
system call to be wired up.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/statfs.c                            |  470 ++++++++++++++++++++++++++++
 include/linux/fs.h                     |    4 
 include/linux/fsinfo.h                 |   40 ++
 include/linux/syscalls.h               |    4 
 include/uapi/linux/fsinfo.h            |  237 ++++++++++++++
 samples/statx/Makefile                 |    5 
 samples/statx/test-fsinfo.c            |  539 ++++++++++++++++++++++++++++++++
 9 files changed, 1300 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/statx/test-fsinfo.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 5587bcede253..1c9b56f80cdf 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -403,3 +403,4 @@
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
 390	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 391	i386	fspick			sys_fspick			__ia32_sys_fspick
+392	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 460a464024bf..d2a4d6db4df6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -348,6 +348,7 @@
 337	common	fsopen			__x64_sys_fsopen
 338	common	fsmount			__x64_sys_fsmount
 339	common	fspick			__x64_sys_fspick
+340	common	fsinfo			__x64_sys_fsinfo
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/statfs.c b/fs/statfs.c
index 5b2a24f0f263..fa6be965dce1 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -9,6 +9,7 @@
 #include <linux/security.h>
 #include <linux/uaccess.h>
 #include <linux/compat.h>
+#include <linux/fsinfo.h>
 #include "internal.h"
 
 static int flags_by_mnt(int mnt_flags)
@@ -384,3 +385,472 @@ COMPAT_SYSCALL_DEFINE2(ustat, unsigned, dev, struct compat_ustat __user *, u)
 	return 0;
 }
 #endif
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct dentry *dentry,
+				 struct fsinfo_statfs *p)
+{
+	struct super_block *sb;
+	struct kstatfs buf;
+	int ret;
+
+	ret = statfs_by_dentry(dentry, &buf);
+	if (ret < 0)
+		return ret;
+
+	sb = dentry->d_sb;
+	p->f_blocks	= buf.f_blocks;
+	p->f_bfree	= buf.f_bfree;
+	p->f_bavail	= buf.f_bavail;
+	p->f_files	= buf.f_files;
+	p->f_ffree	= buf.f_ffree;
+	p->f_favail	= buf.f_ffree;
+	p->f_bsize	= buf.f_bsize;
+	p->f_frsize	= buf.f_frsize;
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_ids(struct dentry *dentry,
+			      struct fsinfo_ids *p)
+{
+	struct super_block *sb;
+	struct kstatfs buf;
+	int ret;
+
+	ret = statfs_by_dentry(dentry, &buf);
+	if (ret < 0)
+		return ret;
+
+	sb = dentry->d_sb;
+	p->f_fstype	= sb->s_magic;
+	p->f_dev_major	= MAJOR(sb->s_dev);
+	p->f_dev_minor	= MINOR(sb->s_dev);
+	p->f_flags	= ST_VALID | flags_by_sb(sb->s_flags);
+
+	memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+	strcpy(p->f_fs_name, dentry->d_sb->s_type->name);
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_limits(struct dentry *dentry,
+				 struct fsinfo_limits *lim)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	lim->max_file_size = sb->s_maxbytes;
+	lim->max_hard_links = sb->s_max_links;
+	lim->max_uid = UINT_MAX;
+	lim->max_gid = UINT_MAX;
+	lim->max_projid = UINT_MAX;
+	lim->max_filename_len = NAME_MAX;
+	lim->max_symlink_len = PAGE_SIZE;
+	lim->max_xattr_name_len = XATTR_NAME_MAX;
+	lim->max_xattr_body_len = XATTR_SIZE_MAX;
+	lim->max_dev_major = 0xffffff;
+	lim->max_dev_minor = 0xff;
+	return sizeof(*lim);
+}
+
+static int fsinfo_generic_supports(struct dentry *dentry,
+				   struct fsinfo_supports *c)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	c->stx_mask = STATX_BASIC_STATS;
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		c->stx_attributes |= STATX_ATTR_AUTOMOUNT;
+	return sizeof(*c);
+}
+
+static int fsinfo_generic_capabilities(struct dentry *dentry,
+				       struct fsinfo_capabilities *c)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	if (sb->s_mtd)
+		fsinfo_set_cap(c, fsinfo_cap_is_flash_fs);
+	else if (sb->s_bdev)
+		fsinfo_set_cap(c, fsinfo_cap_is_block_fs);
+
+	if (sb->s_quota_types & QTYPE_MASK_USR)
+		fsinfo_set_cap(c, fsinfo_cap_user_quotas);
+	if (sb->s_quota_types & QTYPE_MASK_GRP)
+		fsinfo_set_cap(c, fsinfo_cap_group_quotas);
+	if (sb->s_quota_types & QTYPE_MASK_PRJ)
+		fsinfo_set_cap(c, fsinfo_cap_project_quotas);
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		fsinfo_set_cap(c, fsinfo_cap_automounts);
+	if (sb->s_id[0])
+		fsinfo_set_cap(c, fsinfo_cap_volume_id);
+
+	fsinfo_set_cap(c, fsinfo_cap_has_atime);
+	fsinfo_set_cap(c, fsinfo_cap_has_ctime);
+	fsinfo_set_cap(c, fsinfo_cap_has_mtime);
+	return sizeof(*c);
+}
+
+static int fsinfo_generic_timestamp_info(struct dentry *dentry,
+					 struct fsinfo_timestamp_info *ts)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	/* If unset, assume 1s granularity */
+	u16 mantissa = 1;
+	s8 exponent = 0;
+
+	ts->minimum_timestamp = S64_MIN;
+	ts->maximum_timestamp = S64_MAX;
+	if (sb->s_time_gran < 1000000000) {
+		if (sb->s_time_gran < 1000)
+			exponent = -9;
+		else if (sb->s_time_gran < 1000000)
+			exponent = -6;
+		else
+			exponent = -3;
+	}
+#define set_gran(x)				\
+	do {					\
+		ts->x##_mantissa = mantissa;	\
+		ts->x##_exponent = exponent;	\
+	} while (0)
+	set_gran(atime_gran);
+	set_gran(btime_gran);
+	set_gran(ctime_gran);
+	set_gran(mtime_gran);
+	return sizeof(*ts);
+}
+
+static int fsinfo_generic_volume_uuid(struct dentry *dentry,
+				      struct fsinfo_volume_uuid *vu)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	memcpy(vu, &sb->s_uuid, sizeof(*vu));
+	return sizeof(*vu);
+}
+
+static int fsinfo_generic_volume_id(struct dentry *dentry, char *buf)
+{
+	struct super_block *sb = dentry->d_sb;
+	size_t len = strlen(sb->s_id);
+
+	if (buf)
+		memcpy(buf, sb->s_id, len + 1);
+	return len;
+}
+
+static int fsinfo_generic_name_encoding(struct dentry *dentry, char *buf)
+{
+	static const char encoding[] = "utf8";
+
+	if (buf)
+		memcpy(buf, encoding, sizeof(encoding) - 1);
+	return sizeof(encoding) - 1;
+}
+
+static int fsinfo_generic_io_size(struct dentry *dentry,
+				  struct fsinfo_io_size *c)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct kstatfs buf;
+	int ret;
+
+	if (sb->s_op->statfs == simple_statfs) {
+		c->block_size = PAGE_SIZE;
+		c->max_single_read_size = 0;
+		c->max_single_write_size = 0;
+		c->best_read_size = PAGE_SIZE;
+		c->best_write_size = PAGE_SIZE;
+	} else {
+		ret = statfs_by_dentry(dentry, &buf);
+		if (ret < 0)
+			return ret;
+		c->block_size = buf.f_bsize;
+		c->max_single_read_size = buf.f_bsize;
+		c->max_single_write_size = buf.f_bsize;
+		c->best_read_size = PAGE_SIZE;
+		c->best_write_size = PAGE_SIZE;
+	}
+	return sizeof(*c);
+}
+
+/*
+ * Implement some queries generically from stuff in the superblock.
+ */
+int generic_fsinfo(struct dentry *dentry, struct fsinfo_kparams *params)
+{
+#define _gen(X) fsinfo_attr_##X: return fsinfo_generic_##X(dentry, params->buffer)
+
+	switch (params->request) {
+	case _gen(statfs);
+	case _gen(ids);
+	case _gen(limits);
+	case _gen(supports);
+	case _gen(capabilities);
+	case _gen(timestamp_info);
+	case _gen(volume_uuid);
+	case _gen(volume_id);
+	case _gen(name_encoding);
+	case _gen(io_size);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL(generic_fsinfo);
+
+/*
+ * Retrieve the filesystem info.  We make some stuff up if the operation is not
+ * supported.
+ */
+int vfs_fsinfo(const struct path *path, struct fsinfo_kparams *params)
+{
+	struct dentry *dentry = path->dentry;
+	int (*get_fsinfo)(struct dentry *, struct fsinfo_kparams *);
+	int ret;
+
+	if (params->request == fsinfo_attr_fsinfo) {
+		struct fsinfo_fsinfo *info = params->buffer;
+
+		info->max_attr	= fsinfo_attr__nr;
+		info->max_cap	= fsinfo_cap__nr;
+		return sizeof(*info);
+	}
+
+	get_fsinfo = dentry->d_sb->s_op->get_fsinfo;
+	if (!get_fsinfo) {
+		if (!dentry->d_sb->s_op->statfs)
+			return -EOPNOTSUPP;
+		get_fsinfo = generic_fsinfo;
+	}
+
+	ret = security_sb_statfs(dentry);
+	if (ret)
+		return ret;
+
+	ret = get_fsinfo(dentry, params);
+	if (ret < 0)
+		return ret;
+
+	if (params->request == fsinfo_attr_ids &&
+	    params->buffer) {
+		struct fsinfo_ids *p = params->buffer;
+
+		p->f_flags |= flags_by_mnt(path->mnt->mnt_flags);
+	}
+	return ret;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *filename,
+			   struct fsinfo_kparams *params)
+{
+	struct path path;
+	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	int ret = -EINVAL;
+
+	if ((params->at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+				 AT_EMPTY_PATH)) != 0)
+		return -EINVAL;
+
+	if (params->at_flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (params->at_flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (params->at_flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+	ret = user_path_at(dfd, filename, lookup_flags, &path);
+	if (ret)
+		goto out;
+
+	ret = vfs_fsinfo(&path, params);
+	path_put(&path);
+	if (retry_estale(ret, lookup_flags)) {
+		lookup_flags |= LOOKUP_REVAL;
+		goto retry;
+	}
+out:
+	return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_kparams *params)
+{
+	struct fd f = fdget_raw(fd);
+	int ret = -EBADF;
+
+	if (f.file) {
+		ret = vfs_fsinfo(&f.file->f_path, params);
+		fdput(f);
+	}
+	return ret;
+}
+
+/*
+ * Return buffer information by requestable attribute.
+ *
+ * STRUCT indicates a fixed-size structure with only one instance.
+ * STRUCT_N indicates a fixed-size structure that may have multiple instances.
+ * STRING indicates a string with only one instance.
+ * STRING_N indicates a string that may have multiple instances.
+ * STRUCT_ARRAY indicates an array of fixed-size structs with only one instance.
+ * STRUCT_ARRAY_N as above that may have multiple instances.
+ *
+ * If an entry is marked STRUCT, STRUCT_N or STRUCT_NM then if no buffer is
+ * supplied to sys_fsinfo(), sys_fsinfo() will handle returning the buffer size
+ * without calling vfs_fsinfo() and the filesystem.
+ *
+ * No struct may have more than 252 bytes (ie. 0x3f * 4)
+ */
+#define FSINFO_STRING(N)	 [fsinfo_attr_##N] = 0x0000
+#define FSINFO_STRUCT(N)	 [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)
+#define FSINFO_STRING_N(N)	 [fsinfo_attr_##N] = 0x4000
+#define FSINFO_STRUCT_N(N)	 [fsinfo_attr_##N] = 0x4000 | sizeof(struct fsinfo_##N)
+#define FSINFO_STRUCT_NM(N)	 [fsinfo_attr_##N] = 0x8000 | sizeof(struct fsinfo_##N)
+static const u16 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
+	FSINFO_STRUCT		(statfs),
+	FSINFO_STRUCT		(fsinfo),
+	FSINFO_STRUCT		(ids),
+	FSINFO_STRUCT		(limits),
+	FSINFO_STRUCT		(capabilities),
+	FSINFO_STRUCT		(supports),
+	FSINFO_STRUCT		(timestamp_info),
+	FSINFO_STRING		(volume_id),
+	FSINFO_STRUCT		(volume_uuid),
+	FSINFO_STRING		(volume_name),
+	FSINFO_STRING		(cell_name),
+	FSINFO_STRING		(domain_name),
+	FSINFO_STRING		(realm_name),
+	FSINFO_STRING_N		(server_name),
+	FSINFO_STRUCT_NM	(server_address),
+	FSINFO_STRING_N		(parameter),
+	FSINFO_STRING_N		(source),
+	FSINFO_STRING		(name_encoding),
+	FSINFO_STRING		(name_codepage),
+	FSINFO_STRUCT		(io_size),
+};
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @filename: Filesystem to query or NULL.
+ * @_params: Parameters to define request (or NULL for enhanced statfs).
+ * @_buffer: Result buffer.
+ * @buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem.  The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
+ * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
+ * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
+ * ENODATA is returned.
+ *
+ * On success, the size of the attribute's value is returned.  If @buf_size is
+ * 0 or @_buffer is NULL, only the size is returned.  If the size of the value
+ * is larger than @buf_size, it will be truncated by the copy.  If the size of
+ * the value is smaller than @buf_size then the excess buffer space will be
+ * cleared.  The full size of the value will be returned, irrespective of how
+ * much data is actually placed in the buffer.
+ */
+SYSCALL_DEFINE5(fsinfo,
+		int, dfd, const char __user *, filename,
+		struct fsinfo_params __user *, _params,
+		void __user *, _buffer, size_t, buf_size)
+{
+	struct fsinfo_params user_params;
+	struct fsinfo_kparams params;
+	size_t size;
+	int ret;
+
+	if (_params) {
+		if (copy_from_user(&user_params, _params, sizeof(user_params)))
+			return -EFAULT;
+		if (user_params.__reserved[0] ||
+		    user_params.__reserved[1] ||
+		    user_params.__reserved[2] ||
+		    user_params.__reserved[3] ||
+		    user_params.__reserved[4] ||
+		    user_params.__reserved[5])
+			return -EINVAL;
+		if (user_params.request >= fsinfo_attr__nr)
+			return -EOPNOTSUPP;
+		params.at_flags = user_params.at_flags;
+		params.request = user_params.request;
+		params.Nth = user_params.Nth;
+		params.Mth = user_params.Mth;
+	} else {
+		params.at_flags = 0;
+		params.request = fsinfo_attr_statfs;
+		params.Nth = 0;
+		params.Mth = 0;
+	}
+
+	if (!_buffer || !buf_size) {
+		buf_size = 0;
+		_buffer = NULL;
+	}
+
+	/* Allocate an appropriately-sized buffer.  We will truncate the
+	 * contents when we write the contents back to userspace.
+	 */
+	size = fsinfo_buffer_sizes[params.request];
+	switch (size & 0xc000) {
+	case 0x0000:
+		if (params.Nth != 0)
+			return -ENODATA;
+		/* Fall through */
+	case 0x4000:
+		if (params.Mth != 0)
+			return -ENODATA;
+		/* Fall through */
+	case 0x8000:
+		break;
+	case 0xc000:
+		return -ENOBUFS;
+	}
+
+	size &= ~0xc000;
+	if (size == 0) {
+		size = 4096; /* String */
+	} else {
+		if (buf_size == 0)
+			return size; /* We know how big the buffer should be */
+
+		/* Clear any part of the buffer that we won't fill. */
+		if (buf_size > size &&
+		    clear_user(_buffer, buf_size) != 0)
+			return -EFAULT;
+	}
+
+	if (buf_size > 0) {
+		params.buf_size = size;
+		params.buffer = kzalloc(size, GFP_KERNEL);
+		if (!params.buffer)
+			return -ENOMEM;
+	} else {
+		params.buf_size = 0;
+		params.buffer = NULL;
+	}
+
+	if (filename)
+		ret = vfs_fsinfo_path(dfd, filename, &params);
+	else
+		ret = vfs_fsinfo_fd(dfd, &params);
+	if (ret < 0)
+		goto error;
+
+	if (ret == 0) {
+		ret = -ENODATA;
+		goto error;
+	}
+
+	if (buf_size > ret)
+		buf_size = ret;
+
+	if (copy_to_user(_buffer, params.buffer, buf_size))
+		ret = -EFAULT;
+error:
+	kfree(params.buffer);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e6d963f2fdc2..bcbe94c0dfe8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -62,6 +62,8 @@ struct iov_iter;
 struct fscrypt_info;
 struct fscrypt_operations;
 struct fs_context;
+struct fsinfo_kparams;
+enum fsinfo_attribute;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1847,6 +1849,7 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
+	int (*get_fsinfo) (struct dentry *, struct fsinfo_kparams *);
 	int (*remount_fs) (struct super_block *, int *, char *, size_t);
 	int (*reconfigure) (struct super_block *, struct fs_context *);
 	void (*umount_begin) (struct super_block *);
@@ -2223,6 +2226,7 @@ extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
 extern int vfs_statfs(const struct path *, struct kstatfs *);
 extern int user_statfs(const char __user *, struct kstatfs *);
 extern int fd_statfs(int, struct kstatfs *);
+extern int vfs_fsinfo(const struct path *, struct fsinfo_kparams *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..c356391b4b2a
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,40 @@
+/* Filesystem information query
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#include <uapi/linux/fsinfo.h>
+
+struct fsinfo_kparams {
+	__u32			at_flags;	/* AT_SYMLINK_NOFOLLOW and similar */
+	enum fsinfo_attribute	request;	/* What is being asking for */
+	__u32			Nth;		/* Instance of it (some may have multiple) */
+	__u32			Mth;		/* Subinstance */
+	void			*buffer;	/* Where to place the reply */
+	size_t			buf_size;	/* Size of the buffer */
+};
+
+extern int generic_fsinfo(struct dentry *, struct fsinfo_kparams *);
+
+static inline void fsinfo_set_cap(struct fsinfo_capabilities *c,
+				  enum fsinfo_capability cap)
+{
+	c->capabilities[cap / 8] |= 1 << (cap % 8);
+}
+
+static inline void fsinfo_clear_cap(struct fsinfo_capabilities *c,
+				    enum fsinfo_capability cap)
+{
+	c->capabilities[cap / 8] &= ~(1 << (cap % 8));
+}
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ac803f5c0822..da3575dded79 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -49,6 +49,7 @@ struct stat64;
 struct statfs;
 struct statfs64;
 struct statx;
+struct fsinfo_params;
 struct __sysctl_args;
 struct sysinfo;
 struct timespec;
@@ -907,6 +908,9 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
 asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
+asmlinkage long sys_fsinfo(int dfd, const char __user *path,
+			   struct fsinfo_params __user *params,
+			   void __user *buffer, size_t buf_size);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..f2bc5130544d
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,237 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+
+/*
+ * The filesystem attributes that can be requested.  Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+enum fsinfo_attribute {
+	fsinfo_attr_statfs		= 0,	/* statfs()-style state */
+	fsinfo_attr_fsinfo		= 1,	/* Information about fsinfo() */
+	fsinfo_attr_ids			= 2,	/* Filesystem IDs */
+	fsinfo_attr_limits		= 3,	/* Filesystem limits */
+	fsinfo_attr_supports		= 4,	/* What's supported in statx, iocflags, ... */
+	fsinfo_attr_capabilities	= 5,	/* Filesystem capabilities (bits) */
+	fsinfo_attr_timestamp_info	= 6,	/* Inode timestamp info */
+	fsinfo_attr_volume_id		= 7,	/* Volume ID (string) */
+	fsinfo_attr_volume_uuid		= 8,	/* Volume UUID (LE uuid) */
+	fsinfo_attr_volume_name		= 9,	/* Volume name (string) */
+	fsinfo_attr_cell_name		= 10,	/* Cell name (string) */
+	fsinfo_attr_domain_name		= 11,	/* Domain name (string) */
+	fsinfo_attr_realm_name		= 12,	/* Realm name (string) */
+	fsinfo_attr_server_name		= 13,	/* Name of the Nth server */
+	fsinfo_attr_server_address	= 14,	/* Mth address of the Nth server */
+	fsinfo_attr_parameter		= 15,	/* Nth mount parameter (string) */
+	fsinfo_attr_source		= 16,	/* Nth mount source name (string) */
+	fsinfo_attr_name_encoding	= 17,	/* Filename encoding (string) */
+	fsinfo_attr_name_codepage	= 18,	/* Filename codepage (string) */
+	fsinfo_attr_io_size		= 19,	/* Optimal I/O sizes */
+	fsinfo_attr__nr
+};
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
+ * desired.
+ */
+struct fsinfo_params {
+	__u32	at_flags;	/* AT_SYMLINK_NOFOLLOW and similar flags */
+	__u32	request;	/* What is being asking for (enum fsinfo_attribute) */
+	__u32	Nth;		/* Instance of it (some may have multiple) */
+	__u32	Mth;		/* Subinstance of Nth instance */
+	__u32	__reserved[6];	/* Reserved params; all must be 0 */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_statfs).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+	__u64	f_blocks;	/* Total number of blocks in fs */
+	__u64	f_bfree;	/* Total number of free blocks */
+	__u64	f_bavail;	/* Number of free blocks available to ordinary user */
+	__u64	f_files;	/* Total number of file nodes in fs */
+	__u64	f_ffree;	/* Number of free file nodes */
+	__u64	f_favail;	/* Number of free file nodes available to ordinary user */
+	__u32	f_bsize;	/* Optimal block size */
+	__u32	f_frsize;	/* Fragment size */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_ids).
+ *
+ * List of basic identifiers as is normally found in statfs().
+ */
+struct fsinfo_ids {
+	char	f_fs_name[15 + 1];
+	__u64	f_flags;	/* Filesystem mount flags (MS_*) */
+	__u64	f_fsid;		/* Short 64-bit Filesystem ID (as statfs) */
+	__u64	f_sb_id;	/* Internal superblock ID for sbnotify()/mntnotify() */
+	__u32	f_fstype;	/* Filesystem type from linux/magic.h [uncond] */
+	__u32	f_dev_major;	/* As st_dev_* from struct statx [uncond] */
+	__u32	f_dev_minor;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_limits).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+	__u64	max_file_size;			/* Maximum file size */
+	__u64	max_uid;			/* Maximum UID supported */
+	__u64	max_gid;			/* Maximum GID supported */
+	__u64	max_projid;			/* Maximum project ID supported */
+	__u32	max_dev_major;			/* Maximum device major representable */
+	__u32	max_dev_minor;			/* Maximum device minor representable */
+	__u32	max_hard_links;			/* Maximum number of hard links on a file */
+	__u32	max_xattr_body_len;		/* Maximum xattr content length */
+	__u32	max_xattr_name_len;		/* Maximum xattr name length */
+	__u32	max_filename_len;		/* Maximum filename length */
+	__u32	max_symlink_len;		/* Maximum symlink content length */
+	__u32	__reserved[1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_supports).
+ *
+ * What's supported in various masks, such as statx() attribute and mask bits
+ * and IOC flags.
+ */
+struct fsinfo_supports {
+	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
+	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
+	__u32	ioc_flags;		/* What FS_IOC_* flags are supported */
+	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
+	__u32	__reserved[1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_capabilities).
+ *
+ * Bitmask indicating filesystem capabilities where renderable as single bits.
+ */
+enum fsinfo_capability {
+	fsinfo_cap_is_kernel_fs		= 0,	/* fs is kernel-special filesystem */
+	fsinfo_cap_is_block_fs		= 1,	/* fs is block-based filesystem */
+	fsinfo_cap_is_flash_fs		= 2,	/* fs is flash filesystem */
+	fsinfo_cap_is_network_fs	= 3,	/* fs is network filesystem */
+	fsinfo_cap_is_automounter_fs	= 4,	/* fs is automounter special filesystem */
+	fsinfo_cap_automounts		= 5,	/* fs supports automounts */
+	fsinfo_cap_adv_locks		= 6,	/* fs supports advisory file locking */
+	fsinfo_cap_mand_locks		= 7,	/* fs supports mandatory file locking */
+	fsinfo_cap_leases		= 8,	/* fs supports file leases */
+	fsinfo_cap_uids			= 9,	/* fs supports numeric uids */
+	fsinfo_cap_gids			= 10,	/* fs supports numeric gids */
+	fsinfo_cap_projids		= 11,	/* fs supports numeric project ids */
+	fsinfo_cap_id_names		= 12,	/* fs supports user names */
+	fsinfo_cap_id_guids		= 13,	/* fs supports user guids */
+	fsinfo_cap_windows_attrs	= 14,	/* fs has windows attributes */
+	fsinfo_cap_user_quotas		= 15,	/* fs has per-user quotas */
+	fsinfo_cap_group_quotas		= 16,	/* fs has per-group quotas */
+	fsinfo_cap_project_quotas	= 17,	/* fs has per-project quotas */
+	fsinfo_cap_xattrs		= 18,	/* fs has xattrs */
+	fsinfo_cap_journal		= 19,	/* fs has a journal */
+	fsinfo_cap_data_is_journalled	= 20,	/* fs is using data journalling */
+	fsinfo_cap_o_sync		= 21,	/* fs supports O_SYNC */
+	fsinfo_cap_o_direct		= 22,	/* fs supports O_DIRECT */
+	fsinfo_cap_volume_id		= 23,	/* fs has a volume ID */
+	fsinfo_cap_volume_uuid		= 24,	/* fs has a volume UUID */
+	fsinfo_cap_volume_name		= 25,	/* fs has a volume name */
+	fsinfo_cap_volume_fsid		= 26,	/* fs has a volume FSID */
+	fsinfo_cap_cell_name		= 27,	/* fs has a cell name */
+	fsinfo_cap_domain_name		= 28,	/* fs has a domain name */
+	fsinfo_cap_realm_name		= 29,	/* fs has a realm name */
+	fsinfo_cap_iver_all_change	= 30,	/* i_version represents data + meta changes */
+	fsinfo_cap_iver_data_change	= 31,	/* i_version represents data changes only */
+	fsinfo_cap_iver_mono_incr	= 32,	/* i_version incremented monotonically */
+	fsinfo_cap_symlinks		= 33,	/* fs supports symlinks */
+	fsinfo_cap_hard_links		= 34,	/* fs supports hard links */
+	fsinfo_cap_hard_links_1dir	= 35,	/* fs supports hard links in same dir only */
+	fsinfo_cap_device_files		= 36,	/* fs supports bdev, cdev */
+	fsinfo_cap_unix_specials	= 37,	/* fs supports pipe, fifo, socket */
+	fsinfo_cap_resource_forks	= 38,	/* fs supports resource forks/streams */
+	fsinfo_cap_name_case_indep	= 39,	/* Filename case independence is mandatory */
+	fsinfo_cap_name_non_utf8	= 40,	/* fs has non-utf8 names */
+	fsinfo_cap_name_has_codepage	= 41,	/* fs has a filename codepage */
+	fsinfo_cap_sparse		= 42,	/* fs supports sparse files */
+	fsinfo_cap_not_persistent	= 43,	/* fs is not persistent */
+	fsinfo_cap_no_unix_mode		= 44,	/* fs does not support unix mode bits */
+	fsinfo_cap_has_atime		= 45,	/* fs supports access time */
+	fsinfo_cap_has_btime		= 46,	/* fs supports birth/creation time */
+	fsinfo_cap_has_ctime		= 47,	/* fs supports change time */
+	fsinfo_cap_has_mtime		= 48,	/* fs supports modification time */
+	fsinfo_cap__nr
+};
+
+struct fsinfo_capabilities {
+	__u8	capabilities[(fsinfo_cap__nr + 7) / 8];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_timestamp_info).
+ */
+struct fsinfo_timestamp_info {
+	__s64	minimum_timestamp;	/* Minimum timestamp value in seconds */
+	__s64	maximum_timestamp;	/* Maximum timestamp value in seconds */
+	__u16	atime_gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
+	__u16	btime_gran_mantissa;
+	__u16	ctime_gran_mantissa;
+	__u16	mtime_gran_mantissa;
+	__s8	atime_gran_exponent;
+	__s8	btime_gran_exponent;
+	__s8	ctime_gran_exponent;
+	__s8	mtime_gran_exponent;
+	__u32	__reserved[1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_volume_uuid).
+ */
+struct fsinfo_volume_uuid {
+	__u8	uuid[16];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_server_addresses).
+ *
+ * Find the Mth address of the Nth server for a network mount.
+ */
+struct fsinfo_server_address {
+	struct __kernel_sockaddr_storage address;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_io_size).
+ *
+ * Retrieve the optimal I/O size for a filesystem.
+ */
+struct fsinfo_io_size {
+	__u32		block_size;		/* Minimum block granularity for O_DIRECT */
+	__u32		max_single_read_size;	/* Maximum size of a single unbuffered read */
+	__u32		max_single_write_size;	/* Maximum size of a single unbuffered write */
+	__u32		best_read_size;		/* Optimal read size */
+	__u32		best_write_size;	/* Optimal write size */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_fsinfo).
+ *
+ * This gives information about fsinfo() itself.
+ */
+struct fsinfo_fsinfo {
+	__u32	max_attr;	/* Number of supported attributes (fsinfo_attr__nr) */
+	__u32	max_cap;	/* Number of supported capabilities (fsinfo_cap__nr) */
+};
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/statx/Makefile b/samples/statx/Makefile
index 59df7c25a9d1..9cb9a88e3a10 100644
--- a/samples/statx/Makefile
+++ b/samples/statx/Makefile
@@ -1,7 +1,10 @@
 # List of programs to build
-hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx
+hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx test-fsinfo
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
+
+HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
+HOSTLOADLIBES_test-fsinfo += -lm
diff --git a/samples/statx/test-fsinfo.c b/samples/statx/test-fsinfo.c
new file mode 100644
index 000000000000..9e9fa62a3b9f
--- /dev/null
+++ b/samples/statx/test-fsinfo.c
@@ -0,0 +1,539 @@
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <sys/stat.h>
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
+	       void *buffer, size_t buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
+}
+
+#define FSINFO_STRING(N)	 [fsinfo_attr_##N] = 0x00
+#define FSINFO_STRUCT(N)	 [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRING_N(N)	 [fsinfo_attr_##N] = 0x40
+#define FSINFO_STRUCT_N(N)	 [fsinfo_attr_##N] = 0x40 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRUCT_NM(N)	 [fsinfo_attr_##N] = 0x80 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+static const __u8 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
+	FSINFO_STRUCT		(statfs),
+	FSINFO_STRUCT		(fsinfo),
+	FSINFO_STRUCT		(ids),
+	FSINFO_STRUCT		(limits),
+	FSINFO_STRUCT		(supports),
+	FSINFO_STRUCT		(capabilities),
+	FSINFO_STRUCT		(timestamp_info),
+	FSINFO_STRING		(volume_id),
+	FSINFO_STRUCT		(volume_uuid),
+	FSINFO_STRING		(volume_name),
+	FSINFO_STRING		(cell_name),
+	FSINFO_STRING		(domain_name),
+	FSINFO_STRING		(realm_name),
+	FSINFO_STRING_N		(server_name),
+	FSINFO_STRUCT_NM	(server_address),
+	FSINFO_STRING_N		(parameter),
+	FSINFO_STRING_N		(source),
+	FSINFO_STRING		(name_encoding),
+	FSINFO_STRING		(name_codepage),
+	FSINFO_STRUCT		(io_size),
+};
+
+#define FSINFO_NAME(N) [fsinfo_attr_##N] = #N
+static const char *fsinfo_attr_names[fsinfo_attr__nr] = {
+	FSINFO_NAME(statfs),
+	FSINFO_NAME(fsinfo),
+	FSINFO_NAME(ids),
+	FSINFO_NAME(limits),
+	FSINFO_NAME(supports),
+	FSINFO_NAME(capabilities),
+	FSINFO_NAME(timestamp_info),
+	FSINFO_NAME(volume_id),
+	FSINFO_NAME(volume_uuid),
+	FSINFO_NAME(volume_name),
+	FSINFO_NAME(cell_name),
+	FSINFO_NAME(domain_name),
+	FSINFO_NAME(realm_name),
+	FSINFO_NAME(server_name),
+	FSINFO_NAME(server_address),
+	FSINFO_NAME(parameter),
+	FSINFO_NAME(source),
+	FSINFO_NAME(name_encoding),
+	FSINFO_NAME(name_codepage),
+	FSINFO_NAME(io_size),
+};
+
+union reply {
+	char buffer[4096];
+	struct fsinfo_statfs statfs;
+	struct fsinfo_fsinfo fsinfo;
+	struct fsinfo_ids ids;
+	struct fsinfo_limits limits;
+	struct fsinfo_supports supports;
+	struct fsinfo_capabilities caps;
+	struct fsinfo_timestamp_info timestamps;
+	struct fsinfo_volume_uuid uuid;
+	struct fsinfo_server_address srv_addr;
+	struct fsinfo_io_size io_size;
+};
+
+static void dump_hex(unsigned int *data, int from, int to)
+{
+	unsigned offset, print_offset = 1, col = 0;
+
+	from /= 4;
+	to = (to + 3) / 4;
+
+	for (offset = from; offset < to; offset++) {
+		if (print_offset) {
+			printf("%04x: ", offset * 8);
+			print_offset = 0;
+		}
+		printf("%08x", data[offset]);
+		col++;
+		if ((col & 3) == 0) {
+			printf("\n");
+			print_offset = 1;
+		} else {
+			printf(" ");
+		}
+	}
+
+	if (!print_offset)
+		printf("\n");
+}
+
+static void dump_attr_statfs(union reply *r, int size)
+{
+	struct fsinfo_statfs *f = &r->statfs;
+
+	printf("\n");
+	printf("\tblocks: n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_blocks,
+	       (unsigned long long)f->f_bfree,
+	       (unsigned long long)f->f_bavail);
+
+	printf("\tfiles : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_files,
+	       (unsigned long long)f->f_ffree,
+	       (unsigned long long)f->f_favail);
+	printf("\tbsize : %u\n", f->f_bsize);
+	printf("\tfrsize: %u\n", f->f_frsize);
+}
+
+static void dump_attr_fsinfo(union reply *r, int size)
+{
+	struct fsinfo_fsinfo *f = &r->fsinfo;
+
+	printf("max_attr=%u max_cap=%u\n", f->max_attr, f->max_cap);
+}
+
+static void dump_attr_ids(union reply *r, int size)
+{
+	struct fsinfo_ids *f = &r->ids;
+
+	printf("\n");
+	printf("\tdev   : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+	printf("\tfs    : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+	printf("\tflags : %llx\n", (unsigned long long)f->f_flags);
+	printf("\tfsid  : %llx\n", (unsigned long long)f->f_fsid);
+}
+
+static void dump_attr_limits(union reply *r, int size)
+{
+	struct fsinfo_limits *f = &r->limits;
+
+	printf("\n");
+	printf("\tmax file size: %llx\n", f->max_file_size);
+	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
+	       f->max_uid, f->max_gid, f->max_projid);
+	printf("\tmax dev      : maj=%x min=%x\n",
+	       f->max_dev_major, f->max_dev_minor);
+	printf("\tmax links    : %x\n", f->max_hard_links);
+	printf("\tmax xattr    : n=%x b=%x\n",
+	       f->max_xattr_name_len, f->max_xattr_body_len);
+	printf("\tmax len      : file=%x sym=%x\n",
+	       f->max_filename_len, f->max_symlink_len);
+}
+
+static void dump_attr_supports(union reply *r, int size)
+{
+	struct fsinfo_supports *f = &r->supports;
+
+	printf("\n");
+	printf("\tstx_attr=%llx\n", f->stx_attributes);
+	printf("\tstx_mask=%x\n", f->stx_mask);
+	printf("\tioc_flags=%x\n", f->ioc_flags);
+	printf("\twin_fattrs=%x\n", f->win_file_attrs);
+}
+
+#define FSINFO_CAP_NAME(C) [fsinfo_cap_##C] = #C
+static const char *fsinfo_cap_names[fsinfo_cap__nr] = {
+	FSINFO_CAP_NAME(is_kernel_fs),
+	FSINFO_CAP_NAME(is_block_fs),
+	FSINFO_CAP_NAME(is_flash_fs),
+	FSINFO_CAP_NAME(is_network_fs),
+	FSINFO_CAP_NAME(is_automounter_fs),
+	FSINFO_CAP_NAME(automounts),
+	FSINFO_CAP_NAME(adv_locks),
+	FSINFO_CAP_NAME(mand_locks),
+	FSINFO_CAP_NAME(leases),
+	FSINFO_CAP_NAME(uids),
+	FSINFO_CAP_NAME(gids),
+	FSINFO_CAP_NAME(projids),
+	FSINFO_CAP_NAME(id_names),
+	FSINFO_CAP_NAME(id_guids),
+	FSINFO_CAP_NAME(windows_attrs),
+	FSINFO_CAP_NAME(user_quotas),
+	FSINFO_CAP_NAME(group_quotas),
+	FSINFO_CAP_NAME(project_quotas),
+	FSINFO_CAP_NAME(xattrs),
+	FSINFO_CAP_NAME(journal),
+	FSINFO_CAP_NAME(data_is_journalled),
+	FSINFO_CAP_NAME(o_sync),
+	FSINFO_CAP_NAME(o_direct),
+	FSINFO_CAP_NAME(volume_id),
+	FSINFO_CAP_NAME(volume_uuid),
+	FSINFO_CAP_NAME(volume_name),
+	FSINFO_CAP_NAME(volume_fsid),
+	FSINFO_CAP_NAME(cell_name),
+	FSINFO_CAP_NAME(domain_name),
+	FSINFO_CAP_NAME(realm_name),
+	FSINFO_CAP_NAME(iver_all_change),
+	FSINFO_CAP_NAME(iver_data_change),
+	FSINFO_CAP_NAME(iver_mono_incr),
+	FSINFO_CAP_NAME(symlinks),
+	FSINFO_CAP_NAME(hard_links),
+	FSINFO_CAP_NAME(hard_links_1dir),
+	FSINFO_CAP_NAME(device_files),
+	FSINFO_CAP_NAME(unix_specials),
+	FSINFO_CAP_NAME(resource_forks),
+	FSINFO_CAP_NAME(name_case_indep),
+	FSINFO_CAP_NAME(name_non_utf8),
+	FSINFO_CAP_NAME(name_has_codepage),
+	FSINFO_CAP_NAME(sparse),
+	FSINFO_CAP_NAME(not_persistent),
+	FSINFO_CAP_NAME(no_unix_mode),
+	FSINFO_CAP_NAME(has_atime),
+	FSINFO_CAP_NAME(has_btime),
+	FSINFO_CAP_NAME(has_ctime),
+	FSINFO_CAP_NAME(has_mtime),
+};
+
+static void dump_attr_capabilities(union reply *r, int size)
+{
+	struct fsinfo_capabilities *f = &r->caps;
+	int i;
+
+	for (i = 0; i < sizeof(f->capabilities); i++)
+		printf("%02x", f->capabilities[i]);
+	printf("\n");
+	for (i = 0; i < fsinfo_cap__nr; i++)
+		if (f->capabilities[i / 8] & (1 << (i % 8)))
+			printf("\t- %s\n", fsinfo_cap_names[i]);
+}
+
+static void dump_attr_timestamp_info(union reply *r, int size)
+{
+	struct fsinfo_timestamp_info *f = &r->timestamps;
+
+	printf("range=%llx-%llx\n",
+	       (unsigned long long)f->minimum_timestamp,
+	       (unsigned long long)f->maximum_timestamp);
+
+#define print_time(G) \
+	printf("\t"#G"time : gran=%gs\n",			\
+	       (f->G##time_gran_mantissa *		\
+		pow(10., f->G##time_gran_exponent)))
+	print_time(a);
+	print_time(b);
+	print_time(c);
+	print_time(m);
+}
+
+static void dump_attr_volume_uuid(union reply *r, int size)
+{
+	struct fsinfo_volume_uuid *f = &r->uuid;
+
+	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+	       "-%02x%02x%02x%02x%02x%02x\n",
+	       f->uuid[ 0], f->uuid[ 1],
+	       f->uuid[ 2], f->uuid[ 3],
+	       f->uuid[ 4], f->uuid[ 5],
+	       f->uuid[ 6], f->uuid[ 7],
+	       f->uuid[ 8], f->uuid[ 9],
+	       f->uuid[10], f->uuid[11],
+	       f->uuid[12], f->uuid[13],
+	       f->uuid[14], f->uuid[15]);
+}
+
+static void dump_attr_server_address(union reply *r, int size)
+{
+	struct fsinfo_server_address *f = &r->srv_addr;
+
+	printf("family=%u\n", f->address.ss_family);
+}
+
+static void dump_attr_io_size(union reply *r, int size)
+{
+	struct fsinfo_io_size *f = &r->io_size;
+
+	printf("bs=%u\n", f->block_size);
+}
+
+/*
+ *
+ */
+typedef void (*dumper_t)(union reply *r, int size);
+
+#define FSINFO_DUMPER(N) [fsinfo_attr_##N] = dump_attr_##N
+static const dumper_t fsinfo_attr_dumper[fsinfo_attr__nr] = {
+	FSINFO_DUMPER(statfs),
+	FSINFO_DUMPER(fsinfo),
+	FSINFO_DUMPER(ids),
+	FSINFO_DUMPER(limits),
+	FSINFO_DUMPER(supports),
+	FSINFO_DUMPER(capabilities),
+	FSINFO_DUMPER(timestamp_info),
+	FSINFO_DUMPER(volume_uuid),
+	FSINFO_DUMPER(server_address),
+	FSINFO_DUMPER(io_size),
+};
+
+static void dump_fsinfo(enum fsinfo_attribute attr, __u8 about,
+			union reply *r, int size)
+{
+	dumper_t dumper = fsinfo_attr_dumper[attr];
+	unsigned int len;
+
+	if (!dumper) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	len = (about & 0x3f) * sizeof(__u32);
+	if (size < len) {
+		printf("<short data %u/%u>\n", size, len);
+		return;
+	}
+
+	dumper(r, size);
+}
+
+/*
+ * Try one subinstance of an attribute.
+ */
+static int try_one(const char *file, struct fsinfo_params *params, bool raw)
+{
+	union reply r;
+	char *p;
+	int ret;
+	__u8 about;
+
+	memset(&r.buffer, 0xbd, sizeof(r.buffer));
+
+	errno = 0;
+	ret = fsinfo(AT_FDCWD, file, params, r.buffer, sizeof(r.buffer));
+	if (params->request >= fsinfo_attr__nr) {
+		if (ret == -1 && errno == EOPNOTSUPP)
+			exit(0);
+		fprintf(stderr, "Unexpected error for too-large command %u: %m\n",
+			params->request);
+		exit(1);
+	}
+
+	//printf("fsinfo(%s,%s,%u,%u) = %d: %m\n",
+	//       file, fsinfo_attr_names[params->request],
+	//       params->Nth, params->Mth, ret);
+
+	about = fsinfo_buffer_sizes[params->request];
+	if (ret == -1) {
+		if (errno == ENODATA) {
+			switch (about & 0xc0) {
+			case 0x00:
+				if (params->Nth == 0 && params->Mth == 0) {
+					fprintf(stderr,
+						"Unexpected ENODATA1 (%u[%u][%u])\n",
+						params->request, params->Nth, params->Mth);
+					exit(1);
+				}
+				break;
+			case 0x40:
+				if (params->Nth == 0 && params->Mth == 0) {
+					fprintf(stderr,
+						"Unexpected ENODATA2 (%u[%u][%u])\n",
+						params->request, params->Nth, params->Mth);
+					exit(1);
+				}
+				break;
+			}
+			return (params->Mth == 0) ? 2 : 1;
+		}
+		if (errno == EOPNOTSUPP) {
+			if (params->Nth > 0 || params->Mth > 0) {
+				fprintf(stderr,
+					"Should return -ENODATA (%u[%u][%u])\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			//printf("\e[33m%s\e[m: <not supported>\n",
+			//       fsinfo_attr_names[attr]);
+			return 2;
+		}
+		perror(file);
+		exit(1);
+	}
+
+	if (raw) {
+		if (ret > 4096)
+			ret = 4096;
+		dump_hex((unsigned int *)&r.buffer, 0, ret);
+		return 0;
+	}
+
+	switch (about & 0xc0) {
+	case 0x00:
+		printf("\e[33m%s\e[m: ",
+		       fsinfo_attr_names[params->request]);
+		break;
+	case 0x40:
+		printf("\e[33m%s[%u]\e[m: ",
+		       fsinfo_attr_names[params->request],
+		       params->Nth);
+		break;
+	case 0x80:
+		printf("\e[33m%s[%u][%u]\e[m: ",
+		       fsinfo_attr_names[params->request],
+		       params->Nth, params->Mth);
+		break;
+	}
+
+	switch (about) {
+		/* Struct */
+	case 0x01 ... 0x3f:
+	case 0x41 ... 0x7f:
+	case 0x81 ... 0xbf:
+		dump_fsinfo(params->request, about, &r, ret);
+		return 0;
+
+		/* String */
+	case 0x00:
+	case 0x40:
+	case 0x80:
+		if (ret >= 4096) {
+			ret = 4096;
+			r.buffer[4092] = '.';
+			r.buffer[4093] = '.';
+			r.buffer[4094] = '.';
+			r.buffer[4095] = 0;
+		} else {
+			r.buffer[ret] = 0;
+		}
+		for (p = r.buffer; *p; p++) {
+			if (!isprint(*p)) {
+				printf("<non-printable>\n");
+				continue;
+			}
+		}
+		printf("%s\n", r.buffer);
+		return 0;
+
+	default:
+		fprintf(stderr, "Fishy about %u %02x\n", params->request, about);
+		exit(1);
+	}
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+	struct fsinfo_params params = {
+		.at_flags = AT_SYMLINK_NOFOLLOW,
+	};
+	unsigned int attr;
+	int raw = 0, opt, Nth, Mth;
+
+	while ((opt = getopt(argc, argv, "alr"))) {
+		switch (opt) {
+		case 'a':
+			params.at_flags |= AT_NO_AUTOMOUNT;
+			continue;
+		case 'l':
+			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
+			continue;
+		case 'r':
+			raw = 1;
+			continue;
+		}
+		break;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	if (argc != 1) {
+		printf("Format: test-fsinfo [-alr] <file>\n");
+		exit(2);
+	}
+
+	for (attr = 0; attr <= fsinfo_attr__nr; attr++) {
+		Nth = 0;
+		do {
+			Mth = 0;
+			do {
+				params.request = attr;
+				params.Nth = Nth;
+				params.Mth = Mth;
+
+				switch (try_one(argv[0], &params, raw)) {
+				case 0:
+					continue;
+				case 1:
+					goto done_M;
+				case 2:
+					goto done_N;
+				}
+			} while (++Mth < 100);
+
+		done_M:
+			if (Mth >= 100) {
+				fprintf(stderr, "Fishy: Mth == %u\n", Mth);
+				break;
+			}
+
+		} while (++Nth < 100);
+
+	done_N:
+		if (Nth >= 100) {
+			fprintf(stderr, "Fishy: Nth == %u\n", Nth);
+			break;
+		}
+	}
+
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 32/32] afs: Add fsinfo support [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (30 preceding siblings ...)
  2018-07-10 22:44 ` [PATCH 31/32] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
@ 2018-07-10 22:45 ` David Howells
  2018-07-10 22:52 ` [MANPAGE PATCH] Add manpages for move_mount(2) and open_tree(2) David Howells
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:45 UTC (permalink / raw)
  To: viro; +Cc: dhowells, linux-fsdevel, torvalds, linux-kernel

Add fsinfo support to the AFS filesystem.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/super.c |  133 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index ab64edff11af..037f20f5ee90 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/magic.h>
+#include <linux/fsinfo.h>
 #include <net/net_namespace.h>
 #include "internal.h"
 
@@ -34,6 +35,7 @@ static void afs_kill_super(struct super_block *sb);
 static struct inode *afs_alloc_inode(struct super_block *sb);
 static void afs_destroy_inode(struct inode *inode);
 static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
+static int afs_get_fsinfo(struct dentry *dentry, struct fsinfo_kparams *params);
 static int afs_show_devname(struct seq_file *m, struct dentry *root);
 static int afs_show_options(struct seq_file *m, struct dentry *root);
 static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference);
@@ -51,6 +53,7 @@ int afs_net_id;
 
 static const struct super_operations afs_super_ops = {
 	.statfs		= afs_statfs,
+	.get_fsinfo	= afs_get_fsinfo,
 	.alloc_inode	= afs_alloc_inode,
 	.drop_inode	= afs_drop_inode,
 	.destroy_inode	= afs_destroy_inode,
@@ -750,3 +753,133 @@ static int afs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 	return ret;
 }
+
+/*
+ * Get filesystem information.
+ */
+static int afs_get_fsinfo(struct dentry *dentry, struct fsinfo_kparams *params)
+{
+	struct fsinfo_timestamp_info *tsinfo;
+	struct fsinfo_server_address *addr;
+	struct fsinfo_capabilities *caps;
+	struct fsinfo_supports *sup;
+	struct afs_server_list *slist;
+	struct afs_super_info *as = AFS_FS_S(dentry->d_sb);
+	struct afs_addr_list *alist;
+	struct afs_server *server;
+	struct afs_volume *volume = as->volume;
+	struct afs_cell *cell = as->cell;
+	struct afs_net *net = afs_d2net(dentry);
+	bool dyn_root = as->dyn_root;
+	int ret;
+
+	switch (params->request) {
+	case fsinfo_attr_timestamp_info:
+		tsinfo = params->buffer;
+		tsinfo->minimum_timestamp = 0;
+		tsinfo->maximum_timestamp = UINT_MAX;
+		tsinfo->mtime_gran_mantissa = 1;
+		tsinfo->mtime_gran_exponent = 0;
+		return sizeof(*tsinfo);
+
+	case fsinfo_attr_supports:
+		sup = params->buffer;
+		sup->stx_mask = (STATX_TYPE | STATX_MODE |
+				 STATX_NLINK |
+				 STATX_UID | STATX_GID |
+				 STATX_MTIME | STATX_INO |
+				 STATX_SIZE);
+		sup->stx_attributes = STATX_ATTR_AUTOMOUNT;
+		return sizeof(*sup);
+
+	case fsinfo_attr_capabilities:
+		caps = params->buffer;
+		if (dyn_root) {
+			fsinfo_set_cap(caps, fsinfo_cap_is_automounter_fs);
+			fsinfo_set_cap(caps, fsinfo_cap_automounts);
+		} else {
+			fsinfo_set_cap(caps, fsinfo_cap_is_network_fs);
+			fsinfo_set_cap(caps, fsinfo_cap_automounts);
+			fsinfo_set_cap(caps, fsinfo_cap_adv_locks);
+			fsinfo_set_cap(caps, fsinfo_cap_uids);
+			fsinfo_set_cap(caps, fsinfo_cap_gids);
+			fsinfo_set_cap(caps, fsinfo_cap_volume_id);
+			fsinfo_set_cap(caps, fsinfo_cap_volume_name);
+			fsinfo_set_cap(caps, fsinfo_cap_cell_name);
+			fsinfo_set_cap(caps, fsinfo_cap_iver_mono_incr);
+			fsinfo_set_cap(caps, fsinfo_cap_symlinks);
+			fsinfo_set_cap(caps, fsinfo_cap_hard_links_1dir);
+			fsinfo_set_cap(caps, fsinfo_cap_has_mtime);
+		}
+		return sizeof(*caps);
+
+	case fsinfo_attr_volume_name:
+		if (dyn_root)
+			return -EOPNOTSUPP;
+		if (params->buffer)
+			memcpy(params->buffer, volume->name, volume->name_len);
+		return volume->name_len;
+
+	case fsinfo_attr_cell_name:
+		if (dyn_root)
+			return -EOPNOTSUPP;
+		if (params->buffer)
+			memcpy(params->buffer, cell->name, cell->name_len);
+		return cell->name_len;
+
+	case fsinfo_attr_server_name:
+		if (dyn_root)
+			return -EOPNOTSUPP;
+		read_lock(&volume->servers_lock);
+		slist = afs_get_serverlist(volume->servers);
+		read_unlock(&volume->servers_lock);
+
+		if (params->Nth < slist->nr_servers) {
+			server = slist->servers[params->Nth].server;
+			if (params->buffer)
+				ret = sprintf(params->buffer, "%pU", &server->uuid);
+			else
+				ret = 16 * 2 + 4;
+		} else {
+			ret = -ENODATA;
+		}
+
+		afs_put_serverlist(net, slist);
+		return ret;
+
+	case fsinfo_attr_server_address:
+		addr = params->buffer;
+		if (dyn_root)
+			return -EOPNOTSUPP;
+		read_lock(&volume->servers_lock);
+		slist = afs_get_serverlist(volume->servers);
+		read_unlock(&volume->servers_lock);
+
+		ret = -ENODATA;
+		if (params->Nth >= slist->nr_servers)
+			goto put_slist;
+		server = slist->servers[params->Nth].server;
+
+		read_lock(&server->fs_lock);
+		alist = afs_get_addrlist(rcu_access_pointer(server->addresses));
+		read_unlock(&server->fs_lock);
+		if (!alist)
+			goto put_slist;
+
+		if (params->Mth >= alist->nr_addrs)
+			goto put_alist;
+
+		memcpy(addr, &alist->addrs[params->Mth],
+		       sizeof(struct sockaddr_rxrpc));
+		ret = sizeof(*addr);
+
+	put_alist:
+		afs_put_addrlist(alist);
+	put_slist:
+		afs_put_serverlist(net, slist);
+		return ret;
+
+	default:
+		return generic_fsinfo(dentry, params);
+	}
+}


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/32] vfs: Require specification of size of mount data for internal mounts [ver #9]
  2018-07-10 22:42 ` [PATCH 11/32] vfs: Require specification of size of mount data for internal mounts " David Howells
@ 2018-07-10 22:51   ` Linus Torvalds
  0 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-10 22:51 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Tue, Jul 10, 2018 at 3:42 PM David Howells <dhowells@redhat.com> wrote:
>
> Require specification of the size of the mount data passed to the VFS
> mounting functions by internal mounts.

This should not be patch 11/32 in some big and complex series with new
user API's etc.

This should be a prerequisite patch that stands on its own.

          Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [MANPAGE PATCH] Add manpages for move_mount(2) and open_tree(2)
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (31 preceding siblings ...)
  2018-07-10 22:45 ` [PATCH 32/32] afs: Add fsinfo support " David Howells
@ 2018-07-10 22:52 ` David Howells
  2019-10-09  9:51   ` Michael Kerrisk (man-pages)
  2018-07-10 22:54 ` [MANPAGE PATCH] Add manpage for fsopen(2), fspick(2) and fsmount(2) David Howells
                   ` (4 subsequent siblings)
  37 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-10 22:52 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: dhowells, viro, linux-api, linux-fsdevel, torvalds, linux-kernel,
	linux-man

Add manual pages to document the move_mount and open_tree() system calls.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 man2/move_mount.2 |  274 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man2/open_tree.2  |  260 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 534 insertions(+)
 create mode 100644 man2/move_mount.2
 create mode 100644 man2/open_tree.2

diff --git a/man2/move_mount.2 b/man2/move_mount.2
new file mode 100644
index 000000000..3a819fb84
--- /dev/null
+++ b/man2/move_mount.2
@@ -0,0 +1,274 @@
+'\" t
+.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH MOVE_MOUNT 2 2018-06-08 "Linux" "Linux Programmer's Manual"
+.SH NAME
+move_mount \- Move mount objects around the filesystem topology
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.br
+.B #include <sys/mount.h>
+.br
+.B #include <unistd.h>
+.br
+.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
+.PP
+.BI "int move_mount(int " from_dirfd ", const char *" from_pathname ","
+.BI "               int " to_dirfd ", const char *" to_pathname ","
+.BI "               unsigned int " flags );
+.fi
+.PP
+.IR Note :
+There are no glibc wrappers for these system calls.
+.SH DESCRIPTION
+The
+.BR move_mount ()
+call moves a mount from one place to another; it can also be used to attach an
+unattached mount created by
+.BR fsmount "() or " open_tree "() with " OPEN_TREE_CLONE .
+.PP
+If
+.BR move_mount ()
+is called repeatedly with a file descriptor that refers to a mount object,
+then the object will be attached/moved the first time and then moved again and
+again and again, detaching it from the previous mountpoint each time.
+.PP
+To access the source mount object or the destination mountpoint, no
+permissions are required on the object itself, but if either pathname is
+supplied, execute (search) permission is required on all of the directories
+specified in
+.IR from_pathname " or " to_pathname .
+.PP
+The caller does, however, require the appropriate capabilities or permission
+to effect a mount.
+.PP
+.BR move_mount ()
+uses
+.IR from_pathname ", " from_dirfd " and some " flags
+to locate the mount object to be moved and
+.IR to_pathname ", " to_dirfd " and some other " flags
+to locate the destination mountpoint.  Each lookup can be done in one of a
+variety of ways:
+.TP
+[*] By absolute path.
+The pathname points to an absolute path and the dirfd is ignored.  The file is
+looked up by name, starting from the root of the filesystem as seen by the
+calling process.
+.TP
+[*] By cwd-relative path.
+The pathname points to a relative path and the dirfd is
+.IR AT_FDCWD .
+The file is looked up by name, starting from the current working directory.
+.TP
+[*] By dir-relative path.
+The pathname points to relative path and the dirfd indicates a file descriptor
+pointing to a directory.  The file is looked up by name, starting from the
+directory specified by
+.IR dirfd .
+.TP
+[*] By file descriptor.
+The pathname points to "", the dirfd points directly to the mount object to
+move or the destination mount point and the appropriate
+.B *_EMPTY_PATH
+flag is set.
+.PP
+.I flags
+can be used to influence a path-based lookup.  A value for
+.I flags
+is constructed by OR'ing together zero or more of the following constants:
+.TP
+.BR MOVE_MOUNT_F_EMPTY_PATH
+.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
+If
+.I from_pathname
+is an empty string, operate on the file referred to by
+.IR from_dirfd
+(which may have been obtained using the
+.BR open (2)
+.B O_PATH
+flag or
+.BR open_tree ())
+If
+.I from_dirfd
+is
+.BR AT_FDCWD ,
+the call operates on the current working directory.
+In this case,
+.I from_dirfd
+can refer to any type of file, not just a directory.
+This flag is Linux-specific; define
+.B _GNU_SOURCE
+.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
+to obtain its definition.
+.TP
+.B MOVE_MOUNT_T_EMPTY_PATH
+As above, but operating on
+.IR to_pathname " and " to_dirfd .
+.TP
+.B MOVE_MOUNT_F_NO_AUTOMOUNT
+Don't automount the terminal ("basename") component of
+.I from_pathname
+if it is a directory that is an automount point.  This allows a mount object
+that has an automount point at its root to be moved and prevents unintended
+triggering of an automount point.
+The
+.B MOVE_MOUNT_F_NO_AUTOMOUNT
+flag has no effect if the automount point has already been mounted over.  This
+flag is Linux-specific; define
+.B _GNU_SOURCE
+.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
+to obtain its definition.
+.TP
+.B MOVE_MOUNT_T_NO_AUTOMOUNT
+As above, but operating on
+.IR to_pathname " and " to_dirfd .
+This allows an automount point to be manually mounted over.
+.TP
+.B MOVE_MOUNT_F_SYMLINKS
+If
+.I from_pathname
+is a symbolic link, then dereference it.  The default for
+.BR move_mount ()
+is to not follow symlinks.
+.TP
+.B MOVE_MOUNT_T_SYMLINKS
+As above, but operating on
+.IR to_pathname " and " to_dirfd .
+
+.SH EXAMPLES
+The
+.BR move_mount ()
+function can be used like the following:
+.PP
+.RS
+.nf
+move_mount(AT_FDCWD, "/a", AT_FDCWD, "/b", 0);
+.fi
+.RE
+.PP
+This would move the object mounted on "/a" to "/b".  It can also be used in
+conjunction with
+.BR open_tree "(2) or " open "(2) with " O_PATH :
+.PP
+.RS
+.nf
+fd = open_tree(AT_FDCWD, "/mnt", 0);
+move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
+move_mount(fd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH);
+move_mount(fd, "", AT_FDCWD, "/mnt4", MOVE_MOUNT_F_EMPTY_PATH);
+.fi
+.RE
+.PP
+This would attach the path point for "/mnt" to fd, then it would move the
+mount to "/mnt2", then move it to "/mnt3" and finally to "/mnt4".
+.PP
+It can also be used to attach new mounts:
+.PP
+.RS
+.nf
+sfd = fsopen("ext4", FSOPEN_CLOEXEC);
+write(sfd, "s /dev/sda1");
+write(sfd, "o user_xattr");
+mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_NODEV);
+move_mount(mfd, "", AT_FDCWD, "/home", MOVE_MOUNT_F_EMPTY_PATH);
+.fi
+.RE
+.PP
+Which would open the Ext4 filesystem mounted on "/dev/sda1", turn on user
+extended attribute support and create a mount object for it.  Finally, the new
+mount object would be attached with
+.BR move_mount ()
+to "/home".
+
+
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.SH RETURN VALUE
+On success, 0 is returned.  On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EACCES
+Search permission is denied for one of the directories
+in the path prefix of
+.IR pathname .
+(See also
+.BR path_resolution (7).)
+.TP
+.B EBADF
+.IR from_dirfd " or " to_dirfd
+is not a valid open file descriptor.
+.TP
+.B EFAULT
+.IR from_pathname " or " to_pathname
+is NULL or either one point to a location outside the process's accessible
+address space.
+.TP
+.B EINVAL
+Reserved flag specified in
+.IR flags .
+.TP
+.B ELOOP
+Too many symbolic links encountered while traversing the pathname.
+.TP
+.B ENAMETOOLONG
+.IR from_pathname " or " to_pathname
+is too long.
+.TP
+.B ENOENT
+A component of
+.IR from_pathname " or " to_pathname
+does not exist, or one is an empty string and the appropriate
+.B *_EMPTY_PATH
+was not specified in
+.IR flags .
+.TP
+.B ENOMEM
+Out of memory (i.e., kernel memory).
+.TP
+.B ENOTDIR
+A component of the path prefix of
+.IR from_pathname " or " to_pathname
+is not a directory or one or the other is relative and the appropriate
+.I *_dirfd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR move_mount ()
+was added to Linux in kernel 4.18.
+.SH CONFORMING TO
+.BR move_mount ()
+is Linux-specific.
+.SH NOTES
+Glibc does not (yet) provide a wrapper for the
+.BR move_mount ()
+system call; call it using
+.BR syscall (2).
+.SH SEE ALSO
+.BR fsmount (2),
+.BR fsopen (2),
+.BR open_tree (2)
diff --git a/man2/open_tree.2 b/man2/open_tree.2
new file mode 100644
index 000000000..7e9c86fe3
--- /dev/null
+++ b/man2/open_tree.2
@@ -0,0 +1,260 @@
+'\" t
+.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH OPEN_TREE 2 2018-06-08 "Linux" "Linux Programmer's Manual"
+.SH NAME
+open_tree \- Pick or clone mount object and attach to fd
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.br
+.B #include <sys/mount.h>
+.br
+.B #include <unistd.h>
+.br
+.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
+.PP
+.BI "int open_tree(int " dirfd ", const char *" pathname ", unsigned int " flags );
+.fi
+.PP
+.IR Note :
+There are no glibc wrappers for these system calls.
+.SH DESCRIPTION
+.BR open_tree ()
+picks the mount object specified by the pathname and attaches it to a new file
+descriptor or clones it and attaches the clone to the file descriptor.  The
+resultant file descriptor is indistinguishable from one produced by
+.BR open "(2) with " O_PATH .
+.PP
+In the case that the mount object is cloned, the clone will be "unmounted" and
+destroyed when the file descriptor is closed if it is not otherwise mounted
+somewhere by calling
+.BR move_mount (2).
+.PP
+To select a mount object, no permissions are required on the object referred
+to by the path, but execute (search) permission is required on all of the
+directories in
+.I pathname
+that lead to the object.
+.PP
+To clone an object, however, the caller must have mount capabilities and
+permissions.
+.PP
+.BR open_tree ()
+uses
+.IR pathname ", " dirfd " and " flags
+to locate the target object in one of a variety of ways:
+.TP
+[*] By absolute path.
+.I pathname
+points to an absolute path and
+.I dirfd
+is ignored.  The object is looked up by name, starting from the root of the
+filesystem as seen by the calling process.
+.TP
+[*] By cwd-relative path.
+.I pathname
+points to a relative path and
+.IR dirfd " is " AT_FDCWD .
+The object is looked up by name, starting from the current working directory.
+.TP
+[*] By dir-relative path.
+.I pathname
+points to relative path and
+.I dirfd
+indicates a file descriptor pointing to a directory.  The object is looked up
+by name, starting from the directory specified by
+.IR dirfd .
+.TP
+[*] By file descriptor.
+.I pathname
+is "",
+.I dirfd
+indicates a file descriptor and
+.B AT_EMPTY_PATH
+is set in
+.IR flags .
+The mount attached to the file descriptor is queried directly.  The file
+descriptor may point to any type of file, not just a directory.
+
+.\"______________________________________________________________
+.PP
+.I flags
+can be used to control the operation of the function and to influence a
+path-based lookup.  A value for
+.I flags
+is constructed by OR'ing together zero or more of the following constants:
+.TP
+.BR AT_EMPTY_PATH
+.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
+If
+.I pathname
+is an empty string, operate on the file referred to by
+.IR dirfd
+(which may have been obtained from
+.BR open "(2) with"
+.BR O_PATH ", from " fsmount (2)
+or from another
+.BR open_tree ()).
+If
+.I dirfd
+is
+.BR AT_FDCWD ,
+the call operates on the current working directory.
+In this case,
+.I dirfd
+can refer to any type of file, not just a directory.
+This flag is Linux-specific; define
+.B _GNU_SOURCE
+.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
+to obtain its definition.
+.TP
+.BR AT_NO_AUTOMOUNT
+Don't automount the terminal ("basename") component of
+.I pathname
+if it is a directory that is an automount point.  This flag allows the
+automount point itself to be picked up or a mount cloned that is rooted on the
+automount point.  The
+.B AT_NO_AUTOMOUNT
+flag has no effect if the mount point has already been mounted over.
+This flag is Linux-specific; define
+.B _GNU_SOURCE
+.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
+to obtain its definition.
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If
+.I pathname
+is a symbolic link, do not dereference it: instead pick up or clone a mount
+rooted on the link itself.
+.TP
+.B OPEN_TREE_CLOEXEC
+Set the close-on-exec flag for the new file descriptor.  This will cause the
+file descriptor to be closed automatically when a process exec's.
+.TP
+.B OPEN_TREE_CLONE
+Rather than directly attaching the selected object to the file descriptor,
+clone the object, set the root of the new mount object to that point and
+attach the clone to the file descriptor.
+.TP
+.B AT_RECURSIVE
+This is only permitted in conjunction with OPEN_TREE_CLONE.  It causes the
+entire mount subtree rooted at the selected spot to be cloned rather than just
+that one mount object.
+
+
+.SH EXAMPLE
+The
+.BR open_tree ()
+function can be used like the following:
+.PP
+.RS
+.nf
+fd1 = open_tree(AT_FDCWD, "/mnt", 0);
+fd2 = open_tree(fd1, "",
+                AT_EMPTY_PATH | OPEN_TREE_CLONE | AT_RECURSIVE);
+move_mount(fd2, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
+.fi
+.RE
+.PP
+This would attach the path point for "/mnt" to fd1, then it would copy the
+entire subtree at the point referred to by fd1 and attach that to fd2; lastly,
+it would attach the clone to "/mnt2".
+
+
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.SH RETURN VALUE
+On success, the new file descriptor is returned.  On error, \-1 is returned,
+and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EACCES
+Search permission is denied for one of the directories
+in the path prefix of
+.IR pathname .
+(See also
+.BR path_resolution (7).)
+.TP
+.B EBADF
+.I dirfd
+is not a valid open file descriptor.
+.TP
+.B EFAULT
+.I pathname
+is NULL or
+.IR pathname
+point to a location outside the process's accessible address space.
+.TP
+.B EINVAL
+Reserved flag specified in
+.IR flags .
+.TP
+.B ELOOP
+Too many symbolic links encountered while traversing the pathname.
+.TP
+.B ENAMETOOLONG
+.I pathname
+is too long.
+.TP
+.B ENOENT
+A component of
+.I pathname
+does not exist, or
+.I pathname
+is an empty string and
+.B AT_EMPTY_PATH
+was not specified in
+.IR flags .
+.TP
+.B ENOMEM
+Out of memory (i.e., kernel memory).
+.TP
+.B ENOTDIR
+A component of the path prefix of
+.I pathname
+is not a directory or
+.I pathname
+is relative and
+.I dirfd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR open_tree ()
+was added to Linux in kernel 4.18.
+.SH CONFORMING TO
+.BR open_tree ()
+is Linux-specific.
+.SH NOTES
+Glibc does not (yet) provide a wrapper for the
+.BR open_tree ()
+system call; call it using
+.BR syscall (2).
+.SH SEE ALSO
+.BR fsmount (2),
+.BR move_mount (2),
+.BR open (2)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [MANPAGE PATCH] Add manpage for fsopen(2), fspick(2) and fsmount(2)
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (32 preceding siblings ...)
  2018-07-10 22:52 ` [MANPAGE PATCH] Add manpages for move_mount(2) and open_tree(2) David Howells
@ 2018-07-10 22:54 ` David Howells
  2019-10-09  9:52   ` Michael Kerrisk (man-pages)
  2018-07-10 22:55 ` [MANPAGE PATCH] Add manpage for fsinfo(2) David Howells
                   ` (3 subsequent siblings)
  37 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-10 22:54 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: dhowells, viro, linux-api, linux-fsdevel, torvalds, linux-kernel,
	linux-man

Add a manual page to document the fsopen(), fspick() and fsmount() system
calls.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 man2/fsmount.2 |    1 
 man2/fsopen.2  |  357 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 man2/fspick.2  |    1 
 3 files changed, 359 insertions(+)
 create mode 100644 man2/fsmount.2
 create mode 100644 man2/fsopen.2
 create mode 100644 man2/fspick.2

diff --git a/man2/fsmount.2 b/man2/fsmount.2
new file mode 100644
index 000000000..2bf59fc3e
--- /dev/null
+++ b/man2/fsmount.2
@@ -0,0 +1 @@
+.so man2/fsopen.2
diff --git a/man2/fsopen.2 b/man2/fsopen.2
new file mode 100644
index 000000000..1bc761ab4
--- /dev/null
+++ b/man2/fsopen.2
@@ -0,0 +1,357 @@
+'\" t
+.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH FSOPEN 2 2018-06-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+fsopen, fsmount, fspick \- Handle filesystem (re-)configuration and mounting
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.br
+.B #include <sys/mount.h>
+.br
+.B #include <unistd.h>
+.br
+.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
+.PP
+.BI "int fsopen(const char *" fsname ", unsigned int " flags );
+.PP
+.BI "int fsmount(int " fd ", unsigned int " flags ", unsigned int " ms_flags );
+.PP
+.BI "int fspick(int " dirfd ", const char *" pathname ", unsigned int " flags );
+.fi
+.PP
+.IR Note :
+There are no glibc wrappers for these system calls.
+.SH DESCRIPTION
+.PP
+.BR fsopen ()
+creates a new filesystem configuration context within the kernel for the
+filesystem named in the
+.I fsname
+parameter and attaches it to a file descriptor, which it then returns.  The
+file descriptor can be marked close-on-exec by setting
+.B FSOPEN_CLOEXEC
+in flags.
+.PP
+The
+file descriptor can then be used to configure the desired filesystem parameters
+and security parameters by using
+.BR write (2)
+to pass parameters to it and then writing a command to actually create the
+filesystem representation.
+.PP
+The file descriptor also serves as a channel by which more comprehensive error,
+warning and information messages may be retrieved from the kernel using
+.BR read (2).
+.PP
+Once the kernel's filesystem representation has been created, it can be queried
+by calling
+.BR fsinfo (2)
+on the file descriptor.  fsinfo() will spot that the target is actually a
+creation context and look inside that.
+.PP
+.BR fsmount ()
+can then be called to create a mount object that refers to the newly created
+filesystem representation, with the propagation and mount restrictions to be
+applied specified in
+.IR ms_flags .
+The mount object is then attached to a new file descriptor that looks like one
+created by
+.BR open "(2) with " O_PATH " or " open_tree (2).
+This can be passed to
+.BR move_mount (2)
+to attach the mount object to a mountpoint, thereby completing the process.
+.PP
+The file descriptor returned by fsmount() is marked close-on-exec if
+FSMOUNT_CLOEXEC is specified in
+.IR flags .
+.PP
+After fsmount() has completed, the context created by fsopen() is reset and
+moved to reconfiguration state, allowing the new superblock to be reconfigured.
+.PP
+.BR fspick ()
+creates a new filesystem context within the kernel, attaches the superblock
+specified by
+.IR dfd ", " pathname ", " flags
+and puts it into the reconfiguration state and attached the context to a new
+file descriptor that can then be parameterised with
+.BR write (2)
+exactly the same as for the context created by fsopen() above.
+.PP
+.I flags
+is an OR'd together mask of
+.B FSPICK_CLOEXEC
+which indicates that the returned file descriptor should be marked
+close-on-exec and
+.BR FSPICK_SYMLINK_NOFOLLOW ", " FSPICK_NO_AUTOMOUNT " and " FSPICK_EMPTY_PATH
+which control the pathwalk to the target object (see below).
+
+.\"________________________________________________________
+.SS Writable Command Interface
+Superblock (re-)configuration is achieved by writing command strings to the
+context file descriptor using
+.BR write (2).
+Each string is prefixed with a specifier indicating the class of command
+being specified.  The available commands include:
+.TP
+\fB"o <option>"\fP
+Specify a filesystem or security parameter.
+.I <option>
+is typically a key or key=val format string.  Since the length of the option is
+given to write(), the option may include any sort of character, including
+spaces and commas or even binary data.
+.TP
+\fB"s <name>"\fP
+Specify a device file, network server or other other source specification.
+This may be optional, depending on the filesystem, and it may be possible to
+provide multiple of them to a filesystem.
+.TP
+\fB"x create"\fP
+End the filesystem configuration phase and try and create a representation in
+the kernel with the parameters specified.  After this, the context is shifted
+to the mount-pending state waiting for an fsmount() call to occur.
+.TP
+\fB"x reconfigure"\fP
+End a filesystem reconfiguration phase try to apply the parameters to the
+filesystem representation.  After this, the context gets reset and put back to
+the start of the reconfiguration phase again.
+.PP
+With this interface, option strings are not limited to 4096 bytes, either
+individually or in sum, and they are also not restricted to text-only options.
+Further, errors may be given individually for each option and not aggregated or
+dumped into the kernel log.
+
+.\"________________________________________________________
+.SS Message Retrieval Interface
+The context file descriptor may be queried for message strings at any time by
+calling
+.BR read (2)
+on the file descriptor.  This will return formatted messages that are prefixed
+to indicate their class:
+.TP
+\fB"e <message>"\fP
+An error message string was logged.
+.TP
+\fB"i <message>"\fP
+An informational message string was logged.
+.TP
+\fB"w <message>"\fP
+An warning message string was logged.
+.PP
+Messages are removed from the queue as they're read.
+
+.\"________________________________________________________
+.SH EXAMPLES
+To illustrate the process, here's an example whereby this can be used to mount
+an ext4 filesystem on /dev/sdb1 onto /mnt.  Note that the example ignores the
+fact that
+.BR write (2)
+has a length parameter and that errors might occur.
+.PP
+.in +4n
+.nf
+sfd = fsopen("ext4", FSOPEN_CLOEXEC);
+write(sfd, "s /dev/sdb1");
+write(sfd, "o noatime");
+write(sfd, "o acl");
+write(sfd, "o user_attr");
+write(sfd, "o iversion");
+write(sfd, "x create");
+fsinfo(sfd, NULL, ...);
+mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
+move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
+.fi
+.in
+.PP
+Here, an ext4 context is created first and attached to sfd.  This is then told
+where its source will be, given a bunch of options and created.
+.BR fsinfo (2)
+can then be used to query the filesystem.  Then fsmount() is called to create a
+mount object and
+.BR move_mount (2)
+is called to attach it to its intended mountpoint.
+.PP
+And here's an example of mounting from an NFS server:
+.PP
+.in +4n
+.nf
+sfd = fsopen("nfs", 0);
+write(sfd, "s example.com/pub/linux");
+write(sfd, "o nfsvers=3");
+write(sfd, "o rsize=65536");
+write(sfd, "o wsize=65536");
+write(sfd, "o rdma");
+write(sfd, "x create");
+mfd = fsmount(sfd, 0, MS_NODEV);
+move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
+.fi
+.in
+.PP
+Reconfiguration can be achieved by:
+.PP
+.in +4n
+.nf
+sfd = fspick(AT_FDCWD, "/mnt", FSPICK_NO_AUTOMOUNT | FSPICK_CLOEXEC);
+write(sfd, "o ro");
+write(sfd, "x reconfigure");
+.fi
+.in
+.PP
+or:
+.PP
+.in +4n
+.nf
+sfd = fsopen(...);
+...
+mfd = fsmount(sfd, ...);
+...
+write(sfd, "o ro");
+write(sfd, "x reconfigure");
+.fi
+.in
+
+
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.SH RETURN VALUE
+On success, all three functions return a file descriptor.  On error, \-1 is
+returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The error values given below result from filesystem type independent
+errors.
+Each filesystem type may have its own special errors and its
+own special behavior.
+See the Linux kernel source code for details.
+.TP
+.B EACCES
+A component of a path was not searchable.
+(See also
+.BR path_resolution (7).)
+.TP
+.B EACCES
+Mounting a read-only filesystem was attempted without giving the
+.B MS_RDONLY
+flag.
+.TP
+.B EACCES
+The block device
+.I source
+is located on a filesystem mounted with the
+.B MS_NODEV
+option.
+.\" mtk: Probably: write permission is required for MS_BIND, with
+.\" the error EPERM if not present; CAP_DAC_OVERRIDE is required.
+.TP
+.B EBUSY
+.I source
+cannot be reconfigured read-only, because it still holds files open for
+writing.
+.TP
+.B EFAULT
+One of the pointer arguments points outside the user address space.
+.TP
+.B EINVAL
+.I source
+had an invalid superblock.
+.TP
+.B EINVAL
+.I ms_flags
+includes more than one of
+.BR MS_SHARED ,
+.BR MS_PRIVATE ,
+.BR MS_SLAVE ,
+or
+.BR MS_UNBINDABLE .
+.TP
+.BR EINVAL
+An attempt was made to bind mount an unbindable mount.
+.TP
+.B ELOOP
+Too many links encountered during pathname resolution.
+.TP
+.B EMFILE
+The system has too many open files to create more.
+.TP
+.B ENFILE
+The process has too many open files to create more.
+.TP
+.B ENAMETOOLONG
+A pathname was longer than
+.BR MAXPATHLEN .
+.TP
+.B ENODEV
+Filesystem
+.I fsname
+not configured in the kernel.
+.TP
+.B ENOENT
+A pathname was empty or had a nonexistent component.
+.TP
+.B ENOMEM
+The kernel could not allocate sufficient memory to complete the call.
+.TP
+.B ENOTBLK
+.I source
+is not a block device (and a device was required).
+.TP
+.B ENOTDIR
+.IR pathname ,
+or a prefix of
+.IR source ,
+is not a directory.
+.TP
+.B ENXIO
+The major number of the block device
+.I source
+is out of range.
+.TP
+.B EPERM
+The caller does not have the required privileges.
+.SH CONFORMING TO
+These functions are Linux-specific and should not be used in programs intended
+to be portable.
+.SH VERSIONS
+.BR fsopen "(), " fsmount "() and " fspick ()
+were added to Linux in kernel 4.18.
+.SH NOTES
+Glibc does not (yet) provide a wrapper for the
+.BR fsopen "() , " fsmount "() or " fspick "()"
+system calls; call them using
+.BR syscall (2).
+.SH SEE ALSO
+.BR mountpoint (1),
+.BR move_mount (2),
+.BR open_tree (2),
+.BR umount (2),
+.BR mount_namespaces (7),
+.BR path_resolution (7),
+.BR findmnt (8),
+.BR lsblk (8),
+.BR mount (8),
+.BR umount (8)
diff --git a/man2/fspick.2 b/man2/fspick.2
new file mode 100644
index 000000000..2bf59fc3e
--- /dev/null
+++ b/man2/fspick.2
@@ -0,0 +1 @@
+.so man2/fsopen.2

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [MANPAGE PATCH] Add manpage for fsinfo(2)
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (33 preceding siblings ...)
  2018-07-10 22:54 ` [MANPAGE PATCH] Add manpage for fsopen(2), fspick(2) and fsmount(2) David Howells
@ 2018-07-10 22:55 ` David Howells
  2019-10-09  9:52   ` Michael Kerrisk (man-pages)
  2019-10-09 12:02   ` David Howells
  2018-07-10 23:01 ` [PATCH 00/32] VFS: Introduce filesystem context [ver #9] Linus Torvalds
                   ` (2 subsequent siblings)
  37 siblings, 2 replies; 113+ messages in thread
From: David Howells @ 2018-07-10 22:55 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: dhowells, viro, linux-api, linux-fsdevel, torvalds, linux-kernel,
	linux-man

Add a manual page to document the fsinfo() system call.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 man2/fsinfo.2       | 1017 +++++++++++++++++++++++++++++++++++++++++++++++++++
 man2/ioctl_iflags.2 |    6 
 man2/stat.2         |    7 
 man2/statx.2        |   13 +
 man2/utime.2        |    7 
 man2/utimensat.2    |    7 
 6 files changed, 1057 insertions(+)
 create mode 100644 man2/fsinfo.2

diff --git a/man2/fsinfo.2 b/man2/fsinfo.2
new file mode 100644
index 000000000..5710232df
--- /dev/null
+++ b/man2/fsinfo.2
@@ -0,0 +1,1017 @@
+'\" t
+.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH FSINFO 2 2018-06-06 "Linux" "Linux Programmer's Manual"
+.SH NAME
+fsinfo \- Get filesystem information
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.br
+.B #include <sys/fsinfo.h>
+.br
+.B #include <unistd.h>
+.br
+.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
+.PP
+.BI "int fsinfo(int " dirfd ", const char *" pathname ","
+.BI "           struct fsinfo_params *" params ","
+.BI "           void *" buffer ", size_t " buf_size );
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for
+.BR fsinfo ();
+see NOTES.
+.SH DESCRIPTION
+.PP
+fsinfo() retrieves the desired filesystem attribute, as selected by the
+parameters pointed to by
+.IR params ,
+and stores its value in the buffer pointed to by
+.IR buffer .
+.PP
+The parameter structure is optional, defaulting to all the parameters being 0
+if the pointer is NULL.  The structure looks like the following:
+.PP
+.in +4n
+.nf
+struct fsinfo_params {
+    __u32 at_flags;     /* AT_SYMLINK_NOFOLLOW and similar flags */
+    __u32 request;      /* Requested attribute */
+    __u32 Nth;          /* Instance of attribute */
+    __u32 Mth;          /* Subinstance of Nth instance */
+    __u32 __reserved[6]; /* Reserved params; all must be 0 */
+};
+.fi
+.in
+.PP
+The filesystem to be queried is looked up using a combination of
+.IR dfd ", " pathname " and " params->at_flags.
+This is discussed in more detail below.
+.PP
+The desired attribute is indicated by
+.IR params->request .
+If
+.I params
+is NULL, this will default to
+.BR fsinfo_attr_statfs ,
+which retrieves some of the information returned by
+.BR statfs ().
+The available attributes are described below in the "THE ATTRIBUTES" section.
+.PP
+Some attributes can have multiple values and some can even have multiple
+instances with multiple values.  For example, a network filesystem might use
+multiple servers.  The names of each of these servers can be retrieved by
+using
+.I params->Nth
+to iterate through all the instances until error
+.B ENODATA
+occurs, indicating the end of the list.  Further, each server might have
+multiple addresses available; these can be enumerated using
+.I params->Nth
+to iterate the servers and
+.I params->Mth
+to iterate the addresses of the Nth server.
+.PP
+The amount of data written into the buffer depends on the attribute selected.
+Some attributes return variable-length strings and some return fixed-size
+structures.  If either
+.IR buffer " is  NULL  or " buf_size " is 0"
+then the size of the attribute value will be returned and nothing will be
+written into the buffer.
+.PP
+The
+.I params->__reserved
+parameters must all be 0.
+.\"_______________________________________________________
+.SS
+Allowance for Future Attribute Expansion
+.PP
+To allow for the future expansion and addition of fields to any fixed-size
+structure attribute,
+.BR fsinfo ()
+makes the following guarantees:
+.RS 4m
+.IP (1) 4m
+It will always clear any excess space in the buffer.
+.IP (2) 4m
+It will always return the actual size of the data.
+.IP (3) 4m
+It will truncate the data to fit it into the buffer rather than giving an
+error.
+.IP (4) 4m
+Any new version of a structure will incorporate all the fields from the old
+version at same offsets.
+.RE
+.PP
+So, for example, if the caller is running on an older version of the kernel
+with an older, smaller version of the structure than was asked for, the kernel
+will write the smaller version into the buffer and will clear the remainder of
+the buffer to make sure any additional fields are set to 0.  The function will
+return the actual size of the data.
+.PP
+On the other hand, if the caller is running on a newer version of the kernel
+with a newer version of the structure that is larger than the buffer, the write
+to the buffer will be truncated to fit as necessary and the actual size of the
+data will be returned.
+.PP
+Note that this doesn't apply to variable-length string attributes.
+
+.\"_______________________________________________________
+.SS
+Invoking \fBfsinfo\fR():
+.PP
+To access a file's status, no permissions are required on the file itself, but
+in the case of
+.BR fsinfo ()
+with a path, execute (search) permission is required on all of the directories
+in
+.I pathname
+that lead to the file.
+.PP
+.BR fsinfo ()
+uses
+.IR pathname ", " dirfd " and " params->at_flags
+to locate the target file in one of a variety of ways:
+.TP
+[*] By absolute path.
+.I pathname
+points to an absolute path and
+.I dirfd
+is ignored.  The file is looked up by name, starting from the root of the
+filesystem as seen by the calling process.
+.TP
+[*] By cwd-relative path.
+.I pathname
+points to a relative path and
+.IR dirfd " is " AT_FDCWD .
+The file is looked up by name, starting from the current working directory.
+.TP
+[*] By dir-relative path.
+.I pathname
+points to relative path and
+.I dirfd
+indicates a file descriptor pointing to a directory.  The file is looked up by
+name, starting from the directory specified by
+.IR dirfd .
+.TP
+[*] By file descriptor.
+.IR pathname " is " NULL " and " dirfd
+indicates a file descriptor.  The file attached to the file descriptor is
+queried directly.  The file descriptor may point to any type of file, not just
+a directory.
+.PP
+.I flags
+can be used to influence a path-based lookup.  A value for
+.I flags
+is constructed by OR'ing together zero or more of the following constants:
+.TP
+.BR AT_EMPTY_PATH
+.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
+If
+.I pathname
+is an empty string, operate on the file referred to by
+.IR dirfd
+(which may have been obtained using the
+.BR open (2)
+.B O_PATH
+flag).
+If
+.I dirfd
+is
+.BR AT_FDCWD ,
+the call operates on the current working directory.
+In this case,
+.I dirfd
+can refer to any type of file, not just a directory.
+This flag is Linux-specific; define
+.B _GNU_SOURCE
+.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
+to obtain its definition.
+.TP
+.BR AT_NO_AUTOMOUNT
+Don't automount the terminal ("basename") component of
+.I pathname
+if it is a directory that is an automount point.  This allows the caller to
+gather attributes of the filesystem holding an automount point (rather than
+the filesystem it would mount).  This flag can be used in tools that scan
+directories to prevent mass-automounting of a directory of automount points.
+The
+.B AT_NO_AUTOMOUNT
+flag has no effect if the mount point has already been mounted over.
+This flag is Linux-specific; define
+.B _GNU_SOURCE
+.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
+to obtain its definition.
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If
+.I pathname
+is a symbolic link, do not dereference it:
+instead return information about the link itself, like
+.BR lstat ().
+.SH THE ATTRIBUTES
+.PP
+There is a range of attributes that can be selected from.  These are:
+
+.\" __________________ fsinfo_attr_statfs __________________
+.TP
+.B fsinfo_attr_statfs
+This retrieves the "dynamic"
+.B statfs
+information, such as block and file counts, that are expected to change whilst
+a filesystem is being used.  This fills in the following structure:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_statfs {
+    __u64 f_blocks;	/* Total number of blocks in fs */
+    __u64 f_bfree;	/* Total number of free blocks */
+    __u64 f_bavail;	/* Number of free blocks available to ordinary user */
+    __u64 f_files;	/* Total number of file nodes in fs */
+    __u64 f_ffree;	/* Number of free file nodes */
+    __u64 f_favail;	/* Number of free file nodes available to ordinary user */
+    __u32 f_bsize;	/* Optimal block size */
+    __u32 f_frsize;	/* Fragment size */
+};
+.fi
+.in
+.RE
+.IP
+The fields correspond to those of the same name returned by
+.BR statfs ().
+
+.\" __________________ fsinfo_attr_fsinfo __________________
+.TP
+.B fsinfo_attr_fsinfo
+This retrieves information about the
+.BR fsinfo ()
+system call itself.  This fills in the following structure:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_fsinfo {
+    __u32 max_attr;
+    __u32 max_cap;
+};
+.fi
+.in
+.RE
+.IP
+The
+.I max_attr
+value indicates the number of attributes supported by the
+.BR fsinfo ()
+system call, and
+.I max_cap
+indicates the number of capability bits supported by the
+.B fsinfo_attr_capabilities
+attribute.  The first corresponds to
+.I fsinfo_attr__nr
+and the second to
+.I fsinfo_cap__nr
+in the header file.
+
+.\" __________________ fsinfo_attr_ids __________________
+.TP
+.B fsinfo_attr_ids
+This retrieves a number of fixed IDs and other static information otherwise
+available through
+.BR statfs ().
+The following structure is filled in:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_ids {
+    char  f_fs_name[15 + 1]; /* Filesystem name */
+    __u64 f_flags;	/* Filesystem mount flags (MS_*) */
+    __u64 f_fsid;	/* Short 64-bit Filesystem ID */
+    __u64 f_sb_id;	/* Internal superblock ID */
+    __u32 f_fstype;	/* Filesystem type from linux/magic.h */
+    __u32 f_dev_major;	/* As st_dev_* from struct statx */
+    __u32 f_dev_minor;
+};
+.fi
+.in
+.RE
+.IP
+Most of these are filled in as for
+.BR statfs (),
+with the addition of the filesystem's symbolic name in
+.I f_fs_name
+and an identifier for use in notifications in
+.IR f_sb_id .
+
+.\" __________________ fsinfo_attr_limits __________________
+.TP
+.B fsinfo_attr_limits
+This retrieves information about the limits of what a filesystem can support.
+The following structure is filled in:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_limits {
+    __u64 max_file_size;
+    __u64 max_uid;
+    __u64 max_gid;
+    __u64 max_projid;
+    __u32 max_dev_major;
+    __u32 max_dev_minor;
+    __u32 max_hard_links;
+    __u32 max_xattr_body_len;
+    __u16 max_xattr_name_len;
+    __u16 max_filename_len;
+    __u16 max_symlink_len;
+    __u16 __reserved[1];
+};
+.fi
+.in
+.RE
+.IP
+These indicate the maximum supported sizes for a variety of filesystem objects,
+including the file size, the extended attribute name length and body length,
+the filename length and the symlink body length.
+.IP
+It also indicates the maximum representable values for a User ID, a Group ID,
+a Project ID, a device major number and a device minor number.
+.IP
+And finally, it indicates the maximum number of hard links that can be made to
+a file.
+.IP
+Note that some of these values may be zero if the underlying object or concept
+is not supported by the filesystem or the medium.
+
+.\" __________________ fsinfo_attr_supports __________________
+.TP
+.B fsinfo_attr_supports
+This retrieves information about what bits a filesystem supports in various
+masks.  The following structure is filled in:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_supports {
+    __u64 stx_attributes;
+    __u32 stx_mask;
+    __u32 ioc_flags;
+    __u32 win_file_attrs;
+    __u32 __reserved[1];
+};
+.fi
+.in
+.RE
+.IP
+The
+.IR stx_attributes " and " stx_mask
+fields indicate what bits in the struct statx fields of the matching names
+are supported by the filesystem.
+.IP
+The
+.I ioc_flags
+field indicates what FS_*_FL flag bits as used through the FS_IOC_GET/SETFLAGS
+ioctls are supported by the filesystem.
+.IP
+The
+.I win_file_attrs
+indicates what DOS/Windows file attributes a filesystem supports, if any.
+
+.\" __________________ fsinfo_attr_capabilities __________________
+.TP
+.B fsinfo_attr_capabilities
+This retrieves information about what features a filesystem supports as a
+series of single bit indicators.  The following structure is filled in:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_capabilities {
+    __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
+};
+.fi
+.in
+.RE
+.IP
+where the bit of interest can be found by:
+.PP
+.RS
+.in +4n
+.nf
+	p->capabilities[bit / 8] & (1 << (bit % 8)))
+.fi
+.in
+.RE
+.IP
+The bits are listed by
+.I enum fsinfo_capability
+and
+.B fsinfo_cap__nr
+is one more than the last capability bit listed in the header file.
+.IP
+Note that the number of capability bits actually supported by the kernel can be
+found using the
+.B fsinfo_attr_fsinfo
+attribute.
+.IP
+The capability bits and their meanings are listed below in the "THE
+CAPABILITIES" section.
+
+.\" __________________ fsinfo_attr_timestamp_info __________________
+.TP
+.B fsinfo_attr_timestamp_info
+This retrieves information about what timestamp resolution and scope is
+supported by a filesystem for each of the file timestamps.  The following
+structure is filled in:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_timestamp_info {
+	__s64 minimum_timestamp;
+	__s64 maximum_timestamp;
+	__u16 atime_gran_mantissa;
+	__u16 btime_gran_mantissa;
+	__u16 ctime_gran_mantissa;
+	__u16 mtime_gran_mantissa;
+	__s8  atime_gran_exponent;
+	__s8  btime_gran_exponent;
+	__s8  ctime_gran_exponent;
+	__s8  mtime_gran_exponent;
+	__u32 __reserved[1];
+};
+.fi
+.in
+.RE
+.IP
+where
+.IR minimum_timestamp " and " maximum_timestamp
+are the limits on the timestamps that the filesystem supports and
+.IR *time_gran_mantissa " and " *time_gran_exponent
+indicate the granularity of each timestamp in terms of seconds, using the
+formula:
+.PP
+.RS
+.in +4n
+.nf
+mantissa * pow(10, exponent) Seconds
+.fi
+.in
+.RE
+.IP
+where exponent may be negative and the result may be a fraction of a second.
+.IP
+Four timestamps are detailed: \fBA\fPccess time, \fBB\fPirth/creation time,
+\fBC\fPhange time and \fBM\fPodification time.  Capability bits are defined
+that specify whether each of these exist in the filesystem or not.
+.IP
+Note that the timestamp description may be approximated or inaccurate if the
+file is actually remote or is the union of multiple objects.
+
+.\" __________________ fsinfo_attr_volume_id __________________
+.TP
+.B fsinfo_attr_volume_id
+This retrieves the system's superblock volume identifier as a variable-length
+string.  This does not necessarily represent a value stored in the medium but
+might be constructed on the fly.
+.IP
+For instance, for a block device this is the block device identifier
+(eg. "sdb2"); for AFS this would be the numeric volume identifier.
+
+.\" __________________ fsinfo_attr_volume_uuid __________________
+.TP
+.B fsinfo_attr_volume_uuid
+This retrieves the volume UUID, if there is one, as a little-endian binary
+UUID.  This fills in the following structure:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_volume_uuid {
+    __u8 uuid[16];
+};
+.fi
+.in
+.RE
+.IP
+
+.\" __________________ fsinfo_attr_volume_name __________________
+.TP
+.B fsinfo_attr_volume_name
+This retrieves the filesystem's volume name as a variable-length string.  This
+is expected to represent a name stored in the medium.
+.IP
+For a block device, this might be a label stored in the superblock.  For a
+network filesystem, this might be a logical volume name of some sort.
+
+.\" __________________ fsinfo_attr_cell/domain __________________
+.PP
+.B fsinfo_attr_cell_name
+.br
+.B fsinfo_attr_domain_name
+.br
+.IP
+These two attributes are variable-length string attributes that may be used to
+obtain information about network filesystems.  An AFS volume, for instance,
+belongs to a named cell.  CIFS shares may belong to a domain.
+
+.\" __________________ fsinfo_attr_realm_name __________________
+.TP
+.B fsinfo_attr_realm_name
+This attribute is variable-length string that indicates the Kerberos realm that
+a filesystem's authentication tokens should come from.
+
+.\" __________________ fsinfo_attr_server_name __________________
+.TP
+.B fsinfo_attr_server_name
+This attribute is a multiple-value attribute that lists the names of the
+servers that are backing a network filesystem.  Each value is a variable-length
+string.  The values are enumerated by calling
+.BR fsinfo ()
+multiple times, incrementing
+.I params->Nth
+each time until an ENODATA error occurs, thereby indicating the end of the
+list.
+
+.\" __________________ fsinfo_attr_server_address __________________
+.TP
+.B fsinfo_attr_server_address
+This attribute is a multiple-instance, multiple-value attribute that lists the
+addresses of the servers that are backing a network filesystem.  Each value is
+a structure of the following type:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_server_address {
+    struct __kernel_sockaddr_storage address;
+};
+.fi
+.in
+.RE
+.IP
+Where the address may be AF_INET, AF_INET6, AF_RXRPC or any other type as
+appropriate to the filesystem.
+.IP
+The values are enumerated by calling
+.IR fsinfo ()
+multiple times, incrementing
+.I params->Nth
+to step through the servers and
+.I params->Mth
+to step through the addresses of the Nth server each time until ENODATA errors
+occur, thereby indicating either the end of a server's address list or the end
+of the server list.
+.IP
+Barring the server list changing whilst being accessed, it is expected that the
+.I params->Nth
+will correspond to
+.I params->Nth
+for
+.BR fsinfo_attr_server_name .
+
+.\" __________________ fsinfo_attr_parameter __________________
+.TP
+.B fsinfo_attr_parameter
+This attribute is a multiple-value attribute that lists the values of the mount
+parameters for a filesystem as variable-length strings.
+.IP
+The parameters are enumerated by calling
+.BR fsinfo ()
+multiple times, incrementing
+.I params->Nth
+to step through them until error ENODATA is given.
+.IP
+Parameter strings are presented in a form akin to the way they're passed to the
+context created by the
+.BR fsopen ()
+system call.  For example, straight text parameters will be rendered as
+something like:
+.PP
+.RS
+.in +4n
+.nf
+"o data=journal"
+"o noquota"
+.fi
+.in
+.RE
+.IP
+Where the initial "word" indicates the option form.
+
+.\" __________________ fsinfo_attr_source __________________
+.TP
+.B fsinfo_attr_source
+This attribute is a multiple-value attribute that lists the mount sources for a
+filesystem as variable-length strings.  Normally only one source will be
+available, but the possibility of having more than one is allowed for.
+.IP
+The sources are enumerated by calling
+.BR fsinfo ()
+multiple times, incrementing
+.I params->Nth
+to step through them until error ENODATA is given.
+.IP
+Source strings are presented in a form akin to the way they're passed to the
+context created by the
+.BR fsopen ()
+system call.  For example, they will be rendered as something like:
+.PP
+.RS
+.in +4n
+.nf
+"s /dev/sda1"
+"s example.com/pub/linux/"
+.fi
+.in
+.RE
+.IP
+Where the initial "word" indicates the option form.
+
+.\" __________________ fsinfo_attr_name_encoding __________________
+.TP
+.B fsinfo_attr_name_encoding
+This attribute is variable-length string that indicates the filename encoding
+used by the filesystem.  The default is "utf8".  Note that this may indicate a
+non-8-bit encoding if that's what the underlying filesystem actually supports.
+
+.\" __________________ fsinfo_attr_name_codepage __________________
+.TP
+.B fsinfo_attr_name_codepage
+This attribute is variable-length string that indicates the codepage used to
+translate filenames from the filesystem to the system if this is applicable to
+the filesystem.
+
+.\" __________________ fsinfo_attr_io_size __________________
+.TP
+.B fsinfo_attr_io_size
+This retrieves information about the I/O sizes supported by the filesystem.
+The following structure is filled in:
+.PP
+.RS
+.in +4n
+.nf
+struct fsinfo_io_size {
+    __u32 block_size;
+    __u32 max_single_read_size;
+    __u32 max_single_write_size;
+    __u32 best_read_size;
+    __u32 best_write_size;
+};
+.fi
+.in
+.RE
+.IP
+Where
+.I block_size
+indicates the fundamental I/O block size of the filesystem as something
+O_DIRECT read/write sizes must be a multiple of;
+.IR max_single_write_size " and " max_single_write_size
+indicate the maximum sizes for individual unbuffered data transfer operations;
+and
+.IR best_read_size " and " best_write_size
+indicate the recommended I/O sizes.
+.IP
+Note that any of these may be zero if inapplicable or indeterminable.
+
+
+
+.SH THE CAPABILITIES
+.PP
+There are number of capability bits in a bit array that can be retrieved using
+.BR fsinfo_attr_capabilities .
+These give information about features of the filesystem driver and the specific
+filesystem.
+
+.\" __________________ fsinfo_cap_is_*_fs __________________
+.PP
+.B fsinfo_cap_is_kernel_fs
+.br
+.B fsinfo_cap_is_block_fs
+.br
+.B fsinfo_cap_is_flash_fs
+.br
+.B fsinfo_cap_is_network_fs
+.br
+.B fsinfo_cap_is_automounter_fs
+.IP
+These indicate the primary type of the filesystem.
+.B kernel
+filesystems are special communication interfaces that substitute files for
+system calls; examples include procfs and sysfs.
+.B block
+filesystems require a block device on which to operate; examples include ext4
+and XFS.
+.B flash
+filesystems require an MTD device on which to operate; examples include JFFS2.
+.B network
+filesystems require access to the network and contact one or more servers;
+examples include NFS and AFS.
+.B automounter
+filesystems are kernel special filesystems that host automount points and
+triggers to dynamically create automount points.  Examples include autofs and
+AFS's dynamic root.
+
+.\" __________________ fsinfo_cap_automounts __________________
+.TP
+.B fsinfo_cap_automounts
+The filesystem may have automount points that can be triggered by pathwalk.
+
+.\" __________________ fsinfo_cap_adv_locks __________________
+.TP
+.B fsinfo_cap_adv_locks
+The filesystem supports advisory file locks.  For a network filesystem, this
+indicates that the advisory file locks are cross-client (and also between
+server and its local filesystem on something like NFS).
+
+.\" __________________ fsinfo_cap_mand_locks __________________
+.TP
+.B fsinfo_cap_mand_locks
+The filesystem supports mandatory file locks.  For a network filesystem, this
+indicates that the mandatory file locks are cross-client (and also between
+server and its local filesystem on something like NFS).
+
+.\" __________________ fsinfo_cap_leases __________________
+.TP
+.B fsinfo_cap_leases
+The filesystem supports leases.  For a network filesystem, this means that the
+server will tell the client to clean up its state on a file before passing the
+lease to another client.
+
+.\" __________________ fsinfo_cap_*ids __________________
+.PP
+.B fsinfo_cap_uids
+.br
+.B fsinfo_cap_gids
+.br
+.B fsinfo_cap_projids
+.IP
+These indicate that the filesystem supports numeric user IDs, group IDs and
+project IDs respectively.
+
+.\" __________________ fsinfo_cap_id_* __________________
+.PP
+.B fsinfo_cap_id_names
+.br
+.B fsinfo_cap_id_guids
+.IP
+These indicate that the filesystem employs textual names and/or GUIDs as
+identifiers.
+
+.\" __________________ fsinfo_cap_windows_attrs __________________
+.TP
+.B fsinfo_cap_windows_attrs
+Indicates that the filesystem supports some Windows FILE_* attributes.
+
+.\" __________________ fsinfo_cap_*_quotas __________________
+.PP
+.B fsinfo_cap_user_quotas
+.br
+.B fsinfo_cap_group_quotas
+.br
+.B fsinfo_cap_project_quotas
+.IP
+These indicate that the filesystem supports quotas for users, groups and
+projects respectively.
+
+.\" __________________ fsinfo_cap_xattrs/filetypes __________________
+.PP
+.B fsinfo_cap_xattrs
+.br
+.B fsinfo_cap_symlinks
+.br
+.B fsinfo_cap_hard_links
+.br
+.B fsinfo_cap_hard_links_1dir
+.br
+.B fsinfo_cap_device_files
+.br
+.B fsinfo_cap_unix_specials
+.IP
+These indicate that the filesystem supports respectively extended attributes;
+symbolic links; hard links spanning direcories; hard links, but only within a
+directory; block and character device files; and UNIX special files, such as
+FIFO and socket.
+
+.\" __________________ fsinfo_cap_*journal* __________________
+.PP
+.B fsinfo_cap_journal
+.br
+.B fsinfo_cap_data_is_journalled
+.IP
+The first of these indicates that the filesystem has a journal and the second
+that the file data changes are being journalled.
+
+.\" __________________ fsinfo_cap_o_* __________________
+.PP
+.B fsinfo_cap_o_sync
+.br
+.B fsinfo_cap_o_direct
+.IP
+These indicate that O_SYNC and O_DIRECT are supported respectively.
+
+.\" __________________ fsinfo_cap_o_* __________________
+.PP
+.B fsinfo_cap_volume_id
+.br
+.B fsinfo_cap_volume_uuid
+.br
+.B fsinfo_cap_volume_name
+.br
+.B fsinfo_cap_volume_fsid
+.br
+.B fsinfo_cap_cell_name
+.br
+.B fsinfo_cap_domain_name
+.br
+.B fsinfo_cap_realm_name
+.IP
+These indicate if various attributes are supported by the filesystem, where
+.B fsinfo_cap_X
+here corresponds to
+.BR fsinfo_attr_X .
+
+.\" __________________ fsinfo_cap_iver_* __________________
+.PP
+.B fsinfo_cap_iver_all_change
+.br
+.B fsinfo_cap_iver_data_change
+.br
+.B fsinfo_cap_iver_mono_incr
+.IP
+These indicate if
+.I i_version
+on an inode in the filesystem is supported and
+how it behaves.
+.B all_change
+indicates that i_version is incremented on metadata changes as well as data
+changes.
+.B data_change
+indicates that i_version is only incremented on data changes, including
+truncation.
+.B mono_incr
+indicates that i_version is incremented by exactly 1 for each change made.
+
+.\" __________________ fsinfo_cap_resource_forks __________________
+.TP
+.B fsinfo_cap_resource_forks
+This indicates that the filesystem supports some sort of resource fork or
+alternate data stream on a file.  This isn't the same as an extended attribute.
+
+.\" __________________ fsinfo_cap_name_* __________________
+.PP
+.B fsinfo_cap_name_case_indep
+.br
+.B fsinfo_cap_name_non_utf8
+.br
+.B fsinfo_cap_name_has_codepage
+.IP
+These indicate certain facts about the filenames in a filesystem: whether
+they're case-independent; if they're not UTF-8; and if there's a codepage
+employed to map the names.
+
+.\" __________________ fsinfo_cap_sparse __________________
+.TP
+.B fsinfo_cap_sparse
+This indicates that the filesystem supports sparse files.
+
+.\" __________________ fsinfo_cap_not_persistent __________________
+.TP
+.B fsinfo_cap_not_persistent
+This indicates that the filesystem is not persistent, and that any data stored
+here will not be saved in the event that the filesystem is unmounted, the
+machine is rebooted or the machine loses power.
+
+.\" __________________ fsinfo_cap_no_unix_mode __________________
+.TP
+.B fsinfo_cap_no_unix_mode
+This indicates that the filesystem doesn't support the UNIX mode permissions
+bits.
+
+.\" __________________ fsinfo_cap_has_*time __________________
+.PP
+.B fsinfo_cap_has_atime
+.br
+.B fsinfo_cap_has_btime
+.br
+.B fsinfo_cap_has_ctime
+.br
+.B fsinfo_cap_has_mtime
+.IP
+These indicate as to what timestamps a filesystem supports, including: Access
+time, Birth/creation time, Change time (metadata and data) and Modification
+time (data only).
+
+
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.SH RETURN VALUE
+On success, the size of the value that the kernel has available is returned,
+irrespective of whether the buffer is large enough to hold that.  The data
+written to the buffer will be truncated if it is not.  On error, \-1 is
+returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EACCES
+Search permission is denied for one of the directories
+in the path prefix of
+.IR pathname .
+(See also
+.BR path_resolution (7).)
+.TP
+.B EBADF
+.I dirfd
+is not a valid open file descriptor.
+.TP
+.B EFAULT
+.I pathname
+is NULL or
+.IR pathname ", " params " or " buffer
+point to a location outside the process's accessible address space.
+.TP
+.B EINVAL
+Reserved flag specified in
+.IR params->at_flags " or one of " params->__reserved[]
+is not 0.
+.TP
+.B EOPNOTSUPP
+Unsupported attribute requested in
+.IR params->request .
+This may be beyond the limit of the supported attribute set or may just not be
+one that's supported by the filesystem.
+.TP
+.B ENODATA
+Unavailable attribute value requested by
+.IR params->Nth " and/or " params->Mth .
+.TP
+.B ELOOP
+Too many symbolic links encountered while traversing the pathname.
+.TP
+.B ENAMETOOLONG
+.I pathname
+is too long.
+.TP
+.B ENOENT
+A component of
+.I pathname
+does not exist, or
+.I pathname
+is an empty string and
+.B AT_EMPTY_PATH
+was not specified in
+.IR params->at_flags .
+.TP
+.B ENOMEM
+Out of memory (i.e., kernel memory).
+.TP
+.B ENOTDIR
+A component of the path prefix of
+.I pathname
+is not a directory or
+.I pathname
+is relative and
+.I dirfd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR fsinfo ()
+was added to Linux in kernel 4.18.
+.SH CONFORMING TO
+.BR fsinfo ()
+is Linux-specific.
+.SH NOTES
+Glibc does not (yet) provide a wrapper for the
+.BR fsinfo ()
+system call; call it using
+.BR syscall (2).
+.SH SEE ALSO
+.BR ioctl_iflags (2),
+.BR statx (2),
+.BR statfs (2)
diff --git a/man2/ioctl_iflags.2 b/man2/ioctl_iflags.2
index 9c77b08b9..49ba4444e 100644
--- a/man2/ioctl_iflags.2
+++ b/man2/ioctl_iflags.2
@@ -200,9 +200,15 @@ the effective user ID of the caller must match the owner of the file,
 or the caller must have the
 .BR CAP_FOWNER
 capability.
+.PP
+The set of flags supported by a filesystem can be determined by calling
+.IR fsinfo (2)
+with attribute
+.IR fsinfo_attr_supports .
 .SH SEE ALSO
 .BR chattr (1),
 .BR lsattr (1),
+.BR fsinfo (2),
 .BR mount (2),
 .BR btrfs (5),
 .BR ext4 (5),
diff --git a/man2/stat.2 b/man2/stat.2
index dad9a01ac..ee4001f85 100644
--- a/man2/stat.2
+++ b/man2/stat.2
@@ -532,6 +532,12 @@ If none of the aforementioned macros are defined,
 then the nanosecond values are exposed with names of the form
 .IR st_atimensec .
 .\"
+.PP
+Which timestamps are supported by a filesystem and their the ranges and
+granularities can be determined by calling
+.IR fsinfo (2)
+with attribute
+.IR fsinfo_attr_timestamp_info .
 .SS C library/kernel differences
 Over time, increases in the size of the
 .I stat
@@ -707,6 +713,7 @@ main(int argc, char *argv[])
 .BR access (2),
 .BR chmod (2),
 .BR chown (2),
+.BR fsinfo (2),
 .BR readlink (2),
 .BR utime (2),
 .BR capabilities (7),
diff --git a/man2/statx.2 b/man2/statx.2
index edac9f6f4..9a57c1b90 100644
--- a/man2/statx.2
+++ b/man2/statx.2
@@ -534,12 +534,25 @@ Glibc does not (yet) provide a wrapper for the
 .BR statx ()
 system call; call it using
 .BR syscall (2).
+.PP
+The sets of mask/stx_mask and stx_attributes bits supported by a filesystem
+can be determined by calling
+.IR fsinfo (2)
+with attribute
+.IR fsinfo_attr_supports .
+.PP
+Which timestamps are supported by a filesystem and their the ranges and
+granularities can also be determined by calling
+.IR fsinfo (2)
+with attribute
+.IR fsinfo_attr_timestamp_info .
 .SH SEE ALSO
 .BR ls (1),
 .BR stat (1),
 .BR access (2),
 .BR chmod (2),
 .BR chown (2),
+.BR fsinfo (2),
 .BR readlink (2),
 .BR stat (2),
 .BR utime (2),
diff --git a/man2/utime.2 b/man2/utime.2
index 03a43a416..c6acdbac2 100644
--- a/man2/utime.2
+++ b/man2/utime.2
@@ -181,9 +181,16 @@ on an append-only file.
 .\" is just a wrapper for
 .\" .BR utime ()
 .\" and hence does not allow a subsecond resolution.
+.PP
+Which timestamps are supported by a filesystem and their the ranges and
+granularities can be determined by calling
+.IR fsinfo (2)
+with attribute
+.IR fsinfo_attr_timestamp_info .
 .SH SEE ALSO
 .BR chattr (1),
 .BR touch (1),
+.BR fsinfo (2),
 .BR futimesat (2),
 .BR stat (2),
 .BR utimensat (2),
diff --git a/man2/utimensat.2 b/man2/utimensat.2
index d61b43e96..be8925548 100644
--- a/man2/utimensat.2
+++ b/man2/utimensat.2
@@ -633,9 +633,16 @@ instead checks whether the
 .\" conversely, a process with a read-only file descriptor won't
 .\" be able to update the timestamps of a file,
 .\" even if it has write permission on the file.
+.PP
+Which timestamps are supported by a filesystem and their the ranges and
+granularities can be determined by calling
+.IR fsinfo (2)
+with attribute
+.IR fsinfo_attr_timestamp_info .
 .SH SEE ALSO
 .BR chattr (1),
 .BR touch (1),
+.BR fsinfo (2),
 .BR futimesat (2),
 .BR openat (2),
 .BR stat (2),

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (34 preceding siblings ...)
  2018-07-10 22:55 ` [MANPAGE PATCH] Add manpage for fsinfo(2) David Howells
@ 2018-07-10 23:01 ` Linus Torvalds
  2018-07-12  0:46 ` David Howells
  2018-07-18 21:29 ` Getting rid of the usage of write() -- was " David Howells
  37 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-10 23:01 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Tue, Jul 10, 2018 at 3:41 PM David Howells <dhowells@redhat.com> wrote:
>
> Here are a set of patches to create a filesystem context prior to setting
> up a new mount, populating it with the parsed options/binary data, creating
> the superblock and then effecting the mount.  This is also used for remount
> since much of the parsing stuff is common in many filesystems.
>
> This allows namespaces and other information to be conveyed through the
> mount procedure.
>
> This also allows Miklós Szeredi's idea of doing:
>
>         fd = fsopen("nfs");
>         write(fd, "option=val", ...);
>         mfd = fsmount(fd, MS_NODEV);
>         move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
>
> that he presented at LSF-2017 to be implemented (see the relevant patches
> in the series).

All your documentation (both commit logs, man-pages and in-kernel
actual docs you add) only talk about "what".

They don't talk about _why_.

I can imagine why's. But I think that the "why" is actually way mnore
important than the what. At no point did I see a "this is the current
interface, and it doesn't work for xyz, so here's the new interface
that allows us to do stuff".

When you have a diffstat like this:

 171 files changed, 7147 insertions(+), 1805 deletions(-)

I sure want to see an explanation for *WHY* it adds 5000+ lines of core code.

Also, I want to hear about sane security models. One of the things
people really want to do is have users do their own mounts. We've had
security issues in that area. Why does this improve on it, or make it
even worse?

And by "secuyrity models" I absolutely do not mean "here's how you can
do complex smack rules for it".

                 Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/32] smack: Implement filesystem context security hooks [ver #9]
  2018-07-10 22:42 ` [PATCH 08/32] smack: Implement filesystem context security " David Howells
@ 2018-07-10 23:13   ` Casey Schaufler
  2018-07-10 23:19   ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: Casey Schaufler @ 2018-07-10 23:13 UTC (permalink / raw)
  To: David Howells, viro
  Cc: linux-kernel, linux-fsdevel, linux-security-module, torvalds

On 7/10/2018 3:42 PM, David Howells wrote:
> Implement filesystem context security hooks for the smack LSM.
>
> Question: Should the ->fs_context_parse_source() hook be implemented to
> check the labels on any source devices specified?

Checking the label on a block device when doing a mount
is just going to end in tears. If you're remounting from
an already mounted filesystem it might make sense to check
that the new mount doesn't provide greater access than the
existing mount. If the original mount has smackfsdefault="_"
I could see prohibiting the additional mount having
smackfsdefault="*" on a filesystem that doesn't support
xattrs. But that requires that a (hopefully) privileged
process be involved, and we expect them to have a clue.
So no, I don't see it necessary.

>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Casey Schaufler <casey@schaufler-ca.com>
> cc: linux-security-module@vger.kernel.org
> ---
>
>  security/smack/smack_lsm.c |  309 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 309 insertions(+)
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index 7ad226018f51..39780b06469b 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -42,6 +42,7 @@
>  #include <linux/shm.h>
>  #include <linux/binfmts.h>
>  #include <linux/parser.h>
> +#include <linux/fs_context.h>
>  #include "smack.h"
>  
>  #define TRANS_TRUE	"TRUE"
> @@ -521,6 +522,307 @@ static int smack_syslog(int typefrom_file)
>  	return rc;
>  }
>  
> +/*
> + * Mount context operations
> + */
> +
> +struct smack_fs_context {
> +	union {
> +		struct {
> +			char		*fsdefault;
> +			char		*fsfloor;
> +			char		*fshat;
> +			char		*fsroot;
> +			char		*fstransmute;
> +		};
> +		char			*ptrs[5];
> +
> +	};
> +	struct superblock_smack		*sbsp;
> +	struct inode_smack		*isp;
> +	bool				transmute;
> +};
> +
> +/**
> + * smack_fs_context_free - Free the security data from a filesystem context
> + * @fc: The filesystem context to be cleaned up.
> + */
> +static void smack_fs_context_free(struct fs_context *fc)
> +{
> +	struct smack_fs_context *ctx = fc->security;
> +	int i;
> +
> +	if (ctx) {
> +		for (i = 0; i < ARRAY_SIZE(ctx->ptrs); i++)
> +			kfree(ctx->ptrs[i]);
> +		kfree(ctx->isp);
> +		kfree(ctx->sbsp);
> +		kfree(ctx);
> +		fc->security = NULL;
> +	}
> +}
> +
> +/**
> + * smack_fs_context_alloc - Allocate security data for a filesystem context
> + * @fc: The filesystem context.
> + * @reference: Reference dentry (automount/reconfigure) or NULL
> + *
> + * Returns 0 on success or -ENOMEM on error.
> + */
> +static int smack_fs_context_alloc(struct fs_context *fc,
> +				  struct dentry *reference)
> +{
> +	struct smack_fs_context *ctx;
> +	struct superblock_smack *sbsp;
> +	struct inode_smack *isp;
> +	struct smack_known *skp;
> +
> +	ctx = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
> +	if (!ctx)
> +		goto nomem;
> +	fc->security = ctx;
> +
> +	sbsp = kzalloc(sizeof(struct superblock_smack), GFP_KERNEL);
> +	if (!sbsp)
> +		goto nomem_free;
> +	ctx->sbsp = sbsp;
> +
> +	isp = new_inode_smack(NULL);
> +	if (!isp)
> +		goto nomem_free;
> +	ctx->isp = isp;
> +
> +	if (reference) {
> +		if (reference->d_sb->s_security)
> +			memcpy(sbsp, reference->d_sb->s_security, sizeof(*sbsp));
> +	} else if (!smack_privileged(CAP_MAC_ADMIN)) {
> +		/* Unprivileged mounts get root and default from the caller. */
> +		skp = smk_of_current();
> +		sbsp->smk_root = skp;
> +		sbsp->smk_default = skp;
> +	} else {
> +		sbsp->smk_root = &smack_known_floor;
> +		sbsp->smk_default = &smack_known_floor;
> +		sbsp->smk_floor = &smack_known_floor;
> +		sbsp->smk_hat = &smack_known_hat;
> +		/* SMK_SB_INITIALIZED will be zero from kzalloc. */
> +	}
> +
> +	return 0;
> +
> +nomem_free:
> +	smack_fs_context_free(fc);
> +nomem:
> +	return -ENOMEM;
> +}
> +
> +/**
> + * smack_fs_context_dup - Duplicate the security data on fs_context duplication
> + * @fc: The new filesystem context.
> + * @src_fc: The source filesystem context being duplicated.
> + *
> + * Returns 0 on success or -ENOMEM on error.
> + */
> +static int smack_fs_context_dup(struct fs_context *fc,
> +				struct fs_context *src_fc)
> +{
> +	struct smack_fs_context *dst, *src = src_fc->security;
> +	int i;
> +
> +	dst = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
> +	if (!dst)
> +		goto nomem;
> +	fc->security = dst;
> +
> +	dst->sbsp = kmemdup(src->sbsp, sizeof(struct superblock_smack),
> +			    GFP_KERNEL);
> +	if (!dst->sbsp)
> +		goto nomem_free;
> +
> +	for (i = 0; i < ARRAY_SIZE(dst->ptrs); i++) {
> +		if (src->ptrs[i]) {
> +			dst->ptrs[i] = kstrdup(src->ptrs[i], GFP_KERNEL);
> +			if (!dst->ptrs[i])
> +				goto nomem_free;
> +		}
> +	}
> +
> +	return 0;
> +
> +nomem_free:
> +	smack_fs_context_free(fc);
> +nomem:
> +	return -ENOMEM;
> +}
> +
> +/**
> + * smack_fs_context_parse_option - Parse a single mount option
> + * @fc: The new filesystem context being constructed.
> + * @opt: The option text buffer.
> + * @len: The length of the text.
> + *
> + * Returns 0 on success or -ENOMEM on error.
> + */
> +static int smack_fs_context_parse_option(struct fs_context *fc, char *p, size_t len)
> +{
> +	struct smack_fs_context *ctx = fc->security;
> +	substring_t args[MAX_OPT_ARGS];
> +	int rc = -ENOMEM;
> +	int token;
> +
> +	/* Unprivileged mounts don't get to specify Smack values. */
> +	if (!smack_privileged(CAP_MAC_ADMIN))
> +		return -EPERM;
> +
> +	token = match_token(p, smk_mount_tokens, args);
> +	switch (token) {
> +	case Opt_fsdefault:
> +		if (ctx->fsdefault)
> +			goto error_dup;
> +		ctx->fsdefault = match_strdup(&args[0]);
> +		if (!ctx->fsdefault)
> +			goto error;
> +		break;
> +	case Opt_fsfloor:
> +		if (ctx->fsfloor)
> +			goto error_dup;
> +		ctx->fsfloor = match_strdup(&args[0]);
> +		if (!ctx->fsfloor)
> +			goto error;
> +		break;
> +	case Opt_fshat:
> +		if (ctx->fshat)
> +			goto error_dup;
> +		ctx->fshat = match_strdup(&args[0]);
> +		if (!ctx->fshat)
> +			goto error;
> +		break;
> +	case Opt_fsroot:
> +		if (ctx->fsroot)
> +			goto error_dup;
> +		ctx->fsroot = match_strdup(&args[0]);
> +		if (!ctx->fsroot)
> +			goto error;
> +		break;
> +	case Opt_fstransmute:
> +		if (ctx->fstransmute)
> +			goto error_dup;
> +		ctx->fstransmute = match_strdup(&args[0]);
> +		if (!ctx->fstransmute)
> +			goto error;
> +		break;
> +	default:
> +		pr_warn("Smack:  unknown mount option\n");
> +		goto error_inval;
> +	}
> +
> +	return 0;
> +
> +error_dup:
> +	pr_warn("Smack: duplicate mount option\n");
> +error_inval:
> +	rc = -EINVAL;
> +error:
> +	return rc;
> +}
> +
> +/**
> + * smack_fs_context_validate - Validate the filesystem context security data
> + * @fc: The filesystem context.
> + *
> + * Returns 0 on success or -ENOMEM on error.
> + */
> +static int smack_fs_context_validate(struct fs_context *fc)
> +{
> +	struct smack_fs_context *ctx = fc->security;
> +	struct superblock_smack *sbsp = ctx->sbsp;
> +	struct inode_smack *isp = ctx->isp;
> +	struct smack_known *skp;
> +
> +	if (ctx->fsdefault) {
> +		skp = smk_import_entry(ctx->fsdefault, 0);
> +		if (IS_ERR(skp))
> +			return PTR_ERR(skp);
> +		sbsp->smk_default = skp;
> +	}
> +
> +	if (ctx->fsfloor) {
> +		skp = smk_import_entry(ctx->fsfloor, 0);
> +		if (IS_ERR(skp))
> +			return PTR_ERR(skp);
> +		sbsp->smk_floor = skp;
> +	}
> +
> +	if (ctx->fshat) {
> +		skp = smk_import_entry(ctx->fshat, 0);
> +		if (IS_ERR(skp))
> +			return PTR_ERR(skp);
> +		sbsp->smk_hat = skp;
> +	}
> +
> +	if (ctx->fsroot || ctx->fstransmute) {
> +		skp = smk_import_entry(ctx->fstransmute ?: ctx->fsroot, 0);
> +		if (IS_ERR(skp))
> +			return PTR_ERR(skp);
> +		sbsp->smk_root = skp;
> +		ctx->transmute = !!ctx->fstransmute;
> +	}
> +
> +	isp->smk_inode = sbsp->smk_root;
> +	return 0;
> +}
> +
> +/**
> + * smack_sb_get_tree - Assign the context to a newly created superblock
> + * @fc: The new filesystem context.
> + *
> + * Returns 0 on success or -ENOMEM on error.
> + */
> +static int smack_sb_get_tree(struct fs_context *fc)
> +{
> +	struct smack_fs_context *ctx = fc->security;
> +	struct superblock_smack *sbsp = ctx->sbsp;
> +	struct dentry *root = fc->root;
> +	struct inode *inode = d_backing_inode(root);
> +	struct super_block *sb = root->d_sb;
> +	struct inode_smack *isp;
> +	bool transmute = ctx->transmute;
> +
> +	if (sb->s_security)
> +		return 0;
> +
> +	if (!smack_privileged(CAP_MAC_ADMIN)) {
> +		/*
> +		 * For a handful of fs types with no user-controlled
> +		 * backing store it's okay to trust security labels
> +		 * in the filesystem. The rest are untrusted.
> +		 */
> +		if (fc->user_ns != &init_user_ns &&
> +		    sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
> +		    sb->s_magic != RAMFS_MAGIC) {
> +			transmute = true;
> +			sbsp->smk_flags |= SMK_SB_UNTRUSTED;
> +		}
> +	}
> +
> +	sbsp->smk_flags |= SMK_SB_INITIALIZED;
> +	sb->s_security = sbsp;
> +	ctx->sbsp = NULL;
> +
> +	/* Initialize the root inode. */
> +	isp = inode->i_security;
> +	if (isp == NULL) {
> +		isp = ctx->isp;
> +		ctx->isp = NULL;
> +		inode->i_security = isp;
> +	} else
> +		isp->smk_inode = sbsp->smk_root;
> +
> +	if (transmute)
> +		isp->smk_flags |= SMK_INODE_TRANSMUTE;
> +
> +	return 0;
> +}
>  
>  /*
>   * Superblock Hooks.
> @@ -4647,6 +4949,13 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(ptrace_traceme, smack_ptrace_traceme),
>  	LSM_HOOK_INIT(syslog, smack_syslog),
>  
> +	LSM_HOOK_INIT(fs_context_alloc, smack_fs_context_alloc),
> +	LSM_HOOK_INIT(fs_context_dup, smack_fs_context_dup),
> +	LSM_HOOK_INIT(fs_context_free, smack_fs_context_free),
> +	LSM_HOOK_INIT(fs_context_parse_option, smack_fs_context_parse_option),
> +	LSM_HOOK_INIT(fs_context_validate, smack_fs_context_validate),
> +	LSM_HOOK_INIT(sb_get_tree, smack_sb_get_tree),
> +
>  	LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
>  	LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
>  	LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),
>
>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/32] smack: Implement filesystem context security hooks [ver #9]
  2018-07-10 22:42 ` [PATCH 08/32] smack: Implement filesystem context security " David Howells
  2018-07-10 23:13   ` Casey Schaufler
@ 2018-07-10 23:19   ` David Howells
  2018-07-10 23:28     ` Casey Schaufler
  1 sibling, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-10 23:19 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: dhowells, viro, linux-kernel, linux-fsdevel,
	linux-security-module, torvalds

Casey Schaufler <casey@schaufler-ca.com> wrote:

> > Implement filesystem context security hooks for the smack LSM.
> >
> > Question: Should the ->fs_context_parse_source() hook be implemented to
> > check the labels on any source devices specified?
> 
> Checking the label on a block device when doing a mount
> is just going to end in tears. If you're remounting from
> an already mounted filesystem it might make sense to check
> that the new mount doesn't provide greater access than the
> existing mount. If the original mount has smackfsdefault="_"
> I could see prohibiting the additional mount having
> smackfsdefault="*" on a filesystem that doesn't support
> xattrs. But that requires that a (hopefully) privileged
> process be involved, and we expect them to have a clue.
> So no, I don't see it necessary.

I think I may have meant the device file rather than the actual device
content.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/32] smack: Implement filesystem context security hooks [ver #9]
  2018-07-10 23:19   ` David Howells
@ 2018-07-10 23:28     ` Casey Schaufler
  0 siblings, 0 replies; 113+ messages in thread
From: Casey Schaufler @ 2018-07-10 23:28 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-kernel, linux-fsdevel, linux-security-module, torvalds

On 7/10/2018 4:19 PM, David Howells wrote:
> Casey Schaufler <casey@schaufler-ca.com> wrote:
>
>>> Implement filesystem context security hooks for the smack LSM.
>>>
>>> Question: Should the ->fs_context_parse_source() hook be implemented to
>>> check the labels on any source devices specified?
>> Checking the label on a block device when doing a mount
>> is just going to end in tears. If you're remounting from
>> an already mounted filesystem it might make sense to check
>> that the new mount doesn't provide greater access than the
>> existing mount. If the original mount has smackfsdefault="_"
>> I could see prohibiting the additional mount having
>> smackfsdefault="*" on a filesystem that doesn't support
>> xattrs. But that requires that a (hopefully) privileged
>> process be involved, and we expect them to have a clue.
>> So no, I don't see it necessary.
> I think I may have meant the device file rather than the actual device
> content.

You may have! I see no reason to look at the label on /dev/sdb1
when mounting it. There's already sufficient privilege required
to protect that in my mind.

>
> David
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10/32] tomoyo: Implement security hooks for the new mount API [ver #9]
  2018-07-10 22:42 ` [PATCH 10/32] tomoyo: " David Howells
@ 2018-07-10 23:34   ` Tetsuo Handa
  0 siblings, 0 replies; 113+ messages in thread
From: Tetsuo Handa @ 2018-07-10 23:34 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-kernel, dhowells, linux-fsdevel,
	linux-security-module, tomoyo-dev-en, torvalds

David Howells wrote:
> Implement the security hook to check the creation of a new mountpoint for
> Tomoyo.
> 
> As far as I can tell, Tomoyo doesn't make use of the mount data or parse
> any mount options, so I haven't implemented any of the fs_context hooks for
> it.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> cc: tomoyo-dev-en@lists.sourceforge.jp
> cc: linux-security-module@vger.kernel.org
> 

Would you provide examples of each possible combination as a C program?
For example, if one mount point from multiple sources with different
options are possible, please describe such pattern using syscall so that
LSM modules can run it to see whether they are working as expected.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
@ 2018-07-10 23:59   ` Andy Lutomirski
  2018-07-11  1:05     ` Linus Torvalds
                       ` (2 more replies)
  2018-07-11  7:22   ` David Howells
                     ` (3 subsequent siblings)
  4 siblings, 3 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-10 23:59 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-api, linux-fsdevel, torvalds, linux-kernel, jannh

[cc Jann - you love this stuff]

> On Jul 10, 2018, at 3:44 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Provide an fsopen() system call that starts the process of preparing to
> create a superblock that will then be mountable, using an fd as a context
> handle.  fsopen() is given the name of the filesystem that will be used:
> 
>    int mfd = fsopen(const char *fsname, unsigned int flags);

This is great in principle, but I think you’re seriously playing with fire with the API. 

> 
> where flags can be 0 or FSOPEN_CLOEXEC.
> 
> For example:
> 
>    sfd = fsopen("ext4", FSOPEN_CLOEXEC);
>    write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg

Imagine some malicious program passes sfd as stdout to a setuid program. That program gets persuaded to write “s /etc/shadow”.  What happens?  You’re okay as long as *every single fs* gets it right, but that’s asking a lot.

>    write(sfd, "o noatime");
>    write(sfd, "o acl");
>    write(sfd, "o user_attr");
>    write(sfd, "o iversion");
>    write(sfd, "o ");
>    write(sfd, "r /my/container"); // root inside the fs
>    write(sfd, "x create"); // create the superblock

From cursory inspection of a bunch of the code, I think the expectation is that the actual device access happens in the “x” action. This is not okay. You can’t do this kind of thing in a write() handler, unless you somehow make every single access using f_cred, which is a real pain.

I think the right solution is one of:

(a) Pass a netlink-formatted blob to fsopen() and do the whole thing in one syscall. I don’t mean using netlink sockets — just the nlattr format.  Or you could use a different format. The part that matters is using just one syscall to do the whole thing.

(b) Keep the current structure but use a new syscall instead of write().

(c) Keep using write() but literally just buffer the data. Then have a new syscall to commit it.  In other words, replace “x” with a syscall and call all the fs_context_operations helpers in that context instead of from write().

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 23:59   ` Andy Lutomirski
@ 2018-07-11  1:05     ` Linus Torvalds
  2018-07-11  1:15       ` Al Viro
  2018-07-11  1:14     ` Jann Horn
  2018-07-11  8:42     ` David Howells
  2 siblings, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2018-07-11  1:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Yeah, Andy is right that we should *not* make "write()" have side effects.

Use it to queue things by all means, but not "do" things. Not unless
there's a very sane security model.

On Tue, Jul 10, 2018 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think the right solution is one of:
>
> (a) Pass a netlink-formatted blob to fsopen() and do the whole thing in one syscall. I don’t mean using netlink sockets — just the nlattr format.  Or you could use a different format. The part that matters is using just one syscall to do the whole thing.

Please no. Not another nasty marshalling thing.

> (b) Keep the current structure but use a new syscall instead of write().
>
> (c) Keep using write() but literally just buffer the data. Then have a new syscall to commit it.  In other words, replace “x” with a syscall and call all the fs_context_operations helpers in that context instead of from write().

But yeah, b-or-c sounds fine.

               Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 23:59   ` Andy Lutomirski
  2018-07-11  1:05     ` Linus Torvalds
@ 2018-07-11  1:14     ` Jann Horn
  2018-07-11  1:16       ` Al Viro
  2018-07-11  8:42     ` David Howells
  2 siblings, 1 reply; 113+ messages in thread
From: Jann Horn @ 2018-07-11  1:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Al Viro, Linux API, linux-fsdevel, Linus Torvalds,
	kernel list

On Tue, Jul 10, 2018 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> [cc Jann - you love this stuff]
>
> > On Jul 10, 2018, at 3:44 PM, David Howells <dhowells@redhat.com> wrote:
> >
> > Provide an fsopen() system call that starts the process of preparing to
> > create a superblock that will then be mountable, using an fd as a context
> > handle.  fsopen() is given the name of the filesystem that will be used:
> >
> >    int mfd = fsopen(const char *fsname, unsigned int flags);
>
> This is great in principle, but I think you’re seriously playing with fire with the API.
>
> >
> > where flags can be 0 or FSOPEN_CLOEXEC.
> >
> > For example:
> >
> >    sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> >    write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
>
> Imagine some malicious program passes sfd as stdout to a setuid program. That program gets persuaded to write “s /etc/shadow”.  What happens?  You’re okay as long as *every single fs* gets it right, but that’s asking a lot.
>
> >    write(sfd, "o noatime");
> >    write(sfd, "o acl");
> >    write(sfd, "o user_attr");
> >    write(sfd, "o iversion");
> >    write(sfd, "o ");
> >    write(sfd, "r /my/container"); // root inside the fs
> >    write(sfd, "x create"); // create the superblock
>
> From cursory inspection of a bunch of the code, I think the expectation is that the actual device access happens in the “x” action. This is not okay. You can’t do this kind of thing in a write() handler, unless you somehow make every single access using f_cred, which is a real pain.
>
> I think the right solution is one of:
>
> (a) Pass a netlink-formatted blob to fsopen() and do the whole thing in one syscall. I don’t mean using netlink sockets — just the nlattr format.  Or you could use a different format. The part that matters is using just one syscall to do the whole thing.
>
> (b) Keep the current structure but use a new syscall instead of write().
>
> (c) Keep using write() but literally just buffer the data. Then have a new syscall to commit it.  In other words, replace “x” with a syscall and call all the fs_context_operations helpers in that context instead of from write().

I also love ioctls, so I think you could also use an ioctl to do the
commit? You can do anything (well, almost anything) that you can do in
syscall context in ioctl context, too; and when you already have a
file descriptor of a specific type that you want to perform an
operation on, an ioctl works just fine.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  1:05     ` Linus Torvalds
@ 2018-07-11  1:15       ` Al Viro
  2018-07-11  1:33         ` Andy Lutomirski
                           ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Al Viro @ 2018-07-11  1:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, David Howells, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Tue, Jul 10, 2018 at 06:05:49PM -0700, Linus Torvalds wrote:
> Yeah, Andy is right that we should *not* make "write()" have side effects.
> 
> Use it to queue things by all means, but not "do" things. Not unless
> there's a very sane security model.
> 
> On Tue, Jul 10, 2018 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > I think the right solution is one of:
> >
> > (a) Pass a netlink-formatted blob to fsopen() and do the whole thing in one syscall. I don’t mean using netlink sockets — just the nlattr format.  Or you could use a different format. The part that matters is using just one syscall to do the whole thing.
> 
> Please no. Not another nasty marshalling thing.
> 
> > (b) Keep the current structure but use a new syscall instead of write().
> >
> > (c) Keep using write() but literally just buffer the data. Then have a new syscall to commit it.  In other words, replace “x” with a syscall and call all the fs_context_operations helpers in that context instead of from write().
> 
> But yeah, b-or-c sounds fine.

Umm...  How about "use credentials of opener for everything"?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  1:14     ` Jann Horn
@ 2018-07-11  1:16       ` Al Viro
  0 siblings, 0 replies; 113+ messages in thread
From: Al Viro @ 2018-07-11  1:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, David Howells, Linux API, linux-fsdevel,
	Linus Torvalds, kernel list

On Tue, Jul 10, 2018 at 06:14:10PM -0700, Jann Horn wrote:

> I also love ioctls, so I think you could also use an ioctl to do the
> commit? You can do anything (well, almost anything) that you can do in
> syscall context in ioctl context, too; and when you already have a
> file descriptor of a specific type that you want to perform an
> operation on, an ioctl works just fine.

Poe's Law in action...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  1:15       ` Al Viro
@ 2018-07-11  1:33         ` Andy Lutomirski
  2018-07-11  1:48         ` Linus Torvalds
  2018-07-11  8:43         ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-11  1:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Howells, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Tue, Jul 10, 2018 at 6:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jul 10, 2018 at 06:05:49PM -0700, Linus Torvalds wrote:
>> Yeah, Andy is right that we should *not* make "write()" have side effects.
>>
>> Use it to queue things by all means, but not "do" things. Not unless
>> there's a very sane security model.
>>
>> On Tue, Jul 10, 2018 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> >
>> > I think the right solution is one of:
>> >
>> > (a) Pass a netlink-formatted blob to fsopen() and do the whole thing in one syscall. I don’t mean using netlink sockets — just the nlattr format.  Or you could use a different format. The part that matters is using just one syscall to do the whole thing.
>>
>> Please no. Not another nasty marshalling thing.
>>
>> > (b) Keep the current structure but use a new syscall instead of write().
>> >
>> > (c) Keep using write() but literally just buffer the data. Then have a new syscall to commit it.  In other words, replace “x” with a syscall and call all the fs_context_operations helpers in that context instead of from write().
>>
>> But yeah, b-or-c sounds fine.
>
> Umm...  How about "use credentials of opener for everything"?

If you want to audit every single filesystem for any code that uses
credentials for anything and add all the right kernel APIs and make
sure the filesystem uses them and somehow keep screwups from getting
added down the line, then okay I guess.  As far as I know, we don't
even *have* an API for "open this device node using this struct cred
*".

I kind of want to add a hack to set some poison bit in current->cred
in sys_write() and clear it on the way out.  Sigh.

--Andy

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  1:15       ` Al Viro
  2018-07-11  1:33         ` Andy Lutomirski
@ 2018-07-11  1:48         ` Linus Torvalds
  2018-07-11  8:43         ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-11  1:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, David Howells, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Tue, Jul 10, 2018 at 6:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  How about "use credentials of opener for everything"?

yeah, we have that for writes in general.

Nobody ever actually follows that rule. They may *think* they do, and
then they call to some helper that does "capability(CAP_SYS_WHATEVAH)"
without even realizing it.

But I'm certainly ok with writes, if it's just filling a buffer.
Preferably a standard buffer we already have, like a seqfile or pipe
(hey, splice!) or whatever.

And then you have that final op to actually "commit" the state. Which
shouldn't be a write (and not the close).

           Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
  2018-07-10 23:59   ` Andy Lutomirski
@ 2018-07-11  7:22   ` David Howells
  2018-07-11 16:38     ` Eric Biggers
                       ` (2 more replies)
  2018-07-11 15:51   ` Jonathan Corbet
                     ` (2 subsequent siblings)
  4 siblings, 3 replies; 113+ messages in thread
From: David Howells @ 2018-07-11  7:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, viro, linux-api, linux-fsdevel, torvalds, linux-kernel, jannh

Andy Lutomirski <luto@amacapital.net> wrote:

> >    sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> >    write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> 
> Imagine some malicious program passes sfd as stdout to a setuid
> program. That program gets persuaded to write "s /etc/shadow".  What
> happens?  You’re okay as long as *every single fs* gets it right, but that’s
> asking a lot.

Do note that you must already have CAP_SYS_ADMIN to be able to call fsopen().

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 23:59   ` Andy Lutomirski
  2018-07-11  1:05     ` Linus Torvalds
  2018-07-11  1:14     ` Jann Horn
@ 2018-07-11  8:42     ` David Howells
  2018-07-11 16:03       ` Linus Torvalds
  2 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-11  8:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Andy Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Yeah, Andy is right that we should *not* make "write()" have side effects.

Note that write() has side effects all over the place: procfs, sysfs, debugfs,
tracefs, ...  Though for the most part they're single-shot jobs and not
cumulative (I'm not sure this is always true for debugfs - there's a lot of
weird stuff in there).

> > (b) Keep the current structure but use a new syscall instead of write().
> >
> > (c) Keep using write() but literally just buffer the data. Then have a new
> > syscall to commit it.  In other words, replace “x” with a syscall and call
> > all the fs_context_operations helpers in that context instead of from
> > write().
> 
> But yeah, b-or-c sounds fine.

I would prefer to avoid the "let's buffer everything" but rather parse the
data as we go along.  What I currently do is store the parsed data in the
context and only actually *apply* it when someone sends the 'x' command.

There are two reasons for this:

 (1) mount()'s error handling is slight: it can only return an error code, but
     creating and mounting something has so many different and interesting
     ways of going wrong and I want to be able to give better error reporting.

     This gets more interesting if it happens inside a container where you
     can't see dmesg.

 (2) Parsing the data means you only need to store the result of the parse and
     can reject anything that's unknown or contradictory.

     Buffering till the end means you have to buffer *everything* - and,
     unless you limit your buffer, you risk running out of RAM.

Now, I can replace the 'x' command with an ioctl() so that just writing random
rubbish to the fd won't cause anything to actually happen.

	fd = fsopen("ext4");
	write(fd, "s /dev/sda1");
	write(fd, "o user_xattr");
	ioctl(fd, FSOPEN_IOC_CREATE_SB, 0);

or I could make a special syscall for it:

	fscommit(fd, FSCOMMIT_CREATE);

or:

	fscommit(fd, FSCOMMIT_RECONFIGURE);

and require that you have CAP_SYS_ADMIN to enact it.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  1:15       ` Al Viro
  2018-07-11  1:33         ` Andy Lutomirski
  2018-07-11  1:48         ` Linus Torvalds
@ 2018-07-11  8:43         ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-11  8:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Linus Torvalds, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Andy Lutomirski <luto@kernel.org> wrote:

> > Umm...  How about "use credentials of opener for everything"?
> 
> If you want to audit every single filesystem for any code that uses
> credentials for anything and add all the right kernel APIs and make
> sure the filesystem uses them and somehow keep screwups from getting
> added down the line, then okay I guess.  As far as I know, we don't
> even *have* an API for "open this device node using this struct cred
> *".

You can use override_creds() too.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/32] selinux: Implement the new mount API LSM hooks [ver #9]
  2018-07-10 22:42 ` [PATCH 07/32] selinux: Implement the new mount API LSM hooks " David Howells
@ 2018-07-11 14:08   ` Stephen Smalley
  0 siblings, 0 replies; 113+ messages in thread
From: Stephen Smalley @ 2018-07-11 14:08 UTC (permalink / raw)
  To: David Howells, viro
  Cc: Paul Moore, linux-kernel, linux-security-module, selinux,
	linux-fsdevel, torvalds

On 07/10/2018 06:42 PM, David Howells wrote:
> Implement the new mount API LSM hooks for SELinux.  At some point the old
> hooks will need to be removed.
> 
> Question: Should the ->fs_context_parse_source() hook be implemented to
> check the labels on any source devices specified?

The hook interface doesn't appear to lend itself to such validation, since you are just passing a string, not an inode.
Looking up the inode within the security module could easily yield a different object than what is ultimately used for the actual mount.

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Paul Moore <paul@paul-moore.com>
> cc: Stephen Smalley <sds@tycho.nsa.gov>
> cc: selinux@tycho.nsa.gov
> cc: linux-security-module@vger.kernel.org
> ---
> 
>  security/selinux/hooks.c |  264 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 264 insertions(+)
> 
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 5bb53edd74cc..bdecae4b7306 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -48,6 +48,7 @@
>  #include <linux/fdtable.h>
>  #include <linux/namei.h>
>  #include <linux/mount.h>
> +#include <linux/fs_context.h>
>  #include <linux/netfilter_ipv4.h>
>  #include <linux/netfilter_ipv6.h>
>  #include <linux/tty.h>
> @@ -2973,6 +2974,261 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
>  				   FILESYSTEM__UNMOUNT, NULL);
>  }
>  
> +/* fsopen mount context operations */
> +
> +static int selinux_fs_context_alloc(struct fs_context *fc,
> +				    struct dentry *reference)
> +{
> +	struct security_mnt_opts *opts;
> +
> +	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> +	if (!opts)
> +		return -ENOMEM;
> +
> +	fc->security = opts;
> +	return 0;
> +}
> +
> +static int selinux_fs_context_dup(struct fs_context *fc,
> +				  struct fs_context *src_fc)
> +{
> +	const struct security_mnt_opts *src = src_fc->security;
> +	struct security_mnt_opts *opts;
> +	int i, n;
> +
> +	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> +	if (!opts)
> +		return -ENOMEM;
> +	fc->security = opts;
> +
> +	if (!src || !src->num_mnt_opts)
> +		return 0;
> +	n = opts->num_mnt_opts = src->num_mnt_opts;
> +
> +	if (src->mnt_opts) {
> +		opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
> +		if (!opts->mnt_opts)
> +			return -ENOMEM;
> +
> +		for (i = 0; i < n; i++) {
> +			if (src->mnt_opts[i]) {
> +				opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
> +							    GFP_KERNEL);
> +				if (!opts->mnt_opts[i])
> +					return -ENOMEM;
> +			}
> +		}
> +	}
> +
> +	if (src->mnt_opts_flags) {
> +		opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
> +					       n * sizeof(int), GFP_KERNEL);
> +		if (!opts->mnt_opts_flags)
> +			return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static void selinux_fs_context_free(struct fs_context *fc)
> +{
> +	struct security_mnt_opts *opts = fc->security;
> +
> +	if (opts) {
> +		security_free_mnt_opts(opts);
> +		fc->security = NULL;
> +	}
> +}
> +
> +static int selinux_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
> +{
> +	struct security_mnt_opts *opts = fc->security;
> +	substring_t args[MAX_OPT_ARGS];
> +	unsigned int have;
> +	char *c, **oo;
> +	int token, ctx, i, *of;
> +
> +	token = match_token(opt, tokens, args);
> +	if (token == Opt_error)
> +		return 0; /* Doesn't belong to us. */
> +
> +	have = 0;
> +	for (i = 0; i < opts->num_mnt_opts; i++)
> +		have |= 1 << opts->mnt_opts_flags[i];
> +	if (have & (1 << token))
> +		return -EINVAL;
> +
> +	switch (token) {
> +	case Opt_context:
> +		if (have & (1 << Opt_defcontext))
> +			goto incompatible;
> +		ctx = CONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_fscontext:
> +		ctx = FSCONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_rootcontext:
> +		ctx = ROOTCONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_defcontext:
> +		if (have & (1 << Opt_context))
> +			goto incompatible;
> +		ctx = DEFCONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_labelsupport:
> +		return 1;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +
> +copy_context_string:
> +	if (opts->num_mnt_opts > 3)
> +		return -EINVAL;
> +
> +	of = krealloc(opts->mnt_opts_flags,
> +		      (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
> +	if (!of)
> +		return -ENOMEM;
> +	of[opts->num_mnt_opts] = 0;
> +	opts->mnt_opts_flags = of;
> +
> +	oo = krealloc(opts->mnt_opts,
> +		      (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
> +	if (!oo)
> +		return -ENOMEM;
> +	oo[opts->num_mnt_opts] = NULL;
> +	opts->mnt_opts = oo;
> +
> +	c = match_strdup(&args[0]);
> +	if (!c)
> +		return -ENOMEM;
> +	opts->mnt_opts[opts->num_mnt_opts] = c;
> +	opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
> +	opts->num_mnt_opts++;
> +	return 1;
> +
> +incompatible:
> +	return -EINVAL;
> +}
> +
> +/*
> + * Validate the security parameters supplied for a reconfiguration/remount
> + * event.
> + */
> +static int selinux_validate_for_sb_reconfigure(struct fs_context *fc)
> +{
> +	struct super_block *sb = fc->root->d_sb;
> +	struct superblock_security_struct *sbsec = sb->s_security;
> +	struct security_mnt_opts *opts = fc->security;
> +	int rc, i, *flags;
> +	char **mount_options;
> +
> +	if (!(sbsec->flags & SE_SBINITIALIZED))
> +		return 0;
> +
> +	mount_options = opts->mnt_opts;
> +	flags = opts->mnt_opts_flags;
> +
> +	for (i = 0; i < opts->num_mnt_opts; i++) {
> +		u32 sid;
> +
> +		if (flags[i] == SBLABEL_MNT)
> +			continue;
> +
> +		rc = security_context_str_to_sid(&selinux_state, mount_options[i],
> +						 &sid, GFP_KERNEL);
> +		if (rc) {
> +			pr_warn("SELinux: security_context_str_to_sid"
> +				"(%s) failed for (dev %s, type %s) errno=%d\n",
> +				mount_options[i], sb->s_id, sb->s_type->name, rc);
> +			goto inval;
> +		}
> +
> +		switch (flags[i]) {
> +		case FSCONTEXT_MNT:
> +			if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
> +				goto bad_option;
> +			break;
> +		case CONTEXT_MNT:
> +			if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
> +				goto bad_option;
> +			break;
> +		case ROOTCONTEXT_MNT: {
> +			struct inode_security_struct *root_isec;
> +			root_isec = backing_inode_security(sb->s_root);
> +
> +			if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
> +				goto bad_option;
> +			break;
> +		}
> +		case DEFCONTEXT_MNT:
> +			if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
> +				goto bad_option;
> +			break;
> +		default:
> +			goto inval;
> +		}
> +	}
> +
> +	rc = 0;
> +out:
> +	return rc;
> +
> +bad_option:
> +	pr_warn("SELinux: unable to change security options "
> +		"during remount (dev %s, type=%s)\n",
> +		sb->s_id, sb->s_type->name);
> +inval:
> +	rc = -EINVAL;
> +	goto out;
> +}
> +
> +/*
> + * Validate the security context assembled from the option data supplied to
> + * mount.
> + */
> +static int selinux_fs_context_validate(struct fs_context *fc)
> +{
> +	if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)
> +		return selinux_validate_for_sb_reconfigure(fc);
> +	return 0;
> +}
> +
> +/*
> + * Set the security context on a superblock.
> + */
> +static int selinux_sb_get_tree(struct fs_context *fc)
> +{
> +	const struct cred *cred = current_cred();
> +	struct common_audit_data ad;
> +	int rc;
> +
> +	rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
> +	if (rc)
> +		return rc;
> +
> +	/* Allow all mounts performed by the kernel */
> +	if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
> +		return 0;
> +
> +	ad.type = LSM_AUDIT_DATA_DENTRY;
> +	ad.u.dentry = fc->root;
> +	return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
> +}
> +
> +static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
> +				 unsigned int mnt_flags)
> +{
> +	const struct cred *cred = current_cred();
> +
> +	return path_has_perm(cred, mountpoint, FILE__MOUNTON);
> +}
> +
>  /* inode security operations */
>  
>  static int selinux_inode_alloc_security(struct inode *inode)
> @@ -6905,6 +7161,14 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
>  	LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
>  
> +	LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
> +	LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
> +	LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
> +	LSM_HOOK_INIT(fs_context_parse_option, selinux_fs_context_parse_option),
> +	LSM_HOOK_INIT(fs_context_validate, selinux_fs_context_validate),
> +	LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
> +	LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
> +
>  	LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
>  	LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
>  	LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
  2018-07-10 23:59   ` Andy Lutomirski
  2018-07-11  7:22   ` David Howells
@ 2018-07-11 15:51   ` Jonathan Corbet
  2018-07-11 16:18   ` David Howells
  2018-07-12 17:15   ` Greg KH
  4 siblings, 0 replies; 113+ messages in thread
From: Jonathan Corbet @ 2018-07-11 15:51 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-api, linux-fsdevel, torvalds, linux-kernel

On Tue, 10 Jul 2018 23:44:09 +0100
David Howells <dhowells@redhat.com> wrote:

> 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> 	write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> 	write(sfd, "o noatime");
> 	write(sfd, "o acl");
> 	write(sfd, "o user_attr");
> 	write(sfd, "o iversion");
> 	write(sfd, "o ");
> 	write(sfd, "r /my/container"); // root inside the fs
> 	write(sfd, "x create"); // create the superblock

A minor detail but ... the "r" operation mentioned above is not actually
implemented in this system call.

jon

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  8:42     ` David Howells
@ 2018-07-11 16:03       ` Linus Torvalds
  0 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-11 16:03 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Wed, Jul 11, 2018 at 1:42 AM David Howells <dhowells@redhat.com> wrote:
>
>      Buffering till the end means you have to buffer *everything* - and,
>      unless you limit your buffer, you risk running out of RAM

Do we really care?

Can't we limit the buffer size to something small?

Right now, the mount options can't be bigger than a page anyway. Why
would we want to extend on that?

Btw, the magic word here is "why". I really really want to see a
fairly exhaustive explanation of why this all is such a big deal, and
exactly what limitations (including perhaps the mount option buffer
size) are such a pain right now and need changing.

> Now, I can replace the 'x' command with an ioctl() so that just writing random
> rubbish to the fd won't cause anything to actually happen.
>
>         fd = fsopen("ext4");
>         write(fd, "s /dev/sda1");
>         write(fd, "o user_xattr");
>         ioctl(fd, FSOPEN_IOC_CREATE_SB, 0);
>
> or I could make a special syscall for it:
>
>         fscommit(fd, FSCOMMIT_CREATE);
>
> or:
>
>         fscommit(fd, FSCOMMIT_RECONFIGURE);
>
> and require that you have CAP_SYS_ADMIN to enact it.

I think any of them sound fairly ok, with that whole "we need reasons" caveat.

               Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
                     ` (2 preceding siblings ...)
  2018-07-11 15:51   ` Jonathan Corbet
@ 2018-07-11 16:18   ` David Howells
  2018-07-12 17:15   ` Greg KH
  4 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-11 16:18 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: dhowells, viro, linux-api, linux-fsdevel, torvalds, linux-kernel

Jonathan Corbet <corbet@lwn.net> wrote:

> A minor detail but ... the "r" operation mentioned above is not actually
> implemented in this system call.

Yeah, that's something I'd like to add.  NFS4 already does this inside its
->mount() method, so my thought is that we might be able to move this from
there to the core.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  7:22   ` David Howells
@ 2018-07-11 16:38     ` Eric Biggers
  2018-07-11 17:06     ` Andy Lutomirski
  2018-07-12 14:54     ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Eric Biggers @ 2018-07-11 16:38 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, viro, linux-api, linux-fsdevel, torvalds,
	linux-kernel, jannh

On Wed, Jul 11, 2018 at 08:22:41AM +0100, David Howells wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > >    sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > >    write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> > 
> > Imagine some malicious program passes sfd as stdout to a setuid
> > program. That program gets persuaded to write "s /etc/shadow".  What
> > happens?  You’re okay as long as *every single fs* gets it right, but that’s
> > asking a lot.
> 
> Do note that you must already have CAP_SYS_ADMIN to be able to call fsopen().
> 
> David

Not really, by default an unprivileged user can still do:

	unshare(CLONE_NEWUSER|CLONE_NEWNS);
	syscall(__NR_fsopen, "ext4", 0);

- Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  7:22   ` David Howells
  2018-07-11 16:38     ` Eric Biggers
@ 2018-07-11 17:06     ` Andy Lutomirski
  2018-07-12 14:54     ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-11 17:06 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Linux API, Linux FS Devel, Linus Torvalds, LKML, Jann Horn

> On Jul 11, 2018, at 12:22 AM, David Howells <dhowells@redhat.com> wrote:
>
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>>>   sfd = fsopen("ext4", FSOPEN_CLOEXEC);
>>>   write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
>>
>> Imagine some malicious program passes sfd as stdout to a setuid
>> program. That program gets persuaded to write "s /etc/shadow".  What
>> happens?  You’re okay as long as *every single fs* gets it right, but that’s
>> asking a lot.
>
> Do note that you must already have CAP_SYS_ADMIN to be able to call fsopen().

If you’re not allowing it already, someone will want user namespace
root to be able to use this very, very soon.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (35 preceding siblings ...)
  2018-07-10 23:01 ` [PATCH 00/32] VFS: Introduce filesystem context [ver #9] Linus Torvalds
@ 2018-07-12  0:46 ` David Howells
  2018-07-18 21:29 ` Getting rid of the usage of write() -- was " David Howells
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-12  0:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Al Viro, linux-fsdevel, Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> All your documentation (both commit logs, man-pages and in-kernel
> actual docs you add) only talk about "what".
> 
> They don't talk about _why_.
> 
> I can imagine why's. But I think that the "why" is actually way mnore
> important than the what. At no point did I see a "this is the current
> interface, and it doesn't work for xyz, so here's the new interface
> that allows us to do stuff".

Firstly, there are a bunch of problems with the current mount(2) syscall:

 (1) It's actually six or seven different interfaces rolled into one and weird
     combinations of flags make it do different things beyond the original
     specification of the syscall.

 (2) It produces a particularly large and diverse set of errors, which have to
     be mapped back to a small error code.  Yes, there's dmesg - if you have
     it configured - but you can't necessarily see that if you're doing a
     mount inside of a container.

 (3) It copies a PAGE_SIZE block of data for each of the type, device name and
     options.

 (4) The size of the buffers is PAGE_SIZE - and this is arch dependent.

 (5) You can't mount into another mount namespace.  I could, for example,
     build a container without having to be in that container's namespace if I
     can do it from outside.

 (6) It's not really geared for the specification of multiple sources, but
     some filesystems really want that - overlayfs, for example.

and some problems in the internal kernel api:

 (1) There's no defined way to supply namespace configuration for the
     superblock - so, for instance, I can't say that I want to create a
     superblock in a particular network namespace (on automount, say).

     NFS hacks around this by creating multiple shadow file_system_types with
     different ->mount() ops.

 (2) When calling mount internally, unless you have NFS-like hacks, you have
     to generate or otherwise provide text config data which then gets parsed,
     when some of the time you could bypass the parsing stage entirely.

 (3) The amount of data in the data buffer is not known, but the data buffer
     might be on a kernel stack somewhere, leading to the possibility of
     tripping the stack underrun guard.

and other issues too:

 (1) Superblock remount in some filesystems applies options on an as-parsed
     basis, so if there's a parse failure, a partial alteration with no
     rollback is effected.

 (2) Under some circumstances, the mount data may get copied multiple times so
     that it can have multiple parsers applied to it or because it has to be
     parsed multiple times - for instance, once to get the preliminary info
     required to access the on-disk superblock and then again to update the
     superblock record in the kernel.

I want to be able to add support for a bunch of things:

 (1) UID, GID and Project ID mapping/translation.  I want to be able to
     install a translation table of some sort on the superblock to translate
     source identifiers (which may be foreign numeric UIDs/GIDs, text names,
     GUIDs) into system identifiers.  This needs to be done before the
     superblock is published[*].

     Note that this may, for example, involve using the context and the
     superblock held therein to issue an RPC to a server to look up
     translations.

     [*] By "published" I mean made available through mount so that other
     	 userspace processes can access it by path.

     Maybe specifying a translation range element with something like:

	write(fd, "t uid <srcuid> <nsuid> <count>");

     The translation information also needs to propagate over an automount in
     some circumstances.

 (2) Namespace configuration.  I want to be able to tell the superblock
     creation process what namespaces should be applied when it created (in
     particular the userns and netns) for containerisation purposes, e.g.:

	write(fd, "n user=<fd> net=<fd>");

 (3) Namespace propagation.  I want to have a properly defined mechanism for
     propagating namespace configuration over automounts within the kernel.
     This will be particularly useful for network filesystems.

 (4) Pre-mount attribute query.  A chunk of the changes is actually the
     fsinfo() syscall to query attributes of the filesystem beyond what's
     available in statx() and statfs().  This will allow a created superblock
     to be queried before it is published.

 (5) Upcall for configuration.  I would like to be able to query configuration
     that's stored in userspace when an automount is made.  For instance, to
     look up network parameters for NFS or to find a cache selector for
     fscache.

     The internal fs_context could be passed to the upcall process or the
     kernel could read a config file directly if named appropriately for the
     superblock, perhaps:

	[/etc/fscontext.d/afs/example.com/cell.cfg]
	realm = EXAMPLE.COM
	translation = uid,3000,4000,100
	fscache = tag=fred

 (6) Event notifications.  I want to be able to install a watch on a
     superblock before it is published to catch things like quota events and
     EIO.

 (7) Large and binary parameters.  There might be at some point a need to pass
     large/binary objects like Microsoft PACs around.  If I understand PACs
     correctly, you can obtain these from the Kerberos server and then pass
     them to the file server when you connect.

     Having it possible to pass large or binary objects as individual writes
     makes parsing these trivial.  OTOH, some or all of this can potentially
     be handled with the use of the keyrings interface - as the afs filesystem
     does for passing kerberos tokens around; it's just that that seems
     overkill for a parameter you may only need once.

> When you have a diffstat like this:
> 
>  171 files changed, 7147 insertions(+), 1805 deletions(-)
> 
> I sure want to see an explanation for *WHY* it adds 5000+ lines of core code.

Note that there's a chunk more core code to be removed too, once all the
filesystems have been converted, including some of the added code.

> Also, I want to hear about sane security models. One of the things
> people really want to do is have users do their own mounts. We've had
> security issues in that area. Why does this improve on it, or make it
> even worse?

At the moment, I think it's fairly neutral in that regard.  Currently, you
have to have CAP_SYS_ADMIN to call fsopen() and again to call fsmount().

To supervise user-triggered mounting, I might need to add something to permit
upcalling for permission or configuration, then this could be in the parent of
a container, say, or something dispatched from systemd in the system root.  It
should be able to restrict the sources and options that a non-privileged or
container-based mount request is given.

An upcall to an arbiter could be passed the fs-context fd as an argument and
could then use fsinfo() to query the context, including the option flags.

It also might be possible to handle this through LSM policy, particularly if I
formalise the specification of *all* sources in the context.  For example, I
could require things like:

	write(fd, "s store /dev/sda1");	// Specify the storage device
	write(fd, "s jnl /dev/sda2");	// Specify a separate journal
	write(fd, "s nfs example.com");	// Specify an NFS server
	write(fd, "s afs example.com");	// Specify an AFS cell

Then the LSMs could be asked to rule on whether the "store" and "jnl" block
devices could be used for those purposes by the caller and "nfs" or "afs"
names could be looked up in the DNS.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-11  7:22   ` David Howells
  2018-07-11 16:38     ` Eric Biggers
  2018-07-11 17:06     ` Andy Lutomirski
@ 2018-07-12 14:54     ` David Howells
  2018-07-12 15:50       ` Linus Torvalds
                         ` (3 more replies)
  2 siblings, 4 replies; 113+ messages in thread
From: David Howells @ 2018-07-12 14:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Linux API, Linux FS Devel, Linus Torvalds,
	LKML, Jann Horn

Andy Lutomirski <luto@kernel.org> wrote:

> > On Jul 11, 2018, at 12:22 AM, David Howells <dhowells@redhat.com> wrote:
> >
> > Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >>>   sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> >>>   write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> >>
> >> Imagine some malicious program passes sfd as stdout to a setuid
> >> program. That program gets persuaded to write "s /etc/shadow".  What
> >> happens?  You’re okay as long as *every single fs* gets it right, but
> >> that’s asking a lot.
> >
> > Do note that you must already have CAP_SYS_ADMIN to be able to call
> > fsopen().
> 
> If you're not allowing it already, someone will want user namespace
> root to be able to use this very, very soon.

Yeah, I'm sure.  And I've been thinking on how to deal with it.

I think we *have* to open the source files/devices with the creds of whoever
called fsopen() or fspick() - that way you can't upgrade your privs by passing
your context fd to a suid program.  To enforce this, I think it's simplest for
fscontext_write() to call override_creds() right after taking the uapi_mutex
and then call revert_creds() right before dropping the mutex.

Another thing we might want to look at is to allow a supervisory process to
examine the context before permitting the create/reconfigure action to
proceed.  It might also be possible to do this through the LSM.

David



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 14:54     ` David Howells
@ 2018-07-12 15:50       ` Linus Torvalds
  2018-07-12 16:00         ` Al Viro
  2018-07-12 16:23       ` Andy Lutomirski
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 15:50 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 7:54 AM David Howells <dhowells@redhat.com> wrote:
>
> I think we *have* to open the source files/devices with the creds of whoever
> called fsopen() or fspick() - that way you can't upgrade your privs by passing
> your context fd to a suid program.  To enforce this, I think it's simplest for
> fscontext_write() to call override_creds() right after taking the uapi_mutex
> and then call revert_creds() right before dropping the mutex.

No.

Don't play games with override_creds. It's wrong.

You have to use file->f_creds - no games, no garbage.

But "write()" simply is *NOT* a good "command" interface. If you want
to send a command, use an ioctl or a system call.

Because it's not just about credentials. It's not just about fooling a
suid app into writing an error message to a descriptor you wrote. It's
also about things like "splice()", which can write to your target
using a kernel buffer, and thus trick you into doing a command while
we have the context set to kernel addresses.

Are we trying to get away from that issue? Yes. But it's just another
example of why "write()" IS NOT TO BE USED FOR COMMANDS.

Only use write() for data.

That's final. We're not adding yet another clueless fuck-up of an
interface just because people cannot understand this very simple rule:
"write()" is for data, not for commands.

No more excuses.

                 Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 15:50       ` Linus Torvalds
@ 2018-07-12 16:00         ` Al Viro
  2018-07-12 16:07           ` Linus Torvalds
  0 siblings, 1 reply; 113+ messages in thread
From: Al Viro @ 2018-07-12 16:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 08:50:46AM -0700, Linus Torvalds wrote:

> But "write()" simply is *NOT* a good "command" interface. If you want
> to send a command, use an ioctl or a system call.
> 
> Because it's not just about credentials. It's not just about fooling a
> suid app into writing an error message to a descriptor you wrote. It's
> also about things like "splice()", which can write to your target
> using a kernel buffer, and thus trick you into doing a command while
> we have the context set to kernel addresses.

Wait a sec - that's only a problem if your command contains pointer-chasing
et.al.  Which is why e.g. /dev/sg is fucked in head.  But for something that
is plain text, what's the problem with splice/write/sendmsg/whatever?

I'm not talking about this particular interface, but "write is bad for
commands" as general policy looks missing the point.  If anything, it's
pointer-chasing crap that should be banned everywhere.  Just look at SG_IO -
it's a ioctl, and it's absolute garbage...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:00         ` Al Viro
@ 2018-07-12 16:07           ` Linus Torvalds
  2018-07-12 16:31             ` Al Viro
  0 siblings, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 16:07 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 9:00 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Wait a sec - that's only a problem if your command contains pointer-chasing
> et.al.

No.

It's a problem if anybody ever does something like "let's have a
helper splice thread that uses splice to move data automatically from
one buffer to another".

And yes, it's something people have wanted.

Seriously. I'm putting my foot down. NO COMMANDS IN WRITE DATA!

We have made that mistake in the past. Having done stupid things in
the past is not an excuse for doing so again. Quite the reverse.
Making the same mistake and not learning from your mistakes is the
sign of stupidity.

So I repeat: write is for data. If you want an action, you do it with
ioctl, or you do it with a system call.

              Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 14:54     ` David Howells
  2018-07-12 15:50       ` Linus Torvalds
@ 2018-07-12 16:23       ` Andy Lutomirski
  2018-07-12 16:31         ` Linus Torvalds
                           ` (2 more replies)
  2018-07-12 20:23       ` David Howells
  2018-07-12 21:00       ` David Howells
  3 siblings, 3 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-12 16:23 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Al Viro, Linux API, Linux FS Devel,
	Linus Torvalds, LKML, Jann Horn, tycho



> On Jul 12, 2018, at 7:54 AM, David Howells <dhowells@redhat.com> wrote:
> 
> Andy Lutomirski <luto@kernel.org> wrote:
> 
>>> On Jul 11, 2018, at 12:22 AM, David Howells <dhowells@redhat.com> wrote:
>>> 
>>> Andy Lutomirski <luto@amacapital.net> wrote:
>>> 
>>>>>  sfd = fsopen("ext4", FSOPEN_CLOEXEC);
>>>>>  write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
>>>> 
>>>> Imagine some malicious program passes sfd as stdout to a setuid
>>>> program. That program gets persuaded to write "s /etc/shadow".  What
>>>> happens?  You’re okay as long as *every single fs* gets it right, but
>>>> that’s asking a lot.
>>> 
>>> Do note that you must already have CAP_SYS_ADMIN to be able to call
>>> fsopen().
>> 
>> If you're not allowing it already, someone will want user namespace
>> root to be able to use this very, very soon.
> 
> Yeah, I'm sure.  And I've been thinking on how to deal with it.
> 
> I think we *have* to open the source files/devices with the creds of whoever
> called fsopen() or fspick() - that way you can't upgrade your privs by passing
> your context fd to a suid program.  To enforce this, I think it's simplest for
> fscontext_write() to call override_creds() right after taking the uapi_mutex
> and then call revert_creds() right before dropping the mutex.
> 

If you make a syscall that attaches a block device to an fscontext, you don’t need any of this.  Heck, someone might actually *want* to grab a block device from a different namespace.

All this override_creds() stuff is maybe okay if we were fixing an old broken thing. But this is brand new.  And having write() call override_creds() and do nontrivial things is a fascinating attack surface.

Just imagine what blows up if I abuse fscontext to open a block device on a path that traverses an AFS mount or /proc/.../fd or similar.  Or if I splice() from a network filesystem into fscontext.

(Al- can’t we just stop allowing splice() at all on things that don’t use iov_iter?)

> Another thing we might want to look at is to allow a supervisory process to
> examine the context before permitting the create/reconfigure action to
> proceed.  It might also be possible to do this through the LSM.

Cc Tycho. He’s working on this exact idea using seccomp. And he’d probably much, much prefer if configuration of an fscontext didn’t use a performance-critical syscall like write().

As a straw man, I suggest:

fsconfigure(contextfd, ADD_BLOCKDEV, dfd, path, flags);

fsconfigure(contextfd, ADD_OPTION, 0, “foo=bar”, flags);

Etc.  

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:07           ` Linus Torvalds
@ 2018-07-12 16:31             ` Al Viro
  2018-07-12 16:39               ` Linus Torvalds
  0 siblings, 1 reply; 113+ messages in thread
From: Al Viro @ 2018-07-12 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 09:07:36AM -0700, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 9:00 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Wait a sec - that's only a problem if your command contains pointer-chasing
> > et.al.
> 
> No.
> 
> It's a problem if anybody ever does something like "let's have a
> helper splice thread that uses splice to move data automatically from
> one buffer to another".
>
> And yes, it's something people have wanted.
> 
> Seriously. I'm putting my foot down. NO COMMANDS IN WRITE DATA!
> 
> We have made that mistake in the past. Having done stupid things in
> the past is not an excuse for doing so again. Quite the reverse.
> Making the same mistake and not learning from your mistakes is the
> sign of stupidity.
> 
> So I repeat: write is for data. If you want an action, you do it with
> ioctl, or you do it with a system call.

*shrug*

I think you are wrong[1], but it's your decision.  And seriously, ioctl?
_That_ has a great track record...

[1] one man's data is another man's commands, for starters.  All networking
protocols would fit your description.  So would ANSI escape sequences ("move
cursor to line 12 column 45" does sound like a command), so would writing
postscript to printer, etc.

IME it's more about data structures that are not marshalled cleanly - that
tends to go badly wrong.  Again, see SG_IO for recent example...

Anyway, your tree, your policy.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:23       ` Andy Lutomirski
@ 2018-07-12 16:31         ` Linus Torvalds
  2018-07-12 16:41         ` Al Viro
  2018-07-12 16:58         ` Al Viro
  2 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 16:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn,
	Tycho Andersen

On Thu, Jul 12, 2018 at 9:23 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> (Al- can’t we just stop allowing splice() at all on things that don’t use iov_iter?)

We could add a FMODE_SPLICE_READ/WRITE bit, and let people opt in to
splice. We probably should have.

But again, that really doesn't change the fundamentals.  Using write()
for commands is stupid.

It also means that you have to _parse_ all the damn input at that
level, which is a mistake too. It easily leads to insane decisions
like "you have to use 'write()' calls without buffering", because
re-buffering the stream is a f*cking pain.

Just say no. Seriously. Stop this idiotic discussion.

I'm just happy this came up early, because that way I know to look out
for it and not merge it.

                 Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:31             ` Al Viro
@ 2018-07-12 16:39               ` Linus Torvalds
  2018-07-12 17:14                 ` Linus Torvalds
  2018-07-12 17:52                 ` Al Viro
  0 siblings, 2 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 16:39 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 9:31 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> And seriously, ioctl? _That_ has a great track record...

I agree that a system call is likely saner. Especially since we'd have
one to _start_ this (ie "fsopen()") it would make sense to have the
one to finalize it.

> [1] one man's data is another man's commands, for starters.  All networking
> protocols would fit your description.  So would ANSI escape sequences ("move
> cursor to line 12 column 45" does sound like a command), so would writing
> postscript to printer, etc.

.. and all of that is just data to the kernel.

Yes, vt100 escape sequences etc _are_ commands, and boy have we had
bugs in that area. But there the excuse is "that's how the world is".

The thing is, "reality" is the ultimate argument. You can't argue with
cold hard facts.

But when designing a new interface that doesn't have that kind of
constraints, do it right.

> IME it's more about data structures that are not marshalled cleanly - that
> tends to go badly wrong.  Again, see SG_IO for recent example...

SG_IO actually gets it right. It doesn't do async, but that's part of
the design (and a big part of why it's a lot simpler - the read-write
thing is actually broken too and just forces user space to basically
know SCSI).

              Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:23       ` Andy Lutomirski
  2018-07-12 16:31         ` Linus Torvalds
@ 2018-07-12 16:41         ` Al Viro
  2018-07-12 16:58         ` Al Viro
  2 siblings, 0 replies; 113+ messages in thread
From: Al Viro @ 2018-07-12 16:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Andy Lutomirski, Linux API, Linux FS Devel,
	Linus Torvalds, LKML, Jann Horn, tycho

On Thu, Jul 12, 2018 at 09:23:22AM -0700, Andy Lutomirski wrote:

> If you make a syscall that attaches a block device to an fscontext, you don’t need any of this.  Heck, someone might actually *want* to grab a block device from a different namespace.

Fuck, NO.  The whole notion of "block device of filesystem" is fucking
garbage.  It's up to filesystem driver whether it uses any block
devices.  For backing store or otherwise.  Single or multiple.  Moreover,
it's up to filesystem driver whether it cares if backing store is
a block device, or mtd device, or...

Repeat after me: syscall that attaches a block device to an fscontext
makes as much sense as a syscall that attaches a charset name to the
same.  With a special syscall for attaching a timestamp granularity,
and another for selecting GID semantics on subdirectory creation.

Commit vs. write separation is one thing; fuckloads of special syscalls
for passing vaguely defined classes of mount options (which device
name *is*) is quite different.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:23       ` Andy Lutomirski
  2018-07-12 16:31         ` Linus Torvalds
  2018-07-12 16:41         ` Al Viro
@ 2018-07-12 16:58         ` Al Viro
  2018-07-12 17:54           ` Andy Lutomirski
  2 siblings, 1 reply; 113+ messages in thread
From: Al Viro @ 2018-07-12 16:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Andy Lutomirski, Linux API, Linux FS Devel,
	Linus Torvalds, LKML, Jann Horn, tycho

On Thu, Jul 12, 2018 at 09:23:22AM -0700, Andy Lutomirski wrote:
 
> As a straw man, I suggest:
> 
> fsconfigure(contextfd, ADD_BLOCKDEV, dfd, path, flags);
> 
> fsconfigure(contextfd, ADD_OPTION, 0, “foo=bar”, flags);

Bollocks.  First of all, block device *IS* a fucking option.
Always had been.  It's not even that it's passed as a separate
argument for historical reasons - just look at NFS.  That argument
is a detached part of options, parsed (yes, *parsed*) by filesystem
in question in whatever way it prefers.

Look at the things like e.g. cramfs.  That argument is interpreted
as pathname of block device.  Or that of mtd device.  Or the magic
string "mtd" followed by mtd number.

What's more, filesystems can and do live on more than one device.
Like e.g. btrfs.  Or like something journalled with the journal
on separate device.  So you do *NOT* get away from the need to
open stuff while doing mount - not unless you introduce arseloads
of ADD_... shite in your scheme.  And create a huge centralized
pile of code dealing with it.  ADD_NFS_IPV4_SERVER_AND_PATH, etc.?

You can't avoid parsing stuff.  It's one thing to argue at which
*point* you prefer doing that, but it has to be done kernel-side.
Format of filesystem options is fundamentally up to filesystem,
whichever syscall you use.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:39               ` Linus Torvalds
@ 2018-07-12 17:14                 ` Linus Torvalds
  2018-07-12 17:44                   ` Al Viro
  2018-07-12 17:52                 ` Al Viro
  1 sibling, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 17:14 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 9:39 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I agree that a system call is likely saner. Especially since we'd have
> one to _start_ this (ie "fsopen()") it would make sense to have the
> one to finalize it.

Side note: if we can make do with just a buffer, then we wouldn't need
"fsopen()". You could literally just open a pipe, and write to it.
It's got 16 pages worth of buffers by default, and you can increase it
(within reason) as root.

Of course, depending on IO patterns, not all the buffer pages are
necessarily fully used, so it's not like you get a buffer of size
PAGE_SIZE*16, but we do merge buffers so you should be fairly close.

Then you really could do without a fsopen(). Just fill a pipe with
data, and do "fsmount()" on the pipe contents.

Added upside? You can use "iov_iter_pipe()" to iterate over all that data.

I'm only half joking.

            Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
                     ` (3 preceding siblings ...)
  2018-07-11 16:18   ` David Howells
@ 2018-07-12 17:15   ` Greg KH
  2018-07-12 17:20     ` Al Viro
  4 siblings, 1 reply; 113+ messages in thread
From: Greg KH @ 2018-07-12 17:15 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-api, linux-fsdevel, torvalds, linux-kernel

On Tue, Jul 10, 2018 at 11:44:09PM +0100, David Howells wrote:
> Provide an fsopen() system call that starts the process of preparing to
> create a superblock that will then be mountable, using an fd as a context
> handle.  fsopen() is given the name of the filesystem that will be used:
> 
> 	int mfd = fsopen(const char *fsname, unsigned int flags);
> 
> where flags can be 0 or FSOPEN_CLOEXEC.
> 
> For example:
> 
> 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> 	write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> 	write(sfd, "o noatime");
> 	write(sfd, "o acl");
> 	write(sfd, "o user_attr");
> 	write(sfd, "o iversion");
> 	write(sfd, "o ");
> 	write(sfd, "r /my/container"); // root inside the fs
> 	write(sfd, "x create"); // create the superblock

Ugh, creating configfs again in a syscall form?  I know people love
file descriptors, but can't you do this with a configfs entry instead if
you really want to do this type of thing from userspace in this type of
"style"?

Why reinvent the wheel again?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 17:15   ` Greg KH
@ 2018-07-12 17:20     ` Al Viro
  2018-07-12 18:03       ` Greg KH
  0 siblings, 1 reply; 113+ messages in thread
From: Al Viro @ 2018-07-12 17:20 UTC (permalink / raw)
  To: Greg KH; +Cc: David Howells, linux-api, linux-fsdevel, torvalds, linux-kernel

On Thu, Jul 12, 2018 at 07:15:05PM +0200, Greg KH wrote:
> On Tue, Jul 10, 2018 at 11:44:09PM +0100, David Howells wrote:
> > Provide an fsopen() system call that starts the process of preparing to
> > create a superblock that will then be mountable, using an fd as a context
> > handle.  fsopen() is given the name of the filesystem that will be used:
> > 
> > 	int mfd = fsopen(const char *fsname, unsigned int flags);
> > 
> > where flags can be 0 or FSOPEN_CLOEXEC.
> > 
> > For example:
> > 
> > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > 	write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> > 	write(sfd, "o noatime");
> > 	write(sfd, "o acl");
> > 	write(sfd, "o user_attr");
> > 	write(sfd, "o iversion");
> > 	write(sfd, "o ");
> > 	write(sfd, "r /my/container"); // root inside the fs
> > 	write(sfd, "x create"); // create the superblock
> 
> Ugh, creating configfs again in a syscall form?  I know people love
> file descriptors, but can't you do this with a configfs entry instead if
> you really want to do this type of thing from userspace in this type of
> "style"?
> 
> Why reinvent the wheel again?

The damn thing REALLY, REALLY depends upon the fs type.  How would
you map it on configfs?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 17:14                 ` Linus Torvalds
@ 2018-07-12 17:44                   ` Al Viro
  2018-07-12 17:54                     ` Linus Torvalds
  0 siblings, 1 reply; 113+ messages in thread
From: Al Viro @ 2018-07-12 17:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 10:14:05AM -0700, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 9:39 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I agree that a system call is likely saner. Especially since we'd have
> > one to _start_ this (ie "fsopen()") it would make sense to have the
> > one to finalize it.
> 
> Side note: if we can make do with just a buffer, then we wouldn't need
> "fsopen()". You could literally just open a pipe, and write to it.
> It's got 16 pages worth of buffers by default, and you can increase it
> (within reason) as root.
> 
> Of course, depending on IO patterns, not all the buffer pages are
> necessarily fully used, so it's not like you get a buffer of size
> PAGE_SIZE*16, but we do merge buffers so you should be fairly close.
> 
> Then you really could do without a fsopen(). Just fill a pipe with
> data, and do "fsmount()" on the pipe contents.
> 
> Added upside? You can use "iov_iter_pipe()" to iterate over all that data.
> 
> I'm only half joking.

One semi-historical note here.

Originally, mount(2) (and it had been there since v1) had only one filesystem
type to deal with.  So it was really just "mount <block device pathname> on
<mountpoint pathname>, read-only or read-write".  3 arguments, two strings and
one flag (flag, BTW, was a later addition).

It didn't last.  I can dig out the archaeological notes and cut'n'paste the
whole horror story here, but that'll be way too long and scary.

By 4.2BSD times there had been essentially an enum encoding the filesystem
type and type-tagged union of structs with type-dependent options.  Plus
some options taking more bits in what used to be "is it r/w?" flag.

Leaving aside the whole "mount new/bind/remount/etc." overloading we have
in mount(2) today, we have a bunch of named filesystems, each with its
own set of options.  Device name has ceased to be something special for
many decades; the type name is what's universally present and that's what
decides how the rest (including "device name") is to be interpreted.

Fundamentally, we start with selecting (by name) a filesystem driver we'll
be talking to.  The rest (device name + string options + flags like noexec
that are not handled on VFS level) is given to that driver, which either
tells us to take a hike or gives us a dentry tree that can be attached.

Separating type name from everything else makes a lot of sense, simply
because it's what determines the parsing and interpretation of the rest.
Speaking of half-joking, I suggested AF_FSTYPE at some point.  Then
fsopen(2) would be connect(2)...

I think that having that (connection used to talk to fs driver, with or
without an already set up fs instance we are talking about) as first-class
object makes sense.  That's completely unrelated to the question of buffering,
of course.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:39               ` Linus Torvalds
  2018-07-12 17:14                 ` Linus Torvalds
@ 2018-07-12 17:52                 ` Al Viro
  1 sibling, 0 replies; 113+ messages in thread
From: Al Viro @ 2018-07-12 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 09:39:31AM -0700, Linus Torvalds wrote:

> > [1] one man's data is another man's commands, for starters.  All networking
> > protocols would fit your description.  So would ANSI escape sequences ("move
> > cursor to line 12 column 45" does sound like a command), so would writing
> > postscript to printer, etc.
> 
> .. and all of that is just data to the kernel.
> 
> Yes, vt100 escape sequences etc _are_ commands, and boy have we had
> bugs in that area. But there the excuse is "that's how the world is".

... along with "something similar to ncurses-based programs usable over
ssh is a good thing to have, without having said ssh somehow intercept
and marshal ioctls" ;-)  I can just imagine something e.g. RDMA people
would've designed instead... OTOH, I'm eating right now, so better
not go there...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 17:44                   ` Al Viro
@ 2018-07-12 17:54                     ` Linus Torvalds
  0 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 17:54 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Andrew Lutomirski, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 10:44 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Separating type name from everything else makes a lot of sense

I do not dispute that at all.

But you can specify the type name in the "commit" phase, it doesn't
have to be at "fsopen" time.

In fact, doing so would _force_ a certain cleanliness to the
interfaces - it would force the rest to be filesystem-agnostic, rather
than possibly have semantic hacks for some part.

            Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 16:58         ` Al Viro
@ 2018-07-12 17:54           ` Andy Lutomirski
  0 siblings, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-12 17:54 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Andy Lutomirski, Linux API, Linux FS Devel,
	Linus Torvalds, LKML, Jann Horn, tycho



> On Jul 12, 2018, at 9:58 AM, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
>> On Thu, Jul 12, 2018 at 09:23:22AM -0700, Andy Lutomirski wrote:
>> 
>> As a straw man, I suggest:
>> 
>> fsconfigure(contextfd, ADD_BLOCKDEV, dfd, path, flags);
>> 
>> fsconfigure(contextfd, ADD_OPTION, 0, “foo=bar”, flags);
> 
> Bollocks.  First of all, block device *IS* a fucking option.
> Always had been.  It's not even that it's passed as a separate
> argument for historical reasons - just look at NFS.  That argument
> is a detached part of options, parsed (yes, *parsed*) by filesystem
> in question in whatever way it prefers.

Fine, then generalize it. fsconfigure(context, ADD_FD, “some fs-specific string explaining what’s going on”, fd);  The point being that there are tons of cases where the filesystem wants to identify some backing store by some device node, and it seems like we should support something along the lines of a modern *at interface.

If I’m writing a daemon that deals with filesystems, I don’t want an API that looks like do_god_knows_what(context, “filesystem specific string that may contain a path to a device node or a network address”). That API will be a pain to use, since that opaque string may come from some random config file and I have no clue what it does. If I want to pass a device node or other object to a filesystem, I want to pass an fd (so I can use openat, SCM_CREDS, etc), and I want it to be crystal clear that I’m passing some object in. And if I tell a filesystem to access the network, I want it to be entirely clear which network namespace is in use.

I realize that doing this right is tricky when there are lots of legacy filesystems that parse opaque strings. That’s fine. We can convert things slowly.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 17:20     ` Al Viro
@ 2018-07-12 18:03       ` Greg KH
  2018-07-12 18:30         ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Greg KH @ 2018-07-12 18:03 UTC (permalink / raw)
  To: Al Viro; +Cc: David Howells, linux-api, linux-fsdevel, torvalds, linux-kernel

On Thu, Jul 12, 2018 at 06:20:24PM +0100, Al Viro wrote:
> On Thu, Jul 12, 2018 at 07:15:05PM +0200, Greg KH wrote:
> > On Tue, Jul 10, 2018 at 11:44:09PM +0100, David Howells wrote:
> > > Provide an fsopen() system call that starts the process of preparing to
> > > create a superblock that will then be mountable, using an fd as a context
> > > handle.  fsopen() is given the name of the filesystem that will be used:
> > > 
> > > 	int mfd = fsopen(const char *fsname, unsigned int flags);
> > > 
> > > where flags can be 0 or FSOPEN_CLOEXEC.
> > > 
> > > For example:
> > > 
> > > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > > 	write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> > > 	write(sfd, "o noatime");
> > > 	write(sfd, "o acl");
> > > 	write(sfd, "o user_attr");
> > > 	write(sfd, "o iversion");
> > > 	write(sfd, "o ");
> > > 	write(sfd, "r /my/container"); // root inside the fs
> > > 	write(sfd, "x create"); // create the superblock
> > 
> > Ugh, creating configfs again in a syscall form?  I know people love
> > file descriptors, but can't you do this with a configfs entry instead if
> > you really want to do this type of thing from userspace in this type of
> > "style"?
> > 
> > Why reinvent the wheel again?
> 
> The damn thing REALLY, REALLY depends upon the fs type.  How would
> you map it on configfs?

/sys/kernel/config/fs/ext4/ would work, right?  Each fs "type" would be
listed there.

Anyway, the whole "write a bunch of options and then do a 'create'" is
exactly the way configfs works.  Why not use that?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 18:03       ` Greg KH
@ 2018-07-12 18:30         ` Andy Lutomirski
  2018-07-12 18:34           ` Al Viro
  2018-07-12 19:08           ` Greg KH
  0 siblings, 2 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-12 18:30 UTC (permalink / raw)
  To: Greg KH
  Cc: Al Viro, David Howells, linux-api, linux-fsdevel, torvalds, linux-kernel


> On Jul 12, 2018, at 11:03 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
>> On Thu, Jul 12, 2018 at 06:20:24PM +0100, Al Viro wrote:
>>> On Thu, Jul 12, 2018 at 07:15:05PM +0200, Greg KH wrote:
>>>> On Tue, Jul 10, 2018 at 11:44:09PM +0100, David Howells wrote:
>>>> Provide an fsopen() system call that starts the process of preparing to
>>>> create a superblock that will then be mountable, using an fd as a context
>>>> handle.  fsopen() is given the name of the filesystem that will be used:
>>>> 
>>>>    int mfd = fsopen(const char *fsname, unsigned int flags);
>>>> 
>>>> where flags can be 0 or FSOPEN_CLOEXEC.
>>>> 
>>>> For example:
>>>> 
>>>>    sfd = fsopen("ext4", FSOPEN_CLOEXEC);
>>>>    write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
>>>>    write(sfd, "o noatime");
>>>>    write(sfd, "o acl");
>>>>    write(sfd, "o user_attr");
>>>>    write(sfd, "o iversion");
>>>>    write(sfd, "o ");
>>>>    write(sfd, "r /my/container"); // root inside the fs
>>>>    write(sfd, "x create"); // create the superblock
>>> 
>>> Ugh, creating configfs again in a syscall form?  I know people love
>>> file descriptors, but can't you do this with a configfs entry instead if
>>> you really want to do this type of thing from userspace in this type of
>>> "style"?
>>> 
>>> Why reinvent the wheel again?
>> 
>> The damn thing REALLY, REALLY depends upon the fs type.  How would
>> you map it on configfs?
> 
> /sys/kernel/config/fs/ext4/ would work, right?  Each fs "type" would be
> listed there.
> 
> Anyway, the whole "write a bunch of options and then do a 'create'" is
> exactly the way configfs works.  Why not use that?
> 
> 

How do you mount configfs in the first place?  And how do you use this in a mount namespace without a private configfs instance or where you don’t want configfs mounted?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 18:30         ` Andy Lutomirski
@ 2018-07-12 18:34           ` Al Viro
  2018-07-12 18:35             ` Al Viro
  2018-07-12 19:08           ` Greg KH
  1 sibling, 1 reply; 113+ messages in thread
From: Al Viro @ 2018-07-12 18:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg KH, David Howells, linux-api, linux-fsdevel, torvalds, linux-kernel

On Thu, Jul 12, 2018 at 11:30:32AM -0700, Andy Lutomirski wrote:

Andi, Greg - alt.tasteless is over -> that way.

And for fsck sake, fix your MUA.  Lines are obscenely long...

> How do you mount configfs in the first place?  And how do you use this in a mount namespace without a private configfs instance or where you don’t want configfs mounted?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 18:34           ` Al Viro
@ 2018-07-12 18:35             ` Al Viro
  0 siblings, 0 replies; 113+ messages in thread
From: Al Viro @ 2018-07-12 18:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg KH, David Howells, linux-api, linux-fsdevel, torvalds, linux-kernel

On Thu, Jul 12, 2018 at 07:34:26PM +0100, Al Viro wrote:
> On Thu, Jul 12, 2018 at 11:30:32AM -0700, Andy Lutomirski wrote:
> 
> Andi,

Apologies for misspelling - finger macros strike ;-/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 18:30         ` Andy Lutomirski
  2018-07-12 18:34           ` Al Viro
@ 2018-07-12 19:08           ` Greg KH
  1 sibling, 0 replies; 113+ messages in thread
From: Greg KH @ 2018-07-12 19:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, David Howells, linux-api, linux-fsdevel, torvalds, linux-kernel

On Thu, Jul 12, 2018 at 11:30:32AM -0700, Andy Lutomirski wrote:
> 
> > On Jul 12, 2018, at 11:03 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> > 
> >> On Thu, Jul 12, 2018 at 06:20:24PM +0100, Al Viro wrote:
> >>> On Thu, Jul 12, 2018 at 07:15:05PM +0200, Greg KH wrote:
> >>>> On Tue, Jul 10, 2018 at 11:44:09PM +0100, David Howells wrote:
> >>>> Provide an fsopen() system call that starts the process of preparing to
> >>>> create a superblock that will then be mountable, using an fd as a context
> >>>> handle.  fsopen() is given the name of the filesystem that will be used:
> >>>> 
> >>>>    int mfd = fsopen(const char *fsname, unsigned int flags);
> >>>> 
> >>>> where flags can be 0 or FSOPEN_CLOEXEC.
> >>>> 
> >>>> For example:
> >>>> 
> >>>>    sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> >>>>    write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> >>>>    write(sfd, "o noatime");
> >>>>    write(sfd, "o acl");
> >>>>    write(sfd, "o user_attr");
> >>>>    write(sfd, "o iversion");
> >>>>    write(sfd, "o ");
> >>>>    write(sfd, "r /my/container"); // root inside the fs
> >>>>    write(sfd, "x create"); // create the superblock
> >>> 
> >>> Ugh, creating configfs again in a syscall form?  I know people love
> >>> file descriptors, but can't you do this with a configfs entry instead if
> >>> you really want to do this type of thing from userspace in this type of
> >>> "style"?
> >>> 
> >>> Why reinvent the wheel again?
> >> 
> >> The damn thing REALLY, REALLY depends upon the fs type.  How would
> >> you map it on configfs?
> > 
> > /sys/kernel/config/fs/ext4/ would work, right?  Each fs "type" would be
> > listed there.
> > 
> > Anyway, the whole "write a bunch of options and then do a 'create'" is
> > exactly the way configfs works.  Why not use that?
> > 
> > 
> 
> How do you mount configfs in the first place?  And how do you use this
> in a mount namespace without a private configfs instance or where you
> don’t want configfs mounted?--

Ok, fair enough, I missed the part where this is going to replace
mount(2).  Although you could just use mount(2) to mount configfs on a
mount point in the initramfs image and then go from there at boot time :)

/me runs away...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 14:54     ` David Howells
  2018-07-12 15:50       ` Linus Torvalds
  2018-07-12 16:23       ` Andy Lutomirski
@ 2018-07-12 20:23       ` David Howells
  2018-07-12 20:25         ` Andy Lutomirski
                           ` (2 more replies)
  2018-07-12 21:00       ` David Howells
  3 siblings, 3 replies; 113+ messages in thread
From: David Howells @ 2018-07-12 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Don't play games with override_creds. It's wrong.
> 
> You have to use file->f_creds - no games, no garbage.

You missed the point.

It's all very well to say "use file->f_creds".  The problem is this has to be
handed down all the way through the filesystem and down into the block layer
as appropriate to anywhere there's an LSM call, a CAP_* check or a pathwalk -
but there's not currently any way to do that.

mount_bdev() and blkdev_get_by_path() are examples of this.  At the moment
there is no cred parameter there.  We'd also have to pass the creds down into
path_init() to store in struct nameidata and make sure that every permissions
call that might be invoked during pathwalk in every filesystem uses that, not
current_cred().

I made an attempt to do this a while ago and the patch got rather large before
I gave up.  In many ways, it's what we *should* do, but so many things need an
extra parameter...  If you really want, I can try that again.  It's possible I
can automate it with some perl scripting to parse the error messages from the
compiler.

My suggestion was to use override_creds() to impose the appropriate creds at
the top, be that file->f_creds or fs_context->creds (they would be the same in
any case).

If we want to go down the pass-the-creds-down route, then we can temporarily
do override_creds() until we've made the changes and then remove it later.

> But "write()" simply is *NOT* a good "command" interface. If you want
> to send a command, use an ioctl or a system call.

Okay.

> Because it's not just about credentials. It's not just about fooling a
> suid app into writing an error message to a descriptor you wrote. It's
> also about things like "splice()", which can write to your target
> using a kernel buffer, and thus trick you into doing a command while
> we have the context set to kernel addresses.
> 
> Are we trying to get away from that issue? Yes. But it's just another
> example of why "write()" IS NOT TO BE USED FOR COMMANDS.

Btw, do we protect sysfs, debugfs, tracefs, procfs, etc. writes against
splice?  Some of the things in debugfs are really icky, allowing you to muck
directly with hardware.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 20:23       ` David Howells
@ 2018-07-12 20:25         ` Andy Lutomirski
  2018-07-12 20:34         ` Linus Torvalds
  2018-07-12 21:26         ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-12 20:25 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn



> On Jul 12, 2018, at 1:23 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> Don't play games with override_creds. It's wrong.
>> 
>> You have to use file->f_creds - no games, no garbage.
> 
> You missed the point.
> 

> 
> My suggestion was to use override_creds() to impose the appropriate creds at
> the top, be that file->f_creds or fs_context->creds (they would be the same in
> any case).

I think it should be a new syscall and use current’s creds. No override needed.


> Btw, do we protect sysfs, debugfs, tracefs, procfs, etc. writes against
> splice?  Some of the things in debugfs are really icky, allowing you to muck
> directly with hardware.
> 

We try. It has been a perennial source of severe bugs.

This is part of why I’d like to see splice() be an opt in. Also, it’s a major step toward getting rid of set_fs().

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 20:23       ` David Howells
  2018-07-12 20:25         ` Andy Lutomirski
@ 2018-07-12 20:34         ` Linus Torvalds
  2018-07-12 20:36           ` Linus Torvalds
  2018-07-12 21:26         ` David Howells
  2 siblings, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 20:34 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 1:23 PM David Howells <dhowells@redhat.com> wrote:
>
> It's all very well to say "use file->f_creds".  The problem is this has to be
> handed down all the way through the filesystem and down into the block layer
> as appropriate to anywhere there's an LSM call, a CAP_* check or a pathwalk -
> but there's not currently any way to do that.

.. and the reason is simple: you damn well shouldn't do that.

The unix semantics are that credentials are checked at open time.

If your interface involves checking credentials at write() time, your
interface is garbage shit.

Really.

This is the whole "write() is only for data". If you ever have
credentials mattering at write time, you're doing something wrong.

Really really.

Don't do it.

             Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 20:34         ` Linus Torvalds
@ 2018-07-12 20:36           ` Linus Torvalds
  0 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 20:36 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 1:34 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is the whole "write() is only for data". If you ever have
> credentials mattering at write time, you're doing something wrong.
>
> Really really.
>
> Don't do it.

.. and I'd like to repeat: we *have* done things wrong. But that's
simply not an excuse. We've done it wrong in SCSI, we've done it wrong
in various /proc files, we've done it wrong in many places.

But let's not do it wrong AGAIN.

                Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 14:54     ` David Howells
                         ` (2 preceding siblings ...)
  2018-07-12 20:23       ` David Howells
@ 2018-07-12 21:00       ` David Howells
  2018-07-12 21:29         ` Linus Torvalds
  2018-07-13 13:27         ` David Howells
  3 siblings, 2 replies; 113+ messages in thread
From: David Howells @ 2018-07-12 21:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Andy Lutomirski, Al Viro, Linux API, Linux FS Devel,
	Linus Torvalds, LKML, Jann Horn, tycho

Andy Lutomirski <luto@amacapital.net> wrote:

> fsconfigure(contextfd, ADD_BLOCKDEV, dfd, path, flags);
> 
> fsconfigure(contextfd, ADD_OPTION, 0, “foo=bar”, flags);

That seems okayish.  I'm not sure we need the flags, but I do want to allow
for binary data in an option.  So perhaps something like:

	int fsconfig(int fd, unsigned int type,
		     const char *key, const void *val, size_t val_len);

for example:

	fd = fsopen("ext4", FSOPEN_CLOEXEC);
	fsconfig(fd, fsconfig_blockdev, "dev.data", "/dev/sda1", ...);
	fsconfig(fd, fsconfig_blockdev, "dev.journal", "/dev/sda2", ...);
	fsconfig(fd, fsconfig_option, "user_xattr", NULL, ...);
	fsconfig(fd, fsconfig_option, "errors", "continue", ...);
	fsconfig(fd, fsconfig_option, "data", "journal", ...);
	fsconfig(fd, fsconfig_security, "selinux.context", "unconfined_u:...");
	fsconfig(fd, fsconfig_create, NULL, NULL, 0);
	mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

	fd = fsopen("nfs", FSOPEN_CLOEXEC);
	fsconfig(fd, fsconfig_namespace, "user", "<usernsfd>", ...);
	fsconfig(fd, fsconfig_namespace, "net", "<netnsfd>", ...);
	fsconfig(fd, fsconfig_option, "server", "foo.com", ...);
	fsconfig(fd, fsconfig_option, "root", "/bar", ...);
	fsconfig(fd, fsconfig_option, "soft", NULL, ...);
	fsconfig(fd, fsconfig_option, "retry", "3", ...);
	fsconfig(fd, fsconfig_option, "wsize", "4096", ...);
	fsconfig(fd, fsconfig_uidmap, "dhowells", "1234", ...);
	fsconfig(fd, fsconfig_security, "selinux.context", "unconfined_u:...");
	fsconfig(fd, fsconfig_create, NULL, NULL, 0);
	mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

This does mean that userspace has to work harder, though, but it would
simplify the LSM interface internally.

Al Viro <viro@ftp.linux.org.uk>

> First of all, block device *IS* a fucking option.

Whilst that is true, I still need to be able to separate it out for
unconverted filesystems.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 20:23       ` David Howells
  2018-07-12 20:25         ` Andy Lutomirski
  2018-07-12 20:34         ` Linus Torvalds
@ 2018-07-12 21:26         ` David Howells
  2018-07-12 21:40           ` Linus Torvalds
                             ` (2 more replies)
  2 siblings, 3 replies; 113+ messages in thread
From: David Howells @ 2018-07-12 21:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The unix semantics are that credentials are checked at open time.

Sigh.

The problem is that there's more than one actual "open" involved.

	fd = fsopen("ext4");				<--- #1
	whatever_interface(fd, "s /dev/sda1");
	whatever_interface(fd, "o journal_path=/dev/sda2");
	do_the_create_thing(fd);			<--- #2 and #3

The initial check to see whether you can mount or not is done at #1.

But later there are two nested file opens.  Internally, deep down inside the
block layer, /dev/sda1 and /dev/sda2 are opened and further permissions checks
are done, whether you like it or not.  But these have no access to the creds
attached to fd as things currently stand.

So we have three choices:

 (1) Pass the creds from ->get_tree() all the way down into pathwalk and make
     sure *every* check that pathwalk does uses it.

 (2) When do_the_create_thing() is invoked, it wraps the call to ->get_tree()
     with override_creds(file->f_cred).

 (3) Forget using an fd to refer to the context.  fsopen() takes absolutely
     everything, perhaps as a kv array and spits out an O_PATH fd.  You don't
     get improved error reporting, you don't get a chance for interaction -
     say with the server, to construct an ID mapping table - and you don't get
     the chance to query the superblock before creating a mount.

     So, something like:

	struct fsopen_param {
		unsigned int type,
		const char *key;
		const void *val;
		unsigned int val_len;
	};

	mfd = fsopen(const char *fs_type,
		     unsigned int flags, /* CLOEXEC */
		     const struct fsopen_param *params,
		     unsigned int param_count,
		     unsigned int ms_flags /* eg. MNT_NOEXEC */);

     For example:

	struct fsopen_param params[] = {
		{ fsopen_source, "dev.fs", "/dev/sda1" }
		{ fsopen_source, "dev.journal", "/dev/sda2" }
		{ fsopen_option, "user_xattr" }
		{ fsopen_option, "data", "journal" }
		{ fsopen_option, "jqfmt", "vfsv1" }
		{ fsopen_security, "selinux.context", "unconfined_u..." }
	};

	mfd = fsopen("ext4", FSOPEN_CLOEXEC, params, ARRAY_SIZE(params),
		     MNT_NOEXEC);

     There would need to be an fsreconfig() also in a similar vein.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 21:00       ` David Howells
@ 2018-07-12 21:29         ` Linus Torvalds
  2018-07-13 13:27         ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 21:29 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn,
	Tycho Andersen

On Thu, Jul 12, 2018 at 2:00 PM David Howells <dhowells@redhat.com> wrote:
>
>
> for example:
>
>         fd = fsopen("ext4", FSOPEN_CLOEXEC);
>         fsconfig(fd, fsconfig_blockdev, "dev.data", "/dev/sda1", ...);
>         fsconfig(fd, fsconfig_blockdev, "dev.journal", "/dev/sda2", ...);

Ok, that looks good to me. It also avoids the parsing issue with using
an interface like "write()", where the expectation is that you can
append things etc.

              Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 21:26         ` David Howells
@ 2018-07-12 21:40           ` Linus Torvalds
  2018-07-12 22:32           ` Theodore Y. Ts'o
  2018-07-12 22:54           ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Linus Torvalds @ 2018-07-12 21:40 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 2:26 PM David Howells <dhowells@redhat.com> wrote:
>
> The problem is that there's more than one actual "open" involved.

No. The problem is "write()".

This is not about open, about fsopen, or about anything at all.

This is about the fact that "write()" by definition can happen in a
different - and unexpected - context. Whether that be due to suid or
due to splice, or due to any other random issue is entirely
immaterial.

(The same is true of "read()" too, but very few people try to make
"read()" have side effects, so it's less of an issue. It does happen,
though).

But once you have another interface than "read/write()", the issues go
away. Those other interfaces are synchronous, and now you can decide
"ok, I'll just use current creds".

>  (1) Pass the creds from ->get_tree() all the way down into pathwalk and make
>      sure *every* check that pathwalk does uses it.

No. See above.

If your write() does anything but buffering data, it's not getting merged.

>  (2) When do_the_create_thing() is invoked, it wraps the call to ->get_tree()
>      with override_creds(file->f_cred).

No.

We do not wrap creds in any case. It's just asking for *another* kind
of security issue, where you fool some higher-security thing into
giving you access because it wrapped the higher-security case instead.

>  (3) Forget using an fd to refer to the context.  fsopen() takes absolutely
>      everything, perhaps as a kv array and spits out an O_PATH fd.

That works.

Or you know - do what I told you to do ALL THE TIME, which was to not
use write(), or to only buffer things with write().

But yes, any option that simply avoids read and write is fine.

You can even have a file descriptor. We already have file descriptors
that cannot be read from or written to. It's quite common for special
devices, the whole "open /dev/floppy with O_NONBLOCK only to be able
to do control operations with it" goes back to pretty much day #1.

More recently, we have the whole "FMODE_PATH" kind of file descriptor,
which works as a directory entry, but not for read and write.

So file descriptors can have very useful properties.

But no. We do not use "write()" to implement actions. If you think you
need to check permissions and think you need a "cred", then you're not
using write(). It really is that simple.

Not using write just avouds *all* the problems. If you can fool a suid
application to do arbitrary system calls for you, then it's not the
system call that is the security problem.

                Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 21:26         ` David Howells
  2018-07-12 21:40           ` Linus Torvalds
@ 2018-07-12 22:32           ` Theodore Y. Ts'o
  2018-07-12 22:54           ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Theodore Y. Ts'o @ 2018-07-12 22:32 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 10:26:37PM +0100, David Howells wrote:
> The problem is that there's more than one actual "open" involved.
> 
> 	fd = fsopen("ext4");				<--- #1
> 	whatever_interface(fd, "s /dev/sda1");
> 	whatever_interface(fd, "o journal_path=/dev/sda2");
> 	do_the_create_thing(fd);			<--- #2 and #3
> 
> The initial check to see whether you can mount or not is done at #1.
> 
> But later there are two nested file opens.  Internally, deep down inside the
> block layer, /dev/sda1 and /dev/sda2 are opened and further permissions checks
> are done, whether you like it or not.  But these have no access to the creds
> attached to fd as things currently stand.

So maybe the answer is that you open /dev/sda1 and /dev/sda2 and then
pass the file descriptors to the fsopen object?  We can require that
the fd's be opened with O_RDWR and O_EXCL, which has the benefit where
if you have multiple block devices, you know *which* block device had
a problem with being grabbed for an exclusive open.

Just a thought.

						- Ted

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 21:26         ` David Howells
  2018-07-12 21:40           ` Linus Torvalds
  2018-07-12 22:32           ` Theodore Y. Ts'o
@ 2018-07-12 22:54           ` David Howells
  2018-07-12 23:21             ` Andy Lutomirski
                               ` (3 more replies)
  2 siblings, 4 replies; 113+ messages in thread
From: David Howells @ 2018-07-12 22:54 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: dhowells, Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn

Theodore Y. Ts'o <tytso@mit.edu> wrote:

> So maybe the answer is that you open /dev/sda1 and /dev/sda2 and then
> pass the file descriptors to the fsopen object?  We can require that
> the fd's be opened with O_RDWR and O_EXCL, which has the benefit where
> if you have multiple block devices, you know *which* block device had
> a problem with being grabbed for an exclusive open.

Would that mean then that doing:

	mount /dev/sda3 /a
	mount /dev/sda3 /b

would then fail on the second command because /dev/sda3 is already open
exclusively?

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 22:54           ` David Howells
@ 2018-07-12 23:21             ` Andy Lutomirski
  2018-07-12 23:23             ` Jann Horn
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-12 23:21 UTC (permalink / raw)
  To: David Howells
  Cc: Theodore Y. Ts'o, Linus Torvalds, Andrew Lutomirski, Al Viro,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Jann Horn



> On Jul 12, 2018, at 3:54 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
>> So maybe the answer is that you open /dev/sda1 and /dev/sda2 and then
>> pass the file descriptors to the fsopen object?  We can require that
>> the fd's be opened with O_RDWR and O_EXCL, which has the benefit where
>> if you have multiple block devices, you know *which* block device had
>> a problem with being grabbed for an exclusive open.
> 
> Would that mean then that doing:
> 
>    mount /dev/sda3 /a
>    mount /dev/sda3 /b
> 
> would then fail on the second command because /dev/sda3 is already open
> exclusively?
> 

I tend to think that this *should* fail using the new API.  The semantics of the second mount request are bizarre at best.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 22:54           ` David Howells
  2018-07-12 23:21             ` Andy Lutomirski
@ 2018-07-12 23:23             ` Jann Horn
  2018-07-12 23:33               ` Jann Horn
  2018-07-12 23:35             ` David Howells
  2018-07-13  2:35             ` Theodore Y. Ts'o
  3 siblings, 1 reply; 113+ messages in thread
From: Jann Horn @ 2018-07-12 23:23 UTC (permalink / raw)
  To: David Howells
  Cc: Theodore Y. Ts'o, Linus Torvalds, Andy Lutomirski, Al Viro,
	Linux API, linux-fsdevel, kernel list

On Thu, Jul 12, 2018 at 3:54 PM David Howells <dhowells@redhat.com> wrote:
>
> Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> > So maybe the answer is that you open /dev/sda1 and /dev/sda2 and then
> > pass the file descriptors to the fsopen object?  We can require that
> > the fd's be opened with O_RDWR and O_EXCL, which has the benefit where
> > if you have multiple block devices, you know *which* block device had
> > a problem with being grabbed for an exclusive open.
>
> Would that mean then that doing:
>
>         mount /dev/sda3 /a
>         mount /dev/sda3 /b
>
> would then fail on the second command because /dev/sda3 is already open
> exclusively?

Not exactly. mount_bdev() uses FMODE_EXCL, which locks out parallel
usage *with a different filesystem type*. This is the effect:

# strace -e trace=mount mount -t vfat /dev/loop0 mount
mount("/dev/loop0", "/home/jannh/tmp/x/mount", "vfat", MS_MGC_VAL, NULL) = 0
+++ exited with 0 +++
# strace -e trace=mount mount -t ext4 /dev/loop0 mount
mount("/dev/loop0", "/home/jannh/tmp/x/mount", "ext4", MS_MGC_VAL,
NULL) = -1 EBUSY (Device or resource busy)
mount: /home/jannh/tmp/x/mount: /dev/loop0 already mounted on
/home/jannh/tmp/x/mount.
+++ exited with 32 +++

I don't really understand why it's not more strict though...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 23:23             ` Jann Horn
@ 2018-07-12 23:33               ` Jann Horn
  0 siblings, 0 replies; 113+ messages in thread
From: Jann Horn @ 2018-07-12 23:33 UTC (permalink / raw)
  To: David Howells
  Cc: Theodore Y. Ts'o, Linus Torvalds, Andy Lutomirski, Al Viro,
	Linux API, linux-fsdevel, kernel list

On Thu, Jul 12, 2018 at 4:23 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Jul 12, 2018 at 3:54 PM David Howells <dhowells@redhat.com> wrote:
> >
> > Theodore Y. Ts'o <tytso@mit.edu> wrote:
> >
> > > So maybe the answer is that you open /dev/sda1 and /dev/sda2 and then
> > > pass the file descriptors to the fsopen object?  We can require that
> > > the fd's be opened with O_RDWR and O_EXCL, which has the benefit where
> > > if you have multiple block devices, you know *which* block device had
> > > a problem with being grabbed for an exclusive open.
> >
> > Would that mean then that doing:
> >
> >         mount /dev/sda3 /a
> >         mount /dev/sda3 /b
> >
> > would then fail on the second command because /dev/sda3 is already open
> > exclusively?
>
> Not exactly. mount_bdev() uses FMODE_EXCL, which locks out parallel
> usage *with a different filesystem type*. This is the effect:
>
> # strace -e trace=mount mount -t vfat /dev/loop0 mount
> mount("/dev/loop0", "/home/jannh/tmp/x/mount", "vfat", MS_MGC_VAL, NULL) = 0
> +++ exited with 0 +++
> # strace -e trace=mount mount -t ext4 /dev/loop0 mount
> mount("/dev/loop0", "/home/jannh/tmp/x/mount", "ext4", MS_MGC_VAL,
> NULL) = -1 EBUSY (Device or resource busy)
> mount: /home/jannh/tmp/x/mount: /dev/loop0 already mounted on
> /home/jannh/tmp/x/mount.
> +++ exited with 32 +++
>
> I don't really understand why it's not more strict though...

Er, sorry, of course that's the current behavior, not the behavior of
the suggested API.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 22:54           ` David Howells
  2018-07-12 23:21             ` Andy Lutomirski
  2018-07-12 23:23             ` Jann Horn
@ 2018-07-12 23:35             ` David Howells
  2018-07-12 23:50               ` Andy Lutomirski
  2018-07-13  0:03               ` David Howells
  2018-07-13  2:35             ` Theodore Y. Ts'o
  3 siblings, 2 replies; 113+ messages in thread
From: David Howells @ 2018-07-12 23:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Theodore Y. Ts'o, Linus Torvalds,
	Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Andy Lutomirski <luto@amacapital.net> wrote:

> I tend to think that this *should* fail using the new API.  The semantics of
> the second mount request are bizarre at best.

You still have to support existing behaviour lest you break userspace.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 23:35             ` David Howells
@ 2018-07-12 23:50               ` Andy Lutomirski
  2018-07-13  0:03               ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-12 23:50 UTC (permalink / raw)
  To: David Howells
  Cc: Theodore Y. Ts'o, Linus Torvalds, Andrew Lutomirski, Al Viro,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Jann Horn



> On Jul 12, 2018, at 4:35 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
>> I tend to think that this *should* fail using the new API.  The semantics of
>> the second mount request are bizarre at best.
> 
> You still have to support existing behaviour lest you break userspace.
> 

I assume the existing behavior is that a bind mount is created?  If so, the new mount(8) tool could do it in user code.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 23:35             ` David Howells
  2018-07-12 23:50               ` Andy Lutomirski
@ 2018-07-13  0:03               ` David Howells
  2018-07-13  0:24                 ` Andy Lutomirski
  2018-07-13  7:30                 ` David Howells
  1 sibling, 2 replies; 113+ messages in thread
From: David Howells @ 2018-07-13  0:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Theodore Y. Ts'o, Linus Torvalds,
	Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Andy Lutomirski <luto@amacapital.net> wrote:

> >> I tend to think that this *should* fail using the new API.  The semantics
> >> of the second mount request are bizarre at best.
> > 
> > You still have to support existing behaviour lest you break userspace.
> > 
> 
> I assume the existing behavior is that a bind mount is created?  If so, the
> new mount(8) tool could do it in user code.

You have a race there.

Also you can't currently directly create a bind mount from userspace as you
can only bind from another path point - which you may not be able to access
(either by permission failure or because it's not in your mount namespace).

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-13  0:03               ` David Howells
@ 2018-07-13  0:24                 ` Andy Lutomirski
  2018-07-13  7:30                 ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-13  0:24 UTC (permalink / raw)
  To: David Howells
  Cc: Theodore Y. Ts'o, Linus Torvalds, Andrew Lutomirski, Al Viro,
	Linux API, linux-fsdevel, Linux Kernel Mailing List, Jann Horn



> On Jul 12, 2018, at 5:03 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
>>>> I tend to think that this *should* fail using the new API.  The semantics
>>>> of the second mount request are bizarre at best.
>>> 
>>> You still have to support existing behaviour lest you break userspace.
>>> 
>> 
>> I assume the existing behavior is that a bind mount is created?  If so, the
>> new mount(8) tool could do it in user code.
> 
> You have a race there.
> 
> Also you can't currently directly create a bind mount from userspace as you
> can only bind from another path point - which you may not be able to access
> (either by permission failure or because it's not in your mount namespace).
> 

Are you trying to preserve the magic bind semantics with the new API?  If you are, I think it should be by explicit opt in only. Otherwise you risk having your shiny new way to specify fs options get ignored when the magic bind mount happens. 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 22/32] vfs: Provide documentation for new mount API [ver #9]
  2018-07-10 22:43 ` [PATCH 22/32] vfs: Provide documentation for new mount API " David Howells
@ 2018-07-13  1:37   ` Randy Dunlap
  2018-07-13  9:45   ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: Randy Dunlap @ 2018-07-13  1:37 UTC (permalink / raw)
  To: David Howells, viro; +Cc: linux-fsdevel, torvalds, linux-kernel

On 07/10/2018 03:43 PM, David Howells wrote:
> Provide documentation for the new mount API.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  Documentation/filesystems/mount_api.txt |  439 +++++++++++++++++++++++++++++++
>  1 file changed, 439 insertions(+)
>  create mode 100644 Documentation/filesystems/mount_api.txt

Hi,

I would review this but it sounds like I should just wait for the
next version.

-- 
~Randy

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 22:54           ` David Howells
                               ` (2 preceding siblings ...)
  2018-07-12 23:35             ` David Howells
@ 2018-07-13  2:35             ` Theodore Y. Ts'o
  3 siblings, 0 replies; 113+ messages in thread
From: Theodore Y. Ts'o @ 2018-07-13  2:35 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn

On Thu, Jul 12, 2018 at 11:54:41PM +0100, David Howells wrote:
> 
> Would that mean then that doing:
> 
> 	mount /dev/sda3 /a
> 	mount /dev/sda3 /b
> 
> would then fail on the second command because /dev/sda3 is already open
> exclusively?

Good point.  One workaround would be to require an open with O_PATH instead.

     	     	 	    	     	- Ted

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-13  0:03               ` David Howells
  2018-07-13  0:24                 ` Andy Lutomirski
@ 2018-07-13  7:30                 ` David Howells
  2018-07-19  1:30                   ` Eric W. Biederman
  1 sibling, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-13  7:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Theodore Y. Ts'o, Linus Torvalds,
	Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

Andy Lutomirski <luto@amacapital.net> wrote:

> > Also you can't currently directly create a bind mount from userspace as you
> > can only bind from another path point - which you may not be able to access
> > (either by permission failure or because it's not in your mount namespace).
> > 
> 
> Are you trying to preserve the magic bind semantics with the new API?

No, I'm pointing out that you can't emulate this by doing a bind mount from
userspace if you can't access the thing you're binding from.

Now, we could create a syscall that just picks up an extant superblock using a
device and attaches it to a mount for you, but that would have to be at least
partially parameterised - which would be very fs-dependent - so that it can
know whether or not you're allowed to create another mount to that sb.

What you're talking about is emulating sget() in userspace - when we have to
do it in the kernel anyway if we still offer mount(2).

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 22/32] vfs: Provide documentation for new mount API [ver #9]
  2018-07-10 22:43 ` [PATCH 22/32] vfs: Provide documentation for new mount API " David Howells
  2018-07-13  1:37   ` Randy Dunlap
@ 2018-07-13  9:45   ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-13  9:45 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: dhowells, viro, linux-fsdevel, torvalds, linux-kernel

Randy Dunlap <rdunlap@infradead.org> wrote:

> I would review this but it sounds like I should just wait for the
> next version.

Probably a good idea.  Thanks for the consideration anyway.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-12 21:00       ` David Howells
  2018-07-12 21:29         ` Linus Torvalds
@ 2018-07-13 13:27         ` David Howells
  2018-07-13 15:01           ` Andy Lutomirski
                             ` (2 more replies)
  1 sibling, 3 replies; 113+ messages in thread
From: David Howells @ 2018-07-13 13:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Andy Lutomirski, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn,
	Tycho Andersen

Whilst I'm at it, do we want the option of doing the equivalent of mountat()?
I.e. offering the option to open all the device files used by a superblock
with dfd and AT_* flags in combination with the filename?

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-13 13:27         ` David Howells
@ 2018-07-13 15:01           ` Andy Lutomirski
  2018-07-13 15:40           ` David Howells
  2018-07-17  9:40           ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-13 15:01 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn,
	Tycho Andersen



> On Jul 13, 2018, at 6:27 AM, David Howells <dhowells@redhat.com> wrote:
> 
> Whilst I'm at it, do we want the option of doing the equivalent of mountat()?
> I.e. offering the option to open all the device files used by a superblock
> with dfd and AT_* flags in combination with the filename?
> 

Isn’t that more or less what I was suggesting?  I suggested dfd and path and I also suggested just an fd and letting the caller open the file itself.

> David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-13 13:27         ` David Howells
  2018-07-13 15:01           ` Andy Lutomirski
@ 2018-07-13 15:40           ` David Howells
  2018-07-13 17:14             ` Andy Lutomirski
  2018-07-17  9:40           ` David Howells
  2 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2018-07-13 15:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn,
	Tycho Andersen

Andy Lutomirski <luto@amacapital.net> wrote:

> > Whilst I'm at it, do we want the option of doing the equivalent of
> > mountat()?  I.e. offering the option to open all the device files used by
> > a superblock with dfd and AT_* flags in combination with the filename?
> > 
> 
> Isn't that more or less what I was suggesting?

Yes, you suggested that.  I'm asking if we actually need that.

> ... I also suggested just an fd and letting the caller open the file itself.

I'm not entirely sure, but that might prevent the filesystem from being able
to use it, since userspace might then prevent the filesystem getting exclusive
holdership.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-13 15:40           ` David Howells
@ 2018-07-13 17:14             ` Andy Lutomirski
  0 siblings, 0 replies; 113+ messages in thread
From: Andy Lutomirski @ 2018-07-13 17:14 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn,
	Tycho Andersen

On Fri, Jul 13, 2018 at 8:40 AM, David Howells <dhowells@redhat.com> wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> > Whilst I'm at it, do we want the option of doing the equivalent of
>> > mountat()?  I.e. offering the option to open all the device files used by
>> > a superblock with dfd and AT_* flags in combination with the filename?
>> >
>>
>> Isn't that more or less what I was suggesting?
>
> Yes, you suggested that.  I'm asking if we actually need that.
>

Suppose some program in a container chroots itself and then tries to
create an fscontext backed by "/path/to/blockdev".  The syscall gets
intercepted by a container manager.  That manager now has a somewhat
awkward time of mounting the same fs, although it could use
"/proc/PID/root/path/to/blockdev", I suppose.  Even that approach has
some potentially awkward permission issues.  I would defer to the
people who actually write software like this, but I can imagine fds
being considerably easier to work with.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-13 13:27         ` David Howells
  2018-07-13 15:01           ` Andy Lutomirski
  2018-07-13 15:40           ` David Howells
@ 2018-07-17  9:40           ` David Howells
  2 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-17  9:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Linus Torvalds, Andrew Lutomirski, Al Viro, Linux API,
	linux-fsdevel, Linux Kernel Mailing List, Jann Horn,
	Tycho Andersen

Andy Lutomirski <luto@amacapital.net> wrote:

> > Whilst I'm at it, do we want the option of doing the equivalent of
> > mountat()?  I.e. offering the option to open all the device files used by
> > a superblock with dfd and AT_* flags in combination with the filename?
> > 
> 
> Isn’t that more or less what I was suggesting?  I suggested dfd and path and I also suggested just an fd and letting the caller open the file itself.

Do we need AT_* flags?  There are three that we could use:

	AT_SYMLINK_NOFOLLOW
	AT_NO_AUTOMOUNT
	AT_EMPTY_PATH

AT_EMPTY_PATH I can see, but I don't see it as likely that we'd want to use
the other two for selecting a source?  Note that we can always do:

	fsfd = fsopen("ext4");
	sfd = open("/dev/", O_PATH);
	fsconfig(fsfd, fsconfig_set_path, "journal_path", "sda1", sfd);

or:

	fsfd = fsopen("ext4");
	sfd = open("/dev/sda1", O_PATH);
	fsconfig(fsfd, fsconfig_set_path_empty, "journal_path", "", sfd);

or:

	fsfd = fsopen("ext4");
	jfd = open("/dev/sda1", O_RDWR);
	fsconfig(fsfd, fsconfig_set_fd, "journal_path", NULL, jfd);

assuming the open on the latter doesn't exclude the use by the filesystem.

This way I don't need a second syscall or a 6-arg syscall to handle path
specification.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Getting rid of the usage of write() -- was Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #9]
  2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
                   ` (36 preceding siblings ...)
  2018-07-12  0:46 ` David Howells
@ 2018-07-18 21:29 ` David Howells
  37 siblings, 0 replies; 113+ messages in thread
From: David Howells @ 2018-07-18 21:29 UTC (permalink / raw)
  To: torvalds, viro; +Cc: dhowells, linux-fsdevel, linux-kernel

Hi Linus, Al,

I'm thinking of adding in the attached patch as a starting point for replacing
write() as the method by which configuration/actioning is done.

For the moment, it just glues the key and the value back together inside the
kernel and passes that on to the filesystem.  I'm still working on a patch to
pass key,val pairs through, but just the patch below would allow Al to take up
the UAPI bits into linux-next.

David
---
vfs: Add a syscall for configuring and triggering actions on a context

Add a syscall for configuring a filesystem creation context and triggering
actions upon it, to be used in conjunction with fsopen, fspick and fsmount.

    long fsconfig(int fs_fd, unsigned int cmd, const char *key,
                  const void *value, int aux);

Where fs_fd indicates the context, cmd indicates the action to take, key
indicates the parameter name for parameter-setting actions and, if needed,
value points to a buffer containing the value and aux can give more
information for the value.

The following command IDs are proposed:

 (*) fsconfig_set_flag: No value is specified.  The parameter must be
     boolean in nature.  The key may be prefixed with "no" to invert the
     setting. value must be NULL and aux must be 0.

 (*) fsconfig_set_string: A string value is specified.  The parameter can
     be expecting boolean, integer, string or take a path.  A conversion to
     an appropriate type will be attempted (which may include looking up as
     a path).  value points to a NUL-terminated string and aux must be 0.

 (*) fsconfig_set_binary: A binary blob is specified.  value points to
     the blob and aux indicates its size.  The parameter must be expecting
     a blob.

 (*) fsconfig_set_path: A non-empty path is specified.  The parameter must
     be expecting a path object.  value points to a NUL-terminated string
     that is the path and aux is a file descriptor at which to start a
     relative lookup or AT_FDCWD.

 (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
     implied.

 (*) fsconfig_set_fd: An open file descriptor is specified.  value must
     be NULL and aux indicates the file descriptor.

 (*) fsconfig_cmd_create: Trigger superblock creation.

 (*) fsconfig_cmd_reconfigure: Trigger superblock reconfiguration.

For the "set" command IDs, the idea is that the file_system_type will point
to a list of parameters and the types of value that those parameters expect
to take.  The core code can then do the parse and argument conversion and
then give the LSM and FS a cooked option or array of options to use.

Source specification is also done the same way same way, using special keys
"source", "source1", "source2", etc..

[!] Note that, for the moment, the key and value are just glued back
together and handed to the filesystem.  Every filesystem that uses options
uses match_token() and co. to do this, and this will need to be changed -
but not all at once.

Example usage:

    fd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
    fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
    fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
    fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
    fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
    fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
    fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
    fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
    fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("afs", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("jffs2", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
---
 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/fs_context.c                        |  177 ++++++++++------
 fs/fsopen.c                            |  363 +++++++++++++++++++++------------
 include/linux/fs_context.h             |    2 
 include/linux/syscalls.h               |    2 
 include/uapi/linux/fs.h                |   14 +
 samples/mount_api/test-fsmount.c       |   24 +-
 8 files changed, 382 insertions(+), 202 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1c9b56f80cdf..7bc9a6bae788 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -404,3 +404,4 @@
 390	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 391	i386	fspick			sys_fspick			__ia32_sys_fspick
 392	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
+393	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d2a4d6db4df6..9caf2f0be723 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -349,6 +349,7 @@
 338	common	fsmount			__x64_sys_fsmount
 339	common	fspick			__x64_sys_fspick
 340	common	fsinfo			__x64_sys_fsinfo
+341	common	fsconfig		__x64_sys_fsconfig
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fs_context.c b/fs/fs_context.c
index f388ab29d37d..071723cf11c8 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -19,10 +19,10 @@
 #include <linux/slab.h>
 #include <linux/magic.h>
 #include <linux/security.h>
-#include <linux/parser.h>
 #include <linux/mnt_namespace.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/bsearch.h>
 #include <net/net_namespace.h>
 #include <asm/sections.h>
 #include "mount.h"
@@ -45,81 +45,102 @@ struct legacy_fs_context {
 static int legacy_init_fs_context(struct fs_context *fc, struct dentry *dentry);
 static const struct fs_context_operations legacy_fs_context_ops;
 
-static const match_table_t common_set_sb_flag = {
-	{ SB_DIRSYNC,		"dirsync" },
-	{ SB_LAZYTIME,		"lazytime" },
-	{ SB_MANDLOCK,		"mand" },
-	{ SB_POSIXACL,		"posixacl" },
-	{ SB_RDONLY,		"ro" },
-	{ SB_SYNCHRONOUS,	"sync" },
-	{ },
+struct constant_table {
+	const char	*name;
+	int		value;
 };
 
-static const match_table_t common_clear_sb_flag = {
-	{ SB_LAZYTIME,		"nolazytime" },
-	{ SB_MANDLOCK,		"nomand" },
-	{ SB_RDONLY,		"rw" },
-	{ SB_SILENT,		"silent" },
-	{ SB_SYNCHRONOUS,	"async" },
-	{ },
+static const struct constant_table common_set_sb_flag[] = {
+	{ "dirsync",	SB_DIRSYNC },
+	{ "lazytime",	SB_LAZYTIME },
+	{ "mand",	SB_MANDLOCK },
+	{ "posixacl",	SB_POSIXACL },
+	{ "ro",		SB_RDONLY },
+	{ "sync",	SB_SYNCHRONOUS },
 };
 
-static const match_table_t forbidden_sb_flag = {
-	{ 1,	"bind" },
-	{ 1,	"move" },
-	{ 1,	"private" },
-	{ 1,	"remount" },
-	{ 1,	"shared" },
-	{ 1,	"slave" },
-	{ 1,	"unbindable" },
-	{ 1,	"rec" },
-	{ 1,	"noatime" },
-	{ 1,	"relatime" },
-	{ 1,	"norelatime" },
-	{ 1,	"strictatime" },
-	{ 1,	"nostrictatime" },
-	{ 1,	"nodiratime" },
-	{ 1,	"dev" },
-	{ 1,	"nodev" },
-	{ 1,	"exec" },
-	{ 1,	"noexec" },
-	{ 1,	"suid" },
-	{ 1,	"nosuid" },
-	{ },
+static const struct constant_table common_clear_sb_flag[] = {
+	{ "async",	SB_SYNCHRONOUS },
+	{ "nolazytime",	SB_LAZYTIME },
+	{ "nomand",	SB_MANDLOCK },
+	{ "rw",		SB_RDONLY },
+	{ "silent",	SB_SILENT },
 };
 
+static const char *forbidden_sb_flag[] = {
+	"bind",
+	"dev",
+	"exec",
+	"move",
+	"noatime",
+	"nodev",
+	"nodiratime",
+	"noexec",
+	"norelatime",
+	"nostrictatime",
+	"nosuid",
+	"private",
+	"rec",
+	"relatime",
+	"remount",
+	"shared",
+	"slave",
+	"strictatime",
+	"suid",
+	"unbindable",
+};
+
+static int lookup_one(const void *name, const void *entry)
+{
+	const struct constant_table *e = entry;
+	return strcmp(name, e->name);
+}
+
+static int lookup_constant(const struct constant_table tbl[], size_t tbl_size,
+			   const char *name, int not_found)
+{
+	const struct constant_table *e;
+
+	e = bsearch(name, tbl, tbl_size, sizeof(tbl[0]), lookup_one);
+	if (!e)
+		return not_found;
+	return e->value;
+}
+#define lookup_constant(t, n, nf) lookup_constant(t, ARRAY_SIZE(t), (n), (nf))
+
 /*
  * Check for a common mount option that manipulates s_flags.
  */
-static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
+static int vfs_parse_sb_flag_option(struct fs_context *fc, const char *key)
 {
-	substring_t args[MAX_OPT_ARGS];
 	unsigned int token;
 
-	token = match_token(data, common_set_sb_flag, args);
+	if (bsearch(key, forbidden_sb_flag, ARRAY_SIZE(forbidden_sb_flag),
+		    sizeof(forbidden_sb_flag[0]),
+		    (int (*)(const void *, const void *))strcmp))
+		return -EINVAL;
+
+	token = lookup_constant(common_set_sb_flag, key, 0);
 	if (token) {
 		fc->sb_flags |= token;
 		return 1;
 	}
 
-	token = match_token(data, common_clear_sb_flag, args);
+	token = lookup_constant(common_clear_sb_flag, key, 0);
 	if (token) {
 		fc->sb_flags &= ~token;
 		return 1;
 	}
 
-	token = match_token(data, forbidden_sb_flag, args);
-	if (token)
-		return -EINVAL;
-
 	return 0;
 }
 
 /**
  * vfs_parse_fs_option - Add a single mount option to a superblock config
  * @fc: The filesystem context to modify
- * @opt: The option to apply.
- * @len: The length of the option.
+ * @key: The parameter name
+ * @value: The parameter value
+ * @v_len: The length of the value
  *
  * A single mount option in string form is applied to the filesystem context
  * being set up.  Certain standard options (for example "ro") are translated
@@ -132,26 +153,51 @@ static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
  * Returns 0 on success and a negative error code on failure.  In the event of
  * failure, supplementary error information may have been set.
  */
-int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len)
+int vfs_parse_fs_option(struct fs_context *fc, char *key, void *value, size_t v_len)
 {
+	size_t len;
+	char *buf = key;
 	int ret;
 
-	ret = vfs_parse_sb_flag_option(fc, opt);
+	ret = vfs_parse_sb_flag_option(fc, key);
 	if (ret < 0)
 		return ret;
 	if (ret == 1)
 		return 0;
 
-	ret = security_fs_context_parse_option(fc, opt, len);
-	if (ret < 0)
-		return ret;
-	if (ret == 1)
-		return 0;
+	/* Splice together the value and the option and pass to the LSM and FS.
+	 *
+	 * [!] TODO: Need to pass key and value through separately.
+	 */
+	len = strlen(key);
+	if (value) {
+		buf = kmalloc(len + 1 + v_len + 1, GFP_KERNEL);
+		if (!buf)
+			return -ENOMEM;
+		memcpy(buf, key, len);
+		buf[len] = '=';
+		len++;
+		memcpy(buf + len, value, v_len);
+		len += v_len;
+		buf[len] = 0;
+	}
 
+	ret = security_fs_context_parse_option(fc, buf, len);
+	if (ret != 0) {
+		if (ret == 1)
+			/* Param belongs to the LSM; don't pass to the FS */
+			ret = 0;
+		goto out;
+	}
+
+	ret = -EINVAL;
 	if (fc->ops->parse_option)
-		return fc->ops->parse_option(fc, opt, len);
+		ret = fc->ops->parse_option(fc, buf, len);
 
-	return -EINVAL;
+out:
+	if (buf != key)
+		kfree(buf);
+	return ret;
 }
 EXPORT_SYMBOL(vfs_parse_fs_option);
 
@@ -205,15 +251,24 @@ EXPORT_SYMBOL(vfs_set_fs_source);
  */
 int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
 {
-	char *options = data, *opt;
+	char *options = data, *key;
 	int ret;
 
 	if (!options)
 		return 0;
 
-	while ((opt = strsep(&options, ",")) != NULL) {
-		if (*opt) {
-			ret = vfs_parse_fs_option(fc, opt, strlen(opt));
+	while ((key = strsep(&options, ",")) != NULL) {
+		if (*key) {
+			size_t v_len = 0;
+			char *value = strchr(key, '=');
+
+			if (value) {
+				if (value == key)
+					continue;
+				*value++ = 0;
+				v_len = strlen(value);
+			}
+			ret = vfs_parse_fs_option(fc, key, value, v_len);
 			if (ret < 0)
 				return ret;
 		}
diff --git a/fs/fsopen.c b/fs/fsopen.c
index ebcbae8c6f10..c0d8dbe21063 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -16,138 +16,9 @@
 #include <linux/security.h>
 #include <linux/anon_inodes.h>
 #include <linux/namei.h>
+#include <linux/file.h>
 #include "mount.h"
 
-/*
- * Userspace writes configuration data and commands to the fd and we parse it
- * here.  For the moment, we assume a single option or command per write.  Each
- * line written is of the form
- *
- *	<command_type><space><stuff...>
- *
- *	s /dev/sda1				-- Source device
- *	o noatime				-- Option without value
- *	o cell=grand.central.org		-- Option with value
- *	x create				-- Create a superblock
- *	x reconfigure				-- Reconfigure a superblock
- */
-static ssize_t fscontext_write(struct file *file,
-			       const char __user *_buf, size_t len, loff_t *pos)
-{
-	struct fs_context *fc = file->private_data;
-	const struct cred *cred;
-	char opt[2], *data;
-	ssize_t ret;
-
-	if (len < 3 || len > 4095)
-		return -EINVAL;
-
-	if (copy_from_user(opt, _buf, 2) != 0)
-		return -EFAULT;
-	switch (opt[0]) {
-	case 's':
-	case 'o':
-	case 'x':
-		break;
-	default:
-		return -EINVAL;
-	}
-	if (opt[1] != ' ')
-		return -EINVAL;
-
-	data = memdup_user_nul(_buf + 2, len - 2);
-	if (IS_ERR(data))
-		return PTR_ERR(data);
-
-	/* From this point onwards we need to lock the fd against someone
-	 * trying to mount it.
-	 */
-	ret = mutex_lock_interruptible(&fc->uapi_mutex);
-	if (ret < 0)
-		goto err_free;
-
-	/* All operations take place using whatever privilege was granted to
-	 * the caller of fsopen() or fspick().
-	 */
-	cred = override_creds(fc->cred);
-	
-	if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
-		if (fc->fs_type->init_fs_context) {
-			ret = fc->fs_type->init_fs_context(fc, fc->root);
-			if (ret < 0) {
-				fc->phase = FS_CONTEXT_FAILED;
-				goto err_unlock;
-			}
-		} else {
-			/* Leave legacy context ops in place */
-		}
-
-		/* Do the security check last because ->init_fs_context may
-		 * change the namespace subscriptions.
-		 */
-		ret = security_fs_context_alloc(fc, fc->root);
-		if (ret < 0) {
-			fc->phase = FS_CONTEXT_FAILED;
-			goto err_unlock;
-		}
-
-		fc->phase = FS_CONTEXT_RECONF_PARAMS;
-	}
-
-	ret = -EINVAL;
-	switch (opt[0]) {
-	case 's':
-		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
-		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
-			goto wrong_phase;
-		ret = vfs_set_fs_source(fc, data, len - 2);
-		if (ret < 0)
-			goto err_unlock;
-		break;
-
-	case 'o':
-		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
-		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
-			goto wrong_phase;
-		ret = vfs_parse_fs_option(fc, data, len - 2);
-		if (ret < 0)
-			goto err_unlock;
-		break;
-
-	case 'x':
-		if (strcmp(data, "create") == 0) {
-			if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
-				goto wrong_phase;
-			fc->phase = FS_CONTEXT_CREATING;
-			ret = vfs_get_tree(fc);
-			if (ret == 0)
-				fc->phase = FS_CONTEXT_AWAITING_MOUNT;
-			else
-				fc->phase = FS_CONTEXT_FAILED;
-		} else {
-			ret = -EOPNOTSUPP;
-		}
-		if (ret < 0)
-			goto err_unlock;
-		break;
-
-	default:
-		goto err_unlock;
-	}
-
-	ret = len;
-err_unlock:
-	revert_creds(cred);
-	mutex_unlock(&fc->uapi_mutex);
-err_free:
-	kfree(data);
-	return ret;
-
-wrong_phase:
-	ret = -EBUSY;
-	goto err_unlock;
-}
-
 /*
  * Allow the user to read back any error, warning or informational messages.
  */
@@ -207,7 +78,6 @@ static int fscontext_release(struct inode *inode, struct file *file)
 
 const struct file_operations fscontext_fops = {
 	.read		= fscontext_read,
-	.write		= fscontext_write,
 	.release	= fscontext_release,
 	.llseek		= no_llseek,
 };
@@ -340,3 +210,234 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
 err:
 	return ret;
 }
+
+/*
+ * Check the state and apply the configuration.  Note that this function is
+ * allowed to 'steal' the value by setting *_value to NULL before returning.
+ */
+static int vfs_fsconfig(struct fs_context *fc, enum fsconfig_command cmd,
+			char *key, void **_value, long aux)
+{
+	void *value = *_value;
+	int ret;
+
+	/* We need to reinitialise the context if we have reconfiguration
+	 * pending after creation or a previous reconfiguration.
+	 */
+	if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+		if (fc->fs_type->init_fs_context) {
+			ret = fc->fs_type->init_fs_context(fc, fc->root);
+			if (ret < 0) {
+				fc->phase = FS_CONTEXT_FAILED;
+				return ret;
+			}
+		} else {
+			/* Leave legacy context ops in place */
+		}
+
+		/* Do the security check last because ->init_fs_context may
+		 * change the namespace subscriptions.
+		 */
+		ret = security_fs_context_alloc(fc, fc->root);
+		if (ret < 0) {
+			fc->phase = FS_CONTEXT_FAILED;
+			return ret;
+		}
+
+		fc->phase = FS_CONTEXT_RECONF_PARAMS;
+	}
+
+	ret = -EINVAL;
+	switch (cmd) {
+	case fsconfig_set_string:
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+			return -EBUSY;
+		if (strcmp(key, "source") == 0)
+			return vfs_set_fs_source(fc, value, strlen(value));
+		/* Fall through */
+
+	case fsconfig_set_flag:
+	case fsconfig_set_binary:
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+			return -EBUSY;
+		return vfs_parse_fs_option(fc, key, value, aux);
+
+	case fsconfig_set_path:
+	case fsconfig_set_path_empty:
+	case fsconfig_set_fd:
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+			return -EBUSY;
+		BUG(); // TODO
+
+	case fsconfig_cmd_create:
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+			return -EBUSY;
+		fc->phase = FS_CONTEXT_CREATING;
+		ret = vfs_get_tree(fc);
+		if (ret == 0)
+			fc->phase = FS_CONTEXT_AWAITING_MOUNT;
+		else
+			fc->phase = FS_CONTEXT_FAILED;
+		return ret;
+
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+/**
+ * sys_fsconfig - Set parameters and trigger actions on a context
+ * @fd: The filesystem context to act upon
+ * @cmd: The action to take
+ * @_key: Where appropriate, the parameter key to set
+ * @_value: Where appropriate, the parameter value to set
+ * @aux: Additional information for the value
+ *
+ * This system call is used to set parameters on a context, including
+ * superblock settings, data source and security labelling.
+ *
+ * Actions include triggering the creation of a superblock and the
+ * reconfiguration of the superblock attached to the specified context.
+ *
+ * When setting a parameter, @cmd indicates the type of value being proposed
+ * and @_key indicates the parameter to be altered.
+ *
+ * @_value and @aux are used to specify the value, should a value be required:
+ *
+ * (*) fsconfig_set_flag: No value is specified.  The parameter must be boolean
+ *     in nature.  The key may be prefixed with "no" to invert the
+ *     setting. @_value must be NULL and @aux must be 0.
+ *
+ * (*) fsconfig_set_string: A string value is specified.  The parameter can be
+ *     expecting boolean, integer, string or take a path.  A conversion to an
+ *     appropriate type will be attempted (which may include looking up as a
+ *     path).  @_value points to a NUL-terminated string and @aux must be 0.
+ *
+ * (*) fsconfig_set_binary: A binary blob is specified.  @_value points to the
+ *     blob and @aux indicates its size.  The parameter must be expecting a
+ *     blob.
+ *
+ * (*) fsconfig_set_path: A non-empty path is specified.  The parameter must be
+ *     expecting a path object.  @_value points to a NUL-terminated string that
+ *     is the path and @aux is a file descriptor at which to start a relative
+ *     lookup or AT_FDCWD.
+ *
+ * (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
+ *     implied.
+ *
+ * (*) fsconfig_set_fd: An open file descriptor is specified.  @_value must be
+ *     NULL and @aux indicates the file descriptor.
+ */
+SYSCALL_DEFINE5(fsconfig,
+		int, fd,
+		unsigned int, cmd,
+		const char __user *, _key,
+		const void __user *, _value,
+		int, aux)
+{
+	struct fs_context *fc;
+	struct fd f;
+	void *value = NULL;
+	char *key = NULL;
+	int ret;
+
+	if (fd < 0)
+		return -EINVAL;
+
+	switch (cmd) {
+	case fsconfig_set_flag:
+		if (!_key || _value || aux)
+			return -EINVAL;
+		break;
+	case fsconfig_set_string:
+		if (!_key || !_value || aux)
+			return -EINVAL;
+		break;
+	case fsconfig_set_binary:
+		if (!_key || !_value || aux <= 0 || aux > 1024 * 1024)
+			return -EINVAL;
+		break;
+	case fsconfig_set_path:
+	case fsconfig_set_path_empty:
+		if (!_key || !_value || (aux != AT_FDCWD && aux < 0))
+			return -EINVAL;
+		break;
+	case fsconfig_set_fd:
+		if (!_key || _value || aux < 0)
+			return -EINVAL;
+		break;
+	case fsconfig_cmd_create:
+	case fsconfig_cmd_reconfigure:
+		if (_key || _value || aux)
+			return -EINVAL;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (f.file->f_op != &fscontext_fops)
+		goto out_f;
+
+	fc = f.file->private_data;
+
+	if (_key) {
+		key = strndup_user(_key, 256);
+		if (IS_ERR(key)) {
+			ret = PTR_ERR(key);
+			goto out_f;
+		}
+	}
+
+	switch (cmd) {
+	case fsconfig_set_string:
+		value = strndup_user(_value, 256);
+		if (IS_ERR(value)) {
+			ret = PTR_ERR(value);
+			goto out_key;
+		}
+		break;
+	case fsconfig_set_binary:
+		value = memdup_user_nul(_value, aux);
+		if (IS_ERR(value)) {
+			ret = PTR_ERR(key);
+			goto out_key;
+		}
+		break;
+	case fsconfig_set_path:
+	case fsconfig_set_path_empty:
+	case fsconfig_set_fd:
+		ret = -EOPNOTSUPP;
+		goto out_key;
+	default:
+		break;
+	}
+
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret == 0) {
+		ret = vfs_fsconfig(fc, cmd, key, &value, aux);
+		mutex_unlock(&fc->uapi_mutex);
+	}
+
+	switch (cmd) {
+	case fsconfig_set_string:
+	case fsconfig_set_binary:
+		kfree(value);
+		/* Fall through */
+	default:
+		break;
+	}
+out_key:
+	kfree(key);
+out_f:
+	fdput(f);
+	return ret;
+}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 305fab41e540..b5dc48c206c4 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -99,7 +99,7 @@ extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
 extern struct fs_context *vfs_sb_reconfig(struct path *path, unsigned int ms_flags);
 extern struct fs_context *vfs_dup_fs_context(struct fs_context *src);
 extern int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t len);
-extern int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len);
+extern int vfs_parse_fs_option(struct fs_context *fc, char *key, void *value, size_t v_len);
 extern int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size);
 extern int vfs_get_tree(struct fs_context *fc);
 extern void put_fs_context(struct fs_context *fc);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index da3575dded79..39260701a267 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -911,6 +911,8 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
 asmlinkage long sys_fsinfo(int dfd, const char __user *path,
 			   struct fsinfo_params __user *params,
 			   void __user *buffer, size_t buf_size);
+asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
+			     const void __user *value, unsigned int aux);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c27576d471c2..be70cbac21b4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -356,4 +356,18 @@ typedef int __bitwise __kernel_rwf_t;
 #define FSPICK_NO_AUTOMOUNT	0x00000004
 #define FSPICK_EMPTY_PATH	0x00000008
 
+/*
+ * The type of fsconfig() call made.
+ */
+enum fsconfig_command {
+	fsconfig_set_flag,		/* Set parameter, supplying no value */
+	fsconfig_set_string,		/* Set parameter, supplying a string value */
+	fsconfig_set_binary,		/* Set parameter, supplying a binary blob value */
+	fsconfig_set_path,		/* Set parameter, supplying an object by path */
+	fsconfig_set_path_empty,	/* Set parameter, supplying an object by (empty) path */
+	fsconfig_set_fd,		/* Set parameter, supplying an object by fd */
+	fsconfig_cmd_create,		/* Invoke superblock creation */
+	fsconfig_cmd_reconfigure,	/* Invoke superblock reconfiguration */
+};
+
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/samples/mount_api/test-fsmount.c b/samples/mount_api/test-fsmount.c
index 44d2dc9fc2a0..ee8db5761a9e 100644
--- a/samples/mount_api/test-fsmount.c
+++ b/samples/mount_api/test-fsmount.c
@@ -16,7 +16,7 @@
 #include <fcntl.h>
 #include <sys/prctl.h>
 #include <sys/wait.h>
-#include <linux/mount.h>
+#include <linux/fs.h>
 #include <linux/unistd.h>
 
 #define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
@@ -58,12 +58,6 @@ void mount_error(int fd, const char *s)
 	exit(1);
 }
 
-#define E_write(fd, s)							\
-	do {								\
-		if (write(fd, s, sizeof(s) - 1) == -1)			\
-			mount_error(fd, s);				\
-	} while (0)
-
 static inline int fsopen(const char *fs_name, unsigned int flags)
 {
 	return syscall(__NR_fsopen, fs_name, flags);
@@ -74,6 +68,12 @@ static inline int fsmount(int fsfd, unsigned int flags, unsigned int ms_flags)
 	return syscall(__NR_fsmount, fsfd, flags, ms_flags);
 }
 
+static inline int fsconfig(int fsfd, unsigned int cmd,
+			   const char *key, const void *val, int aux)
+{
+	return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
+}
+
 static inline int move_mount(int from_dfd, const char *from_pathname,
 			     int to_dfd, const char *to_pathname,
 			     unsigned int flags)
@@ -83,6 +83,12 @@ static inline int move_mount(int from_dfd, const char *from_pathname,
 		       to_dfd, to_pathname, flags);
 }
 
+#define E_fsconfig(fd, cmd, key, val, aux)				\
+	do {								\
+		if (fsconfig(fd, cmd, key, val, aux) == -1)		\
+			mount_error(fd, key ?: "create");		\
+	} while (0)
+
 int main(int argc, char *argv[])
 {
 	int fsfd, mfd;
@@ -94,8 +100,8 @@ int main(int argc, char *argv[])
 		exit(1);
 	}
 
-	E_write(fsfd, "s #grand.central.org:root.cell.");
-	E_write(fsfd, "x create");
+	E_fsconfig(fsfd, fsconfig_set_string, "source", "#grand.central.org:root.cell.", 0);
+	E_fsconfig(fsfd, fsconfig_cmd_create, NULL, NULL, 0);
 	
 	mfd = fsmount(fsfd, 0, MS_RDONLY);
 	if (mfd < 0)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #9]
  2018-07-13  7:30                 ` David Howells
@ 2018-07-19  1:30                   ` Eric W. Biederman
  0 siblings, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2018-07-19  1:30 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Theodore Y. Ts'o, Linus Torvalds,
	Andrew Lutomirski, Al Viro, Linux API, linux-fsdevel,
	Linux Kernel Mailing List, Jann Horn

David Howells <dhowells@redhat.com> writes:

> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> > Also you can't currently directly create a bind mount from userspace as you
>> > can only bind from another path point - which you may not be able to access
>> > (either by permission failure or because it's not in your mount namespace).
>> > 
>> 
>> Are you trying to preserve the magic bind semantics with the new API?
>
> No, I'm pointing out that you can't emulate this by doing a bind mount from
> userspace if you can't access the thing you're binding from.
>
> Now, we could create a syscall that just picks up an extant superblock using a
> device and attaches it to a mount for you, but that would have to be at least
> partially parameterised - which would be very fs-dependent - so that it can
> know whether or not you're allowed to create another mount to that sb.
>
> What you're talking about is emulating sget() in userspace - when we have to
> do it in the kernel anyway if we still offer mount(2).

I am just going to chime in and say that it is absolutely a problem in
the current mount interface that when I mount a filesystem with fresh
parameters I don't know if it is generates an sget and a new super_block
or if it just increments the refcount on an existing super_block.

It is the kind of problem that is actually security sensitive and has
resulted in a security issue in the current linux kernel with respect to
proc.

So yes we absolutely need to have a clean way of dealing with:

mount /dev/sda3 /tmp
mount /dev/sda3 /mnt

So that the second one is forbidden fails.  And userspace has to do the
equivalent of sget to get a file descriptor it can bind into the mount
namespace.

The deep problem is that the second mount does not parse the mount
options and userspace does not know that.  So userspace thinks it is
getting one kind of mount and in practice it gets another (sometimes
with different security properties).  Those different security
properties are an out and out bug.  Although any kind of different and
unexpected properties can be a problem.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [MANPAGE PATCH] Add manpages for move_mount(2) and open_tree(2)
  2018-07-10 22:52 ` [MANPAGE PATCH] Add manpages for move_mount(2) and open_tree(2) David Howells
@ 2019-10-09  9:51   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 113+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-09  9:51 UTC (permalink / raw)
  To: David Howells
  Cc: mtk.manpages, viro, linux-api, linux-fsdevel, torvalds,
	linux-kernel, linux-man, Eric W. Biederman

Hello David,

Your wrote a series of manual pages patches (of which the mail below is one)
for the new mount API about a year before the code patches were actually
released in the kernel.

I'd like to check that these man-pages patches are up to date before
merging them. I think they may not be, since there is one patch for
fsinfo(2) which does not exist in the kernel, and no manual page for
fsconfig(2). I imagine that details may also have changed
in the system calls that were ultimately merged.

Could you write a manual page for fsconfig(2) please?

With respect to the patch below, would you be willing to:
* split it into two pieces, one for each page.
* review the content to see if it accurately reflects what was
  merged into the kernel and then resubmit please?

Thanks,

Michael

On 7/11/18 12:52 AM, David Howells wrote:
> Add manual pages to document the move_mount and open_tree() system calls.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  man2/move_mount.2 |  274 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  man2/open_tree.2  |  260 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 534 insertions(+)
>  create mode 100644 man2/move_mount.2
>  create mode 100644 man2/open_tree.2
> 
> diff --git a/man2/move_mount.2 b/man2/move_mount.2
> new file mode 100644
> index 000000000..3a819fb84
> --- /dev/null
> +++ b/man2/move_mount.2
> @@ -0,0 +1,274 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH MOVE_MOUNT 2 2018-06-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +move_mount \- Move mount objects around the filesystem topology
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/mount.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int move_mount(int " from_dirfd ", const char *" from_pathname ","
> +.BI "               int " to_dirfd ", const char *" to_pathname ","
> +.BI "               unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There are no glibc wrappers for these system calls.
> +.SH DESCRIPTION
> +The
> +.BR move_mount ()
> +call moves a mount from one place to another; it can also be used to attach an
> +unattached mount created by
> +.BR fsmount "() or " open_tree "() with " OPEN_TREE_CLONE .
> +.PP
> +If
> +.BR move_mount ()
> +is called repeatedly with a file descriptor that refers to a mount object,
> +then the object will be attached/moved the first time and then moved again and
> +again and again, detaching it from the previous mountpoint each time.
> +.PP
> +To access the source mount object or the destination mountpoint, no
> +permissions are required on the object itself, but if either pathname is
> +supplied, execute (search) permission is required on all of the directories
> +specified in
> +.IR from_pathname " or " to_pathname .
> +.PP
> +The caller does, however, require the appropriate capabilities or permission
> +to effect a mount.
> +.PP
> +.BR move_mount ()
> +uses
> +.IR from_pathname ", " from_dirfd " and some " flags
> +to locate the mount object to be moved and
> +.IR to_pathname ", " to_dirfd " and some other " flags
> +to locate the destination mountpoint.  Each lookup can be done in one of a
> +variety of ways:
> +.TP
> +[*] By absolute path.
> +The pathname points to an absolute path and the dirfd is ignored.  The file is
> +looked up by name, starting from the root of the filesystem as seen by the
> +calling process.
> +.TP
> +[*] By cwd-relative path.
> +The pathname points to a relative path and the dirfd is
> +.IR AT_FDCWD .
> +The file is looked up by name, starting from the current working directory.
> +.TP
> +[*] By dir-relative path.
> +The pathname points to relative path and the dirfd indicates a file descriptor
> +pointing to a directory.  The file is looked up by name, starting from the
> +directory specified by
> +.IR dirfd .
> +.TP
> +[*] By file descriptor.
> +The pathname points to "", the dirfd points directly to the mount object to
> +move or the destination mount point and the appropriate
> +.B *_EMPTY_PATH
> +flag is set.
> +.PP
> +.I flags
> +can be used to influence a path-based lookup.  A value for
> +.I flags
> +is constructed by OR'ing together zero or more of the following constants:
> +.TP
> +.BR MOVE_MOUNT_F_EMPTY_PATH
> +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> +If
> +.I from_pathname
> +is an empty string, operate on the file referred to by
> +.IR from_dirfd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag or
> +.BR open_tree ())
> +If
> +.I from_dirfd
> +is
> +.BR AT_FDCWD ,
> +the call operates on the current working directory.
> +In this case,
> +.I from_dirfd
> +can refer to any type of file, not just a directory.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B MOVE_MOUNT_T_EMPTY_PATH
> +As above, but operating on
> +.IR to_pathname " and " to_dirfd .
> +.TP
> +.B MOVE_MOUNT_F_NO_AUTOMOUNT
> +Don't automount the terminal ("basename") component of
> +.I from_pathname
> +if it is a directory that is an automount point.  This allows a mount object
> +that has an automount point at its root to be moved and prevents unintended
> +triggering of an automount point.
> +The
> +.B MOVE_MOUNT_F_NO_AUTOMOUNT
> +flag has no effect if the automount point has already been mounted over.  This
> +flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B MOVE_MOUNT_T_NO_AUTOMOUNT
> +As above, but operating on
> +.IR to_pathname " and " to_dirfd .
> +This allows an automount point to be manually mounted over.
> +.TP
> +.B MOVE_MOUNT_F_SYMLINKS
> +If
> +.I from_pathname
> +is a symbolic link, then dereference it.  The default for
> +.BR move_mount ()
> +is to not follow symlinks.
> +.TP
> +.B MOVE_MOUNT_T_SYMLINKS
> +As above, but operating on
> +.IR to_pathname " and " to_dirfd .
> +
> +.SH EXAMPLES
> +The
> +.BR move_mount ()
> +function can be used like the following:
> +.PP
> +.RS
> +.nf
> +move_mount(AT_FDCWD, "/a", AT_FDCWD, "/b", 0);
> +.fi
> +.RE
> +.PP
> +This would move the object mounted on "/a" to "/b".  It can also be used in
> +conjunction with
> +.BR open_tree "(2) or " open "(2) with " O_PATH :
> +.PP
> +.RS
> +.nf
> +fd = open_tree(AT_FDCWD, "/mnt", 0);
> +move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
> +move_mount(fd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH);
> +move_mount(fd, "", AT_FDCWD, "/mnt4", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.RE
> +.PP
> +This would attach the path point for "/mnt" to fd, then it would move the
> +mount to "/mnt2", then move it to "/mnt3" and finally to "/mnt4".
> +.PP
> +It can also be used to attach new mounts:
> +.PP
> +.RS
> +.nf
> +sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> +write(sfd, "s /dev/sda1");
> +write(sfd, "o user_xattr");
> +mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_NODEV);
> +move_mount(mfd, "", AT_FDCWD, "/home", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.RE
> +.PP
> +Which would open the Ext4 filesystem mounted on "/dev/sda1", turn on user
> +extended attribute support and create a mount object for it.  Finally, the new
> +mount object would be attached with
> +.BR move_mount ()
> +to "/home".
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, 0 is returned.  On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied for one of the directories
> +in the path prefix of
> +.IR pathname .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +.IR from_dirfd " or " to_dirfd
> +is not a valid open file descriptor.
> +.TP
> +.B EFAULT
> +.IR from_pathname " or " to_pathname
> +is NULL or either one point to a location outside the process's accessible
> +address space.
> +.TP
> +.B EINVAL
> +Reserved flag specified in
> +.IR flags .
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered while traversing the pathname.
> +.TP
> +.B ENAMETOOLONG
> +.IR from_pathname " or " to_pathname
> +is too long.
> +.TP
> +.B ENOENT
> +A component of
> +.IR from_pathname " or " to_pathname
> +does not exist, or one is an empty string and the appropriate
> +.B *_EMPTY_PATH
> +was not specified in
> +.IR flags .
> +.TP
> +.B ENOMEM
> +Out of memory (i.e., kernel memory).
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of
> +.IR from_pathname " or " to_pathname
> +is not a directory or one or the other is relative and the appropriate
> +.I *_dirfd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR move_mount ()
> +was added to Linux in kernel 4.18.
> +.SH CONFORMING TO
> +.BR move_mount ()
> +is Linux-specific.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR move_mount ()
> +system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR fsmount (2),
> +.BR fsopen (2),
> +.BR open_tree (2)
> diff --git a/man2/open_tree.2 b/man2/open_tree.2
> new file mode 100644
> index 000000000..7e9c86fe3
> --- /dev/null
> +++ b/man2/open_tree.2
> @@ -0,0 +1,260 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH OPEN_TREE 2 2018-06-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +open_tree \- Pick or clone mount object and attach to fd
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/mount.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int open_tree(int " dirfd ", const char *" pathname ", unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There are no glibc wrappers for these system calls.
> +.SH DESCRIPTION
> +.BR open_tree ()
> +picks the mount object specified by the pathname and attaches it to a new file
> +descriptor or clones it and attaches the clone to the file descriptor.  The
> +resultant file descriptor is indistinguishable from one produced by
> +.BR open "(2) with " O_PATH .
> +.PP
> +In the case that the mount object is cloned, the clone will be "unmounted" and
> +destroyed when the file descriptor is closed if it is not otherwise mounted
> +somewhere by calling
> +.BR move_mount (2).
> +.PP
> +To select a mount object, no permissions are required on the object referred
> +to by the path, but execute (search) permission is required on all of the
> +directories in
> +.I pathname
> +that lead to the object.
> +.PP
> +To clone an object, however, the caller must have mount capabilities and
> +permissions.
> +.PP
> +.BR open_tree ()
> +uses
> +.IR pathname ", " dirfd " and " flags
> +to locate the target object in one of a variety of ways:
> +.TP
> +[*] By absolute path.
> +.I pathname
> +points to an absolute path and
> +.I dirfd
> +is ignored.  The object is looked up by name, starting from the root of the
> +filesystem as seen by the calling process.
> +.TP
> +[*] By cwd-relative path.
> +.I pathname
> +points to a relative path and
> +.IR dirfd " is " AT_FDCWD .
> +The object is looked up by name, starting from the current working directory.
> +.TP
> +[*] By dir-relative path.
> +.I pathname
> +points to relative path and
> +.I dirfd
> +indicates a file descriptor pointing to a directory.  The object is looked up
> +by name, starting from the directory specified by
> +.IR dirfd .
> +.TP
> +[*] By file descriptor.
> +.I pathname
> +is "",
> +.I dirfd
> +indicates a file descriptor and
> +.B AT_EMPTY_PATH
> +is set in
> +.IR flags .
> +The mount attached to the file descriptor is queried directly.  The file
> +descriptor may point to any type of file, not just a directory.
> +
> +.\"______________________________________________________________
> +.PP
> +.I flags
> +can be used to control the operation of the function and to influence a
> +path-based lookup.  A value for
> +.I flags
> +is constructed by OR'ing together zero or more of the following constants:
> +.TP
> +.BR AT_EMPTY_PATH
> +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> +If
> +.I pathname
> +is an empty string, operate on the file referred to by
> +.IR dirfd
> +(which may have been obtained from
> +.BR open "(2) with"
> +.BR O_PATH ", from " fsmount (2)
> +or from another
> +.BR open_tree ()).
> +If
> +.I dirfd
> +is
> +.BR AT_FDCWD ,
> +the call operates on the current working directory.
> +In this case,
> +.I dirfd
> +can refer to any type of file, not just a directory.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.BR AT_NO_AUTOMOUNT
> +Don't automount the terminal ("basename") component of
> +.I pathname
> +if it is a directory that is an automount point.  This flag allows the
> +automount point itself to be picked up or a mount cloned that is rooted on the
> +automount point.  The
> +.B AT_NO_AUTOMOUNT
> +flag has no effect if the mount point has already been mounted over.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B AT_SYMLINK_NOFOLLOW
> +If
> +.I pathname
> +is a symbolic link, do not dereference it: instead pick up or clone a mount
> +rooted on the link itself.
> +.TP
> +.B OPEN_TREE_CLOEXEC
> +Set the close-on-exec flag for the new file descriptor.  This will cause the
> +file descriptor to be closed automatically when a process exec's.
> +.TP
> +.B OPEN_TREE_CLONE
> +Rather than directly attaching the selected object to the file descriptor,
> +clone the object, set the root of the new mount object to that point and
> +attach the clone to the file descriptor.
> +.TP
> +.B AT_RECURSIVE
> +This is only permitted in conjunction with OPEN_TREE_CLONE.  It causes the
> +entire mount subtree rooted at the selected spot to be cloned rather than just
> +that one mount object.
> +
> +
> +.SH EXAMPLE
> +The
> +.BR open_tree ()
> +function can be used like the following:
> +.PP
> +.RS
> +.nf
> +fd1 = open_tree(AT_FDCWD, "/mnt", 0);
> +fd2 = open_tree(fd1, "",
> +                AT_EMPTY_PATH | OPEN_TREE_CLONE | AT_RECURSIVE);
> +move_mount(fd2, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.RE
> +.PP
> +This would attach the path point for "/mnt" to fd1, then it would copy the
> +entire subtree at the point referred to by fd1 and attach that to fd2; lastly,
> +it would attach the clone to "/mnt2".
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, the new file descriptor is returned.  On error, \-1 is returned,
> +and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied for one of the directories
> +in the path prefix of
> +.IR pathname .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +.I dirfd
> +is not a valid open file descriptor.
> +.TP
> +.B EFAULT
> +.I pathname
> +is NULL or
> +.IR pathname
> +point to a location outside the process's accessible address space.
> +.TP
> +.B EINVAL
> +Reserved flag specified in
> +.IR flags .
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered while traversing the pathname.
> +.TP
> +.B ENAMETOOLONG
> +.I pathname
> +is too long.
> +.TP
> +.B ENOENT
> +A component of
> +.I pathname
> +does not exist, or
> +.I pathname
> +is an empty string and
> +.B AT_EMPTY_PATH
> +was not specified in
> +.IR flags .
> +.TP
> +.B ENOMEM
> +Out of memory (i.e., kernel memory).
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of
> +.I pathname
> +is not a directory or
> +.I pathname
> +is relative and
> +.I dirfd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR open_tree ()
> +was added to Linux in kernel 4.18.
> +.SH CONFORMING TO
> +.BR open_tree ()
> +is Linux-specific.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR open_tree ()
> +system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR fsmount (2),
> +.BR move_mount (2),
> +.BR open (2)
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [MANPAGE PATCH] Add manpage for fsopen(2), fspick(2) and fsmount(2)
  2018-07-10 22:54 ` [MANPAGE PATCH] Add manpage for fsopen(2), fspick(2) and fsmount(2) David Howells
@ 2019-10-09  9:52   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 113+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-09  9:52 UTC (permalink / raw)
  To: David Howells
  Cc: mtk.manpages, viro, linux-api, linux-fsdevel, torvalds,
	linux-kernel, linux-man, Eric W. Biederman

Hello David,

See my previous mail.

With respect to the patch below, would you be willing to review
the content of this man-pages patch to see if it accurately reflects 
what was merged into the kernel, and then resubmit please?

Thanks,

Michael

On 7/11/18 12:54 AM, David Howells wrote:
> Add a manual page to document the fsopen(), fspick() and fsmount() system
> calls.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  man2/fsmount.2 |    1 
>  man2/fsopen.2  |  357 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  man2/fspick.2  |    1 
>  3 files changed, 359 insertions(+)
>  create mode 100644 man2/fsmount.2
>  create mode 100644 man2/fsopen.2
>  create mode 100644 man2/fspick.2
> 
> diff --git a/man2/fsmount.2 b/man2/fsmount.2
> new file mode 100644
> index 000000000..2bf59fc3e
> --- /dev/null
> +++ b/man2/fsmount.2
> @@ -0,0 +1 @@
> +.so man2/fsopen.2
> diff --git a/man2/fsopen.2 b/man2/fsopen.2
> new file mode 100644
> index 000000000..1bc761ab4
> --- /dev/null
> +++ b/man2/fsopen.2
> @@ -0,0 +1,357 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH FSOPEN 2 2018-06-07 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +fsopen, fsmount, fspick \- Handle filesystem (re-)configuration and mounting
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/mount.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int fsopen(const char *" fsname ", unsigned int " flags );
> +.PP
> +.BI "int fsmount(int " fd ", unsigned int " flags ", unsigned int " ms_flags );
> +.PP
> +.BI "int fspick(int " dirfd ", const char *" pathname ", unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There are no glibc wrappers for these system calls.
> +.SH DESCRIPTION
> +.PP
> +.BR fsopen ()
> +creates a new filesystem configuration context within the kernel for the
> +filesystem named in the
> +.I fsname
> +parameter and attaches it to a file descriptor, which it then returns.  The
> +file descriptor can be marked close-on-exec by setting
> +.B FSOPEN_CLOEXEC
> +in flags.
> +.PP
> +The
> +file descriptor can then be used to configure the desired filesystem parameters
> +and security parameters by using
> +.BR write (2)
> +to pass parameters to it and then writing a command to actually create the
> +filesystem representation.
> +.PP
> +The file descriptor also serves as a channel by which more comprehensive error,
> +warning and information messages may be retrieved from the kernel using
> +.BR read (2).
> +.PP
> +Once the kernel's filesystem representation has been created, it can be queried
> +by calling
> +.BR fsinfo (2)
> +on the file descriptor.  fsinfo() will spot that the target is actually a
> +creation context and look inside that.
> +.PP
> +.BR fsmount ()
> +can then be called to create a mount object that refers to the newly created
> +filesystem representation, with the propagation and mount restrictions to be
> +applied specified in
> +.IR ms_flags .
> +The mount object is then attached to a new file descriptor that looks like one
> +created by
> +.BR open "(2) with " O_PATH " or " open_tree (2).
> +This can be passed to
> +.BR move_mount (2)
> +to attach the mount object to a mountpoint, thereby completing the process.
> +.PP
> +The file descriptor returned by fsmount() is marked close-on-exec if
> +FSMOUNT_CLOEXEC is specified in
> +.IR flags .
> +.PP
> +After fsmount() has completed, the context created by fsopen() is reset and
> +moved to reconfiguration state, allowing the new superblock to be reconfigured.
> +.PP
> +.BR fspick ()
> +creates a new filesystem context within the kernel, attaches the superblock
> +specified by
> +.IR dfd ", " pathname ", " flags
> +and puts it into the reconfiguration state and attached the context to a new
> +file descriptor that can then be parameterised with
> +.BR write (2)
> +exactly the same as for the context created by fsopen() above.
> +.PP
> +.I flags
> +is an OR'd together mask of
> +.B FSPICK_CLOEXEC
> +which indicates that the returned file descriptor should be marked
> +close-on-exec and
> +.BR FSPICK_SYMLINK_NOFOLLOW ", " FSPICK_NO_AUTOMOUNT " and " FSPICK_EMPTY_PATH
> +which control the pathwalk to the target object (see below).
> +
> +.\"________________________________________________________
> +.SS Writable Command Interface
> +Superblock (re-)configuration is achieved by writing command strings to the
> +context file descriptor using
> +.BR write (2).
> +Each string is prefixed with a specifier indicating the class of command
> +being specified.  The available commands include:
> +.TP
> +\fB"o <option>"\fP
> +Specify a filesystem or security parameter.
> +.I <option>
> +is typically a key or key=val format string.  Since the length of the option is
> +given to write(), the option may include any sort of character, including
> +spaces and commas or even binary data.
> +.TP
> +\fB"s <name>"\fP
> +Specify a device file, network server or other other source specification.
> +This may be optional, depending on the filesystem, and it may be possible to
> +provide multiple of them to a filesystem.
> +.TP
> +\fB"x create"\fP
> +End the filesystem configuration phase and try and create a representation in
> +the kernel with the parameters specified.  After this, the context is shifted
> +to the mount-pending state waiting for an fsmount() call to occur.
> +.TP
> +\fB"x reconfigure"\fP
> +End a filesystem reconfiguration phase try to apply the parameters to the
> +filesystem representation.  After this, the context gets reset and put back to
> +the start of the reconfiguration phase again.
> +.PP
> +With this interface, option strings are not limited to 4096 bytes, either
> +individually or in sum, and they are also not restricted to text-only options.
> +Further, errors may be given individually for each option and not aggregated or
> +dumped into the kernel log.
> +
> +.\"________________________________________________________
> +.SS Message Retrieval Interface
> +The context file descriptor may be queried for message strings at any time by
> +calling
> +.BR read (2)
> +on the file descriptor.  This will return formatted messages that are prefixed
> +to indicate their class:
> +.TP
> +\fB"e <message>"\fP
> +An error message string was logged.
> +.TP
> +\fB"i <message>"\fP
> +An informational message string was logged.
> +.TP
> +\fB"w <message>"\fP
> +An warning message string was logged.
> +.PP
> +Messages are removed from the queue as they're read.
> +
> +.\"________________________________________________________
> +.SH EXAMPLES
> +To illustrate the process, here's an example whereby this can be used to mount
> +an ext4 filesystem on /dev/sdb1 onto /mnt.  Note that the example ignores the
> +fact that
> +.BR write (2)
> +has a length parameter and that errors might occur.
> +.PP
> +.in +4n
> +.nf
> +sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> +write(sfd, "s /dev/sdb1");
> +write(sfd, "o noatime");
> +write(sfd, "o acl");
> +write(sfd, "o user_attr");
> +write(sfd, "o iversion");
> +write(sfd, "x create");
> +fsinfo(sfd, NULL, ...);
> +mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> +move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.in
> +.PP
> +Here, an ext4 context is created first and attached to sfd.  This is then told
> +where its source will be, given a bunch of options and created.
> +.BR fsinfo (2)
> +can then be used to query the filesystem.  Then fsmount() is called to create a
> +mount object and
> +.BR move_mount (2)
> +is called to attach it to its intended mountpoint.
> +.PP
> +And here's an example of mounting from an NFS server:
> +.PP
> +.in +4n
> +.nf
> +sfd = fsopen("nfs", 0);
> +write(sfd, "s example.com/pub/linux");
> +write(sfd, "o nfsvers=3");
> +write(sfd, "o rsize=65536");
> +write(sfd, "o wsize=65536");
> +write(sfd, "o rdma");
> +write(sfd, "x create");
> +mfd = fsmount(sfd, 0, MS_NODEV);
> +move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.in
> +.PP
> +Reconfiguration can be achieved by:
> +.PP
> +.in +4n
> +.nf
> +sfd = fspick(AT_FDCWD, "/mnt", FSPICK_NO_AUTOMOUNT | FSPICK_CLOEXEC);
> +write(sfd, "o ro");
> +write(sfd, "x reconfigure");
> +.fi
> +.in
> +.PP
> +or:
> +.PP
> +.in +4n
> +.nf
> +sfd = fsopen(...);
> +...
> +mfd = fsmount(sfd, ...);
> +...
> +write(sfd, "o ro");
> +write(sfd, "x reconfigure");
> +.fi
> +.in
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, all three functions return a file descriptor.  On error, \-1 is
> +returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +The error values given below result from filesystem type independent
> +errors.
> +Each filesystem type may have its own special errors and its
> +own special behavior.
> +See the Linux kernel source code for details.
> +.TP
> +.B EACCES
> +A component of a path was not searchable.
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EACCES
> +Mounting a read-only filesystem was attempted without giving the
> +.B MS_RDONLY
> +flag.
> +.TP
> +.B EACCES
> +The block device
> +.I source
> +is located on a filesystem mounted with the
> +.B MS_NODEV
> +option.
> +.\" mtk: Probably: write permission is required for MS_BIND, with
> +.\" the error EPERM if not present; CAP_DAC_OVERRIDE is required.
> +.TP
> +.B EBUSY
> +.I source
> +cannot be reconfigured read-only, because it still holds files open for
> +writing.
> +.TP
> +.B EFAULT
> +One of the pointer arguments points outside the user address space.
> +.TP
> +.B EINVAL
> +.I source
> +had an invalid superblock.
> +.TP
> +.B EINVAL
> +.I ms_flags
> +includes more than one of
> +.BR MS_SHARED ,
> +.BR MS_PRIVATE ,
> +.BR MS_SLAVE ,
> +or
> +.BR MS_UNBINDABLE .
> +.TP
> +.BR EINVAL
> +An attempt was made to bind mount an unbindable mount.
> +.TP
> +.B ELOOP
> +Too many links encountered during pathname resolution.
> +.TP
> +.B EMFILE
> +The system has too many open files to create more.
> +.TP
> +.B ENFILE
> +The process has too many open files to create more.
> +.TP
> +.B ENAMETOOLONG
> +A pathname was longer than
> +.BR MAXPATHLEN .
> +.TP
> +.B ENODEV
> +Filesystem
> +.I fsname
> +not configured in the kernel.
> +.TP
> +.B ENOENT
> +A pathname was empty or had a nonexistent component.
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the call.
> +.TP
> +.B ENOTBLK
> +.I source
> +is not a block device (and a device was required).
> +.TP
> +.B ENOTDIR
> +.IR pathname ,
> +or a prefix of
> +.IR source ,
> +is not a directory.
> +.TP
> +.B ENXIO
> +The major number of the block device
> +.I source
> +is out of range.
> +.TP
> +.B EPERM
> +The caller does not have the required privileges.
> +.SH CONFORMING TO
> +These functions are Linux-specific and should not be used in programs intended
> +to be portable.
> +.SH VERSIONS
> +.BR fsopen "(), " fsmount "() and " fspick ()
> +were added to Linux in kernel 4.18.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR fsopen "() , " fsmount "() or " fspick "()"
> +system calls; call them using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR mountpoint (1),
> +.BR move_mount (2),
> +.BR open_tree (2),
> +.BR umount (2),
> +.BR mount_namespaces (7),
> +.BR path_resolution (7),
> +.BR findmnt (8),
> +.BR lsblk (8),
> +.BR mount (8),
> +.BR umount (8)
> diff --git a/man2/fspick.2 b/man2/fspick.2
> new file mode 100644
> index 000000000..2bf59fc3e
> --- /dev/null
> +++ b/man2/fspick.2
> @@ -0,0 +1 @@
> +.so man2/fsopen.2
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [MANPAGE PATCH] Add manpage for fsinfo(2)
  2018-07-10 22:55 ` [MANPAGE PATCH] Add manpage for fsinfo(2) David Howells
@ 2019-10-09  9:52   ` Michael Kerrisk (man-pages)
  2019-10-09 12:02   ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-09  9:52 UTC (permalink / raw)
  To: David Howells
  Cc: mtk.manpages, viro, linux-api, linux-fsdevel, torvalds,
	linux-kernel, linux-man, Eric W. Biederman

Hello David,

See my previous mails.

There is no fsinfo(2) in the system call in the kernel currently.
Will that call still be added, or was it replaced by fsconfig(2),
which--as far as I can tell--dnot have a man-pages patch?

Thanks,

Michael

On 7/11/18 12:55 AM, David Howells wrote:
> Add a manual page to document the fsinfo() system call.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  man2/fsinfo.2       | 1017 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  man2/ioctl_iflags.2 |    6 
>  man2/stat.2         |    7 
>  man2/statx.2        |   13 +
>  man2/utime.2        |    7 
>  man2/utimensat.2    |    7 
>  6 files changed, 1057 insertions(+)
>  create mode 100644 man2/fsinfo.2
> 
> diff --git a/man2/fsinfo.2 b/man2/fsinfo.2
> new file mode 100644
> index 000000000..5710232df
> --- /dev/null
> +++ b/man2/fsinfo.2
> @@ -0,0 +1,1017 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH FSINFO 2 2018-06-06 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +fsinfo \- Get filesystem information
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/fsinfo.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int fsinfo(int " dirfd ", const char *" pathname ","
> +.BI "           struct fsinfo_params *" params ","
> +.BI "           void *" buffer ", size_t " buf_size );
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for
> +.BR fsinfo ();
> +see NOTES.
> +.SH DESCRIPTION
> +.PP
> +fsinfo() retrieves the desired filesystem attribute, as selected by the
> +parameters pointed to by
> +.IR params ,
> +and stores its value in the buffer pointed to by
> +.IR buffer .
> +.PP
> +The parameter structure is optional, defaulting to all the parameters being 0
> +if the pointer is NULL.  The structure looks like the following:
> +.PP
> +.in +4n
> +.nf
> +struct fsinfo_params {
> +    __u32 at_flags;     /* AT_SYMLINK_NOFOLLOW and similar flags */
> +    __u32 request;      /* Requested attribute */
> +    __u32 Nth;          /* Instance of attribute */
> +    __u32 Mth;          /* Subinstance of Nth instance */
> +    __u32 __reserved[6]; /* Reserved params; all must be 0 */
> +};
> +.fi
> +.in
> +.PP
> +The filesystem to be queried is looked up using a combination of
> +.IR dfd ", " pathname " and " params->at_flags.
> +This is discussed in more detail below.
> +.PP
> +The desired attribute is indicated by
> +.IR params->request .
> +If
> +.I params
> +is NULL, this will default to
> +.BR fsinfo_attr_statfs ,
> +which retrieves some of the information returned by
> +.BR statfs ().
> +The available attributes are described below in the "THE ATTRIBUTES" section.
> +.PP
> +Some attributes can have multiple values and some can even have multiple
> +instances with multiple values.  For example, a network filesystem might use
> +multiple servers.  The names of each of these servers can be retrieved by
> +using
> +.I params->Nth
> +to iterate through all the instances until error
> +.B ENODATA
> +occurs, indicating the end of the list.  Further, each server might have
> +multiple addresses available; these can be enumerated using
> +.I params->Nth
> +to iterate the servers and
> +.I params->Mth
> +to iterate the addresses of the Nth server.
> +.PP
> +The amount of data written into the buffer depends on the attribute selected.
> +Some attributes return variable-length strings and some return fixed-size
> +structures.  If either
> +.IR buffer " is  NULL  or " buf_size " is 0"
> +then the size of the attribute value will be returned and nothing will be
> +written into the buffer.
> +.PP
> +The
> +.I params->__reserved
> +parameters must all be 0.
> +.\"_______________________________________________________
> +.SS
> +Allowance for Future Attribute Expansion
> +.PP
> +To allow for the future expansion and addition of fields to any fixed-size
> +structure attribute,
> +.BR fsinfo ()
> +makes the following guarantees:
> +.RS 4m
> +.IP (1) 4m
> +It will always clear any excess space in the buffer.
> +.IP (2) 4m
> +It will always return the actual size of the data.
> +.IP (3) 4m
> +It will truncate the data to fit it into the buffer rather than giving an
> +error.
> +.IP (4) 4m
> +Any new version of a structure will incorporate all the fields from the old
> +version at same offsets.
> +.RE
> +.PP
> +So, for example, if the caller is running on an older version of the kernel
> +with an older, smaller version of the structure than was asked for, the kernel
> +will write the smaller version into the buffer and will clear the remainder of
> +the buffer to make sure any additional fields are set to 0.  The function will
> +return the actual size of the data.
> +.PP
> +On the other hand, if the caller is running on a newer version of the kernel
> +with a newer version of the structure that is larger than the buffer, the write
> +to the buffer will be truncated to fit as necessary and the actual size of the
> +data will be returned.
> +.PP
> +Note that this doesn't apply to variable-length string attributes.
> +
> +.\"_______________________________________________________
> +.SS
> +Invoking \fBfsinfo\fR():
> +.PP
> +To access a file's status, no permissions are required on the file itself, but
> +in the case of
> +.BR fsinfo ()
> +with a path, execute (search) permission is required on all of the directories
> +in
> +.I pathname
> +that lead to the file.
> +.PP
> +.BR fsinfo ()
> +uses
> +.IR pathname ", " dirfd " and " params->at_flags
> +to locate the target file in one of a variety of ways:
> +.TP
> +[*] By absolute path.
> +.I pathname
> +points to an absolute path and
> +.I dirfd
> +is ignored.  The file is looked up by name, starting from the root of the
> +filesystem as seen by the calling process.
> +.TP
> +[*] By cwd-relative path.
> +.I pathname
> +points to a relative path and
> +.IR dirfd " is " AT_FDCWD .
> +The file is looked up by name, starting from the current working directory.
> +.TP
> +[*] By dir-relative path.
> +.I pathname
> +points to relative path and
> +.I dirfd
> +indicates a file descriptor pointing to a directory.  The file is looked up by
> +name, starting from the directory specified by
> +.IR dirfd .
> +.TP
> +[*] By file descriptor.
> +.IR pathname " is " NULL " and " dirfd
> +indicates a file descriptor.  The file attached to the file descriptor is
> +queried directly.  The file descriptor may point to any type of file, not just
> +a directory.
> +.PP
> +.I flags
> +can be used to influence a path-based lookup.  A value for
> +.I flags
> +is constructed by OR'ing together zero or more of the following constants:
> +.TP
> +.BR AT_EMPTY_PATH
> +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> +If
> +.I pathname
> +is an empty string, operate on the file referred to by
> +.IR dirfd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag).
> +If
> +.I dirfd
> +is
> +.BR AT_FDCWD ,
> +the call operates on the current working directory.
> +In this case,
> +.I dirfd
> +can refer to any type of file, not just a directory.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.BR AT_NO_AUTOMOUNT
> +Don't automount the terminal ("basename") component of
> +.I pathname
> +if it is a directory that is an automount point.  This allows the caller to
> +gather attributes of the filesystem holding an automount point (rather than
> +the filesystem it would mount).  This flag can be used in tools that scan
> +directories to prevent mass-automounting of a directory of automount points.
> +The
> +.B AT_NO_AUTOMOUNT
> +flag has no effect if the mount point has already been mounted over.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B AT_SYMLINK_NOFOLLOW
> +If
> +.I pathname
> +is a symbolic link, do not dereference it:
> +instead return information about the link itself, like
> +.BR lstat ().
> +.SH THE ATTRIBUTES
> +.PP
> +There is a range of attributes that can be selected from.  These are:
> +
> +.\" __________________ fsinfo_attr_statfs __________________
> +.TP
> +.B fsinfo_attr_statfs
> +This retrieves the "dynamic"
> +.B statfs
> +information, such as block and file counts, that are expected to change whilst
> +a filesystem is being used.  This fills in the following structure:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_statfs {
> +    __u64 f_blocks;	/* Total number of blocks in fs */
> +    __u64 f_bfree;	/* Total number of free blocks */
> +    __u64 f_bavail;	/* Number of free blocks available to ordinary user */
> +    __u64 f_files;	/* Total number of file nodes in fs */
> +    __u64 f_ffree;	/* Number of free file nodes */
> +    __u64 f_favail;	/* Number of free file nodes available to ordinary user */
> +    __u32 f_bsize;	/* Optimal block size */
> +    __u32 f_frsize;	/* Fragment size */
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +The fields correspond to those of the same name returned by
> +.BR statfs ().
> +
> +.\" __________________ fsinfo_attr_fsinfo __________________
> +.TP
> +.B fsinfo_attr_fsinfo
> +This retrieves information about the
> +.BR fsinfo ()
> +system call itself.  This fills in the following structure:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_fsinfo {
> +    __u32 max_attr;
> +    __u32 max_cap;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +The
> +.I max_attr
> +value indicates the number of attributes supported by the
> +.BR fsinfo ()
> +system call, and
> +.I max_cap
> +indicates the number of capability bits supported by the
> +.B fsinfo_attr_capabilities
> +attribute.  The first corresponds to
> +.I fsinfo_attr__nr
> +and the second to
> +.I fsinfo_cap__nr
> +in the header file.
> +
> +.\" __________________ fsinfo_attr_ids __________________
> +.TP
> +.B fsinfo_attr_ids
> +This retrieves a number of fixed IDs and other static information otherwise
> +available through
> +.BR statfs ().
> +The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_ids {
> +    char  f_fs_name[15 + 1]; /* Filesystem name */
> +    __u64 f_flags;	/* Filesystem mount flags (MS_*) */
> +    __u64 f_fsid;	/* Short 64-bit Filesystem ID */
> +    __u64 f_sb_id;	/* Internal superblock ID */
> +    __u32 f_fstype;	/* Filesystem type from linux/magic.h */
> +    __u32 f_dev_major;	/* As st_dev_* from struct statx */
> +    __u32 f_dev_minor;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +Most of these are filled in as for
> +.BR statfs (),
> +with the addition of the filesystem's symbolic name in
> +.I f_fs_name
> +and an identifier for use in notifications in
> +.IR f_sb_id .
> +
> +.\" __________________ fsinfo_attr_limits __________________
> +.TP
> +.B fsinfo_attr_limits
> +This retrieves information about the limits of what a filesystem can support.
> +The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_limits {
> +    __u64 max_file_size;
> +    __u64 max_uid;
> +    __u64 max_gid;
> +    __u64 max_projid;
> +    __u32 max_dev_major;
> +    __u32 max_dev_minor;
> +    __u32 max_hard_links;
> +    __u32 max_xattr_body_len;
> +    __u16 max_xattr_name_len;
> +    __u16 max_filename_len;
> +    __u16 max_symlink_len;
> +    __u16 __reserved[1];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +These indicate the maximum supported sizes for a variety of filesystem objects,
> +including the file size, the extended attribute name length and body length,
> +the filename length and the symlink body length.
> +.IP
> +It also indicates the maximum representable values for a User ID, a Group ID,
> +a Project ID, a device major number and a device minor number.
> +.IP
> +And finally, it indicates the maximum number of hard links that can be made to
> +a file.
> +.IP
> +Note that some of these values may be zero if the underlying object or concept
> +is not supported by the filesystem or the medium.
> +
> +.\" __________________ fsinfo_attr_supports __________________
> +.TP
> +.B fsinfo_attr_supports
> +This retrieves information about what bits a filesystem supports in various
> +masks.  The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_supports {
> +    __u64 stx_attributes;
> +    __u32 stx_mask;
> +    __u32 ioc_flags;
> +    __u32 win_file_attrs;
> +    __u32 __reserved[1];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +The
> +.IR stx_attributes " and " stx_mask
> +fields indicate what bits in the struct statx fields of the matching names
> +are supported by the filesystem.
> +.IP
> +The
> +.I ioc_flags
> +field indicates what FS_*_FL flag bits as used through the FS_IOC_GET/SETFLAGS
> +ioctls are supported by the filesystem.
> +.IP
> +The
> +.I win_file_attrs
> +indicates what DOS/Windows file attributes a filesystem supports, if any.
> +
> +.\" __________________ fsinfo_attr_capabilities __________________
> +.TP
> +.B fsinfo_attr_capabilities
> +This retrieves information about what features a filesystem supports as a
> +series of single bit indicators.  The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_capabilities {
> +    __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +where the bit of interest can be found by:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +	p->capabilities[bit / 8] & (1 << (bit % 8)))
> +.fi
> +.in
> +.RE
> +.IP
> +The bits are listed by
> +.I enum fsinfo_capability
> +and
> +.B fsinfo_cap__nr
> +is one more than the last capability bit listed in the header file.
> +.IP
> +Note that the number of capability bits actually supported by the kernel can be
> +found using the
> +.B fsinfo_attr_fsinfo
> +attribute.
> +.IP
> +The capability bits and their meanings are listed below in the "THE
> +CAPABILITIES" section.
> +
> +.\" __________________ fsinfo_attr_timestamp_info __________________
> +.TP
> +.B fsinfo_attr_timestamp_info
> +This retrieves information about what timestamp resolution and scope is
> +supported by a filesystem for each of the file timestamps.  The following
> +structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_timestamp_info {
> +	__s64 minimum_timestamp;
> +	__s64 maximum_timestamp;
> +	__u16 atime_gran_mantissa;
> +	__u16 btime_gran_mantissa;
> +	__u16 ctime_gran_mantissa;
> +	__u16 mtime_gran_mantissa;
> +	__s8  atime_gran_exponent;
> +	__s8  btime_gran_exponent;
> +	__s8  ctime_gran_exponent;
> +	__s8  mtime_gran_exponent;
> +	__u32 __reserved[1];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +where
> +.IR minimum_timestamp " and " maximum_timestamp
> +are the limits on the timestamps that the filesystem supports and
> +.IR *time_gran_mantissa " and " *time_gran_exponent
> +indicate the granularity of each timestamp in terms of seconds, using the
> +formula:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +mantissa * pow(10, exponent) Seconds
> +.fi
> +.in
> +.RE
> +.IP
> +where exponent may be negative and the result may be a fraction of a second.
> +.IP
> +Four timestamps are detailed: \fBA\fPccess time, \fBB\fPirth/creation time,
> +\fBC\fPhange time and \fBM\fPodification time.  Capability bits are defined
> +that specify whether each of these exist in the filesystem or not.
> +.IP
> +Note that the timestamp description may be approximated or inaccurate if the
> +file is actually remote or is the union of multiple objects.
> +
> +.\" __________________ fsinfo_attr_volume_id __________________
> +.TP
> +.B fsinfo_attr_volume_id
> +This retrieves the system's superblock volume identifier as a variable-length
> +string.  This does not necessarily represent a value stored in the medium but
> +might be constructed on the fly.
> +.IP
> +For instance, for a block device this is the block device identifier
> +(eg. "sdb2"); for AFS this would be the numeric volume identifier.
> +
> +.\" __________________ fsinfo_attr_volume_uuid __________________
> +.TP
> +.B fsinfo_attr_volume_uuid
> +This retrieves the volume UUID, if there is one, as a little-endian binary
> +UUID.  This fills in the following structure:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_volume_uuid {
> +    __u8 uuid[16];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +
> +.\" __________________ fsinfo_attr_volume_name __________________
> +.TP
> +.B fsinfo_attr_volume_name
> +This retrieves the filesystem's volume name as a variable-length string.  This
> +is expected to represent a name stored in the medium.
> +.IP
> +For a block device, this might be a label stored in the superblock.  For a
> +network filesystem, this might be a logical volume name of some sort.
> +
> +.\" __________________ fsinfo_attr_cell/domain __________________
> +.PP
> +.B fsinfo_attr_cell_name
> +.br
> +.B fsinfo_attr_domain_name
> +.br
> +.IP
> +These two attributes are variable-length string attributes that may be used to
> +obtain information about network filesystems.  An AFS volume, for instance,
> +belongs to a named cell.  CIFS shares may belong to a domain.
> +
> +.\" __________________ fsinfo_attr_realm_name __________________
> +.TP
> +.B fsinfo_attr_realm_name
> +This attribute is variable-length string that indicates the Kerberos realm that
> +a filesystem's authentication tokens should come from.
> +
> +.\" __________________ fsinfo_attr_server_name __________________
> +.TP
> +.B fsinfo_attr_server_name
> +This attribute is a multiple-value attribute that lists the names of the
> +servers that are backing a network filesystem.  Each value is a variable-length
> +string.  The values are enumerated by calling
> +.BR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +each time until an ENODATA error occurs, thereby indicating the end of the
> +list.
> +
> +.\" __________________ fsinfo_attr_server_address __________________
> +.TP
> +.B fsinfo_attr_server_address
> +This attribute is a multiple-instance, multiple-value attribute that lists the
> +addresses of the servers that are backing a network filesystem.  Each value is
> +a structure of the following type:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_server_address {
> +    struct __kernel_sockaddr_storage address;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +Where the address may be AF_INET, AF_INET6, AF_RXRPC or any other type as
> +appropriate to the filesystem.
> +.IP
> +The values are enumerated by calling
> +.IR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +to step through the servers and
> +.I params->Mth
> +to step through the addresses of the Nth server each time until ENODATA errors
> +occur, thereby indicating either the end of a server's address list or the end
> +of the server list.
> +.IP
> +Barring the server list changing whilst being accessed, it is expected that the
> +.I params->Nth
> +will correspond to
> +.I params->Nth
> +for
> +.BR fsinfo_attr_server_name .
> +
> +.\" __________________ fsinfo_attr_parameter __________________
> +.TP
> +.B fsinfo_attr_parameter
> +This attribute is a multiple-value attribute that lists the values of the mount
> +parameters for a filesystem as variable-length strings.
> +.IP
> +The parameters are enumerated by calling
> +.BR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +to step through them until error ENODATA is given.
> +.IP
> +Parameter strings are presented in a form akin to the way they're passed to the
> +context created by the
> +.BR fsopen ()
> +system call.  For example, straight text parameters will be rendered as
> +something like:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +"o data=journal"
> +"o noquota"
> +.fi
> +.in
> +.RE
> +.IP
> +Where the initial "word" indicates the option form.
> +
> +.\" __________________ fsinfo_attr_source __________________
> +.TP
> +.B fsinfo_attr_source
> +This attribute is a multiple-value attribute that lists the mount sources for a
> +filesystem as variable-length strings.  Normally only one source will be
> +available, but the possibility of having more than one is allowed for.
> +.IP
> +The sources are enumerated by calling
> +.BR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +to step through them until error ENODATA is given.
> +.IP
> +Source strings are presented in a form akin to the way they're passed to the
> +context created by the
> +.BR fsopen ()
> +system call.  For example, they will be rendered as something like:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +"s /dev/sda1"
> +"s example.com/pub/linux/"
> +.fi
> +.in
> +.RE
> +.IP
> +Where the initial "word" indicates the option form.
> +
> +.\" __________________ fsinfo_attr_name_encoding __________________
> +.TP
> +.B fsinfo_attr_name_encoding
> +This attribute is variable-length string that indicates the filename encoding
> +used by the filesystem.  The default is "utf8".  Note that this may indicate a
> +non-8-bit encoding if that's what the underlying filesystem actually supports.
> +
> +.\" __________________ fsinfo_attr_name_codepage __________________
> +.TP
> +.B fsinfo_attr_name_codepage
> +This attribute is variable-length string that indicates the codepage used to
> +translate filenames from the filesystem to the system if this is applicable to
> +the filesystem.
> +
> +.\" __________________ fsinfo_attr_io_size __________________
> +.TP
> +.B fsinfo_attr_io_size
> +This retrieves information about the I/O sizes supported by the filesystem.
> +The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_io_size {
> +    __u32 block_size;
> +    __u32 max_single_read_size;
> +    __u32 max_single_write_size;
> +    __u32 best_read_size;
> +    __u32 best_write_size;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +Where
> +.I block_size
> +indicates the fundamental I/O block size of the filesystem as something
> +O_DIRECT read/write sizes must be a multiple of;
> +.IR max_single_write_size " and " max_single_write_size
> +indicate the maximum sizes for individual unbuffered data transfer operations;
> +and
> +.IR best_read_size " and " best_write_size
> +indicate the recommended I/O sizes.
> +.IP
> +Note that any of these may be zero if inapplicable or indeterminable.
> +
> +
> +
> +.SH THE CAPABILITIES
> +.PP
> +There are number of capability bits in a bit array that can be retrieved using
> +.BR fsinfo_attr_capabilities .
> +These give information about features of the filesystem driver and the specific
> +filesystem.
> +
> +.\" __________________ fsinfo_cap_is_*_fs __________________
> +.PP
> +.B fsinfo_cap_is_kernel_fs
> +.br
> +.B fsinfo_cap_is_block_fs
> +.br
> +.B fsinfo_cap_is_flash_fs
> +.br
> +.B fsinfo_cap_is_network_fs
> +.br
> +.B fsinfo_cap_is_automounter_fs
> +.IP
> +These indicate the primary type of the filesystem.
> +.B kernel
> +filesystems are special communication interfaces that substitute files for
> +system calls; examples include procfs and sysfs.
> +.B block
> +filesystems require a block device on which to operate; examples include ext4
> +and XFS.
> +.B flash
> +filesystems require an MTD device on which to operate; examples include JFFS2.
> +.B network
> +filesystems require access to the network and contact one or more servers;
> +examples include NFS and AFS.
> +.B automounter
> +filesystems are kernel special filesystems that host automount points and
> +triggers to dynamically create automount points.  Examples include autofs and
> +AFS's dynamic root.
> +
> +.\" __________________ fsinfo_cap_automounts __________________
> +.TP
> +.B fsinfo_cap_automounts
> +The filesystem may have automount points that can be triggered by pathwalk.
> +
> +.\" __________________ fsinfo_cap_adv_locks __________________
> +.TP
> +.B fsinfo_cap_adv_locks
> +The filesystem supports advisory file locks.  For a network filesystem, this
> +indicates that the advisory file locks are cross-client (and also between
> +server and its local filesystem on something like NFS).
> +
> +.\" __________________ fsinfo_cap_mand_locks __________________
> +.TP
> +.B fsinfo_cap_mand_locks
> +The filesystem supports mandatory file locks.  For a network filesystem, this
> +indicates that the mandatory file locks are cross-client (and also between
> +server and its local filesystem on something like NFS).
> +
> +.\" __________________ fsinfo_cap_leases __________________
> +.TP
> +.B fsinfo_cap_leases
> +The filesystem supports leases.  For a network filesystem, this means that the
> +server will tell the client to clean up its state on a file before passing the
> +lease to another client.
> +
> +.\" __________________ fsinfo_cap_*ids __________________
> +.PP
> +.B fsinfo_cap_uids
> +.br
> +.B fsinfo_cap_gids
> +.br
> +.B fsinfo_cap_projids
> +.IP
> +These indicate that the filesystem supports numeric user IDs, group IDs and
> +project IDs respectively.
> +
> +.\" __________________ fsinfo_cap_id_* __________________
> +.PP
> +.B fsinfo_cap_id_names
> +.br
> +.B fsinfo_cap_id_guids
> +.IP
> +These indicate that the filesystem employs textual names and/or GUIDs as
> +identifiers.
> +
> +.\" __________________ fsinfo_cap_windows_attrs __________________
> +.TP
> +.B fsinfo_cap_windows_attrs
> +Indicates that the filesystem supports some Windows FILE_* attributes.
> +
> +.\" __________________ fsinfo_cap_*_quotas __________________
> +.PP
> +.B fsinfo_cap_user_quotas
> +.br
> +.B fsinfo_cap_group_quotas
> +.br
> +.B fsinfo_cap_project_quotas
> +.IP
> +These indicate that the filesystem supports quotas for users, groups and
> +projects respectively.
> +
> +.\" __________________ fsinfo_cap_xattrs/filetypes __________________
> +.PP
> +.B fsinfo_cap_xattrs
> +.br
> +.B fsinfo_cap_symlinks
> +.br
> +.B fsinfo_cap_hard_links
> +.br
> +.B fsinfo_cap_hard_links_1dir
> +.br
> +.B fsinfo_cap_device_files
> +.br
> +.B fsinfo_cap_unix_specials
> +.IP
> +These indicate that the filesystem supports respectively extended attributes;
> +symbolic links; hard links spanning direcories; hard links, but only within a
> +directory; block and character device files; and UNIX special files, such as
> +FIFO and socket.
> +
> +.\" __________________ fsinfo_cap_*journal* __________________
> +.PP
> +.B fsinfo_cap_journal
> +.br
> +.B fsinfo_cap_data_is_journalled
> +.IP
> +The first of these indicates that the filesystem has a journal and the second
> +that the file data changes are being journalled.
> +
> +.\" __________________ fsinfo_cap_o_* __________________
> +.PP
> +.B fsinfo_cap_o_sync
> +.br
> +.B fsinfo_cap_o_direct
> +.IP
> +These indicate that O_SYNC and O_DIRECT are supported respectively.
> +
> +.\" __________________ fsinfo_cap_o_* __________________
> +.PP
> +.B fsinfo_cap_volume_id
> +.br
> +.B fsinfo_cap_volume_uuid
> +.br
> +.B fsinfo_cap_volume_name
> +.br
> +.B fsinfo_cap_volume_fsid
> +.br
> +.B fsinfo_cap_cell_name
> +.br
> +.B fsinfo_cap_domain_name
> +.br
> +.B fsinfo_cap_realm_name
> +.IP
> +These indicate if various attributes are supported by the filesystem, where
> +.B fsinfo_cap_X
> +here corresponds to
> +.BR fsinfo_attr_X .
> +
> +.\" __________________ fsinfo_cap_iver_* __________________
> +.PP
> +.B fsinfo_cap_iver_all_change
> +.br
> +.B fsinfo_cap_iver_data_change
> +.br
> +.B fsinfo_cap_iver_mono_incr
> +.IP
> +These indicate if
> +.I i_version
> +on an inode in the filesystem is supported and
> +how it behaves.
> +.B all_change
> +indicates that i_version is incremented on metadata changes as well as data
> +changes.
> +.B data_change
> +indicates that i_version is only incremented on data changes, including
> +truncation.
> +.B mono_incr
> +indicates that i_version is incremented by exactly 1 for each change made.
> +
> +.\" __________________ fsinfo_cap_resource_forks __________________
> +.TP
> +.B fsinfo_cap_resource_forks
> +This indicates that the filesystem supports some sort of resource fork or
> +alternate data stream on a file.  This isn't the same as an extended attribute.
> +
> +.\" __________________ fsinfo_cap_name_* __________________
> +.PP
> +.B fsinfo_cap_name_case_indep
> +.br
> +.B fsinfo_cap_name_non_utf8
> +.br
> +.B fsinfo_cap_name_has_codepage
> +.IP
> +These indicate certain facts about the filenames in a filesystem: whether
> +they're case-independent; if they're not UTF-8; and if there's a codepage
> +employed to map the names.
> +
> +.\" __________________ fsinfo_cap_sparse __________________
> +.TP
> +.B fsinfo_cap_sparse
> +This indicates that the filesystem supports sparse files.
> +
> +.\" __________________ fsinfo_cap_not_persistent __________________
> +.TP
> +.B fsinfo_cap_not_persistent
> +This indicates that the filesystem is not persistent, and that any data stored
> +here will not be saved in the event that the filesystem is unmounted, the
> +machine is rebooted or the machine loses power.
> +
> +.\" __________________ fsinfo_cap_no_unix_mode __________________
> +.TP
> +.B fsinfo_cap_no_unix_mode
> +This indicates that the filesystem doesn't support the UNIX mode permissions
> +bits.
> +
> +.\" __________________ fsinfo_cap_has_*time __________________
> +.PP
> +.B fsinfo_cap_has_atime
> +.br
> +.B fsinfo_cap_has_btime
> +.br
> +.B fsinfo_cap_has_ctime
> +.br
> +.B fsinfo_cap_has_mtime
> +.IP
> +These indicate as to what timestamps a filesystem supports, including: Access
> +time, Birth/creation time, Change time (metadata and data) and Modification
> +time (data only).
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, the size of the value that the kernel has available is returned,
> +irrespective of whether the buffer is large enough to hold that.  The data
> +written to the buffer will be truncated if it is not.  On error, \-1 is
> +returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied for one of the directories
> +in the path prefix of
> +.IR pathname .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +.I dirfd
> +is not a valid open file descriptor.
> +.TP
> +.B EFAULT
> +.I pathname
> +is NULL or
> +.IR pathname ", " params " or " buffer
> +point to a location outside the process's accessible address space.
> +.TP
> +.B EINVAL
> +Reserved flag specified in
> +.IR params->at_flags " or one of " params->__reserved[]
> +is not 0.
> +.TP
> +.B EOPNOTSUPP
> +Unsupported attribute requested in
> +.IR params->request .
> +This may be beyond the limit of the supported attribute set or may just not be
> +one that's supported by the filesystem.
> +.TP
> +.B ENODATA
> +Unavailable attribute value requested by
> +.IR params->Nth " and/or " params->Mth .
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered while traversing the pathname.
> +.TP
> +.B ENAMETOOLONG
> +.I pathname
> +is too long.
> +.TP
> +.B ENOENT
> +A component of
> +.I pathname
> +does not exist, or
> +.I pathname
> +is an empty string and
> +.B AT_EMPTY_PATH
> +was not specified in
> +.IR params->at_flags .
> +.TP
> +.B ENOMEM
> +Out of memory (i.e., kernel memory).
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of
> +.I pathname
> +is not a directory or
> +.I pathname
> +is relative and
> +.I dirfd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR fsinfo ()
> +was added to Linux in kernel 4.18.
> +.SH CONFORMING TO
> +.BR fsinfo ()
> +is Linux-specific.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR fsinfo ()
> +system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR ioctl_iflags (2),
> +.BR statx (2),
> +.BR statfs (2)
> diff --git a/man2/ioctl_iflags.2 b/man2/ioctl_iflags.2
> index 9c77b08b9..49ba4444e 100644
> --- a/man2/ioctl_iflags.2
> +++ b/man2/ioctl_iflags.2
> @@ -200,9 +200,15 @@ the effective user ID of the caller must match the owner of the file,
>  or the caller must have the
>  .BR CAP_FOWNER
>  capability.
> +.PP
> +The set of flags supported by a filesystem can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_supports .
>  .SH SEE ALSO
>  .BR chattr (1),
>  .BR lsattr (1),
> +.BR fsinfo (2),
>  .BR mount (2),
>  .BR btrfs (5),
>  .BR ext4 (5),
> diff --git a/man2/stat.2 b/man2/stat.2
> index dad9a01ac..ee4001f85 100644
> --- a/man2/stat.2
> +++ b/man2/stat.2
> @@ -532,6 +532,12 @@ If none of the aforementioned macros are defined,
>  then the nanosecond values are exposed with names of the form
>  .IR st_atimensec .
>  .\"
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SS C library/kernel differences
>  Over time, increases in the size of the
>  .I stat
> @@ -707,6 +713,7 @@ main(int argc, char *argv[])
>  .BR access (2),
>  .BR chmod (2),
>  .BR chown (2),
> +.BR fsinfo (2),
>  .BR readlink (2),
>  .BR utime (2),
>  .BR capabilities (7),
> diff --git a/man2/statx.2 b/man2/statx.2
> index edac9f6f4..9a57c1b90 100644
> --- a/man2/statx.2
> +++ b/man2/statx.2
> @@ -534,12 +534,25 @@ Glibc does not (yet) provide a wrapper for the
>  .BR statx ()
>  system call; call it using
>  .BR syscall (2).
> +.PP
> +The sets of mask/stx_mask and stx_attributes bits supported by a filesystem
> +can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_supports .
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can also be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SH SEE ALSO
>  .BR ls (1),
>  .BR stat (1),
>  .BR access (2),
>  .BR chmod (2),
>  .BR chown (2),
> +.BR fsinfo (2),
>  .BR readlink (2),
>  .BR stat (2),
>  .BR utime (2),
> diff --git a/man2/utime.2 b/man2/utime.2
> index 03a43a416..c6acdbac2 100644
> --- a/man2/utime.2
> +++ b/man2/utime.2
> @@ -181,9 +181,16 @@ on an append-only file.
>  .\" is just a wrapper for
>  .\" .BR utime ()
>  .\" and hence does not allow a subsecond resolution.
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SH SEE ALSO
>  .BR chattr (1),
>  .BR touch (1),
> +.BR fsinfo (2),
>  .BR futimesat (2),
>  .BR stat (2),
>  .BR utimensat (2),
> diff --git a/man2/utimensat.2 b/man2/utimensat.2
> index d61b43e96..be8925548 100644
> --- a/man2/utimensat.2
> +++ b/man2/utimensat.2
> @@ -633,9 +633,16 @@ instead checks whether the
>  .\" conversely, a process with a read-only file descriptor won't
>  .\" be able to update the timestamps of a file,
>  .\" even if it has write permission on the file.
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SH SEE ALSO
>  .BR chattr (1),
>  .BR touch (1),
> +.BR fsinfo (2),
>  .BR futimesat (2),
>  .BR openat (2),
>  .BR stat (2),
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [MANPAGE PATCH] Add manpage for fsinfo(2)
  2018-07-10 22:55 ` [MANPAGE PATCH] Add manpage for fsinfo(2) David Howells
  2019-10-09  9:52   ` Michael Kerrisk (man-pages)
@ 2019-10-09 12:02   ` David Howells
  1 sibling, 0 replies; 113+ messages in thread
From: David Howells @ 2019-10-09 12:02 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: dhowells, viro, linux-api, linux-fsdevel, torvalds, linux-kernel,
	linux-man, Eric W. Biederman

Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:

> There is no fsinfo(2) in the system call in the kernel currently.
> Will that call still be added,

Hopefully, but I'm not sure it'll be ready by the next merge window.

> or was it replaced by fsconfig(2),

They're different things and not interchangeable.

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2019-10-09 12:02 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-10 22:41 [PATCH 00/32] VFS: Introduce filesystem context [ver #9] David Howells
2018-07-10 22:41 ` [PATCH 01/32] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
2018-07-10 22:41 ` [PATCH 02/32] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
2018-07-10 22:41 ` [PATCH 03/32] teach move_mount(2) to work with OPEN_TREE_CLONE " David Howells
2018-07-10 22:41 ` [PATCH 04/32] vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled " David Howells
2018-07-10 22:42 ` [PATCH 05/32] vfs: Introduce the basic header for the new mount API's filesystem context " David Howells
2018-07-10 22:42 ` [PATCH 06/32] vfs: Add LSM hooks for the new mount API " David Howells
2018-07-10 22:42 ` [PATCH 07/32] selinux: Implement the new mount API LSM hooks " David Howells
2018-07-11 14:08   ` Stephen Smalley
2018-07-10 22:42 ` [PATCH 08/32] smack: Implement filesystem context security " David Howells
2018-07-10 23:13   ` Casey Schaufler
2018-07-10 23:19   ` David Howells
2018-07-10 23:28     ` Casey Schaufler
2018-07-10 22:42 ` [PATCH 09/32] apparmor: Implement security hooks for the new mount API " David Howells
2018-07-10 22:42 ` [PATCH 10/32] tomoyo: " David Howells
2018-07-10 23:34   ` Tetsuo Handa
2018-07-10 22:42 ` [PATCH 11/32] vfs: Require specification of size of mount data for internal mounts " David Howells
2018-07-10 22:51   ` Linus Torvalds
2018-07-10 22:42 ` [PATCH 12/32] vfs: Separate changing mount flags full remount " David Howells
2018-07-10 22:42 ` [PATCH 13/32] vfs: Implement a filesystem superblock creation/configuration context " David Howells
2018-07-10 22:43 ` [PATCH 14/32] vfs: Remove unused code after filesystem context changes " David Howells
2018-07-10 22:43 ` [PATCH 15/32] procfs: Move proc_fill_super() to fs/proc/root.c " David Howells
2018-07-10 22:43 ` [PATCH 16/32] proc: Add fs_context support to procfs " David Howells
2018-07-10 22:43 ` [PATCH 17/32] ipc: Convert mqueue fs to fs_context " David Howells
2018-07-10 22:43 ` [PATCH 18/32] cpuset: Use " David Howells
2018-07-10 22:43 ` [PATCH 19/32] kernfs, sysfs, cgroup, intel_rdt: Support " David Howells
2018-07-10 22:43 ` [PATCH 20/32] hugetlbfs: Convert to " David Howells
2018-07-10 22:43 ` [PATCH 21/32] vfs: Remove kern_mount_data() " David Howells
2018-07-10 22:43 ` [PATCH 22/32] vfs: Provide documentation for new mount API " David Howells
2018-07-13  1:37   ` Randy Dunlap
2018-07-13  9:45   ` David Howells
2018-07-10 22:44 ` [PATCH 23/32] Make anon_inodes unconditional " David Howells
2018-07-10 22:44 ` [PATCH 24/32] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
2018-07-10 23:59   ` Andy Lutomirski
2018-07-11  1:05     ` Linus Torvalds
2018-07-11  1:15       ` Al Viro
2018-07-11  1:33         ` Andy Lutomirski
2018-07-11  1:48         ` Linus Torvalds
2018-07-11  8:43         ` David Howells
2018-07-11  1:14     ` Jann Horn
2018-07-11  1:16       ` Al Viro
2018-07-11  8:42     ` David Howells
2018-07-11 16:03       ` Linus Torvalds
2018-07-11  7:22   ` David Howells
2018-07-11 16:38     ` Eric Biggers
2018-07-11 17:06     ` Andy Lutomirski
2018-07-12 14:54     ` David Howells
2018-07-12 15:50       ` Linus Torvalds
2018-07-12 16:00         ` Al Viro
2018-07-12 16:07           ` Linus Torvalds
2018-07-12 16:31             ` Al Viro
2018-07-12 16:39               ` Linus Torvalds
2018-07-12 17:14                 ` Linus Torvalds
2018-07-12 17:44                   ` Al Viro
2018-07-12 17:54                     ` Linus Torvalds
2018-07-12 17:52                 ` Al Viro
2018-07-12 16:23       ` Andy Lutomirski
2018-07-12 16:31         ` Linus Torvalds
2018-07-12 16:41         ` Al Viro
2018-07-12 16:58         ` Al Viro
2018-07-12 17:54           ` Andy Lutomirski
2018-07-12 20:23       ` David Howells
2018-07-12 20:25         ` Andy Lutomirski
2018-07-12 20:34         ` Linus Torvalds
2018-07-12 20:36           ` Linus Torvalds
2018-07-12 21:26         ` David Howells
2018-07-12 21:40           ` Linus Torvalds
2018-07-12 22:32           ` Theodore Y. Ts'o
2018-07-12 22:54           ` David Howells
2018-07-12 23:21             ` Andy Lutomirski
2018-07-12 23:23             ` Jann Horn
2018-07-12 23:33               ` Jann Horn
2018-07-12 23:35             ` David Howells
2018-07-12 23:50               ` Andy Lutomirski
2018-07-13  0:03               ` David Howells
2018-07-13  0:24                 ` Andy Lutomirski
2018-07-13  7:30                 ` David Howells
2018-07-19  1:30                   ` Eric W. Biederman
2018-07-13  2:35             ` Theodore Y. Ts'o
2018-07-12 21:00       ` David Howells
2018-07-12 21:29         ` Linus Torvalds
2018-07-13 13:27         ` David Howells
2018-07-13 15:01           ` Andy Lutomirski
2018-07-13 15:40           ` David Howells
2018-07-13 17:14             ` Andy Lutomirski
2018-07-17  9:40           ` David Howells
2018-07-11 15:51   ` Jonathan Corbet
2018-07-11 16:18   ` David Howells
2018-07-12 17:15   ` Greg KH
2018-07-12 17:20     ` Al Viro
2018-07-12 18:03       ` Greg KH
2018-07-12 18:30         ` Andy Lutomirski
2018-07-12 18:34           ` Al Viro
2018-07-12 18:35             ` Al Viro
2018-07-12 19:08           ` Greg KH
2018-07-10 22:44 ` [PATCH 25/32] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
2018-07-10 22:44 ` [PATCH 26/32] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
2018-07-10 22:44 ` [PATCH 27/32] vfs: Implement logging through fs_context " David Howells
2018-07-10 22:44 ` [PATCH 28/32] vfs: Add some logging to the core users of the fs_context log " David Howells
2018-07-10 22:44 ` [PATCH 29/32] afs: Add fs_context support " David Howells
2018-07-10 22:44 ` [PATCH 30/32] afs: Use fs_context to pass parameters over automount " David Howells
2018-07-10 22:44 ` [PATCH 31/32] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
2018-07-10 22:45 ` [PATCH 32/32] afs: Add fsinfo support " David Howells
2018-07-10 22:52 ` [MANPAGE PATCH] Add manpages for move_mount(2) and open_tree(2) David Howells
2019-10-09  9:51   ` Michael Kerrisk (man-pages)
2018-07-10 22:54 ` [MANPAGE PATCH] Add manpage for fsopen(2), fspick(2) and fsmount(2) David Howells
2019-10-09  9:52   ` Michael Kerrisk (man-pages)
2018-07-10 22:55 ` [MANPAGE PATCH] Add manpage for fsinfo(2) David Howells
2019-10-09  9:52   ` Michael Kerrisk (man-pages)
2019-10-09 12:02   ` David Howells
2018-07-10 23:01 ` [PATCH 00/32] VFS: Introduce filesystem context [ver #9] Linus Torvalds
2018-07-12  0:46 ` David Howells
2018-07-18 21:29 ` Getting rid of the usage of write() -- was " David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).